Attribute Uncertainty Quantification
Building attributes derived from source data (e.g., OpenStreetMap) or machine learning models inherently carry degrees of uncertainty.
To provide a transparent measure of data reliability, each attribute is accompanied by a confidence metric. This documentation details the dual-methodology approach—spatial intersection ratios for merged records and classification probability / regression bootstrapping for predicted values—used to quantify this uncertainty.
Type & Subtype
The confidence of the categorical attributes type and subtype is quantified as follows:
For Attributes Merged From OSM: Footprint Intersection Ratio
When an attribute is merged from a source dataset (i.e. OSM) onto a target building footprint, the confidence is calculated as the Intersection over Area (IoA):
- 1.0: Perfect spatial overlap.
- < 1.0: Partial overlap, suggesting potential mismatch or misaligned geometries.
For Attributes Predicted using ML: Calibrated Classification Probabilities
For attributes generated by our classification models, the confidence is the calibrated probability of the predicted class:
- Subtype Confidence: The output of the model after applying calibration to ensure the probability reflects real-world accuracy.
-
Type Confidence: Since type (
residential/non-residential) is an aggregate, its confidence is the sum of the calibrated probabilities of all subtypes belonging to that category:\[P(\text{Type}) = \sum P(\text{Subtypes} \in \text{Type})\]
Height, Floors, Construction Year
The confidence of the numerical attributes height, floors and construction year is quantified as follows:
For Attributes Merged From OSM: Value Extremes
If multiple source buildings match a single target footprint with different values:
- Lower: The minimum value found among all matching sources.
- Upper: The maximum value found among all matching sources.
For Attributes Predicted using ML: Bootstrapped 95% CI
To quantify the uncertainty of our regression models, we use a bootstrap approach:
- Bootstrap Sampling: The model is run \(n=10\) times using different seeds or data subsamples, resulting in a set of predictions \(Y = \{y_1, y_2, \dots, y_{10}\}\).
-
Standard Error Calculation: We calculate the sample mean \(\bar{y}\) and the standard error of the mean (SEM):
\[\text{SEM} = \frac{s}{\sqrt{n}}\]where \(s\) is the sample standard deviation.
-
Interval Calculation: The confidence bounds are defined using the \(t\)-statistic for \(n-1\) degrees of freedom (at \(\alpha = 0.05\)):
Lower Confidence:
\[\text{lower} = \bar{y} - t_{0.975, 9} \times \text{SEM}\]Upper Confidence:
\[\text{upper} = \bar{y} + t_{0.975, 9} \times \text{SEM}\]