Attribute Uncertainty Quantification

Building attributes derived from source data (e.g., OpenStreetMap) or machine learning models inherently carry degrees of uncertainty.

To provide a transparent measure of data reliability, each attribute is accompanied by a confidence metric. This documentation details the dual-methodology approach—spatial intersection ratios for merged records and classification probability / regression bootstrapping for predicted values—used to quantify this uncertainty.

Type & Subtype

The confidence of the categorical attributes type and subtype is quantified as follows:

For Attributes Merged From OSM: Footprint Intersection Ratio

When an attribute is merged from a source dataset (i.e. OSM) onto a target building footprint, the confidence is calculated as the Intersection over Area (IoA):

\[\text{Confidence} = \frac{\text{Area}(B_{target} \cap B_{source})}{\text{Area}(B_{target})}\]

1.0: Perfect spatial overlap.
< 1.0: Partial overlap, suggesting potential mismatch or misaligned geometries.

For Attributes Predicted using ML: Calibrated Classification Probabilities

For attributes generated by our classification models, the confidence is the calibrated probability of the predicted class:

Subtype Confidence: The output of the model after applying calibration to ensure the probability reflects real-world accuracy.
Type Confidence: Since type (residential/non-residential) is an aggregate, its confidence is the sum of the calibrated probabilities of all subtypes belonging to that category:

\[P(\text{Type}) = \sum P(\text{Subtypes} \in \text{Type})\]

Height, Floors, Construction Year

The confidence of the numerical attributes height, floors and construction year is quantified as follows:

For Attributes Merged From OSM: Value Extremes

If multiple source buildings match a single target footprint with different values:

Lower: The minimum value found among all matching sources.
Upper: The maximum value found among all matching sources.

For Attributes Predicted using ML: Bootstrapped 95% CI

To quantify the uncertainty of our regression models, we use a bootstrap approach:

Bootstrap Sampling: The model is run \(n=10\) times using different seeds or data subsamples, resulting in a set of predictions \(Y = \{y_1, y_2, \dots, y_{10}\}\).
Standard Error Calculation: We calculate the sample mean \(\bar{y}\) and the standard error of the mean (SEM):

\[\text{SEM} = \frac{s}{\sqrt{n}}\]

where \(s\) is the sample standard deviation.
Interval Calculation: The confidence bounds are defined using the \(t\)-statistic for \(n-1\) degrees of freedom (at \(\alpha = 0.05\)):

Lower Confidence:

\[\text{lower} = \bar{y} - t_{0.975, 9} \times \text{SEM}\]

Upper Confidence:

\[\text{upper} = \bar{y} + t_{0.975, 9} \times \text{SEM}\]