Efficient Data Loading & Filtering
Filtering at the data-loading stage is the most efficient way to extract specific building subsets based on quality, source, or geography. By using predicate pushdown across partitioned files, you avoid reading unnecessary data into memory.
Prerequisites
The following commands assume that you have downloaded the complete EUBUCCO dataset, e.g. using the CLI:
Spatial Filtering
Bounding Box Filtering
Limit the dataset to a specific geographic extent using helper columns.
Administrative Regions Filtering
Limit the dataset to specific cities, regions, or countries using the partition keys and region_id (NUTS 0/1/2) or city_id row group metadata.
Source Filtering
Filtering by Footprint Source
Discard ML-derived footprints (i.e. Microsoft Footprints).
Filter for governmental data / discard non-authoritative sources (i.e. OSM and Microsoft ML).
Filtering by Attribute Source
Isolate records where the specific attribute was not estimated using ML.
Isolate records where the specific attribute comes from the same source as the footprint geometry (i.e. the attribute was neither merged nor estimated using ML).
Column Filtering
Select only the footprint geometry and the main building attributes, and discard the remaining metadata columns.
Confidence Filtering
Discard buildings with merged or estimated attributes that carry high uncertainty. We treat authoritative data (where confidence is NaN) as 100% certain.
Performance: Since data isn't partitioned or sorted by confidence, metadata-based skipping is unavailable. However, filtering while reading in small buffers keeps memory usage low.
import geopandas as gpd
gdf = gpd.read_parquet("eubucco_data/")
# Categorical: Type confidence > 80%
high_conf = gdf[gdf["type_confidence"].fillna(1.0) > 0.8]
# Numerical: Precise height (uncertainty interval < 2m)
precise_height = gdf[(gdf["height_confidence_upper"] - gdf["height_confidence_lower"]) < 2.0]
import pyarrow.dataset as ds
# Categorical: Type confidence > 80%
precise_type = (ds.field("type_confidence") > 0.8) | ds.field("type_confidence").is_null()
# Numerical: Precise height (uncertainty interval < 2m)
height_spread = ds.field("height_confidence_upper") - ds.field("height_confidence_lower")
precise_height = (height_spread < 2.0) | ds.field("height_confidence_upper").is_null()
dataset = ds.dataset("eubucco_data/")
table = dataset.to_table(filter=(precise_type & precise_height))