Notebook 1 — Data Acquisition¶

Author: Eun-Kyeong Kim (eun-kyeong.kim@lxp.lu), LuxProvide S.A

Multimodal Flood Inference with TerraMind on MeluXina HPC¶

Goal: Search Microsoft Planetary Computer's STAC catalogue for Sentinel-1 SAR, Sentinel-2 optical, and Copernicus DEM tiles that cover the 2021 Luxembourg flood across four temporal phases (pre-month, pre-event, event, post-event). Save the selected item IDs and asset URLs to a JSON manifest so that Notebook 2 can download and package the data without re-querying the catalogue.

What you will learn¶

How a STAC catalogue is structured (collections → items → assets)
How to filter satellite passes by date window, spatial overlap, and image quality
How to serialise acquisition metadata for reproducible downstream processing

1 · Imports¶

We use: - pystac_client — Python client for querying STAC catalogues - planetary_computer — signs asset URLs so they are temporarily downloadable - geopandas / shapely — work with geographic geometries - pandas — tabular summaries of candidate satellite scenes

from __future__ import annotations

import json
import math
from pathlib import Path

import geopandas as gpd
import pandas as pd
import planetary_computer
import pystac_client
from shapely.geometry import box, shape
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

2 · Configuration¶

All tunable parameters live here — change them once and the rest of the notebook updates automatically.

Variable	Meaning
`EVENT["bbox"]`	Bounding box of the study area in WGS-84 lon/lat
`EVENT["phase_dates"]`	One representative date per temporal phase
`S1_WINDOW_DAYS`	Half-width (days) of the search window around each phase date for S1
`S2_WINDOW_DAYS`	Same for Sentinel-2 (wider because clouds reduce availability)
`S2_MAX_CLOUD`	Maximum allowed cloud-cover percentage for S2 scenes

# ── Output paths ────────────────────────────────────────────────────────────
ROOT      = Path("data/terramind_flood_lux")   # top-level data folder
META_ROOT = ROOT / "metadata"                  # where the manifest JSON is saved
META_ROOT.mkdir(parents=True, exist_ok=True)

# ── Study-area and event definition ─────────────────────────────────────────
# The 2021 Luxembourg flood (July 14–15) was one of the worst in decades,
# caused by extreme rainfall in the Ahr–Moselle catchment.
EVENT = {
    "name": "luxembourg_flood_2021_multimodal",
    # [west, south, east, north] in WGS-84 degrees
    "bbox": [5.7333, 49.4426, 6.5282, 50.1830],
    # UTM zone 32N — the projected CRS used for raster packaging in Notebook 2
    "target_crs": "EPSG:32632",
    # One representative date per temporal phase
    "phase_dates": {
        "pre_month":  "2021-06-15",   # ~30 days before the flood peak
        "pre_event":  "2021-07-11",   # a few days before the flood
        "event":      "2021-07-15",   # peak flood day
        "post_event": "2021-07-18",   # early recovery
    },
}

# Fixed order — critical for aligning array time-steps later
PHASE_ORDER = ["pre_month", "pre_event", "event", "post_event"]

# ── STAC collection identifiers ──────────────────────────────────────────────
S1_COLLECTION  = "sentinel-1-rtc"     # Sentinel-1 Radiometrically Terrain Corrected
S2_COLLECTION  = "sentinel-2-l2a"     # Sentinel-2 Level-2A (surface reflectance)
DEM_COLLECTION = "cop-dem-glo-30"     # Copernicus GLO-30 Digital Elevation Model

# ── Band / asset lists ───────────────────────────────────────────────────────
# These are the asset keys used by Planetary Computer for each collection.
S1_ASSETS = ["vv", "vh"]              # dual-polarisation SAR backscatter
S2_ASSETS = [                         # 12 multispectral bands (10–60 m native)
    "B01", "B02", "B03", "B04",
    "B05", "B06", "B07", "B08",
    "B8A", "B09", "B11", "B12",
]

# ── Search window parameters ─────────────────────────────────────────────────
# Half-width in days: search ±N days around each phase date.
# SAR scenes are available every ~6 days, so a 4-day window is usually enough.
# Optical scenes may be cloudy, so we use a wider window for S2.
S1_WINDOW_DAYS = 4
S2_WINDOW_DAYS = 10
S2_MAX_CLOUD   = 25.0   # reject S2 scenes with >25 % cloud cover

# ── Convenience: Region-of-Interest as a GeoDataFrame ────────────────────────
roi_geom = box(*EVENT["bbox"])                                # Shapely Polygon
roi_gdf  = gpd.GeoDataFrame({"geometry": [roi_geom]}, crs="EPSG:4326")
roi_gdf

	geometry
0	POLYGON ((6.5282 49.4426, 6.5282 50.183, 5.733...

Visualize the Region of Interest (covering Luxembourg)¶

# 1. Read the GeoJSON file into a GeoDataFrame
admin_geojson = 'https://download.data.public.lu/resources/limites-administratives-du-grand-duche-de-luxembourg/20260405-030017/limadmin.geojson'
admin_gdf = gpd.read_file(admin_geojson)

# 2. Ensure both are in the same CRS (crucial step!)
if admin_gdf.crs != roi_gdf.crs:
    admin_gdf = admin_gdf.to_crs(roi_gdf.crs)

# 3. Plot together (Overlay)
fig, ax = plt.subplots(figsize=(7, 7))

roi_gdf.plot(ax=ax, color='gray', markersize=5)
admin_gdf.plot(ax=ax, color='white', edgecolor='black')

plt.show()

png

3 · STAC catalogue connection and search helpers¶

How STAC works: A STAC catalogue is a web API that lists satellite imagery as items. Each item has a geometry, a date, and a set of named assets (the actual file URLs). We open the Planetary Computer catalogue and define helper functions to:

Build a date-range string for the search window.
Compute how much of our region-of-interest (ROI) each candidate scene covers.
Score and rank candidates so we always pick the "best" scene per phase.

# Open the Planetary Computer STAC catalogue.
# `modifier=planetary_computer.sign_inplace` automatically signs asset URLs
# so they can be downloaded without manual authentication.
catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)


def phase_window(date_str: str, half_window_days: int) -> str:
    """Return an ISO-8601 datetime interval string centred on *date_str*.

    Example: phase_window("2021-07-15", 4)
             → "2021-07-11T00:00:00Z/2021-07-19T23:59:59Z"
    """
    ts    = pd.Timestamp(date_str, tz="UTC")
    start = (ts - pd.Timedelta(days=half_window_days)).strftime("%Y-%m-%dT00:00:00Z")
    end   = (ts + pd.Timedelta(days=half_window_days)).strftime("%Y-%m-%dT23:59:59Z")
    return f"{start}/{end}"


def item_geometry(item):
    """Extract a Shapely geometry from a STAC item (prefers .geometry over .bbox)."""
    if item.geometry is not None:
        return shape(item.geometry)
    return box(*item.bbox)


def overlap_fraction(item, roi_polygon):
    """Return the fraction of *roi_polygon* that is covered by *item*'s footprint.

    A value of 1.0 means the scene fully covers the ROI.
    """
    scene_geom = item_geometry(item)
    if scene_geom.is_empty:
        return 0.0
    intersection = scene_geom.intersection(roi_polygon)
    return float(intersection.area / roi_polygon.area)


def safe_timestamp(value):
    """Parse a datetime string to a timezone-aware Timestamp; return NaT on failure."""
    return pd.to_datetime(value, utc=True, errors="coerce")


# ── Scene-scoring functions ──────────────────────────────────────────────────
# We want: maximum overlap, minimum distance to the target date, and (for S2)
# minimum cloud cover.  Sorting by `score_tuple` puts the best candidate first.

def s1_score(item, phase_date: str) -> dict:
    """Compute a sortable score for a Sentinel-1 RTC scene.

    Priority (ascending): higher overlap first, then closer to the phase date.
    """
    props    = item.properties
    dt       = safe_timestamp(props.get("datetime") or props.get("start_datetime"))
    target   = pd.Timestamp(phase_date, tz="UTC")
    days_off = abs((dt.normalize() - target.normalize()).days) if pd.notnull(dt) else 999
    overlap  = overlap_fraction(item, roi_geom)
    return {
        "item":            item,
        "datetime":        dt,
        "days_from_phase": days_off,
        "aoi_overlap_frac": overlap,
        "relative_orbit":  props.get("sat:relative_orbit"),
        "orbit_state":     props.get("sat:orbit_state"),
        # Tuple used for sorting: negate overlap so that larger → lower sort key
        "score_tuple": (-overlap, days_off, str(dt),
                        str(props.get("sat:relative_orbit")),
                        str(props.get("sat:orbit_state"))),
    }


def s2_score(item, phase_date: str) -> dict:
    """Compute a sortable score for a Sentinel-2 L2A scene.

    Priority: higher overlap first, then lower cloud cover, then closer date.
    """
    props    = item.properties
    dt       = safe_timestamp(props.get("datetime") or props.get("start_datetime"))
    target   = pd.Timestamp(phase_date, tz="UTC")
    days_off = abs((dt.normalize() - target.normalize()).days) if pd.notnull(dt) else 999
    overlap  = overlap_fraction(item, roi_geom)
    cloud    = float(props.get("eo:cloud_cover", 1000.0))
    return {
        "item":             item,
        "datetime":         dt,
        "days_from_phase":  days_off,
        "aoi_overlap_frac": overlap,
        "eo:cloud_cover":   cloud,
        "score_tuple": (-overlap, cloud, days_off, str(dt)),
    }


def search_phase_items(collection: str, phase_date: str,
                       half_window_days: int, query: dict | None = None):
    """Query the STAC catalogue and return all matching items as a list."""
    search = catalog.search(
        collections=[collection],
        bbox=EVENT["bbox"],
        datetime=phase_window(phase_date, half_window_days),
        query=query or {},
    )
    return list(search.items())

4 · Select best S1 and S2 scenes for each phase¶

For each of the four temporal phases we:

Query the catalogue within the date window.
Keep only candidates that have all required assets (band files present).
Score and sort the candidates.
Pick the top-ranked scene.

This loop populates two summary tables — s1_df and s2_df — that list the chosen scenes with their quality metrics.

s1_rows = []  # will hold one dict per phase
s2_rows = []

for phase in PHASE_ORDER:
    phase_date = EVENT["phase_dates"][phase]

    # ── Sentinel-1 ────────────────────────────────────────────────────────
    # Require dual-polarisation (VV + VH) acquisitions only.
    s1_items = search_phase_items(
        S1_COLLECTION,
        phase_date=phase_date,
        half_window_days=S1_WINDOW_DAYS,
        query={"sar:polarizations": {"eq": ["VV", "VH"]}},
    )

    # Filter to scenes that actually contain both VV and VH asset files,
    # then score and rank.
    s1_candidates = [
        s1_score(item, phase_date)
        for item in s1_items
        if all(k in item.assets for k in S1_ASSETS)
    ]
    s1_candidates.sort(key=lambda r: r["score_tuple"])

    if not s1_candidates:
        raise RuntimeError(f"No S1 RTC candidates found for phase '{phase}'. "
                           "Try widening S1_WINDOW_DAYS.")

    best_s1 = s1_candidates[0]
    s1_rows.append({
        "phase":            phase,
        "collection":       S1_COLLECTION,
        "id":               best_s1["item"].id,
        "datetime":         best_s1["datetime"],
        "days_from_phase":  best_s1["days_from_phase"],
        "aoi_overlap_frac": best_s1["aoi_overlap_frac"],
        "orbit_state":      best_s1["orbit_state"],
        "relative_orbit":   best_s1["relative_orbit"],
        # Signed download URLs — used by Notebook 2
        "vv_href":          best_s1["item"].assets["vv"].href,
        "vh_href":          best_s1["item"].assets["vh"].href,
    })

    # ── Sentinel-2 ────────────────────────────────────────────────────────
    s2_items = search_phase_items(
        S2_COLLECTION,
        phase_date=phase_date,
        half_window_days=S2_WINDOW_DAYS,
        query={"eo:cloud_cover": {"lt": S2_MAX_CLOUD}},
    )

    s2_candidates = [
        s2_score(item, phase_date)
        for item in s2_items
        if all(k in item.assets for k in S2_ASSETS)
    ]
    s2_candidates.sort(key=lambda r: r["score_tuple"])

    if not s2_candidates:
        raise RuntimeError(f"No S2 L2A candidates found for phase '{phase}'. "
                           "Try increasing S2_MAX_CLOUD or widening S2_WINDOW_DAYS.")

    best_s2 = s2_candidates[0]
    s2_rows.append({
        "phase":            phase,
        "collection":       S2_COLLECTION,
        "id":               best_s2["item"].id,
        "datetime":         best_s2["datetime"],
        "days_from_phase":  best_s2["days_from_phase"],
        "aoi_overlap_frac": best_s2["aoi_overlap_frac"],
        "eo:cloud_cover":   best_s2["eo:cloud_cover"],
        # One signed URL per spectral band
        **{f"{band}_href": best_s2["item"].assets[band].href for band in S2_ASSETS},
    })

# ── Display summary tables ────────────────────────────────────────────────
s1_df = pd.DataFrame(s1_rows).sort_values("phase").reset_index(drop=True)
s2_df = pd.DataFrame(s2_rows).sort_values("phase").reset_index(drop=True)

print("Selected Sentinel-1 scenes:")
display(s1_df[["phase", "id", "datetime", "days_from_phase",
               "aoi_overlap_frac", "orbit_state"]])

print("\nSelected Sentinel-2 scenes:")
display(s2_df[["phase", "id", "datetime", "days_from_phase",
               "aoi_overlap_frac", "eo:cloud_cover"]])

Selected Sentinel-1 scenes:

	phase	id	datetime	days_from_phase	aoi_overlap_frac	orbit_state
0	event	S1A_IW_GRDH_1SDV_20210715T055052_20210715T0551...	2021-07-15 05:51:04.897499+00:00	0	1.0	descending
1	post_event	S1A_IW_GRDH_1SDV_20210715T055052_20210715T0551...	2021-07-15 05:51:04.897499+00:00	3	1.0	descending
2	pre_event	S1A_IW_GRDH_1SDV_20210710T054242_20210710T0543...	2021-07-10 05:42:54.536403+00:00	1	1.0	descending
3	pre_month	S1A_IW_GRDH_1SDV_20210616T054240_20210616T0543...	2021-06-16 05:42:53.246501+00:00	1	1.0	descending

Selected Sentinel-2 scenes:

	phase	id	datetime	days_from_phase	aoi_overlap_frac	eo:cloud_cover
0	event	S2A_MSIL2A_20210721T104031_R008_T31UGR_2021072...	2021-07-21 10:40:31.024000+00:00	6	0.840362	3.280537
1	post_event	S2A_MSIL2A_20210721T104031_R008_T31UGR_2021072...	2021-07-21 10:40:31.024000+00:00	3	0.840362	3.280537
2	pre_event	S2A_MSIL2A_20210721T104031_R008_T31UGR_2021072...	2021-07-21 10:40:31.024000+00:00	10	0.840362	3.280537
3	pre_month	S2A_MSIL2A_20210611T104021_R008_T31UGR_2021062...	2021-06-11 10:40:21.024000+00:00	4	0.840362	16.999946

5 · Collect DEM tiles¶

The Copernicus GLO-30 DEM comes in 1°×1° tiles. Several tiles may be needed to cover the full ROI — we record all of them so Notebook 2 can mosaic them.

# Search for all DEM tiles that overlap the ROI (no date filter needed — DEM is static).
dem_items = list(
    catalog.search(
        collections=[DEM_COLLECTION],
        bbox=EVENT["bbox"],
    ).items()
)

# Keep only tiles with non-zero overlap; sort by overlap fraction for reference.
dem_rows = []
for item in dem_items:
    frac = overlap_fraction(item, roi_geom)
    if frac <= 0:
        continue
    dem_rows.append({
        "collection":       DEM_COLLECTION,
        "id":               item.id,
        "aoi_overlap_frac": frac,
        "data_href":        item.assets["data"].href,
    })

dem_df = (pd.DataFrame(dem_rows)
            .sort_values("aoi_overlap_frac", ascending=False)
            .reset_index(drop=True))
print(f"Found {len(dem_df)} DEM tile(s) covering the ROI.")
dem_df

Found 4 DEM tile(s) covering the ROI.

	collection	id	aoi_overlap_frac	data_href
0	cop-dem-glo-30	Copernicus_DSM_COG_10_N49_00_E006_00_DEM	0.500505	https://elevationeuwest.blob.core.windows.net/...
1	cop-dem-glo-30	Copernicus_DSM_COG_10_N49_00_E005_00_DEM	0.252518	https://elevationeuwest.blob.core.windows.net/...
2	cop-dem-glo-30	Copernicus_DSM_COG_10_N50_00_E006_00_DEM	0.164177	https://elevationeuwest.blob.core.windows.net/...
3	cop-dem-glo-30	Copernicus_DSM_COG_10_N50_00_E005_00_DEM	0.082799	https://elevationeuwest.blob.core.windows.net/...

6 · Save acquisition manifest¶

The manifest is a single JSON file that records every decision made in this notebook: which scenes were chosen, their metadata, and the signed download URLs. Notebook 2 reads this file to know exactly which data to download — no need to re-query the catalogue.

manifest = {
    "event":        EVENT,
    "phase_order":  PHASE_ORDER,
    "collections": {
        "S1RTC": S1_COLLECTION,
        "S2L2A": S2_COLLECTION,
        "DEM":   DEM_COLLECTION,
    },
    "assets": {
        "S1RTC": S1_ASSETS,
        "S2L2A": S2_ASSETS,
        "DEM":   ["data"],
    },
    "selected":  {},   # best scene per phase per sensor
    "dem_items": [],   # all DEM tiles needed for the ROI
}

# ── Populate S1 entries ───────────────────────────────────────────────────
for row in s1_rows:
    phase = row["phase"]
    manifest["selected"].setdefault(phase, {})
    manifest["selected"][phase]["S1RTC"] = {
        "collection":       row["collection"],
        "id":               row["id"],
        "datetime":         pd.Timestamp(row["datetime"]).isoformat(),
        "days_from_phase":  int(row["days_from_phase"]),
        "aoi_overlap_frac": float(row["aoi_overlap_frac"]),
        "orbit_state":      row["orbit_state"],
        "relative_orbit":   None if pd.isna(row["relative_orbit"])
                            else int(row["relative_orbit"]),
    }

# ── Populate S2 entries ───────────────────────────────────────────────────
for row in s2_rows:
    phase = row["phase"]
    manifest["selected"].setdefault(phase, {})
    manifest["selected"][phase]["S2L2A"] = {
        "collection":       row["collection"],
        "id":               row["id"],
        "datetime":         pd.Timestamp(row["datetime"]).isoformat(),
        "days_from_phase":  int(row["days_from_phase"]),
        "aoi_overlap_frac": float(row["aoi_overlap_frac"]),
        "eo:cloud_cover":   float(row["eo:cloud_cover"]),
    }

# ── Populate DEM entries ──────────────────────────────────────────────────
for row in dem_rows:
    manifest["dem_items"].append({
        "collection":       row["collection"],
        "id":               row["id"],
        "aoi_overlap_frac": float(row["aoi_overlap_frac"]),
    })

# ── Write to disk ─────────────────────────────────────────────────────────
manifest_path = META_ROOT / "multimodal_acquisition_manifest.json"
manifest_path.write_text(json.dumps(manifest, indent=2))
print(f"Manifest saved → {manifest_path}")
print(f"  Phases recorded : {list(manifest['selected'].keys())}")
print(f"  DEM tiles found : {len(manifest['dem_items'])}")

Manifest saved → data/terramind_flood_lux/metadata/multimodal_acquisition_manifest.json
  Phases recorded : ['pre_month', 'pre_event', 'event', 'post_event']
  DEM tiles found : 4