Skip to content

Image Database

Script: notebooks/01-prepare-data/create_image_database.py

Ingests JPEG composites into a LanceDB source table and enriches each row with IBTrACS hurricane labels matched at the appropriate temporal resolution.


Database Layout

Each project lives in its own subfolder inside shared_source/, keeping the dataset identity in the folder and the table name generic and renameable:

lancedb/shared_source/
  era5_sample_images/     ← SOURCE_PROJECT (LanceDB database)
    images.lance          ← IMG_RAW_TBL_NAME = "images"

Key Configuration

Variable Default Description
SOURCE_PROJECT "era5_sample_images" Project folder name inside shared_source/
IMG_RAW_TBL_NAME "images" Table name (rename freely)
DT_FORMAT "%Y%m%d_%H_rgb.jpeg" strptime pattern for parsing timestamps from filenames
INGEST_RESOLUTION "3h" Subsample the source folder to this frequency (None = ingest all)
TEMPORAL_START / TEMPORAL_END "2016-01-01" / "2018-12-31" Date range for IBTrACS hurricane label loading

Temporal Subsampling

If the source folder contains finer-grained data than needed, INGEST_RESOLUTION filters the file list before any image is decoded — skipped files incur zero cost:

INGEST_RESOLUTION = "3h"   # keep 00Z, 03Z, 06Z, ... from an hourly folder
INGEST_RESOLUTION = "6h"   # keep 00Z, 06Z, 12Z, 18Z
INGEST_RESOLUTION = None    # ingest everything

Hurricane Enrichment

After ingestion, each image row is enriched with storm label columns sourced from IBTrACS v04r01 (North Atlantic basin):

Column Type Description
hurricane_present bool Any storm in the spatial domain at this timestep
n_storms int Number of distinct storms
max_wind_kts float Maximum WMO wind speed (knots)
max_category int Saffir-Simpson code (−1=TD, 0=TS, 1–5=Cat)
storm_ids string Comma-separated IBTrACS SIDs
storm_lats / storm_lons string Positions of representative observations

The temporal resolution is auto-detected from the dt column of the ingested table. For each image timestamp, the nearest IBTrACS observation (closest to the bucket center) is selected per storm.

Data attribution

IBTrACS data is provided by NOAA/NCEI and is in the public domain. See the Home page for full attribution.