Embeddings
Orchestrator: notebooks/02-generate-embeddings/generate_dinov3_embeddings.py
Engine: notebooks/02-generate-embeddings/helpers/v5_dino_embeddings_lancedb.py
Connects to the source image table, configures an experiment, and runs a high-throughput embedding pipeline that produces image-level and patch-level representations.
Supported Models
| Family | Model | Output dim |
|---|---|---|
dinov3_rect |
DINOv2 ViT-B/14 (rectangular) | 768 |
dinov3 |
DINOv2 ViT-B/14 (square) | 768 |
openclip |
OpenCLIP ViT-B/32 | 512 |
Models are registered in helpers/model_registry.json and loaded via timm.
Experiment Output Layout
Each experiment writes to its own subfolder under experiments/era5/:
lancedb/experiments/era5/
<experiment_name>/
<experiment_name>_config.lance ← ~35 key/value metadata pairs
image_embeddings.lance ← one row per image
patch_embeddings.lance ← one row per patch per image
image_embeddings columns
| Column | Description |
|---|---|
image_id |
Foreign key back to source images table |
embedding |
L2-normalized image vector (mean-pooled patches) |
attention_map |
Flat CLS-to-patch attention map (spatial_h × spatial_w) |
patch_embeddings columns
| Column | Description |
|---|---|
patch_id |
Unique patch identifier |
image_id |
Foreign key back to source images table |
patch_index |
Position within the image grid |
embedding |
L2-normalized patch vector |
Inference Pipeline
The engine runs three concurrent components to maximize throughput:
- Worker pool — decodes JPEG blobs and normalizes tensors in parallel (
mp.Pool) - Batch collector — accumulates preprocessed tensors until the batch is full, then flushes to GPU
- Async writer — background thread writing embedding rows to LanceDB while the GPU processes the next batch
For rectangular images, dynamic_img_size=True is used so positional embeddings adapt to the non-square grid without retraining.
Running on HPC
The orchestrator notebook can generate a ready-to-submit PBS job script, or run the embedding script directly via subprocess for interactive sessions.
Model licenses
DINOv2 is released by Meta AI under Apache 2.0. OpenCLIP is released by LAION under MIT/BSD. Model weights are downloaded automatically at runtime via timm / open_clip.