Code used to generate the intake-esm catalogs used to access various datasets in NCAR’s GDEX and examples of utilizing the generated catalogs.
Overview¶
This repository contains tools and scripts for generating intake-ESM catalogs that provide unified access to diverse Earth science datasets. While intake-ESM was originally designed for Earth System Model output, we extend its use to observations, reanalysis data, and other Earth science datasets.
Repo Usage¶
The primary tool is generator/create_catalog.py. It generates an intake-ESM catalog for a specific dataset directory.
Basic CLI
python generator/create_catalog.py <directory> \
[--out <output directory>] \
[--catalog_name <name>] \
[--description <description>] \
[--exclude <glob> ...] \
[--include <glob> ...] \
[--depth <int>] \
[--ignore_vars <var name> ...] \
[--var_metadata <json string|filename>] \
[--global_metadata <json string|filename>] \
[--output_format <csv_and_json|single_json>] \
[--data_format <netcdf|zarr|reference>] \
[--make_remote]Options (brief)¶
<directory>: One or more root data directories to scan (space-separated).--out,-o: Destination directory for generated catalog files (default:./).--catalog_name,-n: Name to use for the catalog file(s) (default:dnnnnnn-posix).--description: Short human-readable description for the catalog (default:N/A).--exclude,-e: Glob pattern(s) to exclude files or directories (can be repeated).--include,-ic: Glob pattern(s) to include files or directories (can be repeated).--depth,-d: Maximum directory recursion depth (integer, default: 0).--ignore_vars,-i: Variable names to ignore (can be repeated).--var_metadata,-vm: Per-variable metadata as a JSON string or a path to a JSON file.--global_metadata,-gm: Catalog-level metadata as a JSON string or a path to a JSON file.--output_format,-of: Output style;csv_and_jsonemits CSV + JSON index files,single_jsonemits a single JSON catalog (default:csv_and_json).--data_format,-df: Input data/reference type:netcdf,zarr, orreference(default:netcdf).--make_remote,-mr: If set, prepare remote-accessible references for https and osdf (boolean flag).
Example¶
python generator/create_catalog.py \
/gdex/data/need/to/be/cataloged/ \
--data_format reference \
--out /data/path/to/store/catalog \
--output_format csv_and_json \
--catalog_name intake_catalog \
--description "reference catalog" \
--depth 0 \
--include "*.zarr" \
--exclude "*.tmp" \
--ignore_vars utc_date \
--var_metadata var_meta.json \
--global_metadata global_meta.json \
--make_remoteNotes
See
generator/create_catalog.pysource for full option parsing and advanced behaviors.
Key Features¶
1. Custom Catalog Generation Tools (ecgtools)¶
We use a custom fork of ecgtools, Currently, pin to commit SHA = 0b3d5b5d0082812e85c821c00c2d619eed0ae3cd along with custom scripts to generate our catalogs. This allows us to:
Handle diverse data formats and structures
Implement custom parsing logic for different data sources
Maintain consistency across various dataset types
2. Broad Dataset Support¶
Although intake-ESM is primarily meant for Earth System Model output, we leverage the package to generate catalogs for:
Observations (satellite, in-situ, etc.)
Reanalysis datasets (ERA5, JRA-3Q, etc.)
Model output (CESM, CMIP, etc.)
Other Earth science data
We strive to match our vocabulary (column names) with conventions used by other major data providers including:
DKRZ (Deutsches Klimarechenzentrum)
Copernicus Climate Data Store
NASA data repositories
NOAA data services
3. Multiple Access Methods¶
Our catalogs support different data access patterns through three main flavors:
a) POSIX¶
Direct filesystem access for users on NCAR HPC systems (Casper, Derecho)
b) HTTPS¶
Web-based access for remote users and standard HTTP protocols
c) OSDF (Open Science Data Federation)¶
Distributed access through the Open Science Data Federation for broader community access
Catelog Usage Examples¶
For comprehensive usage examples and tutorials for the generated catelog:
NCAR HPC users: Visit gdex-examples
OSDF users: Visit osdf_examples
Support and Contributions¶
Issues and Feature Requests¶
We welcome feedback from the community! Please use GitHub issues for:
Bug reports when something is broken
Feature requests for new functionality or datasets
Note: While we appreciate all feature requests, please understand that we may not be able to fulfill all requests due to resource constraints and project priorities.
Getting Help¶
Check the documentation and examples linked above
Search existing GitHub issues for similar problems
Open a new issue with detailed information about your use case
Repository Structure¶
├── README.md
├── requirements.txt
├── generator/ # Core catalog generation tools
│ ├── create_catalog.py
│ └── modify_catalog.py
├── notebooks/ # Example notebooks and development work
└── test/ # Test scriptsInstallation¶
git clone https://github.com/NCAR/gdex-intake-esm.git
cd gdex-intake-esm
pip install -r requirements.txt