Purpose¶
This repository contains tools and examples for creating Kerchunk reference files specifically for GDEX (Geoscience Data Exchange) data holdings. The repository is designed to facilitate cloud-optimized data access to NCAR’s GDEX data collections through Kerchunk reference files, enabling efficient analysis of large climate datasets without requiring full data downloads.
Key Features:
Create Kerchunk reference files for GDEX NetCDF and GRIB data
Generate both individual sidecar files and combined reference files
Support for remote data access via HTTPS and OSDF protocols
Distributed processing capabilities using Dask for large datasets
Integration with GDEX data storage infrastructure at NCAR
Note: Parquet output format is supported for combined references but may have limited functionality compared to JSON format.
Main Scripts¶
create_kerchunk.py¶
The primary tool for creating Kerchunk reference files from GDEX data holdings.
python src/create_kerchunk.py -h
usage: create_kerchunk.py [-h] --action <combine|sidecar> --directory <directory> [--output_directory <directory>] [--filename <output filename>]
[--extensions <extension> [<extension> ...]] [--variables <variable names> [<variable names> ...]]
[--cluster < PBS / single / local >] [--dry_run] [--make_remote] [--regex <regular expression>]
[--output_format < json / parquet >]
Creates kerchunk sidecar files of an entire directory structure.
optional arguments:
-h, --help show this help message and exit
--action <combine|sidecar>, -a <combine|sidecar>
Specify whether to create combined references or create sidecar files.
--directory <directory>, -d <directory>
Directory to scan and create kerchunk reference files.
--output_directory <directory>, -o <directory>
Directory to place output files (default: current directory)
--filename <output filename>, -f <output filename>
Filename for output json.
--extensions <extension> [<extension> ...], -e <extension> [<extension> ...]
Only process files of this extension
--variables <variable names> [<variable names> ...], -v <variable names> [<variable names> ...]
Only gather specific variables. Variable names are case sensitive. Use the special keyword 'ALL' to separate all into individual files.
--cluster < PBS / single / local >, -c < PBS / single / local >
Choose type of dask cluster to use:
PBS - PBSCluster (defaults to 5 workers, uses GDEX queue)
single - singleThreaded
local - localCluster (uses os.ncpus)
--dry_run, -dr Do a dry run of processing
--make_remote, -mr Additionally make a remote accessible copy of json with GDEX URLs
--regex <regular expression>, -r <regular expression>
Combine references that match the specified regular expression
--output_format < json / parquet >, -of < json / parquet >
Specify the output format for combined references (default: json)
Note: Parquet format support is experimentalExample Usage¶
# Create individual sidecar files for a directory
python src/create_kerchunk.py --action sidecar --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output
# Create combined reference file for NetCDF files
python src/create_kerchunk.py --action combine --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output --extensions nc --filename combined_kerchunk.json
# Create combined reference with remote access capability
python src/create_kerchunk.py --action combine --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output --extensions nc --filename bnd_ocean.194907.json --make_remote
# Dry run to preview processing
python src/create_kerchunk.py --action combine --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output --extensions nc --filename combined_kerchunk.json --dry_runAdditional Tools¶
convert_ref_file_loc.py¶
Converts local file paths in Kerchunk reference files to remote HTTPS or OSDF URLs for cloud access.
Supported remote endpoints:
https://data.gdex.ucar.edu(primary GDEX data portal)osdf:///ncar/gdex(Open Science Data Federation endpoint)
create_kerchunk_grib.py¶
Specialized tool for creating Kerchunk reference files from GRIB format data, with support for parameter ID filtering.
separate_kerchunk.py¶
Utility for separating combined Kerchunk reference files into individual variable-specific reference files.
convert_chunks.py¶
Tool for modifying chunk sizes on files
Repository Structure¶
├── src/ # Main source code directory
│ ├── create_kerchunk.py # Primary Kerchunk creation tool
│ ├── create_kerchunk_grib.py # GRIB-specific Kerchunk tool
│ ├── convert_ref_file_loc.py # Local to remote path converter
│ ├── separate_kerchunk.py # Reference file separator
│ └── convert_chunks.py # Chunk size modifier
├── examples/ # Usage examples and batch scripts
└── test/ # Test scripts and validation notebooksGDEX Integration¶
This repository is specifically designed to work with NCAR’s GDEX (Geoscience Data Exchange) infrastructure:
Data Sources: Processes data from
/glade/campaign/collections/gdex/data/Remote Access: Generates reference files compatible with GDEX web services
HPC Integration: Configured for NCAR’s PBS job scheduler with GDEX queue
Protocols: Supports both HTTPS and OSDF data federation protocols
Important Notes¶
Parquet Support: While parquet format is available as an output option (
--output_format parquet), it has experimental support and may have limited functionality compared to the default JSON formatPBS Cluster: When using
--cluster PBS, jobs are automatically submitted to the GDEX queueRemote References: The
--make_remoteflag creates additional reference files with GDEX URLs for cloud-native data access