Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

GDEX ARCO Kerchunk

Purpose

This repository contains tools and examples for creating Kerchunk reference files specifically for GDEX (Geoscience Data Exchange) data holdings. The repository is designed to facilitate cloud-optimized data access to NCAR’s GDEX data collections through Kerchunk reference files, enabling efficient analysis of large climate datasets without requiring full data downloads.

Key Features:

Note: Parquet output format is supported for combined references but may have limited functionality compared to JSON format.

Main Scripts

create_kerchunk.py

The primary tool for creating Kerchunk reference files from GDEX data holdings.

python src/create_kerchunk.py -h
usage: create_kerchunk.py [-h] --action <combine|sidecar> --directory <directory> [--output_directory <directory>] [--filename <output filename>] 
                          [--extensions <extension> [<extension> ...]] [--variables <variable names> [<variable names> ...]] 
                          [--cluster < PBS / single / local >] [--dry_run] [--make_remote] [--regex <regular expression>] 
                          [--output_format < json / parquet >]

Creates kerchunk sidecar files of an entire directory structure.

optional arguments:
  -h, --help            show this help message and exit
  --action <combine|sidecar>, -a <combine|sidecar>
                        Specify whether to create combined references or create sidecar files.
  --directory <directory>, -d <directory>
                        Directory to scan and create kerchunk reference files.
  --output_directory <directory>, -o <directory>
                        Directory to place output files (default: current directory)
  --filename <output filename>, -f <output filename>
                        Filename for output json.
  --extensions <extension> [<extension> ...], -e <extension> [<extension> ...]
                        Only process files of this extension
  --variables <variable names> [<variable names> ...], -v <variable names> [<variable names> ...]
                        Only gather specific variables. Variable names are case sensitive. Use the special keyword 'ALL' to separate all into individual files.
  --cluster < PBS / single / local >, -c < PBS / single / local >
                        Choose type of dask cluster to use:
                        PBS - PBSCluster (defaults to 5 workers, uses GDEX queue)
                        single - singleThreaded
                        local - localCluster (uses os.ncpus)
  --dry_run, -dr        Do a dry run of processing
  --make_remote, -mr    Additionally make a remote accessible copy of json with GDEX URLs
  --regex <regular expression>, -r <regular expression>
                        Combine references that match the specified regular expression
  --output_format < json / parquet >, -of < json / parquet >
                        Specify the output format for combined references (default: json)
                        Note: Parquet format support is experimental

Example Usage

# Create individual sidecar files for a directory
python src/create_kerchunk.py --action sidecar --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output

# Create combined reference file for NetCDF files
python src/create_kerchunk.py --action combine --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output --extensions nc --filename combined_kerchunk.json

# Create combined reference with remote access capability
python src/create_kerchunk.py --action combine --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output --extensions nc --filename bnd_ocean.194907.json --make_remote

# Dry run to preview processing
python src/create_kerchunk.py --action combine --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output --extensions nc --filename combined_kerchunk.json --dry_run

Additional Tools

convert_ref_file_loc.py

Converts local file paths in Kerchunk reference files to remote HTTPS or OSDF URLs for cloud access.

Supported remote endpoints:

create_kerchunk_grib.py

Specialized tool for creating Kerchunk reference files from GRIB format data, with support for parameter ID filtering.

separate_kerchunk.py

Utility for separating combined Kerchunk reference files into individual variable-specific reference files.

convert_chunks.py

Tool for modifying chunk sizes on files

Repository Structure

├── src/                    # Main source code directory
│   ├── create_kerchunk.py     # Primary Kerchunk creation tool
│   ├── create_kerchunk_grib.py # GRIB-specific Kerchunk tool
│   ├── convert_ref_file_loc.py # Local to remote path converter
│   ├── separate_kerchunk.py   # Reference file separator
│   └── convert_chunks.py      # Chunk size modifier
├── examples/               # Usage examples and batch scripts
└── test/                   # Test scripts and validation notebooks

GDEX Integration

This repository is specifically designed to work with NCAR’s GDEX (Geoscience Data Exchange) infrastructure:

Important Notes