Tool suite for creating Kerchunk reference files and managing NetCDF chunk consistency for GDEX (Geoscience Data Exchange) data holdings at NCAR.
Overview¶
This repository provides two main CLI tools:
create_kerchunk.py - Generate Kerchunk reference files for cloud-optimized data access
align_chunks.py - Detect and fix NetCDF chunk-size inconsistencies across files
Both tools support distributed processing with Dask for large-scale data operations.
create_kerchunk.py¶
Purpose¶
Create Kerchunk reference files from GDEX NetCDF data holdings. Supports both individual sidecar files and combined aggregated references, with optional remote URL conversion for cloud access.
Usage¶
python src/create_kerchunk.py --action <combine|sidecar> --directory <directory> [OPTIONS]Options & Arguments¶
Required Arguments¶
--action {combine|sidecar}, -a {combine|sidecar}
Specify the operation mode:
- combine: Aggregate multiple files into single reference
- sidecar: Create individual reference file per data file
--directory <directory>, -d <directory>
Root directory to scan for NetCDF filesOutput Options¶
--output_directory <directory>, -o <directory>
Directory for output files (default: current directory)
--filename <filename>, -f <filename>
Output filename for combined references (default: auto-generated)
--output_format {json|parquet}, -of {json|parquet}
Output format for combined references (default: json)
Note: Parquet support is experimentalFile Selection Options¶
--extensions <ext> [ext ...], -e <ext> [ext ...]
Process only files with these extensions (e.g., nc grib)
--regex <pattern>, -r <pattern>
Include files matching this regex pattern (full path)
Example: "wrf2d_.*\.nc"
--regex_exclude <pattern>, -re <pattern>
Exclude files matching this regex pattern (full path)
Example: ".*_backup\.nc$"Variable Selection Options¶
--variables <var> [var ...], -v <var> [var ...]
Include only specific variables (space-separated)
Example: temperature precipitation humidity
--exclude_variables <var> [var ...], -ev <var> [var ...]
Exclude specific variables from combined reference
Example: XTIME XLONG XLATProcessing Options¶
--cluster {PBS|single|local}, -c {PBS|single|local}
Dask cluster type (default: local):
- PBS: PBSCluster (NCAR HPC, uses gdex queue)
- single: Single-threaded
- local: LocalCluster (multi-process, all CPUs)
--num_processes <int>, -n <int>
Number of Dask workers/jobs (default: 8)
--dry_run, -dr
Preview processing without creating files
--make_remote, -mr
Create additional remote-accessible copies with GDEX URLsUse Cases & Examples¶
Use Case 1: Create Individual Sidecar Files¶
Create one reference file per NetCDF file in a directory structure.
python src/create_kerchunk.py \
--action sidecar \
--directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 \
--output_directory ./kerchunk_refs \
--extensions ncOutput: ./kerchunk_refs/194907/file1.nc.json, ./kerchunk_refs/194907/file2.nc.json, etc.
Use Case 2: Combine All Files into Single Reference¶
Aggregate all matching files into one combined reference for easier management.
python src/create_kerchunk.py \
--action combine \
--directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 \
--output_directory ./output \
--extensions nc \
--filename bnd_ocean_combined.jsonOutput: ./output/bnd_ocean_combined.json with all files aggregated by time
Use Case 3: Filter Files by Pattern¶
Combine only files matching a specific pattern, excluding backups.
python src/create_kerchunk.py \
--action combine \
--directory /glade/campaign/collections/gdex/data/d640000 \
--output_directory ./output \
--extensions nc \
--regex "wrf2d_.*\.nc" \
--regex_exclude ".*_backup\.nc$" \
--filename wrf2d_combined.jsonOutput: Combined reference containing only wrf2d_*.nc files (excluding backups)
Use Case 4: Select Specific Variables Only¶
Reduce reference file size by including only variables of interest.
python src/create_kerchunk.py \
--action combine \
--directory /glade/campaign/collections/gdex/data/d640000 \
--output_directory ./output \
--extensions nc \
--variables T Q U V PSFC \
--filename atmosphere_only.jsonOutput: Reference containing only temperature, humidity, wind, and pressure variables
Use Case 5: Exclude Coordinate Variables¶
Create reference without spatial/temporal coordinates to reduce size.
python src/create_kerchunk.py \
--action combine \
--directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean \
--output_directory ./output \
--extensions nc \
--exclude_variables XTIME XLONG XLAT XLONG_U XLAT_U XLONG_V XLAT_V \
--filename data_only.jsonOutput: Reference with only data variables (coordinates available separately)
Use Case 6: Create Remote-Accessible References¶
Generate references with GDEX URLs for cloud-native data access.
python src/create_kerchunk.py \
--action combine \
--directory /glade/campaign/collections/gdex/data/d640000 \
--output_directory ./output \
--extensions nc \
--filename d640000.json \
--make_remoteOutput:
./output/d640000.json(local file paths)./output/d640000-remote-https.json(GDEX HTTPS URLs)
Use Case 7: Parquet Format with PBS Cluster¶
Combine files to parquet format using HPC cluster for large datasets.
python src/create_kerchunk.py \
--action combine \
--directory /glade/campaign/collections/gdex/data/d640000 \
--output_directory ./output \
--extensions nc \
--output_format parquet \
--filename d640000.parq \
--cluster PBS \
--num_processes 32Output: ./output/d640000.parq (Parquet reference file processed on 32 HPC nodes)
Use Case 8: Dry Run Preview¶
Preview what files will be processed without creating references.
python src/create_kerchunk.py \
--action combine \
--directory /glade/campaign/collections/gdex/data/d640000 \
--extensions nc \
--regex "wrf2d_.*\.nc" \
--dry_runOutput: Shows number of files and structure without processing
Command-Line Reference¶
| Argument | Short | Type | Default | Description |
|---|---|---|---|---|
| action | -a | combine|sidecar | required | Operation mode |
| directory | -d | str | required | Root directory to scan |
| output_directory | -o | str | . | Output directory |
| filename | -f | str | auto | Output filename |
| extensions | -e | str[] | [] | File extensions to process |
| regex | -r | str | None | Include pattern (full path) |
| regex_exclude | -re | str | None | Exclude pattern (full path) |
| variables | -v | str[] | [] | Variables to include |
| exclude_variables | -ev | str[] | [] | Variables to exclude |
| cluster | -c | PBS|single|local | local | Dask cluster type |
| num_processes | -n | int | 8 | Number of workers |
| output_format | -of | json|parquet | json | Output format |
| dry_run | -dr | flag | False | Preview mode |
| make_remote | -mr | flag | False | Create remote URLs |
align_chunks.py¶
Purpose¶
Detect NetCDF chunk-size inconsistencies across a directory of files and automatically rechunk them to match the majority pattern. Uses a two-stage workflow: plan (scan + identify) and execute (rechunk + rewrite).
Usage¶
# Stage 1: Plan
python src/align_chunks.py plan <top_directory> [OPTIONS]
# Stage 2: Execute
python src/align_chunks.py execute <top_directory> [OPTIONS]Options & Arguments¶
Common Arguments (Both Stages)¶
<top_directory>
Root directory containing NetCDF files to process
--log_file <path>, --log-file <path>
Log filename (default: align_chunks.log)
--dask_cluster {local|pbs}, --dask-cluster {local|pbs}
Dask cluster backend (default: pbs):
- local: LocalCluster (multiprocessing)
- pbs: PBSCluster (NCAR HPC)
--num_workers <int>, --num-workers <int>
Number of workers/jobs (default: 8)
--walltime <HH:MM:SS>
Walltime for PBS jobs (default: 24:00:00)Plan Stage Options¶
--pattern <glob>
Filename glob pattern to match (default: *.nc)
Applied to filename only (not full path)
Example: "wrf2d_*.nc"
--exclude_pattern <glob> [...], --exclude-pattern <glob> [...]
Filename patterns to exclude (default: None)
Applied as second filter after --pattern
Example: "*_backup.nc" "*_old.nc"
--exclude_variables <var> [...], --exclude-variables <var> [...]
Variable names to exclude from consistency checks (default: None)
Example: XTIME REFERENCE_TIME
--output_parquet <path>, --output-parquet <path>
Path to write rechunk plan parquet file (required)
Example: ./rechunk_plan.parqExecute Stage Options¶
--input_parquet <path>, --input-parquet <path>
Path to rechunk plan parquet from plan stage (required)
Example: ./rechunk_plan.parq
--output_directory <path>, --output-directory <path>
Output directory for rechunked files (required)
Subdirectory structure relative to top_directory is preserved
Example: ./rechunked_dataUse Cases & Examples¶
Use Case 1: Scan for Chunk Mismatches¶
Identify chunk inconsistencies without making changes.
python src/align_chunks.py plan /gdex/data/d651007 \
--output-parquet ./rechunk_plan.parqOutput:
./rechunk_plan.parq(detailed plan with all inconsistencies)align_chunks.log(human-readable report of mismatches)
Log Output Example:
MISMATCH FOUND: temperature
Unique chunk patterns: 3
Target chunks (most common): (10, 20, 30) (450 files) ← TARGET
Chunks (10, 20, 30): 450 files
Chunks (10, 20, 50): 30 files
Chunks (5, 20, 30): 20 filesUse Case 2: Plan with File Pattern Filtering¶
Scan only specific files, excluding backups and old versions.
python src/align_chunks.py plan /gdex/data/d651007 \
--pattern "wrf2d_*.nc" \
--exclude-pattern "*_backup.nc" "*_v1.nc" \
--output-parquet ./rechunk_plan.parq \
--log-file ./plan_scan.logOutput: Plan containing only wrf2d_*.nc files (excluding backups and v1)
Use Case 3: Exclude Coordinate Variables from Checks¶
Skip checking coordinates that are known to differ.
python src/align_chunks.py plan /gdex/data/d651007 \
--exclude-variables XTIME REFERENCE_TIME lon lat \
--output-parquet ./rechunk_plan.parqOutput: Plan checking only data variables, ignoring time and spatial coords
Use Case 4: Execute Rechunking¶
Apply the plan to rechunk all identified files.
python src/align_chunks.py execute /gdex/data/d651007 \
--input-parquet ./rechunk_plan.parq \
--output-directory ./d651007_rechunkedOutput: Rechunked NetCDF files in ./d651007_rechunked with directory structure preserved
Use Case 5: Execute with Local Cluster¶
Rechunk using local multiprocessing instead of PBS.
python src/align_chunks.py execute /gdex/data/d651007 \
--input-parquet ./rechunk_plan.parq \
--output-directory ./d651007_rechunked \
--dask-cluster local \
--num-workers 4Output: Rechunked files processed on 4 local workers
Use Case 6: Execute with Custom Walltime¶
Run execute stage with extended walltime on PBS cluster.
python src/align_chunks.py execute /gdex/data/d651007 \
--input-parquet ./rechunk_plan.parq \
--output-directory ./d651007_rechunked \
--dask-cluster pbs \
--num-workers 32 \
--walltime 72:00:00Output: Rechunking processed on 32 HPC nodes with 72-hour walltime
Use Case 7: Complete Workflow - Plan and Execute¶
Full pipeline from detection to rechunking with large dataset.
# Step 1: Scan for mismatches (with filtering)
python src/align_chunks.py plan /gdex/data/d651007 \
--pattern "wrf2d_*.nc" \
--exclude-pattern "*_backup.nc" \
--exclude-variables XTIME \
--output-parquet /scratch/rechunk_plan.parq \
--log-file /scratch/plan_d651007.log \
--dask-cluster pbs \
--num-workers 32
# Step 2: Review plan (check log for mismatches)
cat /scratch/plan_d651007.log
# Step 3: Execute rechunking
python src/align_chunks.py execute /gdex/data/d651007 \
--input-parquet /scratch/rechunk_plan.parq \
--output-directory /scratch/d651007_rechunked \
--log-file /scratch/execute_d651007.log \
--dask-cluster pbs \
--num-workers 32 \
--walltime 48:00:00Output:
Plan log showing all mismatches found
Rechunked NetCDF files in output directory
Execute log showing all operations performed
Use Case 8: Verify Results¶
Check rechunked files for consistency after execution.
# Re-run plan stage on output directory to verify all fixed
python src/align_chunks.py plan /scratch/d651007_rechunked \
--pattern "wrf2d_*.nc" \
--output-parquet /scratch/verify_plan.parq \
--log-file /scratch/verify.log
# Check log - should show "No chunk mismatches found!"
cat /scratch/verify.logOutput: Verification log confirming all chunks are now consistent
Command-Line Reference¶
| Argument | Type | Default | Stage | Description |
|---|---|---|---|---|
| top_directory | str | required | Both | Root directory to process |
| log_file | str | align_chunks.log | Both | Log output file |
| dask_cluster | local|pbs | pbs | Both | Cluster backend |
| num_workers | int | 8 | Both | Number of workers |
| walltime | HH:MM:SS | 24:00:00 | Both | PBS job walltime |
| pattern | str | *.nc | Plan | File glob pattern |
| exclude_pattern | str[] | None | Plan | Exclude patterns |
| exclude_variables | str[] | None | Plan | Variables to skip |
| output_parquet | str | required | Plan | Plan output file |
| input_parquet | str | required | Execute | Plan input file |
| output_directory | str | required | Execute | Rechunked output dir |
Repository Structure¶
├── src/
│ ├── create_kerchunk.py # Create Kerchunk references
│ ├── align_chunks.py # Detect & fix chunk inconsistencies
│ ├── create_kerchunk_grib.py # GRIB format support
│ ├── convert_ref_file_loc.py # Convert local → remote URLs
│ ├── separate_kerchunk.py # Split combined references
│ └── convert_chunks.py # Modify chunk sizes
├── examples/ # Usage examples & scripts
├── test/ # Tests & validation
└── README.md # This fileKey Features¶
Distributed Processing¶
PBS Cluster: Automatic job scheduling on NCAR HPC (gdex queue)
Local Cluster: Multiprocessing on single machine
Single-threaded: Debug mode
Memory Management¶
Batch processing (5000 files for create_kerchunk, 100 files for align_chunks)
Real-time memory monitoring
Intermediate progress reporting
Pattern Matching¶
create_kerchunk.py: Full-path regex matching
align_chunks.py: Filename glob patterns
Output Formats¶
JSON (default, fully supported)
Parquet (experimental for create_kerchunk)
Important Notes¶
Variable Names: Case sensitive
Parquet Format: Experimental support, use JSON for production
Time Variable: Automatically detected; manually check if detection fails
Remote URLs:
--make_remotecreates separate reference files with GDEX URLsChunk Majority Rule: align_chunks uses most common pattern as target
Directory Structure: Preserved in all operations
GDEX Integration¶
Data Source:
/glade/campaign/collections/gdex/data/Remote Protocols: HTTPS (https://
data .gdex .ucar .edu) and OSDF