Virtual aggregate CESM MOM6 datasets with kerchunk#
This notebook is adapted from the work by Lucas Sterzinger (an NCAR SIParCS intern in 2021).
Note
This notebook was updated to
discuss
inline_threshold
,add a link to the Project Pythia Cookbok on kerchunk,
add timing information to a few cells,
and add a little more discussion throughout.
What is kerchunk?#
From the docs
Kerchunk is a library that provides a unified way to represent a variety of chunked, compressed data formats (e.g. NetCDF/HDF5, GRIB2, TIFF, …), allowing efficient access to the data from traditional file systems or cloud object storage.
It also provides a flexible way to create virtual datasets from multiple files.
It does this by extracting the byte ranges, compression information and other information about the data and storing this metadata in a new, separate object.
This means that you can create a virtual aggregate dataset over potentially many source files, for efficient, parallel and cloud-friendly in-situ access without having to copy or translate the originals. …
For binary storage of array data, essentially all formats involve taking blocks of in-memory C buffers and encoding/compressing them to disc, with some additional metadata describing the details of that buffer plus any other attributes. This description can be applied to a very wide variety of data formats.
The primary purpose of kerchunk is to find where these binary blocks are, and how to decode them, so that blocks from one or more files can be arranged into aggregate datasets accessed via the zarr library and the power of fsspec
We use kerchunk to generate a virtual Zarr dataset that represents a collection of netCDF files:
Practically, this aggregate dataset is a JSON file stored on disk containing “references” to binary blocks stored elsewhere.
The JSON file is structured to look like a Zarr dataset.
Such a file can be interpreted as an aggregate Zarr dataset using fsspec and zarr.
kerchunk
provides utilities to generate these JSON files.
Tip
The Project Pythia Cookbook on kerchunk is a great resource!
Summary#
We’ll create a virtual aggregate Zarr dataset to represent CESM MOM6 ocean component outputs in the netCDF3 format.
Output streams for this particular simulation are:
static
file with time-invariant grid variablessfc
files with daily average surface informationh
files with monthly averages of full 3D fields at fixed depth levels
For analysis reasons, we’d like the information in the static
file to be merged with the h
Dataset and the sfc
dataset.
So we’ll merge them using kerchunk.combine.merge_vars
.
Then we generate aggregate datasets (JSON files) for the h
and sfc
datasets independently.
Note
These two datasets cannot be combined into a single Dataset without renaming the time
dimension because of the different time frequency. In general, it’s possible that the same variable name appears in different output streams, so merging is usually not a good idea.
We can use Zarr to represent both sfc
and h
in a single dataset using multiple groups.
To do so, we generate a new JSON file that represents all output streams using a Zarr group for each stream (the h
dataset forms one group, and the sfc
dataset another group).
The Zarr specification for groups is quite simple, so this turns out to be easy.
We then demo reading the aggregate Dataset in two ways:
Individual groups using
xarray.open_dataset
with thegroup
kwargAll groups at once using the datatree library.
Setup#
%load_ext watermark
from glob import glob
import dask
import fsspec
import kerchunk
import ujson
import xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.netCDF3 import NetCDF3ToZarr
%watermark -iv
sys : 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0]
kerchunk: 0.1.0
json : 2.0.9
ujson : 5.7.0
fsspec : 2022.11.0
dask : 2023.1.0
xarray : 2023.2.0
I requested 8 cores for my session.
from dask.distributed import Client
client = Client(threads_per_worker=4)
client
Client
Client-ebb8a0fc-c761-11ed-ad98-3cecef1b12d4
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/8787/status |
Cluster Info
LocalCluster
f43c56e0
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/8787/status | Workers: 2 |
Total threads: 8 | Total memory: 32.00 GiB |
Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-89ef705a-60fd-4cbf-858f-b06a26fe6702
Comm: tcp://127.0.0.1:32950 | Workers: 2 |
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/8787/status | Total threads: 8 |
Started: Just now | Total memory: 32.00 GiB |
Workers
Worker: 0
Comm: tcp://127.0.0.1:43851 | Total threads: 4 |
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/35514/status | Memory: 16.00 GiB |
Nanny: tcp://127.0.0.1:36575 | |
Local directory: /glade/scratch/dcherian/tmp/dask/dask-worker-space/worker-_v6quk4w |
Worker: 1
Comm: tcp://127.0.0.1:34664 | Total threads: 4 |
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/39546/status | Memory: 16.00 GiB |
Nanny: tcp://127.0.0.1:45121 | |
Local directory: /glade/scratch/dcherian/tmp/dask/dask-worker-space/worker-q_pbuh7n |
CESM MOM6 output#
There are a large number of files. Usually we use intake-esm to catalog and access the files.
The downside is that navigating the catalog can be painful, and reading from disk involves touching many files wiith xarray.open_mfdataset
.
This can take a while.
root = "/glade/campaign/cgd/oce/projects/pump/cesm/"
casename = "gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods"
There’s a lot of output here, we’ll read a subset.
from glob import glob
files = glob(f"{root}/{casename}/run/*mom6.*")
len(files)
2232
This static file (gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc
) has grid information
(staticfile,) = glob(f"{root}/{casename}/run/*static*")
print(staticfile)
/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc
Simple example: generate references for the static file#
kerchunk provides a number of “backends” or helper functions to generate the “references” for a file format.
CESM output uses netCDF3 so we’ll use NetCDF3ToZarr
. Call .translate
on the returned object to create a dictionary representation of a Zarr dataset.
Tip
Zarr is not a “file format” strictly speaking. It is a format for storing array data in things that look like a Python dictionary (formally called MutableMapping
). A hierarchy of folders/sub-folders on disk is one such “thing”.
See the Zarr docs for more.
refs_static = NetCDF3ToZarr(staticfile)
refs = refs_static.translate()
refs
{'version': 1,
'refs': {'.zgroup': '{"zarr_format":2}',
'xh/.zarray': '{"chunks":[540],"compressor":null,"dtype":">f8","fill_value":null,"filters":null,"order":"C","shape":[540],"zarr_format":2}',
'xh/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
23048,
4320],
'xh/.zattrs': '{"_ARRAY_DIMENSIONS":["xh"],"cartesian_axis":"X","long_name":"h point nominal longitude","units":"degrees_east"}',
'yh/.zarray': '{"chunks":[458],"compressor":null,"dtype":">f8","fill_value":null,"filters":null,"order":"C","shape":[458],"zarr_format":2}',
'yh/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
27368,
3664],
'yh/.zattrs': '{"_ARRAY_DIMENSIONS":["yh"],"cartesian_axis":"Y","long_name":"h point nominal latitude","units":"degrees_north"}',
'xq/.zarray': '{"chunks":[540],"compressor":null,"dtype":">f8","fill_value":null,"filters":null,"order":"C","shape":[540],"zarr_format":2}',
'xq/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
31032,
4320],
'xq/.zattrs': '{"_ARRAY_DIMENSIONS":["xq"],"cartesian_axis":"X","long_name":"q point nominal longitude","units":"degrees_east"}',
'yq/.zarray': '{"chunks":[458],"compressor":null,"dtype":">f8","fill_value":null,"filters":null,"order":"C","shape":[458],"zarr_format":2}',
'yq/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
35352,
3664],
'yq/.zattrs': '{"_ARRAY_DIMENSIONS":["yq"],"cartesian_axis":"Y","long_name":"q point nominal latitude","units":"degrees_north"}',
'geolon/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'geolon/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
39016,
989280],
'geolon/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_methods":"time: point","long_name":"Longitude of tracer (T) points","units":"degrees_east"}',
'geolat/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'geolat/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
1028296,
989280],
'geolat/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_methods":"time: point","long_name":"Latitude of tracer (T) points","units":"degrees_north"}',
'geolon_c/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'geolon_c/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
2017576,
989280],
'geolon_c/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"Longitude of corner (Bu) points","units":"degrees_east"}',
'geolat_c/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'geolat_c/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
3006856,
989280],
'geolat_c/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"Latitude of corner (Bu) points","units":"degrees_north"}',
'geolon_u/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'geolon_u/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
3996136,
989280],
'geolon_u/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"Longitude of zonal velocity (Cu) points","units":"degrees_east"}',
'geolat_u/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'geolat_u/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
4985416,
989280],
'geolat_u/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"Latitude of zonal velocity (Cu) points","units":"degrees_north"}',
'geolon_v/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'geolon_v/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
5974696,
989280],
'geolon_v/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xh"],"cell_methods":"time: point","interp_method":"none","long_name":"Longitude of meridional velocity (Cv) points","units":"degrees_east"}',
'geolat_v/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'geolat_v/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
6963976,
989280],
'geolat_v/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xh"],"cell_methods":"time: point","interp_method":"none","long_name":"Latitude of meridional velocity (Cv) points","units":"degrees_north"}',
'deptho/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'deptho/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
7953256,
989280],
'deptho/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_measures":"area: areacello","cell_methods":"area:mean yh:mean xh:mean time: point","long_name":"Sea Floor Depth","standard_name":"sea_floor_depth_below_geoid","units":"m"}',
'wet/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'wet/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
8942536,
989280],
'wet/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_measures":"area: areacello","cell_methods":"time: point","long_name":"0 if land, 1 if ocean at tracer points","units":"none"}',
'wet_c/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'wet_c/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
9931816,
989280],
'wet_c/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"0 if land, 1 if ocean at corner (Bu) points","units":"none"}',
'wet_u/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'wet_u/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
10921096,
989280],
'wet_u/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"0 if land, 1 if ocean at zonal velocity (Cu) points","units":"none"}',
'wet_v/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'wet_v/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
11910376,
989280],
'wet_v/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xh"],"cell_methods":"time: point","interp_method":"none","long_name":"0 if land, 1 if ocean at meridional velocity (Cv) points","units":"none"}',
'Coriolis/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'Coriolis/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
12899656,
989280],
'Coriolis/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"Coriolis parameter at corner (Bu) points","units":"s-1"}',
'areacello/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'areacello/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
13888936,
989280],
'areacello/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_methods":"area:sum yh:sum xh:sum time: point","long_name":"Ocean Grid-Cell Area","standard_name":"cell_area","units":"m2"}',
'areacello_cu/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'areacello_cu/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
14878216,
989280],
'areacello_cu/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xq"],"cell_methods":"area:sum yh:sum xq:sum time: point","long_name":"Ocean Grid-Cell Area","standard_name":"cell_area","units":"m2"}',
'areacello_cv/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'areacello_cv/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
15867496,
989280],
'areacello_cv/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xh"],"cell_methods":"area:sum yq:sum xh:sum time: point","long_name":"Ocean Grid-Cell Area","standard_name":"cell_area","units":"m2"}',
'areacello_bu/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'areacello_bu/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
16856776,
989280],
'areacello_bu/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xq"],"cell_methods":"area:sum yq:sum xq:sum time: point","long_name":"Ocean Grid-Cell Area","standard_name":"cell_area","units":"m2"}',
'sin_rot/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'sin_rot/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
17846056,
989280],
'sin_rot/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_methods":"time: point","long_name":"sine of the clockwise angle of the ocean grid north to true north","units":"none"}',
'cos_rot/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
'cos_rot/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
18835336,
989280],
'cos_rot/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_methods":"time: point","long_name":"cosine of the clockwise angle of the ocean grid north to true north","units":"none"}',
'time/.zarray': '{"chunks":[1],"compressor":null,"dtype":">f8","fill_value":null,"filters":null,"order":"C","shape":[1],"zarr_format":2}',
'time/.zattrs': '{"_ARRAY_DIMENSIONS":["time"],"calendar":"NOLEAP","calendar_type":"NOLEAP","cartesian_axis":"T","long_name":"time","units":"days since 0001-01-01 00:00:00"}',
'time/0': '\x00\x00\x00\x00\x00\x00\x00\x00',
'.zattrs': '{"grid_tile":"N\\/A","grid_type":"regular","title":"MOM6 diagnostic fields table for CESM case: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods"}'}}
Understanding the references dictionary#
Consider the entries: xh/.zarray
and 'xh/0'
If this dataset were indeed stored on disk as a Zarr DirectoryStore
, then
there would be a subfolder named
xh
.The
xh/.zarray
file idntifiesxh
as an array.The
xh/0
file would contain allxh
values that are stored as a single chunk.
The value associated with xh/0
identifies a byte range in a file that contains the actual values.
Inlining data#
First note that the xh
variable is stored as a reference to a byte range in the static file.
refs["refs"]["xh/0"]
['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
23048,
4320]
The data for time
is stored as bytes (here 0)
refs["refs"]["time/0"]
'\x00\x00\x00\x00\x00\x00\x00\x00'
The inline_threshold
kwarg to NetCDF3ToZarr
controls whether the data is included in the JSON file. By default the value is 100 (I think the units are bytes).
We can bump it up to make sure certain variables are stored in the refereces and can be read without touching the netCDF3 files.
We see that the data is base64 encoded.
NetCDF3ToZarr(staticfile, inline_threshold=5000).translate()["refs"]["xh/0"]
'base64:wHHqqqqqqqrAceAAAAAAAMBx1VVVVVVWwHHKqqqqqqrAccAAAAAAAMBxtVVVVVVUwHGqqqqqqqzAcaAAAAAAAMBxlVVVVVVUwHGKqqqqqqzAcYAAAAAAAMBxdVVVVVVUwHFqqqqqqqzAcWAAAAAAAMBxVVVVVVVUwHFKqqqqqqzAcUAAAAAAAMBxNVVVVVVUwHEqqqqqqqzAcSAAAAAAAMBxFVVVVVVUwHEKqqqqqqzAcQAAAAAAAMBw9VVVVVVUwHDqqqqqqqzAcOAAAAAAAMBw1VVVVVVUwHDKqqqqqqzAcMAAAAAAAMBwtVVVVVVUwHCqqqqqqqzAcKAAAAAAAMBwlVVVVVVUwHCKqqqqqqzAcIAAAAAAAMBwdVVVVVVUwHBqqqqqqqzAcGAAAAAAAMBwVVVVVVVUwHBKqqqqqqzAcEAAAAAAAMBwNVVVVVVUwHAqqqqqqqzAcCAAAAAAAMBwFVVVVVVUwHAKqqqqqqzAcAAAAAAAAMBv6qqqqqqowG/VVVVVVVjAb8AAAAAAAMBvqqqqqqqowG+VVVVVVVjAb4AAAAAAAMBvaqqqqqqowG9VVVVVVVjAb0AAAAAAAMBvKqqqqqqowG8VVVVVVVjAbwAAAAAAAMBu6qqqqqqowG7VVVVVVVjAbsAAAAAAAMBuqqqqqqqowG6VVVVVVVjAboAAAAAAAMBuaqqqqqqowG5VVVVVVVjAbkAAAAAAAMBuKqqqqqqowG4VVVVVVVjAbgAAAAAAAMBt6qqqqqqowG3VVVVVVVjAbcAAAAAAAMBtqqqqqqqowG2VVVVVVVjAbYAAAAAAAMBtaqqqqqqowG1VVVVVVVjAbUAAAAAAAMBtKqqqqqqowG0VVVVVVVjAbQAAAAAAAMBs6qqqqqqowGzVVVVVVVjAbMAAAAAAAMBsqqqqqqqowGyVVVVVVVjAbIAAAAAAAMBsaqqqqqqowGxVVVVVVVjAbEAAAAAAAMBsKqqqqqqowGwVVVVVVVjAbAAAAAAAAMBr6qqqqqqowGvVVVVVVVjAa8AAAAAAAMBrqqqqqqqwwGuVVVVVVVjAa4AAAAAAAMBraqqqqqqwwGtVVVVVVVjAa0AAAAAAAMBrKqqqqqqwwGsVVVVVVVjAawAAAAAAAMBq6qqqqqqwwGrVVVVVVVjAasAAAAAAAMBqqqqqqqqwwGqVVVVVVVjAaoAAAAAAAMBqaqqqqqqwwGpVVVVVVVjAakAAAAAAAMBqKqqqqqqwwGoVVVVVVVjAagAAAAAAAMBp6qqqqqqwwGnVVVVVVVjAacAAAAAAAMBpqqqqqqqwwGmVVVVVVVjAaYAAAAAAAMBpaqqqqqqwwGlVVVVVVVjAaUAAAAAAAMBpKqqqqqqwwGkVVVVVVVjAaQAAAAAAAMBo6qqqqqqwwGjVVVVVVVjAaMAAAAAAAMBoqqqqqqqwwGiVVVVVVVjAaIAAAAAAAMBoaqqqqqqwwGhVVVVVVVjAaEAAAAAAAMBoKqqqqqqwwGgVVVVVVVjAaAAAAAAAAMBn6qqqqqqwwGfVVVVVVVjAZ8AAAAAAAMBnqqqqqqqwwGeVVVVVVVjAZ4AAAAAAAMBnaqqqqqqwwGdVVVVVVVjAZ0AAAAAAAMBnKqqqqqqwwGcVVVVVVVjAZwAAAAAAAMBm6qqqqqqwwGbVVVVVVVjAZsAAAAAAAMBmqqqqqqqwwGaVVVVVVVjAZoAAAAAAAMBmaqqqqqqwwGZVVVVVVVjAZkAAAAAAAMBmKqqqqqqwwGYVVVVVVVjAZgAAAAAAAMBl6qqqqqqwwGXVVVVVVVjAZcAAAAAAAMBlqqqqqqqwwGWVVVVVVVjAZYAAAAAAAMBlaqqqqqqwwGVVVVVVVVjAZUAAAAAAAMBlKqqqqqqwwGUVVVVVVVjAZQAAAAAAAMBk6qqqqqqwwGTVVVVVVVjAZMAAAAAAAMBkqqqqqqqwwGSVVVVVVVjAZIAAAAAAAMBkaqqqqqqwwGRVVVVVVVjAZEAAAAAAAMBkKqqqqqqwwGQVVVVVVVjAZAAAAAAAAMBj6qqqqqqwwGPVVVVVVVjAY8AAAAAAAMBjqqqqqqqwwGOVVVVVVVjAY4AAAAAAAMBjaqqqqqqwwGNVVVVVVVjAY0AAAAAAAMBjKqqqqqqwwGMVVVVVVVjAYwAAAAAAAMBi6qqqqqqwwGLVVVVVVVjAYsAAAAAAAMBiqqqqqqqwwGKVVVVVVVjAYoAAAAAAAMBiaqqqqqqwwGJVVVVVVVjAYkAAAAAAAMBiKqqqqqqwwGIVVVVVVVjAYgAAAAAAAMBh6qqqqqqwwGHVVVVVVVjAYcAAAAAAAMBhqqqqqqqwwGGVVVVVVVjAYYAAAAAAAMBhaqqqqqqwwGFVVVVVVVjAYUAAAAAAAMBhKqqqqqqwwGEVVVVVVVjAYQAAAAAAAMBg6qqqqqqwwGDVVVVVVVjAYMAAAAAAAMBgqqqqqqqwwGCVVVVVVVjAYIAAAAAAAMBgaqqqqqqwwGBVVVVVVVjAYEAAAAAAAMBgKqqqqqqwwGAVVVVVVVjAYAAAAAAAAMBf1VVVVVVgwF+qqqqqqrDAX4AAAAAAAMBfVVVVVVVgwF8qqqqqqrDAXwAAAAAAAMBe1VVVVVVgwF6qqqqqqrDAXoAAAAAAAMBeVVVVVVVgwF4qqqqqqrDAXgAAAAAAAMBd1VVVVVVgwF2qqqqqqrDAXYAAAAAAAMBdVVVVVVVgwF0qqqqqqrDAXQAAAAAAAMBc1VVVVVVgwFyqqqqqqrDAXIAAAAAAAMBcVVVVVVVgwFwqqqqqqrDAXAAAAAAAAMBb1VVVVVVgwFuqqqqqqrDAW4AAAAAAAMBbVVVVVVVgwFsqqqqqqrDAWwAAAAAAAMBa1VVVVVVgwFqqqqqqqrDAWoAAAAAAAMBaVVVVVVVgwFoqqqqqqrDAWgAAAAAAAMBZ1VVVVVVgwFmqqqqqqrDAWYAAAAAAAMBZVVVVVVVgwFkqqqqqqrDAWQAAAAAAAMBY1VVVVVVgwFiqqqqqqrDAWIAAAAAAAMBYVVVVVVVgwFgqqqqqqrDAWAAAAAAAAMBX1VVVVVVgwFeqqqqqqrDAV4AAAAAAAMBXVVVVVVVgwFcqqqqqqrDAVwAAAAAAAMBW1VVVVVVgwFaqqqqqqrDAVoAAAAAAAMBWVVVVVVVgwFYqqqqqqrDAVgAAAAAAAMBV1VVVVVVgwFWqqqqqqrDAVYAAAAAAAMBVVVVVVVVgwFUqqqqqqrDAVQAAAAAAAMBU1VVVVVVgwFSqqqqqqrDAVIAAAAAAAMBUVVVVVVVgwFQqqqqqqrDAVAAAAAAAAMBT1VVVVVVgwFOqqqqqqrDAU4AAAAAAAMBTVVVVVVVgwFMqqqqqqrDAUwAAAAAAAMBS1VVVVVVgwFKqqqqqqrDAUoAAAAAAAMBSVVVVVVVgwFIqqqqqqrDAUgAAAAAAAMBR1VVVVVVgwFGqqqqqqrDAUYAAAAAAAMBRVVVVVVVgwFEqqqqqqrDAUQAAAAAAAMBQ1VVVVVVgwFCqqqqqqrDAUIAAAAAAAMBQVVVVVVVgwFAqqqqqqrDAUAAAAAAAAMBPqqqqqqrAwE9VVVVVVWDATwAAAAAAAMBOqqqqqqrAwE5VVVVVVWDATgAAAAAAAMBNqqqqqqrAwE1VVVVVVWDATQAAAAAAAMBMqqqqqqrAwExVVVVVVWDATAAAAAAAAMBLqqqqqqrAwEtVVVVVVWDASwAAAAAAAMBKqqqqqqrAwEpVVVVVVWDASgAAAAAAAMBJqqqqqqrAwElVVVVVVWDASQAAAAAAAMBIqqqqqqrAwEhVVVVVVWDASAAAAAAAAMBHqqqqqqrAwEdVVVVVVWDARwAAAAAAAMBGqqqqqqrAwEZVVVVVVWDARgAAAAAAAMBFqqqqqqrAwEVVVVVVVWDARQAAAAAAAMBEqqqqqqrAwERVVVVVVWDARAAAAAAAAMBDqqqqqqrAwENVVVVVVWDAQwAAAAAAAMBCqqqqqqrAwEJVVVVVVWDAQgAAAAAAAMBBqqqqqqrAwEFVVVVVVWDAQQAAAAAAAMBAqqqqqqrAwEBVVVVVVWDAQAAAAAAAAMA/VVVVVVWAwD6qqqqqqsDAPgAAAAAAAMA9VVVVVVWAwDyqqqqqqsDAPAAAAAAAAMA7VVVVVVWAwDqqqqqqqsDAOgAAAAAAAMA5VVVVVVWAwDiqqqqqqsDAOAAAAAAAAMA3VVVVVVWAwDaqqqqqqsDANgAAAAAAAMA1VVVVVVWAwDSqqqqqqsDANAAAAAAAAMAzVVVVVVWAwDKqqqqqqsDAMgAAAAAAAMAxVVVVVVWAwDCqqqqqqsDAMAAAAAAAAMAuqqqqqqsAwC1VVVVVVYDALAAAAAAAAMAqqqqqqqsAwClVVVVVVYDAKAAAAAAAAMAmqqqqqqsAwCVVVVVVVYDAJAAAAAAAAMAiqqqqqqsAwCFVVVVVVYDAIAAAAAAAAMAdVVVVVVYAwBqqqqqqqwDAGAAAAAAAAMAVVVVVVVYAwBKqqqqqqwDAEAAAAAAAAMAKqqqqqqwAwAVVVVVVVgDAAAAAAAAAAL/1VVVVVVgAv+VVVVVVWAAAAAAAAAAAAD/lVVVVVVAAP/VVVVVVVABAAAAAAAAAAEAFVVVVVVQAQAqqqqqqqgBAEAAAAAAAAEASqqqqqqoAQBVVVVVVVQBAGAAAAAAAAEAaqqqqqqoAQB1VVVVVVQBAIAAAAAAAAEAhVVVVVVUAQCKqqqqqqoBAJAAAAAAAAEAlVVVVVVUAQCaqqqqqqoBAKAAAAAAAAEApVVVVVVUAQCqqqqqqqoBALAAAAAAAAEAtVVVVVVUAQC6qqqqqqoBAMAAAAAAAAEAwqqqqqqqAQDFVVVVVVUBAMgAAAAAAAEAyqqqqqqqAQDNVVVVVVUBANAAAAAAAAEA0qqqqqqqAQDVVVVVVVUBANgAAAAAAAEA2qqqqqqqAQDdVVVVVVUBAOAAAAAAAAEA4qqqqqqqAQDlVVVVVVUBAOgAAAAAAAEA6qqqqqqqAQDtVVVVVVUBAPAAAAAAAAEA8qqqqqqqAQD1VVVVVVUBAPgAAAAAAAEA+qqqqqqqAQD9VVVVVVUBAQAAAAAAAAEBAVVVVVVVAQECqqqqqqqBAQQAAAAAAAEBBVVVVVVVAQEGqqqqqqqBAQgAAAAAAAEBCVVVVVVVAQEKqqqqqqqBAQwAAAAAAAEBDVVVVVVVAQEOqqqqqqqBARAAAAAAAAEBEVVVVVVVAQESqqqqqqqBARQAAAAAAAEBFVVVVVVVAQEWqqqqqqqBARgAAAAAAAEBGVVVVVVVAQEaqqqqqqqBARwAAAAAAAEBHVVVVVVVAQEeqqqqqqqBASAAAAAAAAEBIVVVVVVVAQEiqqqqqqqBASQAAAAAAAEBJVVVVVVVAQEmqqqqqqqBASgAAAAAAAEBKVVVVVVVAQEqqqqqqqqBASwAAAAAAAEBLVVVVVVVAQEuqqqqqqqBATAAAAAAAAEBMVVVVVVVAQEyqqqqqqqBATQAAAAAAAEBNVVVVVVVAQE2qqqqqqqBATgAAAAAAAEBOVVVVVVVAQE6qqqqqqqBATwAAAAAAAEBPVVVVVVVAQE+qqqqqqqBAUAAAAAAAAEBQKqqqqqqgQFBVVVVVVVBAUIAAAAAAAEBQqqqqqqqgQFDVVVVVVVBAUQAAAAAAAEBRKqqqqqqgQFFVVVVVVVBAUYAAAAAAAEBRqqqqqqqgQFHVVVVVVVBAUgAAAAAAAEBSKqqqqqqg'
Utilities#
Make this a function for reuse later.
def gen_ref(f):
return NetCDF3ToZarr(f, inline_threshold=5000).translate()
Manipulating the references dictionary can be painful. kerchunk comes with some useful pre-processors.
Here we’ll use kerchunk.combine.drop
to drop the time
variable to avoid some problems later on.
# The static file with time-invariant variables has a useless `time` dimension.
# This messes up kerchunk's heuristics.
# kerchunk.combine.drop returns a function ...
drop_time = kerchunk.combine.drop("time")
staticdict = drop_time(gen_ref(staticfile))
staticdict["refs"].keys()
dict_keys(['.zgroup', 'xh/.zarray', 'xh/0', 'xh/.zattrs', 'yh/.zarray', 'yh/0', 'yh/.zattrs', 'xq/.zarray', 'xq/0', 'xq/.zattrs', 'yq/.zarray', 'yq/0', 'yq/.zattrs', 'geolon/.zarray', 'geolon/0.0', 'geolon/.zattrs', 'geolat/.zarray', 'geolat/0.0', 'geolat/.zattrs', 'geolon_c/.zarray', 'geolon_c/0.0', 'geolon_c/.zattrs', 'geolat_c/.zarray', 'geolat_c/0.0', 'geolat_c/.zattrs', 'geolon_u/.zarray', 'geolon_u/0.0', 'geolon_u/.zattrs', 'geolat_u/.zarray', 'geolat_u/0.0', 'geolat_u/.zattrs', 'geolon_v/.zarray', 'geolon_v/0.0', 'geolon_v/.zattrs', 'geolat_v/.zarray', 'geolat_v/0.0', 'geolat_v/.zattrs', 'deptho/.zarray', 'deptho/0.0', 'deptho/.zattrs', 'wet/.zarray', 'wet/0.0', 'wet/.zattrs', 'wet_c/.zarray', 'wet_c/0.0', 'wet_c/.zattrs', 'wet_u/.zarray', 'wet_u/0.0', 'wet_u/.zattrs', 'wet_v/.zarray', 'wet_v/0.0', 'wet_v/.zattrs', 'Coriolis/.zarray', 'Coriolis/0.0', 'Coriolis/.zattrs', 'areacello/.zarray', 'areacello/0.0', 'areacello/.zattrs', 'areacello_cu/.zarray', 'areacello_cu/0.0', 'areacello_cu/.zattrs', 'areacello_cv/.zarray', 'areacello_cv/0.0', 'areacello_cv/.zattrs', 'areacello_bu/.zarray', 'areacello_bu/0.0', 'areacello_bu/.zattrs', 'sin_rot/.zarray', 'sin_rot/0.0', 'sin_rot/.zattrs', 'cos_rot/.zarray', 'cos_rot/0.0', 'cos_rot/.zattrs', 'time/.zarray', 'time/.zattrs', 'time/0', '.zattrs'])
Generate references for the sfc
and h
datasets#
This bit generates individual JSONs for the sfc
and h
datasets:
For each
.nc
file generate references withgen_ref
Use
kerchunk.combine.MultiZarrToZarr
to consolidate to a single Zarr dataset.Merge in the static dataset references using
kerchunk.combine.merge_vars
.Write a new JSON file
def generate_json(root, casename, stream, static_refs):
"""
Generate Kerchunk references for CESM output.
"""
import copy
from pathlib import Path
import dask.bag
import ujson
# Get list of files
flist = sorted(glob(f"{root}/{casename}/run/*mom6.{stream}_*"))
# parallelize generating references using dask.bag
# Alternatively this could be dask.delayed
bag = dask.bag.from_sequence(flist, npartitions=len(flist)).map(gen_ref)
dicts = bag.compute()
# Combine multiple Zarr references (one per file) to
# a single aggregate reference file
mzz = MultiZarrToZarr(dicts, inline_threshold=5000, concat_dims="time")
# merge in the static variable references
# TODO: this deep-copy is necessary because static_refs gets modified in-place otherwise
merged = kerchunk.combine.merge_vars([copy.deepcopy(static_refs), mzz.translate()])
# create the output directory if needed
Path(f"{root}/{casename}/run/jsons/").mkdir(parents=True, exist_ok=True)
# write the JSON
with open(f"{root}/{casename}/run/jsons/{stream}.json", "wb") as f:
f.write(ujson.dumps(merged).encode())
Now we generate the JSON files in parallel with dask:
For the
sfc
stream, it takes 2s per file.For the
h
stream, it takes 200ms per file.
generate_json(root, casename, stream="sfc", static_refs=staticdict)
generate_json(root, casename, stream="h", static_refs=staticdict)
Demo: reading a dataset#
To read the dataset with Xarray, the JSON files needs to be represented as a Zarr dataset.
Use fsspec
to do this.
fs = fsspec.filesystem(
"reference", # protocol
fo=f"{root}/{casename}/run/jsons/sfc.json", # json
skip_instance_cache=True, # skip caching, this is useful when building catalogs.
)
mapper = fs.get_mapper(root="")
Mapper is a dictionary-like object. We can ask it for the .zgroup
“file” for example
mapper[".zgroup"]
b'{"zarr_format":2}'
Magic! The zarr library asks the mapper
for a ‘file’, the fsspec
library responds with data from the appropriate bytes stored in a file somewhere else.
xr.open_zarr(mapper, use_cftime=True, consolidated=False)
<xarray.Dataset> Dimensions: (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2) Coordinates: * nv (nv) float64 1.0 2.0 * time (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00 * xh (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67 * xq (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0 * yh (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74 * yq (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74 Data variables: (12/32) Coriolis (yq, xq) float32 dask.array<chunksize=(458, 540), meta=np.ndarray> SSH (time, yh, xh) float32 dask.array<chunksize=(1, 458, 540), meta=np.ndarray> SSU (time, yh, xq) float32 dask.array<chunksize=(1, 458, 540), meta=np.ndarray> SSV (time, yq, xh) float32 dask.array<chunksize=(1, 458, 540), meta=np.ndarray> areacello (yh, xh) float32 dask.array<chunksize=(458, 540), meta=np.ndarray> areacello_bu (yq, xq) float32 dask.array<chunksize=(458, 540), meta=np.ndarray> ... ... time_bnds (time, nv) timedelta64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray> tos (time, yh, xh) float32 dask.array<chunksize=(1, 458, 540), meta=np.ndarray> wet (yh, xh) float32 dask.array<chunksize=(458, 540), meta=np.ndarray> wet_c (yq, xq) float32 dask.array<chunksize=(458, 540), meta=np.ndarray> wet_u (yh, xq) float32 dask.array<chunksize=(458, 540), meta=np.ndarray> wet_v (yq, xh) float32 dask.array<chunksize=(458, 540), meta=np.ndarray> Attributes: associated_files: areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli... grid_tile: N/A grid_type: regular title: MOM6 diagnostic fields table for CESM case: gmom.e23.G...
(^v^) Looks like surface variables with static variables merged in.
Tip
it is a bit annoying to type 4 lines to read the dataset, but this can be hidden away in an intake catalog.
Combine datasets to single Zarr with groups#
To create the sfc
group, we read the sfc.json
file and add sfc/
to every key.
Repeat for the h
dataset, and add a top level .zgroup
entry.
Now we have dict representation of a virtual Zarr dataset! Write that to a JSON file.
def combine_stream_jsons_as_groups(streams):
ZARR_GROUP_ENTRY = {".zgroup": '{"zarr_format":2}'}
import ujson
newrefs = {}
for stream in streams:
# read in existing JSON references
with open(f"{root}/{casename}/run/jsons/{stream}.json", "rb") as f:
d = ujson.loads(f.read())
# Add a new group by renaming the keys
newrefs.update({f"{stream}/{k}": v for k, v in d["refs"].items()})
# Add top-level .zgroup entry
newrefs.update(ZARR_GROUP_ENTRY)
# This is now the combined dataset
combined = {"version": 1, "refs": newrefs}
# write a new reference JSON file
with open(f"{root}/{casename}/run/jsons/combined.json", "wb") as f:
f.write(ujson.dumps(combined).encode())
Combining them is fast
%%time
combine_stream_jsons_as_groups(streams=["sfc", "h"])
CPU times: user 411 ms, sys: 46.6 ms, total: 458 ms
Wall time: 469 ms
Reading the combined dataset#
Create the filesystem and mapper#
fs = fsspec.filesystem(
"reference",
fo=f"{root}/{casename}/run/jsons/combined.json",
skip_instance_cache=True,
)
mapper = fs.get_mapper(root="")
Simple xarray.open_dataset#
Specify the group
kwarg to extract a single group
%%time
xr.open_dataset(mapper, engine="zarr", group="sfc", use_cftime=True, consolidated=False)
CPU times: user 340 ms, sys: 9.08 ms, total: 349 ms
Wall time: 344 ms
<xarray.Dataset> Dimensions: (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2) Coordinates: * nv (nv) float64 1.0 2.0 * time (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00 * xh (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67 * xq (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0 * yh (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74 * yq (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74 Data variables: (12/32) Coriolis (yq, xq) float32 ... SSH (time, yh, xh) float32 ... SSU (time, yh, xq) float32 ... SSV (time, yq, xh) float32 ... areacello (yh, xh) float32 ... areacello_bu (yq, xq) float32 ... ... ... time_bnds (time, nv) timedelta64[ns] ... tos (time, yh, xh) float32 ... wet (yh, xh) float32 ... wet_c (yq, xq) float32 ... wet_u (yh, xq) float32 ... wet_v (yq, xh) float32 ... Attributes: associated_files: areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli... grid_tile: N/A grid_type: regular title: MOM6 diagnostic fields table for CESM case: gmom.e23.G...
Using datatree#
Open all groups at one go using datatree
import datatree
%%time
tree = datatree.open_datatree(mapper, engine="zarr", use_cftime=True, consolidated=False)
tree
CPU times: user 611 ms, sys: 8.22 ms, total: 619 ms
Wall time: 612 ms
<xarray.DatasetView> Dimensions: () Data variables: *empty*
tree["h"]
<xarray.DatasetView> Dimensions: (yq: 458, xq: 540, yh: 458, xh: 540, time: 300, z_l: 34, nv: 2, z_i: 35) Coordinates: * nv (nv) float64 1.0 2.0 * time (time) object 0046-01-22 12:00:00 ... 0070-12-22 12:00:00 * xh (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67 * xq (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0 * yh (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74 * yq (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74 * z_i (z_i) float64 0.0 5.0 15.0 25.0 ... 5.25e+03 5.75e+03 6.25e+03 * z_l (z_l) float64 2.5 10.0 20.0 32.5 ... 5e+03 5.5e+03 6e+03 Data variables: (12/36) Coriolis (yq, xq) float32 ... areacello (yh, xh) float32 ... areacello_bu (yq, xq) float32 ... areacello_cu (yh, xq) float32 ... areacello_cv (yq, xh) float32 ... average_DT (time) timedelta64[ns] ... ... ... vo (time, z_l, yq, xh) float32 ... volcello (time, z_l, yh, xh) float32 ... wet (yh, xh) float32 ... wet_c (yq, xq) float32 ... wet_u (yh, xq) float32 ... wet_v (yq, xh) float32 ... Attributes: associated_files: areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli... grid_tile: N/A grid_type: regular title: MOM6 diagnostic fields table for CESM case: gmom.e23.G...
tree["sfc"]
<xarray.DatasetView> Dimensions: (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2) Coordinates: * nv (nv) float64 1.0 2.0 * time (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00 * xh (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67 * xq (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0 * yh (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74 * yq (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74 Data variables: (12/32) Coriolis (yq, xq) float32 ... SSH (time, yh, xh) float32 ... SSU (time, yh, xq) float32 ... SSV (time, yq, xh) float32 ... areacello (yh, xh) float32 ... areacello_bu (yq, xq) float32 ... ... ... time_bnds (time, nv) timedelta64[ns] ... tos (time, yh, xh) float32 ... wet (yh, xh) float32 ... wet_c (yq, xq) float32 ... wet_u (yh, xq) float32 ... wet_v (yq, xh) float32 ... Attributes: associated_files: areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli... grid_tile: N/A grid_type: regular title: MOM6 diagnostic fields table for CESM case: gmom.e23.G...
Next#
We could even consider adding higher-level
lnd
,atm
,ocn
groups so that single virtual dataset represents all output streams for all components from a single simulation.In
intake-esm
terminology, a single JSON file representating an aggregate dataset could be a single asset.