Virtual aggregate CESM MOM6 datasets with kerchunk#

This notebook is adapted from the work by Lucas Sterzinger (an NCAR SIParCS intern in 2021).

Note

This notebook was updated to

What is kerchunk?#

From the docs

  1. Kerchunk is a library that provides a unified way to represent a variety of chunked, compressed data formats (e.g. NetCDF/HDF5, GRIB2, TIFF, …), allowing efficient access to the data from traditional file systems or cloud object storage.

  2. It also provides a flexible way to create virtual datasets from multiple files.

  3. It does this by extracting the byte ranges, compression information and other information about the data and storing this metadata in a new, separate object.

  4. This means that you can create a virtual aggregate dataset over potentially many source files, for efficient, parallel and cloud-friendly in-situ access without having to copy or translate the originals. …

  5. For binary storage of array data, essentially all formats involve taking blocks of in-memory C buffers and encoding/compressing them to disc, with some additional metadata describing the details of that buffer plus any other attributes. This description can be applied to a very wide variety of data formats.

  6. The primary purpose of kerchunk is to find where these binary blocks are, and how to decode them, so that blocks from one or more files can be arranged into aggregate datasets accessed via the zarr library and the power of fsspec

We use kerchunk to generate a virtual Zarr dataset that represents a collection of netCDF files:

  • Practically, this aggregate dataset is a JSON file stored on disk containing “references” to binary blocks stored elsewhere.

  • The JSON file is structured to look like a Zarr dataset.

  • Such a file can be interpreted as an aggregate Zarr dataset using fsspec and zarr.

  • kerchunk provides utilities to generate these JSON files.

Tip

The Project Pythia Cookbook on kerchunk is a great resource!

Summary#

We’ll create a virtual aggregate Zarr dataset to represent CESM MOM6 ocean component outputs in the netCDF3 format.

Output streams for this particular simulation are:

  1. static file with time-invariant grid variables

  2. sfc files with daily average surface information

  3. h files with monthly averages of full 3D fields at fixed depth levels

For analysis reasons, we’d like the information in the static file to be merged with the h Dataset and the sfc dataset. So we’ll merge them using kerchunk.combine.merge_vars.

Then we generate aggregate datasets (JSON files) for the h and sfc datasets independently.

Note

These two datasets cannot be combined into a single Dataset without renaming the time dimension because of the different time frequency. In general, it’s possible that the same variable name appears in different output streams, so merging is usually not a good idea.

We can use Zarr to represent both sfc and h in a single dataset using multiple groups. To do so, we generate a new JSON file that represents all output streams using a Zarr group for each stream (the h dataset forms one group, and the sfc dataset another group). The Zarr specification for groups is quite simple, so this turns out to be easy.

We then demo reading the aggregate Dataset in two ways:

  1. Individual groups using xarray.open_dataset with the group kwarg

  2. All groups at once using the datatree library.

Setup#

%load_ext watermark

from glob import glob

import dask
import fsspec
import kerchunk
import ujson
import xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.netCDF3 import NetCDF3ToZarr

%watermark -iv
sys     : 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0]
kerchunk: 0.1.0
json    : 2.0.9
ujson   : 5.7.0
fsspec  : 2022.11.0
dask    : 2023.1.0
xarray  : 2023.2.0

I requested 8 cores for my session.

from dask.distributed import Client

client = Client(threads_per_worker=4)
client

Client

Client-ebb8a0fc-c761-11ed-ad98-3cecef1b12d4

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/8787/status

Cluster Info

CESM MOM6 output#

There are a large number of files. Usually we use intake-esm to catalog and access the files. The downside is that navigating the catalog can be painful, and reading from disk involves touching many files wiith xarray.open_mfdataset. This can take a while.

root = "/glade/campaign/cgd/oce/projects/pump/cesm/"
casename = "gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods"

There’s a lot of output here, we’ll read a subset.

from glob import glob

files = glob(f"{root}/{casename}/run/*mom6.*")
len(files)
2232

This static file (gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc) has grid information

(staticfile,) = glob(f"{root}/{casename}/run/*static*")
print(staticfile)
/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc

Simple example: generate references for the static file#

kerchunk provides a number of “backends” or helper functions to generate the “references” for a file format.

CESM output uses netCDF3 so we’ll use NetCDF3ToZarr. Call .translate on the returned object to create a dictionary representation of a Zarr dataset.

Tip

Zarr is not a “file format” strictly speaking. It is a format for storing array data in things that look like a Python dictionary (formally called MutableMapping). A hierarchy of folders/sub-folders on disk is one such “thing”.

See the Zarr docs for more.

refs_static = NetCDF3ToZarr(staticfile)
refs = refs_static.translate()
refs
{'version': 1,
 'refs': {'.zgroup': '{"zarr_format":2}',
  'xh/.zarray': '{"chunks":[540],"compressor":null,"dtype":">f8","fill_value":null,"filters":null,"order":"C","shape":[540],"zarr_format":2}',
  'xh/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   23048,
   4320],
  'xh/.zattrs': '{"_ARRAY_DIMENSIONS":["xh"],"cartesian_axis":"X","long_name":"h point nominal longitude","units":"degrees_east"}',
  'yh/.zarray': '{"chunks":[458],"compressor":null,"dtype":">f8","fill_value":null,"filters":null,"order":"C","shape":[458],"zarr_format":2}',
  'yh/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   27368,
   3664],
  'yh/.zattrs': '{"_ARRAY_DIMENSIONS":["yh"],"cartesian_axis":"Y","long_name":"h point nominal latitude","units":"degrees_north"}',
  'xq/.zarray': '{"chunks":[540],"compressor":null,"dtype":">f8","fill_value":null,"filters":null,"order":"C","shape":[540],"zarr_format":2}',
  'xq/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   31032,
   4320],
  'xq/.zattrs': '{"_ARRAY_DIMENSIONS":["xq"],"cartesian_axis":"X","long_name":"q point nominal longitude","units":"degrees_east"}',
  'yq/.zarray': '{"chunks":[458],"compressor":null,"dtype":">f8","fill_value":null,"filters":null,"order":"C","shape":[458],"zarr_format":2}',
  'yq/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   35352,
   3664],
  'yq/.zattrs': '{"_ARRAY_DIMENSIONS":["yq"],"cartesian_axis":"Y","long_name":"q point nominal latitude","units":"degrees_north"}',
  'geolon/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'geolon/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   39016,
   989280],
  'geolon/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_methods":"time: point","long_name":"Longitude of tracer (T) points","units":"degrees_east"}',
  'geolat/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'geolat/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   1028296,
   989280],
  'geolat/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_methods":"time: point","long_name":"Latitude of tracer (T) points","units":"degrees_north"}',
  'geolon_c/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'geolon_c/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   2017576,
   989280],
  'geolon_c/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"Longitude of corner (Bu) points","units":"degrees_east"}',
  'geolat_c/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'geolat_c/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   3006856,
   989280],
  'geolat_c/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"Latitude of corner (Bu) points","units":"degrees_north"}',
  'geolon_u/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'geolon_u/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   3996136,
   989280],
  'geolon_u/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"Longitude of zonal velocity (Cu) points","units":"degrees_east"}',
  'geolat_u/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'geolat_u/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   4985416,
   989280],
  'geolat_u/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"Latitude of zonal velocity (Cu) points","units":"degrees_north"}',
  'geolon_v/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'geolon_v/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   5974696,
   989280],
  'geolon_v/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xh"],"cell_methods":"time: point","interp_method":"none","long_name":"Longitude of meridional velocity (Cv) points","units":"degrees_east"}',
  'geolat_v/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'geolat_v/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   6963976,
   989280],
  'geolat_v/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xh"],"cell_methods":"time: point","interp_method":"none","long_name":"Latitude of meridional velocity (Cv) points","units":"degrees_north"}',
  'deptho/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'deptho/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   7953256,
   989280],
  'deptho/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_measures":"area: areacello","cell_methods":"area:mean yh:mean xh:mean time: point","long_name":"Sea Floor Depth","standard_name":"sea_floor_depth_below_geoid","units":"m"}',
  'wet/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'wet/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   8942536,
   989280],
  'wet/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_measures":"area: areacello","cell_methods":"time: point","long_name":"0 if land, 1 if ocean at tracer points","units":"none"}',
  'wet_c/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'wet_c/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   9931816,
   989280],
  'wet_c/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"0 if land, 1 if ocean at corner (Bu) points","units":"none"}',
  'wet_u/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'wet_u/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   10921096,
   989280],
  'wet_u/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"0 if land, 1 if ocean at zonal velocity (Cu) points","units":"none"}',
  'wet_v/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'wet_v/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   11910376,
   989280],
  'wet_v/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xh"],"cell_methods":"time: point","interp_method":"none","long_name":"0 if land, 1 if ocean at meridional velocity (Cv) points","units":"none"}',
  'Coriolis/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'Coriolis/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   12899656,
   989280],
  'Coriolis/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xq"],"cell_methods":"time: point","interp_method":"none","long_name":"Coriolis parameter at corner (Bu) points","units":"s-1"}',
  'areacello/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'areacello/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   13888936,
   989280],
  'areacello/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_methods":"area:sum yh:sum xh:sum time: point","long_name":"Ocean Grid-Cell Area","standard_name":"cell_area","units":"m2"}',
  'areacello_cu/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'areacello_cu/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   14878216,
   989280],
  'areacello_cu/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xq"],"cell_methods":"area:sum yh:sum xq:sum time: point","long_name":"Ocean Grid-Cell Area","standard_name":"cell_area","units":"m2"}',
  'areacello_cv/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'areacello_cv/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   15867496,
   989280],
  'areacello_cv/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xh"],"cell_methods":"area:sum yq:sum xh:sum time: point","long_name":"Ocean Grid-Cell Area","standard_name":"cell_area","units":"m2"}',
  'areacello_bu/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'areacello_bu/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   16856776,
   989280],
  'areacello_bu/.zattrs': '{"_ARRAY_DIMENSIONS":["yq","xq"],"cell_methods":"area:sum yq:sum xq:sum time: point","long_name":"Ocean Grid-Cell Area","standard_name":"cell_area","units":"m2"}',
  'sin_rot/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'sin_rot/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   17846056,
   989280],
  'sin_rot/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_methods":"time: point","long_name":"sine of the clockwise angle of the ocean grid north to true north","units":"none"}',
  'cos_rot/.zarray': '{"chunks":[458,540],"compressor":null,"dtype":">f4","fill_value":1.0000000200408773e+20,"filters":null,"order":"C","shape":[458,540],"zarr_format":2}',
  'cos_rot/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
   18835336,
   989280],
  'cos_rot/.zattrs': '{"_ARRAY_DIMENSIONS":["yh","xh"],"cell_methods":"time: point","long_name":"cosine of the clockwise angle of the ocean grid north to true north","units":"none"}',
  'time/.zarray': '{"chunks":[1],"compressor":null,"dtype":">f8","fill_value":null,"filters":null,"order":"C","shape":[1],"zarr_format":2}',
  'time/.zattrs': '{"_ARRAY_DIMENSIONS":["time"],"calendar":"NOLEAP","calendar_type":"NOLEAP","cartesian_axis":"T","long_name":"time","units":"days since 0001-01-01 00:00:00"}',
  'time/0': '\x00\x00\x00\x00\x00\x00\x00\x00',
  '.zattrs': '{"grid_tile":"N\\/A","grid_type":"regular","title":"MOM6 diagnostic fields table for CESM case: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods"}'}}

Understanding the references dictionary#

Consider the entries: xh/.zarray and 'xh/0'

If this dataset were indeed stored on disk as a Zarr DirectoryStore, then

  • there would be a subfolder named xh.

  • The xh/.zarray file idntifies xh as an array.

  • The xh/0 file would contain all xh values that are stored as a single chunk.

The value associated with xh/0 identifies a byte range in a file that contains the actual values.

Inlining data#

First note that the xh variable is stored as a reference to a byte range in the static file.

refs["refs"]["xh/0"]
['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',
 23048,
 4320]

The data for time is stored as bytes (here 0)

refs["refs"]["time/0"]
'\x00\x00\x00\x00\x00\x00\x00\x00'

The inline_threshold kwarg to NetCDF3ToZarr controls whether the data is included in the JSON file. By default the value is 100 (I think the units are bytes).

We can bump it up to make sure certain variables are stored in the refereces and can be read without touching the netCDF3 files.

We see that the data is base64 encoded.

NetCDF3ToZarr(staticfile, inline_threshold=5000).translate()["refs"]["xh/0"]
'base64:wHHqqqqqqqrAceAAAAAAAMBx1VVVVVVWwHHKqqqqqqrAccAAAAAAAMBxtVVVVVVUwHGqqqqqqqzAcaAAAAAAAMBxlVVVVVVUwHGKqqqqqqzAcYAAAAAAAMBxdVVVVVVUwHFqqqqqqqzAcWAAAAAAAMBxVVVVVVVUwHFKqqqqqqzAcUAAAAAAAMBxNVVVVVVUwHEqqqqqqqzAcSAAAAAAAMBxFVVVVVVUwHEKqqqqqqzAcQAAAAAAAMBw9VVVVVVUwHDqqqqqqqzAcOAAAAAAAMBw1VVVVVVUwHDKqqqqqqzAcMAAAAAAAMBwtVVVVVVUwHCqqqqqqqzAcKAAAAAAAMBwlVVVVVVUwHCKqqqqqqzAcIAAAAAAAMBwdVVVVVVUwHBqqqqqqqzAcGAAAAAAAMBwVVVVVVVUwHBKqqqqqqzAcEAAAAAAAMBwNVVVVVVUwHAqqqqqqqzAcCAAAAAAAMBwFVVVVVVUwHAKqqqqqqzAcAAAAAAAAMBv6qqqqqqowG/VVVVVVVjAb8AAAAAAAMBvqqqqqqqowG+VVVVVVVjAb4AAAAAAAMBvaqqqqqqowG9VVVVVVVjAb0AAAAAAAMBvKqqqqqqowG8VVVVVVVjAbwAAAAAAAMBu6qqqqqqowG7VVVVVVVjAbsAAAAAAAMBuqqqqqqqowG6VVVVVVVjAboAAAAAAAMBuaqqqqqqowG5VVVVVVVjAbkAAAAAAAMBuKqqqqqqowG4VVVVVVVjAbgAAAAAAAMBt6qqqqqqowG3VVVVVVVjAbcAAAAAAAMBtqqqqqqqowG2VVVVVVVjAbYAAAAAAAMBtaqqqqqqowG1VVVVVVVjAbUAAAAAAAMBtKqqqqqqowG0VVVVVVVjAbQAAAAAAAMBs6qqqqqqowGzVVVVVVVjAbMAAAAAAAMBsqqqqqqqowGyVVVVVVVjAbIAAAAAAAMBsaqqqqqqowGxVVVVVVVjAbEAAAAAAAMBsKqqqqqqowGwVVVVVVVjAbAAAAAAAAMBr6qqqqqqowGvVVVVVVVjAa8AAAAAAAMBrqqqqqqqwwGuVVVVVVVjAa4AAAAAAAMBraqqqqqqwwGtVVVVVVVjAa0AAAAAAAMBrKqqqqqqwwGsVVVVVVVjAawAAAAAAAMBq6qqqqqqwwGrVVVVVVVjAasAAAAAAAMBqqqqqqqqwwGqVVVVVVVjAaoAAAAAAAMBqaqqqqqqwwGpVVVVVVVjAakAAAAAAAMBqKqqqqqqwwGoVVVVVVVjAagAAAAAAAMBp6qqqqqqwwGnVVVVVVVjAacAAAAAAAMBpqqqqqqqwwGmVVVVVVVjAaYAAAAAAAMBpaqqqqqqwwGlVVVVVVVjAaUAAAAAAAMBpKqqqqqqwwGkVVVVVVVjAaQAAAAAAAMBo6qqqqqqwwGjVVVVVVVjAaMAAAAAAAMBoqqqqqqqwwGiVVVVVVVjAaIAAAAAAAMBoaqqqqqqwwGhVVVVVVVjAaEAAAAAAAMBoKqqqqqqwwGgVVVVVVVjAaAAAAAAAAMBn6qqqqqqwwGfVVVVVVVjAZ8AAAAAAAMBnqqqqqqqwwGeVVVVVVVjAZ4AAAAAAAMBnaqqqqqqwwGdVVVVVVVjAZ0AAAAAAAMBnKqqqqqqwwGcVVVVVVVjAZwAAAAAAAMBm6qqqqqqwwGbVVVVVVVjAZsAAAAAAAMBmqqqqqqqwwGaVVVVVVVjAZoAAAAAAAMBmaqqqqqqwwGZVVVVVVVjAZkAAAAAAAMBmKqqqqqqwwGYVVVVVVVjAZgAAAAAAAMBl6qqqqqqwwGXVVVVVVVjAZcAAAAAAAMBlqqqqqqqwwGWVVVVVVVjAZYAAAAAAAMBlaqqqqqqwwGVVVVVVVVjAZUAAAAAAAMBlKqqqqqqwwGUVVVVVVVjAZQAAAAAAAMBk6qqqqqqwwGTVVVVVVVjAZMAAAAAAAMBkqqqqqqqwwGSVVVVVVVjAZIAAAAAAAMBkaqqqqqqwwGRVVVVVVVjAZEAAAAAAAMBkKqqqqqqwwGQVVVVVVVjAZAAAAAAAAMBj6qqqqqqwwGPVVVVVVVjAY8AAAAAAAMBjqqqqqqqwwGOVVVVVVVjAY4AAAAAAAMBjaqqqqqqwwGNVVVVVVVjAY0AAAAAAAMBjKqqqqqqwwGMVVVVVVVjAYwAAAAAAAMBi6qqqqqqwwGLVVVVVVVjAYsAAAAAAAMBiqqqqqqqwwGKVVVVVVVjAYoAAAAAAAMBiaqqqqqqwwGJVVVVVVVjAYkAAAAAAAMBiKqqqqqqwwGIVVVVVVVjAYgAAAAAAAMBh6qqqqqqwwGHVVVVVVVjAYcAAAAAAAMBhqqqqqqqwwGGVVVVVVVjAYYAAAAAAAMBhaqqqqqqwwGFVVVVVVVjAYUAAAAAAAMBhKqqqqqqwwGEVVVVVVVjAYQAAAAAAAMBg6qqqqqqwwGDVVVVVVVjAYMAAAAAAAMBgqqqqqqqwwGCVVVVVVVjAYIAAAAAAAMBgaqqqqqqwwGBVVVVVVVjAYEAAAAAAAMBgKqqqqqqwwGAVVVVVVVjAYAAAAAAAAMBf1VVVVVVgwF+qqqqqqrDAX4AAAAAAAMBfVVVVVVVgwF8qqqqqqrDAXwAAAAAAAMBe1VVVVVVgwF6qqqqqqrDAXoAAAAAAAMBeVVVVVVVgwF4qqqqqqrDAXgAAAAAAAMBd1VVVVVVgwF2qqqqqqrDAXYAAAAAAAMBdVVVVVVVgwF0qqqqqqrDAXQAAAAAAAMBc1VVVVVVgwFyqqqqqqrDAXIAAAAAAAMBcVVVVVVVgwFwqqqqqqrDAXAAAAAAAAMBb1VVVVVVgwFuqqqqqqrDAW4AAAAAAAMBbVVVVVVVgwFsqqqqqqrDAWwAAAAAAAMBa1VVVVVVgwFqqqqqqqrDAWoAAAAAAAMBaVVVVVVVgwFoqqqqqqrDAWgAAAAAAAMBZ1VVVVVVgwFmqqqqqqrDAWYAAAAAAAMBZVVVVVVVgwFkqqqqqqrDAWQAAAAAAAMBY1VVVVVVgwFiqqqqqqrDAWIAAAAAAAMBYVVVVVVVgwFgqqqqqqrDAWAAAAAAAAMBX1VVVVVVgwFeqqqqqqrDAV4AAAAAAAMBXVVVVVVVgwFcqqqqqqrDAVwAAAAAAAMBW1VVVVVVgwFaqqqqqqrDAVoAAAAAAAMBWVVVVVVVgwFYqqqqqqrDAVgAAAAAAAMBV1VVVVVVgwFWqqqqqqrDAVYAAAAAAAMBVVVVVVVVgwFUqqqqqqrDAVQAAAAAAAMBU1VVVVVVgwFSqqqqqqrDAVIAAAAAAAMBUVVVVVVVgwFQqqqqqqrDAVAAAAAAAAMBT1VVVVVVgwFOqqqqqqrDAU4AAAAAAAMBTVVVVVVVgwFMqqqqqqrDAUwAAAAAAAMBS1VVVVVVgwFKqqqqqqrDAUoAAAAAAAMBSVVVVVVVgwFIqqqqqqrDAUgAAAAAAAMBR1VVVVVVgwFGqqqqqqrDAUYAAAAAAAMBRVVVVVVVgwFEqqqqqqrDAUQAAAAAAAMBQ1VVVVVVgwFCqqqqqqrDAUIAAAAAAAMBQVVVVVVVgwFAqqqqqqrDAUAAAAAAAAMBPqqqqqqrAwE9VVVVVVWDATwAAAAAAAMBOqqqqqqrAwE5VVVVVVWDATgAAAAAAAMBNqqqqqqrAwE1VVVVVVWDATQAAAAAAAMBMqqqqqqrAwExVVVVVVWDATAAAAAAAAMBLqqqqqqrAwEtVVVVVVWDASwAAAAAAAMBKqqqqqqrAwEpVVVVVVWDASgAAAAAAAMBJqqqqqqrAwElVVVVVVWDASQAAAAAAAMBIqqqqqqrAwEhVVVVVVWDASAAAAAAAAMBHqqqqqqrAwEdVVVVVVWDARwAAAAAAAMBGqqqqqqrAwEZVVVVVVWDARgAAAAAAAMBFqqqqqqrAwEVVVVVVVWDARQAAAAAAAMBEqqqqqqrAwERVVVVVVWDARAAAAAAAAMBDqqqqqqrAwENVVVVVVWDAQwAAAAAAAMBCqqqqqqrAwEJVVVVVVWDAQgAAAAAAAMBBqqqqqqrAwEFVVVVVVWDAQQAAAAAAAMBAqqqqqqrAwEBVVVVVVWDAQAAAAAAAAMA/VVVVVVWAwD6qqqqqqsDAPgAAAAAAAMA9VVVVVVWAwDyqqqqqqsDAPAAAAAAAAMA7VVVVVVWAwDqqqqqqqsDAOgAAAAAAAMA5VVVVVVWAwDiqqqqqqsDAOAAAAAAAAMA3VVVVVVWAwDaqqqqqqsDANgAAAAAAAMA1VVVVVVWAwDSqqqqqqsDANAAAAAAAAMAzVVVVVVWAwDKqqqqqqsDAMgAAAAAAAMAxVVVVVVWAwDCqqqqqqsDAMAAAAAAAAMAuqqqqqqsAwC1VVVVVVYDALAAAAAAAAMAqqqqqqqsAwClVVVVVVYDAKAAAAAAAAMAmqqqqqqsAwCVVVVVVVYDAJAAAAAAAAMAiqqqqqqsAwCFVVVVVVYDAIAAAAAAAAMAdVVVVVVYAwBqqqqqqqwDAGAAAAAAAAMAVVVVVVVYAwBKqqqqqqwDAEAAAAAAAAMAKqqqqqqwAwAVVVVVVVgDAAAAAAAAAAL/1VVVVVVgAv+VVVVVVWAAAAAAAAAAAAD/lVVVVVVAAP/VVVVVVVABAAAAAAAAAAEAFVVVVVVQAQAqqqqqqqgBAEAAAAAAAAEASqqqqqqoAQBVVVVVVVQBAGAAAAAAAAEAaqqqqqqoAQB1VVVVVVQBAIAAAAAAAAEAhVVVVVVUAQCKqqqqqqoBAJAAAAAAAAEAlVVVVVVUAQCaqqqqqqoBAKAAAAAAAAEApVVVVVVUAQCqqqqqqqoBALAAAAAAAAEAtVVVVVVUAQC6qqqqqqoBAMAAAAAAAAEAwqqqqqqqAQDFVVVVVVUBAMgAAAAAAAEAyqqqqqqqAQDNVVVVVVUBANAAAAAAAAEA0qqqqqqqAQDVVVVVVVUBANgAAAAAAAEA2qqqqqqqAQDdVVVVVVUBAOAAAAAAAAEA4qqqqqqqAQDlVVVVVVUBAOgAAAAAAAEA6qqqqqqqAQDtVVVVVVUBAPAAAAAAAAEA8qqqqqqqAQD1VVVVVVUBAPgAAAAAAAEA+qqqqqqqAQD9VVVVVVUBAQAAAAAAAAEBAVVVVVVVAQECqqqqqqqBAQQAAAAAAAEBBVVVVVVVAQEGqqqqqqqBAQgAAAAAAAEBCVVVVVVVAQEKqqqqqqqBAQwAAAAAAAEBDVVVVVVVAQEOqqqqqqqBARAAAAAAAAEBEVVVVVVVAQESqqqqqqqBARQAAAAAAAEBFVVVVVVVAQEWqqqqqqqBARgAAAAAAAEBGVVVVVVVAQEaqqqqqqqBARwAAAAAAAEBHVVVVVVVAQEeqqqqqqqBASAAAAAAAAEBIVVVVVVVAQEiqqqqqqqBASQAAAAAAAEBJVVVVVVVAQEmqqqqqqqBASgAAAAAAAEBKVVVVVVVAQEqqqqqqqqBASwAAAAAAAEBLVVVVVVVAQEuqqqqqqqBATAAAAAAAAEBMVVVVVVVAQEyqqqqqqqBATQAAAAAAAEBNVVVVVVVAQE2qqqqqqqBATgAAAAAAAEBOVVVVVVVAQE6qqqqqqqBATwAAAAAAAEBPVVVVVVVAQE+qqqqqqqBAUAAAAAAAAEBQKqqqqqqgQFBVVVVVVVBAUIAAAAAAAEBQqqqqqqqgQFDVVVVVVVBAUQAAAAAAAEBRKqqqqqqgQFFVVVVVVVBAUYAAAAAAAEBRqqqqqqqgQFHVVVVVVVBAUgAAAAAAAEBSKqqqqqqg'

Utilities#

Make this a function for reuse later.

def gen_ref(f):
    return NetCDF3ToZarr(f, inline_threshold=5000).translate()

Manipulating the references dictionary can be painful. kerchunk comes with some useful pre-processors.

Here we’ll use kerchunk.combine.drop to drop the time variable to avoid some problems later on.

# The static file with time-invariant variables has a useless `time` dimension.
# This messes up kerchunk's heuristics.
# kerchunk.combine.drop returns a function ...
drop_time = kerchunk.combine.drop("time")
staticdict = drop_time(gen_ref(staticfile))
staticdict["refs"].keys()
dict_keys(['.zgroup', 'xh/.zarray', 'xh/0', 'xh/.zattrs', 'yh/.zarray', 'yh/0', 'yh/.zattrs', 'xq/.zarray', 'xq/0', 'xq/.zattrs', 'yq/.zarray', 'yq/0', 'yq/.zattrs', 'geolon/.zarray', 'geolon/0.0', 'geolon/.zattrs', 'geolat/.zarray', 'geolat/0.0', 'geolat/.zattrs', 'geolon_c/.zarray', 'geolon_c/0.0', 'geolon_c/.zattrs', 'geolat_c/.zarray', 'geolat_c/0.0', 'geolat_c/.zattrs', 'geolon_u/.zarray', 'geolon_u/0.0', 'geolon_u/.zattrs', 'geolat_u/.zarray', 'geolat_u/0.0', 'geolat_u/.zattrs', 'geolon_v/.zarray', 'geolon_v/0.0', 'geolon_v/.zattrs', 'geolat_v/.zarray', 'geolat_v/0.0', 'geolat_v/.zattrs', 'deptho/.zarray', 'deptho/0.0', 'deptho/.zattrs', 'wet/.zarray', 'wet/0.0', 'wet/.zattrs', 'wet_c/.zarray', 'wet_c/0.0', 'wet_c/.zattrs', 'wet_u/.zarray', 'wet_u/0.0', 'wet_u/.zattrs', 'wet_v/.zarray', 'wet_v/0.0', 'wet_v/.zattrs', 'Coriolis/.zarray', 'Coriolis/0.0', 'Coriolis/.zattrs', 'areacello/.zarray', 'areacello/0.0', 'areacello/.zattrs', 'areacello_cu/.zarray', 'areacello_cu/0.0', 'areacello_cu/.zattrs', 'areacello_cv/.zarray', 'areacello_cv/0.0', 'areacello_cv/.zattrs', 'areacello_bu/.zarray', 'areacello_bu/0.0', 'areacello_bu/.zattrs', 'sin_rot/.zarray', 'sin_rot/0.0', 'sin_rot/.zattrs', 'cos_rot/.zarray', 'cos_rot/0.0', 'cos_rot/.zattrs', 'time/.zarray', 'time/.zattrs', 'time/0', '.zattrs'])

Generate references for the sfc and h datasets#

This bit generates individual JSONs for the sfc and h datasets:

  1. For each .nc file generate references with gen_ref

  2. Use kerchunk.combine.MultiZarrToZarr to consolidate to a single Zarr dataset.

  3. Merge in the static dataset references using kerchunk.combine.merge_vars.

  4. Write a new JSON file

def generate_json(root, casename, stream, static_refs):
    """
    Generate Kerchunk references for CESM output.
    """

    import copy
    from pathlib import Path

    import dask.bag
    import ujson

    # Get list of files
    flist = sorted(glob(f"{root}/{casename}/run/*mom6.{stream}_*"))

    # parallelize generating references using dask.bag
    # Alternatively this could be dask.delayed
    bag = dask.bag.from_sequence(flist, npartitions=len(flist)).map(gen_ref)
    dicts = bag.compute()

    # Combine multiple Zarr references (one per file) to
    # a single aggregate reference file
    mzz = MultiZarrToZarr(dicts, inline_threshold=5000, concat_dims="time")

    # merge in the static variable references
    # TODO: this deep-copy is necessary because static_refs gets modified in-place otherwise
    merged = kerchunk.combine.merge_vars([copy.deepcopy(static_refs), mzz.translate()])

    # create the output directory if needed
    Path(f"{root}/{casename}/run/jsons/").mkdir(parents=True, exist_ok=True)

    # write the JSON
    with open(f"{root}/{casename}/run/jsons/{stream}.json", "wb") as f:
        f.write(ujson.dumps(merged).encode())

Now we generate the JSON files in parallel with dask:

  • For the sfc stream, it takes 2s per file.

  • For the h stream, it takes 200ms per file.

generate_json(root, casename, stream="sfc", static_refs=staticdict)
generate_json(root, casename, stream="h", static_refs=staticdict)

Demo: reading a dataset#

To read the dataset with Xarray, the JSON files needs to be represented as a Zarr dataset.

Use fsspec to do this.

fs = fsspec.filesystem(
    "reference",  # protocol
    fo=f"{root}/{casename}/run/jsons/sfc.json",  # json
    skip_instance_cache=True,  # skip caching, this is useful when building catalogs.
)
mapper = fs.get_mapper(root="")

Mapper is a dictionary-like object. We can ask it for the .zgroup “file” for example

mapper[".zgroup"]
b'{"zarr_format":2}'

Magic! The zarr library asks the mapper for a ‘file’, the fsspec library responds with data from the appropriate bytes stored in a file somewhere else.

xr.open_zarr(mapper, use_cftime=True, consolidated=False)
<xarray.Dataset>
Dimensions:       (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2)
Coordinates:
  * nv            (nv) float64 1.0 2.0
  * time          (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00
  * xh            (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67
  * xq            (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0
  * yh            (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74
  * yq            (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74
Data variables: (12/32)
    Coriolis      (yq, xq) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>
    SSH           (time, yh, xh) float32 dask.array<chunksize=(1, 458, 540), meta=np.ndarray>
    SSU           (time, yh, xq) float32 dask.array<chunksize=(1, 458, 540), meta=np.ndarray>
    SSV           (time, yq, xh) float32 dask.array<chunksize=(1, 458, 540), meta=np.ndarray>
    areacello     (yh, xh) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>
    areacello_bu  (yq, xq) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>
    ...            ...
    time_bnds     (time, nv) timedelta64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
    tos           (time, yh, xh) float32 dask.array<chunksize=(1, 458, 540), meta=np.ndarray>
    wet           (yh, xh) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>
    wet_c         (yq, xq) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>
    wet_u         (yh, xq) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>
    wet_v         (yq, xh) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>
Attributes:
    associated_files:  areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...
    grid_tile:         N/A
    grid_type:         regular
    title:             MOM6 diagnostic fields table for CESM case: gmom.e23.G...

(^v^) Looks like surface variables with static variables merged in.

Tip

it is a bit annoying to type 4 lines to read the dataset, but this can be hidden away in an intake catalog.

Combine datasets to single Zarr with groups#

To create the sfc group, we read the sfc.json file and add sfc/ to every key.

Repeat for the h dataset, and add a top level .zgroup entry.

Now we have dict representation of a virtual Zarr dataset! Write that to a JSON file.

def combine_stream_jsons_as_groups(streams):
    ZARR_GROUP_ENTRY = {".zgroup": '{"zarr_format":2}'}

    import ujson

    newrefs = {}
    for stream in streams:
        # read in existing JSON references
        with open(f"{root}/{casename}/run/jsons/{stream}.json", "rb") as f:
            d = ujson.loads(f.read())

        # Add a new group by renaming the keys
        newrefs.update({f"{stream}/{k}": v for k, v in d["refs"].items()})

    # Add top-level .zgroup entry
    newrefs.update(ZARR_GROUP_ENTRY)

    # This is now the combined dataset
    combined = {"version": 1, "refs": newrefs}

    # write a new reference JSON file
    with open(f"{root}/{casename}/run/jsons/combined.json", "wb") as f:
        f.write(ujson.dumps(combined).encode())

Combining them is fast

%%time

combine_stream_jsons_as_groups(streams=["sfc", "h"])
CPU times: user 411 ms, sys: 46.6 ms, total: 458 ms
Wall time: 469 ms

Reading the combined dataset#

Create the filesystem and mapper#

fs = fsspec.filesystem(
    "reference",
    fo=f"{root}/{casename}/run/jsons/combined.json",
    skip_instance_cache=True,
)
mapper = fs.get_mapper(root="")

Simple xarray.open_dataset#

Specify the group kwarg to extract a single group

%%time

xr.open_dataset(mapper, engine="zarr", group="sfc", use_cftime=True, consolidated=False)
CPU times: user 340 ms, sys: 9.08 ms, total: 349 ms
Wall time: 344 ms
<xarray.Dataset>
Dimensions:       (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2)
Coordinates:
  * nv            (nv) float64 1.0 2.0
  * time          (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00
  * xh            (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67
  * xq            (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0
  * yh            (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74
  * yq            (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74
Data variables: (12/32)
    Coriolis      (yq, xq) float32 ...
    SSH           (time, yh, xh) float32 ...
    SSU           (time, yh, xq) float32 ...
    SSV           (time, yq, xh) float32 ...
    areacello     (yh, xh) float32 ...
    areacello_bu  (yq, xq) float32 ...
    ...            ...
    time_bnds     (time, nv) timedelta64[ns] ...
    tos           (time, yh, xh) float32 ...
    wet           (yh, xh) float32 ...
    wet_c         (yq, xq) float32 ...
    wet_u         (yh, xq) float32 ...
    wet_v         (yq, xh) float32 ...
Attributes:
    associated_files:  areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...
    grid_tile:         N/A
    grid_type:         regular
    title:             MOM6 diagnostic fields table for CESM case: gmom.e23.G...

Using datatree#

Open all groups at one go using datatree

import datatree
%%time

tree = datatree.open_datatree(mapper, engine="zarr", use_cftime=True, consolidated=False)
tree
CPU times: user 611 ms, sys: 8.22 ms, total: 619 ms
Wall time: 612 ms
<xarray.DatasetView>
Dimensions:  ()
Data variables:
    *empty*
tree["h"]
<xarray.DatasetView>
Dimensions:       (yq: 458, xq: 540, yh: 458, xh: 540, time: 300, z_l: 34,
                   nv: 2, z_i: 35)
Coordinates:
  * nv            (nv) float64 1.0 2.0
  * time          (time) object 0046-01-22 12:00:00 ... 0070-12-22 12:00:00
  * xh            (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67
  * xq            (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0
  * yh            (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74
  * yq            (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74
  * z_i           (z_i) float64 0.0 5.0 15.0 25.0 ... 5.25e+03 5.75e+03 6.25e+03
  * z_l           (z_l) float64 2.5 10.0 20.0 32.5 ... 5e+03 5.5e+03 6e+03
Data variables: (12/36)
    Coriolis      (yq, xq) float32 ...
    areacello     (yh, xh) float32 ...
    areacello_bu  (yq, xq) float32 ...
    areacello_cu  (yh, xq) float32 ...
    areacello_cv  (yq, xh) float32 ...
    average_DT    (time) timedelta64[ns] ...
    ...            ...
    vo            (time, z_l, yq, xh) float32 ...
    volcello      (time, z_l, yh, xh) float32 ...
    wet           (yh, xh) float32 ...
    wet_c         (yq, xq) float32 ...
    wet_u         (yh, xq) float32 ...
    wet_v         (yq, xh) float32 ...
Attributes:
    associated_files:  areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...
    grid_tile:         N/A
    grid_type:         regular
    title:             MOM6 diagnostic fields table for CESM case: gmom.e23.G...
tree["sfc"]
<xarray.DatasetView>
Dimensions:       (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2)
Coordinates:
  * nv            (nv) float64 1.0 2.0
  * time          (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00
  * xh            (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67
  * xq            (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0
  * yh            (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74
  * yq            (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74
Data variables: (12/32)
    Coriolis      (yq, xq) float32 ...
    SSH           (time, yh, xh) float32 ...
    SSU           (time, yh, xq) float32 ...
    SSV           (time, yq, xh) float32 ...
    areacello     (yh, xh) float32 ...
    areacello_bu  (yq, xq) float32 ...
    ...            ...
    time_bnds     (time, nv) timedelta64[ns] ...
    tos           (time, yh, xh) float32 ...
    wet           (yh, xh) float32 ...
    wet_c         (yq, xq) float32 ...
    wet_u         (yh, xq) float32 ...
    wet_v         (yq, xh) float32 ...
Attributes:
    associated_files:  areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...
    grid_tile:         N/A
    grid_type:         regular
    title:             MOM6 diagnostic fields table for CESM case: gmom.e23.G...

Next#

  1. We could even consider adding higher-level lnd, atm, ocn groups so that single virtual dataset represents all output streams for all components from a single simulation.

  2. In intake-esm terminology, a single JSON file representating an aggregate dataset could be a single asset.