{ "cells": [ { "cell_type": "markdown", "id": "a09bb1a6-7d6b-4f94-ae53-7cfaa13d731f", "metadata": {}, "source": [ "# Virtual aggregate CESM MOM6 datasets with kerchunk\n", "\n", "This notebook is adapted from the [work](https://github.com/lsterzinger/2022-esip-kerchunk-tutorial/blob/main/01-Create_References.ipynb) by [Lucas Sterzinger](https://lucassterzinger.com/) (an NCAR SIParCS intern in 2021).\n", "\n", "```{note}\n", "This notebook was updated to \n", "- discuss `inline_threshold`, \n", "- add a link to [the Project Pythia Cookbok on kerchunk](https://projectpythia.org/kerchunk-cookbook/README.html), \n", "- add timing information to a few cells, \n", "- and add a little more discussion throughout.\n", "```\n", "\n", "## What is kerchunk?\n", "\n", "From the [docs](https://fsspec.github.io/kerchunk/)\n", "\n", "> 1. Kerchunk is a library that provides a unified way to represent a variety of chunked, compressed data formats (e.g. NetCDF/HDF5, GRIB2, TIFF, …), allowing efficient access to the data from traditional file systems or cloud object storage. \n", "> 2. It also provides a flexible way to create virtual datasets from multiple files. \n", "> 3. It does this by extracting the byte ranges, compression information and other information about the data and storing this metadata in a new, separate object. \n", "> 4. This means that you can create a virtual aggregate dataset over potentially many source files, for efficient, parallel and cloud-friendly in-situ access without having to copy or translate the originals.\n", "> …\n", "> 5. For binary storage of array data, essentially all formats involve taking blocks of in-memory C buffers and encoding/compressing them to disc, with some additional metadata describing the details of that buffer plus any other attributes. This description can be applied to a very wide variety of data formats.\n", "> 6. The primary purpose of kerchunk is to find where these binary blocks are, and how to decode them, so that blocks from one or more files can be arranged into aggregate datasets accessed via the zarr library and the power of fsspec\n", "\n", "\n", "We use kerchunk to generate a virtual Zarr dataset that represents a collection of netCDF files:\n", "- Practically, this aggregate dataset is a JSON file stored on disk containing \"references\" to binary blocks stored elsewhere. \n", "- The JSON file is structured to look like a [Zarr dataset](https://zarr.readthedocs.io/en/stable/).\n", "- Such a file can be interpreted as an aggregate Zarr dataset using [fsspec](https://filesystem-spec.readthedocs.io/en/latest/?badge=latest) and zarr.\n", "- `kerchunk` provides utilities to generate these JSON files.\n", "\n", "```{tip}\n", "The [Project Pythia Cookbook on kerchunk](https://projectpythia.org/kerchunk-cookbook/README.html) is a great resource!\n", "```\n", "\n", "\n", "## Summary\n", "\n", "We'll create a virtual aggregate Zarr dataset to represent CESM MOM6 ocean component outputs in the netCDF3 format.\n", "\n", "Output streams for this particular simulation are:\n", "1. `static` file with time-invariant grid variables\n", "2. `sfc` files with daily average surface information\n", "3. `h` files with monthly averages of full 3D fields at fixed depth levels\n", "\n", "For analysis reasons, we'd like the information in the `static` file to be merged with the `h` Dataset and the `sfc` dataset.\n", "So we'll merge them using `kerchunk.combine.merge_vars`.\n", "\n", "Then we generate aggregate datasets (JSON files) for the `h` and `sfc` datasets independently. \n", "\n", "```{note}\n", "These two datasets cannot be combined into a single Dataset without renaming the `time` dimension because of the different time frequency. In general, it's possible that the same variable name appears in different output streams, so merging is usually not a good idea.\n", "```\n", "\n", "We can use Zarr to represent both `sfc` and `h` in a single dataset using multiple [groups](https://zarr.readthedocs.io/en/stable/spec/v2.html#groups).\n", "To do so, we generate a new JSON file that represents all output streams using a Zarr group for each stream (the `h` dataset forms one group, and the `sfc` dataset another group).\n", "The Zarr specification for [groups](https://zarr.readthedocs.io/en/stable/spec/v2.html#groups) is quite simple, so this turns out to be easy.\n", "\n", "We then demo reading the aggregate Dataset in two ways:\n", "1. Individual groups using `xarray.open_dataset` with the `group` kwarg\n", "2. All groups at once using the [datatree](https://xarray-datatree.readthedocs.io/en/latest/) library.\n" ] }, { "cell_type": "markdown", "id": "9c7398f3-6ec9-4dc0-8010-c4acaca007b7", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "e620d6b7-aebf-48db-a6b1-6d559451bbe9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sys : 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:26:04) [GCC 10.4.0]\n", "kerchunk: 0.1.0\n", "json : 2.0.9\n", "ujson : 5.7.0\n", "fsspec : 2022.11.0\n", "dask : 2023.1.0\n", "xarray : 2023.2.0\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "\n", "from glob import glob\n", "\n", "import dask\n", "import fsspec\n", "import kerchunk\n", "import ujson\n", "import xarray as xr\n", "from kerchunk.combine import MultiZarrToZarr\n", "from kerchunk.netCDF3 import NetCDF3ToZarr\n", "\n", "%watermark -iv" ] }, { "cell_type": "markdown", "id": "98434fb5-62bb-4baa-be2a-1a3e31a34a47", "metadata": {}, "source": [ "I requested 8 cores for my session." ] }, { "cell_type": "code", "execution_count": 2, "id": "0d145d10-6818-4ed2-a06c-0c3a5d942852", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "
\n", "

Client

\n", "

Client-ebb8a0fc-c761-11ed-ad98-3cecef1b12d4

\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "
Connection method: Cluster objectCluster type: distributed.LocalCluster
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/8787/status\n", "
\n", "\n", " \n", " \n", " \n", "\n", " \n", "
\n", "

Cluster Info

\n", "
\n", "
\n", "
\n", "
\n", "

LocalCluster

\n", "

f43c56e0

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", "
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/8787/status\n", " \n", " Workers: 2\n", "
\n", " Total threads: 8\n", " \n", " Total memory: 32.00 GiB\n", "
Status: runningUsing processes: True
\n", "\n", "
\n", " \n", "

Scheduler Info

\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "

Scheduler

\n", "

Scheduler-89ef705a-60fd-4cbf-858f-b06a26fe6702

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " Comm: tcp://127.0.0.1:32950\n", " \n", " Workers: 2\n", "
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/8787/status\n", " \n", " Total threads: 8\n", "
\n", " Started: Just now\n", " \n", " Total memory: 32.00 GiB\n", "
\n", "
\n", "
\n", "\n", "
\n", " \n", "

Workers

\n", "
\n", "\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 0

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:43851\n", " \n", " Total threads: 4\n", "
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/35514/status\n", " \n", " Memory: 16.00 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:36575\n", "
\n", " Local directory: /glade/scratch/dcherian/tmp/dask/dask-worker-space/worker-_v6quk4w\n", "
\n", "
\n", "
\n", "
\n", " \n", "
\n", "
\n", "
\n", "
\n", " \n", "

Worker: 1

\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "\n", "
\n", " Comm: tcp://127.0.0.1:34664\n", " \n", " Total threads: 4\n", "
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/proxy/39546/status\n", " \n", " Memory: 16.00 GiB\n", "
\n", " Nanny: tcp://127.0.0.1:45121\n", "
\n", " Local directory: /glade/scratch/dcherian/tmp/dask/dask-worker-space/worker-q_pbuh7n\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from dask.distributed import Client\n", "\n", "client = Client(threads_per_worker=4)\n", "client" ] }, { "cell_type": "markdown", "id": "b3ea1224-954e-4d60-bdb3-7f39e74249ba", "metadata": {}, "source": [ "## CESM MOM6 output\n", "\n", "There are a large number of files. Usually we use [intake-esm](https://intake-esm.readthedocs.io/en/stable/) to catalog and access the files.\n", "The downside is that navigating the catalog can be painful, and reading from disk involves touching many files wiith `xarray.open_mfdataset`. \n", "This can take a while." ] }, { "cell_type": "code", "execution_count": 3, "id": "6f8fdd8f-cfdf-4419-bf2e-5ba52193aab3", "metadata": {}, "outputs": [], "source": [ "root = \"/glade/campaign/cgd/oce/projects/pump/cesm/\"\n", "casename = \"gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods\"" ] }, { "cell_type": "markdown", "id": "3957db44-5302-4686-bc8e-c7026e9c3518", "metadata": {}, "source": [ "There's a lot of output here, we'll read a subset." ] }, { "cell_type": "code", "execution_count": 4, "id": "528c9bc3-4b3c-493b-90c7-fbd6a9152a6f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2232" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from glob import glob\n", "\n", "files = glob(f\"{root}/{casename}/run/*mom6.*\")\n", "len(files)" ] }, { "cell_type": "markdown", "id": "bb5f5f45-5ee7-47ce-8c48-de64e6821df7", "metadata": {}, "source": [ "This static file (`gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc`) has grid information" ] }, { "cell_type": "code", "execution_count": 5, "id": "ff94eddc-8bc3-4357-9b80-1721386eba75", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc\n" ] } ], "source": [ "(staticfile,) = glob(f\"{root}/{casename}/run/*static*\")\n", "print(staticfile)" ] }, { "cell_type": "markdown", "id": "c8ef9032-72a5-49d7-b5c7-a59c4549c609", "metadata": {}, "source": [ "## Simple example: generate references for the static file\n", "\n", "kerchunk provides a [number of \"backends\"](https://fsspec.github.io/kerchunk/reference.html) or helper functions to generate the \"references\" for a file format.\n", "\n", "CESM output uses netCDF3 so we'll use `NetCDF3ToZarr`. Call `.translate` on the returned object to create a dictionary representation of a Zarr dataset.\n", "\n", "\n", "```{tip}\n", "Zarr is not a \"file format\" strictly speaking. It is a format for storing array data in things that look like a Python dictionary (formally called `MutableMapping`). A hierarchy of folders/sub-folders on disk is one such \"thing\".\n", "\n", "See the [Zarr docs](https://zarr.readthedocs.io/en/stable/tutorial.html#storage-alternatives) for more.\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "id": "29651be7-a638-4094-a379-9707a9bc5165", "metadata": { "tags": [ "output_scroll" ] }, "outputs": [ { "data": { "text/plain": [ "{'version': 1,\n", " 'refs': {'.zgroup': '{\"zarr_format\":2}',\n", " 'xh/.zarray': '{\"chunks\":[540],\"compressor\":null,\"dtype\":\">f8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[540],\"zarr_format\":2}',\n", " 'xh/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 23048,\n", " 4320],\n", " 'xh/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"xh\"],\"cartesian_axis\":\"X\",\"long_name\":\"h point nominal longitude\",\"units\":\"degrees_east\"}',\n", " 'yh/.zarray': '{\"chunks\":[458],\"compressor\":null,\"dtype\":\">f8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[458],\"zarr_format\":2}',\n", " 'yh/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 27368,\n", " 3664],\n", " 'yh/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yh\"],\"cartesian_axis\":\"Y\",\"long_name\":\"h point nominal latitude\",\"units\":\"degrees_north\"}',\n", " 'xq/.zarray': '{\"chunks\":[540],\"compressor\":null,\"dtype\":\">f8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[540],\"zarr_format\":2}',\n", " 'xq/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 31032,\n", " 4320],\n", " 'xq/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"xq\"],\"cartesian_axis\":\"X\",\"long_name\":\"q point nominal longitude\",\"units\":\"degrees_east\"}',\n", " 'yq/.zarray': '{\"chunks\":[458],\"compressor\":null,\"dtype\":\">f8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[458],\"zarr_format\":2}',\n", " 'yq/0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 35352,\n", " 3664],\n", " 'yq/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yq\"],\"cartesian_axis\":\"Y\",\"long_name\":\"q point nominal latitude\",\"units\":\"degrees_north\"}',\n", " 'geolon/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'geolon/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 39016,\n", " 989280],\n", " 'geolon/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yh\",\"xh\"],\"cell_methods\":\"time: point\",\"long_name\":\"Longitude of tracer (T) points\",\"units\":\"degrees_east\"}',\n", " 'geolat/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'geolat/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 1028296,\n", " 989280],\n", " 'geolat/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yh\",\"xh\"],\"cell_methods\":\"time: point\",\"long_name\":\"Latitude of tracer (T) points\",\"units\":\"degrees_north\"}',\n", " 'geolon_c/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'geolon_c/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 2017576,\n", " 989280],\n", " 'geolon_c/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yq\",\"xq\"],\"cell_methods\":\"time: point\",\"interp_method\":\"none\",\"long_name\":\"Longitude of corner (Bu) points\",\"units\":\"degrees_east\"}',\n", " 'geolat_c/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'geolat_c/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 3006856,\n", " 989280],\n", " 'geolat_c/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yq\",\"xq\"],\"cell_methods\":\"time: point\",\"interp_method\":\"none\",\"long_name\":\"Latitude of corner (Bu) points\",\"units\":\"degrees_north\"}',\n", " 'geolon_u/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'geolon_u/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 3996136,\n", " 989280],\n", " 'geolon_u/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yh\",\"xq\"],\"cell_methods\":\"time: point\",\"interp_method\":\"none\",\"long_name\":\"Longitude of zonal velocity (Cu) points\",\"units\":\"degrees_east\"}',\n", " 'geolat_u/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'geolat_u/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 4985416,\n", " 989280],\n", " 'geolat_u/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yh\",\"xq\"],\"cell_methods\":\"time: point\",\"interp_method\":\"none\",\"long_name\":\"Latitude of zonal velocity (Cu) points\",\"units\":\"degrees_north\"}',\n", " 'geolon_v/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'geolon_v/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 5974696,\n", " 989280],\n", " 'geolon_v/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yq\",\"xh\"],\"cell_methods\":\"time: point\",\"interp_method\":\"none\",\"long_name\":\"Longitude of meridional velocity (Cv) points\",\"units\":\"degrees_east\"}',\n", " 'geolat_v/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'geolat_v/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 6963976,\n", " 989280],\n", " 'geolat_v/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yq\",\"xh\"],\"cell_methods\":\"time: point\",\"interp_method\":\"none\",\"long_name\":\"Latitude of meridional velocity (Cv) points\",\"units\":\"degrees_north\"}',\n", " 'deptho/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'deptho/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 7953256,\n", " 989280],\n", " 'deptho/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yh\",\"xh\"],\"cell_measures\":\"area: areacello\",\"cell_methods\":\"area:mean yh:mean xh:mean time: point\",\"long_name\":\"Sea Floor Depth\",\"standard_name\":\"sea_floor_depth_below_geoid\",\"units\":\"m\"}',\n", " 'wet/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'wet/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 8942536,\n", " 989280],\n", " 'wet/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yh\",\"xh\"],\"cell_measures\":\"area: areacello\",\"cell_methods\":\"time: point\",\"long_name\":\"0 if land, 1 if ocean at tracer points\",\"units\":\"none\"}',\n", " 'wet_c/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'wet_c/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 9931816,\n", " 989280],\n", " 'wet_c/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yq\",\"xq\"],\"cell_methods\":\"time: point\",\"interp_method\":\"none\",\"long_name\":\"0 if land, 1 if ocean at corner (Bu) points\",\"units\":\"none\"}',\n", " 'wet_u/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'wet_u/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 10921096,\n", " 989280],\n", " 'wet_u/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yh\",\"xq\"],\"cell_methods\":\"time: point\",\"interp_method\":\"none\",\"long_name\":\"0 if land, 1 if ocean at zonal velocity (Cu) points\",\"units\":\"none\"}',\n", " 'wet_v/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'wet_v/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 11910376,\n", " 989280],\n", " 'wet_v/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yq\",\"xh\"],\"cell_methods\":\"time: point\",\"interp_method\":\"none\",\"long_name\":\"0 if land, 1 if ocean at meridional velocity (Cv) points\",\"units\":\"none\"}',\n", " 'Coriolis/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'Coriolis/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 12899656,\n", " 989280],\n", " 'Coriolis/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yq\",\"xq\"],\"cell_methods\":\"time: point\",\"interp_method\":\"none\",\"long_name\":\"Coriolis parameter at corner (Bu) points\",\"units\":\"s-1\"}',\n", " 'areacello/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'areacello/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 13888936,\n", " 989280],\n", " 'areacello/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yh\",\"xh\"],\"cell_methods\":\"area:sum yh:sum xh:sum time: point\",\"long_name\":\"Ocean Grid-Cell Area\",\"standard_name\":\"cell_area\",\"units\":\"m2\"}',\n", " 'areacello_cu/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'areacello_cu/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 14878216,\n", " 989280],\n", " 'areacello_cu/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yh\",\"xq\"],\"cell_methods\":\"area:sum yh:sum xq:sum time: point\",\"long_name\":\"Ocean Grid-Cell Area\",\"standard_name\":\"cell_area\",\"units\":\"m2\"}',\n", " 'areacello_cv/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'areacello_cv/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 15867496,\n", " 989280],\n", " 'areacello_cv/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yq\",\"xh\"],\"cell_methods\":\"area:sum yq:sum xh:sum time: point\",\"long_name\":\"Ocean Grid-Cell Area\",\"standard_name\":\"cell_area\",\"units\":\"m2\"}',\n", " 'areacello_bu/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'areacello_bu/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 16856776,\n", " 989280],\n", " 'areacello_bu/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yq\",\"xq\"],\"cell_methods\":\"area:sum yq:sum xq:sum time: point\",\"long_name\":\"Ocean Grid-Cell Area\",\"standard_name\":\"cell_area\",\"units\":\"m2\"}',\n", " 'sin_rot/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'sin_rot/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 17846056,\n", " 989280],\n", " 'sin_rot/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yh\",\"xh\"],\"cell_methods\":\"time: point\",\"long_name\":\"sine of the clockwise angle of the ocean grid north to true north\",\"units\":\"none\"}',\n", " 'cos_rot/.zarray': '{\"chunks\":[458,540],\"compressor\":null,\"dtype\":\">f4\",\"fill_value\":1.0000000200408773e+20,\"filters\":null,\"order\":\"C\",\"shape\":[458,540],\"zarr_format\":2}',\n", " 'cos_rot/0.0': ['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 18835336,\n", " 989280],\n", " 'cos_rot/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"yh\",\"xh\"],\"cell_methods\":\"time: point\",\"long_name\":\"cosine of the clockwise angle of the ocean grid north to true north\",\"units\":\"none\"}',\n", " 'time/.zarray': '{\"chunks\":[1],\"compressor\":null,\"dtype\":\">f8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[1],\"zarr_format\":2}',\n", " 'time/.zattrs': '{\"_ARRAY_DIMENSIONS\":[\"time\"],\"calendar\":\"NOLEAP\",\"calendar_type\":\"NOLEAP\",\"cartesian_axis\":\"T\",\"long_name\":\"time\",\"units\":\"days since 0001-01-01 00:00:00\"}',\n", " 'time/0': '\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00',\n", " '.zattrs': '{\"grid_tile\":\"N\\\\/A\",\"grid_type\":\"regular\",\"title\":\"MOM6 diagnostic fields table for CESM case: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods\"}'}}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "refs_static = NetCDF3ToZarr(staticfile)\n", "refs = refs_static.translate()\n", "refs" ] }, { "cell_type": "markdown", "id": "16ae805e-70fd-4ddf-9875-7b29dd4ad05c", "metadata": {}, "source": [ "### Understanding the references dictionary\n", "Consider the entries: `xh/.zarray` and `'xh/0'`\n", "\n", "If this dataset were indeed stored on disk as a Zarr `DirectoryStore`, then \n", "- there would be a subfolder named `xh`.\n", "- The `xh/.zarray` file idntifies `xh` as an array.\n", "- The `xh/0` file would contain all `xh` values that are stored as a single chunk. \n", "\n", "The value associated with `xh/0` identifies a byte range in a file that contains the actual values." ] }, { "cell_type": "markdown", "id": "bd6c71ec-d969-40de-bfc3-517924df0e74", "metadata": {}, "source": [ "### Inlining data\n", "\n", "First note that the `xh` variable is stored as a reference to a byte range in the static file." ] }, { "cell_type": "code", "execution_count": 7, "id": "94db7f38-21f7-468b-bceb-777249f11474", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/glade/campaign/cgd/oce/projects/pump/cesm//gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods/run/gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseline.kpp.lmd.004.mixpods.mom6.static.nc',\n", " 23048,\n", " 4320]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "refs[\"refs\"][\"xh/0\"]" ] }, { "cell_type": "markdown", "id": "ac37b811-1060-4e0f-a6e5-2fe8ce38392f", "metadata": {}, "source": [ "The data for `time` is stored as bytes (here 0)" ] }, { "cell_type": "code", "execution_count": 8, "id": "8d1e019f-cd2c-47de-8bd3-934d0e44b877", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "refs[\"refs\"][\"time/0\"]" ] }, { "cell_type": "markdown", "id": "dc0aba83-3865-4507-a875-e998470273fb", "metadata": {}, "source": [ "The `inline_threshold` kwarg to `NetCDF3ToZarr` controls whether the data is included in the JSON file. By default the value is 100 (I think the units are bytes).\n", "\n", "We can bump it up to make sure certain variables are stored in the refereces and can be read without touching the netCDF3 files.\n", "\n", "We see that the data is base64 encoded." ] }, { "cell_type": "code", "execution_count": 9, "id": "8d52f127-008a-45f7-8b53-e908c7aab2fc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'base64:wHHqqqqqqqrAceAAAAAAAMBx1VVVVVVWwHHKqqqqqqrAccAAAAAAAMBxtVVVVVVUwHGqqqqqqqzAcaAAAAAAAMBxlVVVVVVUwHGKqqqqqqzAcYAAAAAAAMBxdVVVVVVUwHFqqqqqqqzAcWAAAAAAAMBxVVVVVVVUwHFKqqqqqqzAcUAAAAAAAMBxNVVVVVVUwHEqqqqqqqzAcSAAAAAAAMBxFVVVVVVUwHEKqqqqqqzAcQAAAAAAAMBw9VVVVVVUwHDqqqqqqqzAcOAAAAAAAMBw1VVVVVVUwHDKqqqqqqzAcMAAAAAAAMBwtVVVVVVUwHCqqqqqqqzAcKAAAAAAAMBwlVVVVVVUwHCKqqqqqqzAcIAAAAAAAMBwdVVVVVVUwHBqqqqqqqzAcGAAAAAAAMBwVVVVVVVUwHBKqqqqqqzAcEAAAAAAAMBwNVVVVVVUwHAqqqqqqqzAcCAAAAAAAMBwFVVVVVVUwHAKqqqqqqzAcAAAAAAAAMBv6qqqqqqowG/VVVVVVVjAb8AAAAAAAMBvqqqqqqqowG+VVVVVVVjAb4AAAAAAAMBvaqqqqqqowG9VVVVVVVjAb0AAAAAAAMBvKqqqqqqowG8VVVVVVVjAbwAAAAAAAMBu6qqqqqqowG7VVVVVVVjAbsAAAAAAAMBuqqqqqqqowG6VVVVVVVjAboAAAAAAAMBuaqqqqqqowG5VVVVVVVjAbkAAAAAAAMBuKqqqqqqowG4VVVVVVVjAbgAAAAAAAMBt6qqqqqqowG3VVVVVVVjAbcAAAAAAAMBtqqqqqqqowG2VVVVVVVjAbYAAAAAAAMBtaqqqqqqowG1VVVVVVVjAbUAAAAAAAMBtKqqqqqqowG0VVVVVVVjAbQAAAAAAAMBs6qqqqqqowGzVVVVVVVjAbMAAAAAAAMBsqqqqqqqowGyVVVVVVVjAbIAAAAAAAMBsaqqqqqqowGxVVVVVVVjAbEAAAAAAAMBsKqqqqqqowGwVVVVVVVjAbAAAAAAAAMBr6qqqqqqowGvVVVVVVVjAa8AAAAAAAMBrqqqqqqqwwGuVVVVVVVjAa4AAAAAAAMBraqqqqqqwwGtVVVVVVVjAa0AAAAAAAMBrKqqqqqqwwGsVVVVVVVjAawAAAAAAAMBq6qqqqqqwwGrVVVVVVVjAasAAAAAAAMBqqqqqqqqwwGqVVVVVVVjAaoAAAAAAAMBqaqqqqqqwwGpVVVVVVVjAakAAAAAAAMBqKqqqqqqwwGoVVVVVVVjAagAAAAAAAMBp6qqqqqqwwGnVVVVVVVjAacAAAAAAAMBpqqqqqqqwwGmVVVVVVVjAaYAAAAAAAMBpaqqqqqqwwGlVVVVVVVjAaUAAAAAAAMBpKqqqqqqwwGkVVVVVVVjAaQAAAAAAAMBo6qqqqqqwwGjVVVVVVVjAaMAAAAAAAMBoqqqqqqqwwGiVVVVVVVjAaIAAAAAAAMBoaqqqqqqwwGhVVVVVVVjAaEAAAAAAAMBoKqqqqqqwwGgVVVVVVVjAaAAAAAAAAMBn6qqqqqqwwGfVVVVVVVjAZ8AAAAAAAMBnqqqqqqqwwGeVVVVVVVjAZ4AAAAAAAMBnaqqqqqqwwGdVVVVVVVjAZ0AAAAAAAMBnKqqqqqqwwGcVVVVVVVjAZwAAAAAAAMBm6qqqqqqwwGbVVVVVVVjAZsAAAAAAAMBmqqqqqqqwwGaVVVVVVVjAZoAAAAAAAMBmaqqqqqqwwGZVVVVVVVjAZkAAAAAAAMBmKqqqqqqwwGYVVVVVVVjAZgAAAAAAAMBl6qqqqqqwwGXVVVVVVVjAZcAAAAAAAMBlqqqqqqqwwGWVVVVVVVjAZYAAAAAAAMBlaqqqqqqwwGVVVVVVVVjAZUAAAAAAAMBlKqqqqqqwwGUVVVVVVVjAZQAAAAAAAMBk6qqqqqqwwGTVVVVVVVjAZMAAAAAAAMBkqqqqqqqwwGSVVVVVVVjAZIAAAAAAAMBkaqqqqqqwwGRVVVVVVVjAZEAAAAAAAMBkKqqqqqqwwGQVVVVVVVjAZAAAAAAAAMBj6qqqqqqwwGPVVVVVVVjAY8AAAAAAAMBjqqqqqqqwwGOVVVVVVVjAY4AAAAAAAMBjaqqqqqqwwGNVVVVVVVjAY0AAAAAAAMBjKqqqqqqwwGMVVVVVVVjAYwAAAAAAAMBi6qqqqqqwwGLVVVVVVVjAYsAAAAAAAMBiqqqqqqqwwGKVVVVVVVjAYoAAAAAAAMBiaqqqqqqwwGJVVVVVVVjAYkAAAAAAAMBiKqqqqqqwwGIVVVVVVVjAYgAAAAAAAMBh6qqqqqqwwGHVVVVVVVjAYcAAAAAAAMBhqqqqqqqwwGGVVVVVVVjAYYAAAAAAAMBhaqqqqqqwwGFVVVVVVVjAYUAAAAAAAMBhKqqqqqqwwGEVVVVVVVjAYQAAAAAAAMBg6qqqqqqwwGDVVVVVVVjAYMAAAAAAAMBgqqqqqqqwwGCVVVVVVVjAYIAAAAAAAMBgaqqqqqqwwGBVVVVVVVjAYEAAAAAAAMBgKqqqqqqwwGAVVVVVVVjAYAAAAAAAAMBf1VVVVVVgwF+qqqqqqrDAX4AAAAAAAMBfVVVVVVVgwF8qqqqqqrDAXwAAAAAAAMBe1VVVVVVgwF6qqqqqqrDAXoAAAAAAAMBeVVVVVVVgwF4qqqqqqrDAXgAAAAAAAMBd1VVVVVVgwF2qqqqqqrDAXYAAAAAAAMBdVVVVVVVgwF0qqqqqqrDAXQAAAAAAAMBc1VVVVVVgwFyqqqqqqrDAXIAAAAAAAMBcVVVVVVVgwFwqqqqqqrDAXAAAAAAAAMBb1VVVVVVgwFuqqqqqqrDAW4AAAAAAAMBbVVVVVVVgwFsqqqqqqrDAWwAAAAAAAMBa1VVVVVVgwFqqqqqqqrDAWoAAAAAAAMBaVVVVVVVgwFoqqqqqqrDAWgAAAAAAAMBZ1VVVVVVgwFmqqqqqqrDAWYAAAAAAAMBZVVVVVVVgwFkqqqqqqrDAWQAAAAAAAMBY1VVVVVVgwFiqqqqqqrDAWIAAAAAAAMBYVVVVVVVgwFgqqqqqqrDAWAAAAAAAAMBX1VVVVVVgwFeqqqqqqrDAV4AAAAAAAMBXVVVVVVVgwFcqqqqqqrDAVwAAAAAAAMBW1VVVVVVgwFaqqqqqqrDAVoAAAAAAAMBWVVVVVVVgwFYqqqqqqrDAVgAAAAAAAMBV1VVVVVVgwFWqqqqqqrDAVYAAAAAAAMBVVVVVVVVgwFUqqqqqqrDAVQAAAAAAAMBU1VVVVVVgwFSqqqqqqrDAVIAAAAAAAMBUVVVVVVVgwFQqqqqqqrDAVAAAAAAAAMBT1VVVVVVgwFOqqqqqqrDAU4AAAAAAAMBTVVVVVVVgwFMqqqqqqrDAUwAAAAAAAMBS1VVVVVVgwFKqqqqqqrDAUoAAAAAAAMBSVVVVVVVgwFIqqqqqqrDAUgAAAAAAAMBR1VVVVVVgwFGqqqqqqrDAUYAAAAAAAMBRVVVVVVVgwFEqqqqqqrDAUQAAAAAAAMBQ1VVVVVVgwFCqqqqqqrDAUIAAAAAAAMBQVVVVVVVgwFAqqqqqqrDAUAAAAAAAAMBPqqqqqqrAwE9VVVVVVWDATwAAAAAAAMBOqqqqqqrAwE5VVVVVVWDATgAAAAAAAMBNqqqqqqrAwE1VVVVVVWDATQAAAAAAAMBMqqqqqqrAwExVVVVVVWDATAAAAAAAAMBLqqqqqqrAwEtVVVVVVWDASwAAAAAAAMBKqqqqqqrAwEpVVVVVVWDASgAAAAAAAMBJqqqqqqrAwElVVVVVVWDASQAAAAAAAMBIqqqqqqrAwEhVVVVVVWDASAAAAAAAAMBHqqqqqqrAwEdVVVVVVWDARwAAAAAAAMBGqqqqqqrAwEZVVVVVVWDARgAAAAAAAMBFqqqqqqrAwEVVVVVVVWDARQAAAAAAAMBEqqqqqqrAwERVVVVVVWDARAAAAAAAAMBDqqqqqqrAwENVVVVVVWDAQwAAAAAAAMBCqqqqqqrAwEJVVVVVVWDAQgAAAAAAAMBBqqqqqqrAwEFVVVVVVWDAQQAAAAAAAMBAqqqqqqrAwEBVVVVVVWDAQAAAAAAAAMA/VVVVVVWAwD6qqqqqqsDAPgAAAAAAAMA9VVVVVVWAwDyqqqqqqsDAPAAAAAAAAMA7VVVVVVWAwDqqqqqqqsDAOgAAAAAAAMA5VVVVVVWAwDiqqqqqqsDAOAAAAAAAAMA3VVVVVVWAwDaqqqqqqsDANgAAAAAAAMA1VVVVVVWAwDSqqqqqqsDANAAAAAAAAMAzVVVVVVWAwDKqqqqqqsDAMgAAAAAAAMAxVVVVVVWAwDCqqqqqqsDAMAAAAAAAAMAuqqqqqqsAwC1VVVVVVYDALAAAAAAAAMAqqqqqqqsAwClVVVVVVYDAKAAAAAAAAMAmqqqqqqsAwCVVVVVVVYDAJAAAAAAAAMAiqqqqqqsAwCFVVVVVVYDAIAAAAAAAAMAdVVVVVVYAwBqqqqqqqwDAGAAAAAAAAMAVVVVVVVYAwBKqqqqqqwDAEAAAAAAAAMAKqqqqqqwAwAVVVVVVVgDAAAAAAAAAAL/1VVVVVVgAv+VVVVVVWAAAAAAAAAAAAD/lVVVVVVAAP/VVVVVVVABAAAAAAAAAAEAFVVVVVVQAQAqqqqqqqgBAEAAAAAAAAEASqqqqqqoAQBVVVVVVVQBAGAAAAAAAAEAaqqqqqqoAQB1VVVVVVQBAIAAAAAAAAEAhVVVVVVUAQCKqqqqqqoBAJAAAAAAAAEAlVVVVVVUAQCaqqqqqqoBAKAAAAAAAAEApVVVVVVUAQCqqqqqqqoBALAAAAAAAAEAtVVVVVVUAQC6qqqqqqoBAMAAAAAAAAEAwqqqqqqqAQDFVVVVVVUBAMgAAAAAAAEAyqqqqqqqAQDNVVVVVVUBANAAAAAAAAEA0qqqqqqqAQDVVVVVVVUBANgAAAAAAAEA2qqqqqqqAQDdVVVVVVUBAOAAAAAAAAEA4qqqqqqqAQDlVVVVVVUBAOgAAAAAAAEA6qqqqqqqAQDtVVVVVVUBAPAAAAAAAAEA8qqqqqqqAQD1VVVVVVUBAPgAAAAAAAEA+qqqqqqqAQD9VVVVVVUBAQAAAAAAAAEBAVVVVVVVAQECqqqqqqqBAQQAAAAAAAEBBVVVVVVVAQEGqqqqqqqBAQgAAAAAAAEBCVVVVVVVAQEKqqqqqqqBAQwAAAAAAAEBDVVVVVVVAQEOqqqqqqqBARAAAAAAAAEBEVVVVVVVAQESqqqqqqqBARQAAAAAAAEBFVVVVVVVAQEWqqqqqqqBARgAAAAAAAEBGVVVVVVVAQEaqqqqqqqBARwAAAAAAAEBHVVVVVVVAQEeqqqqqqqBASAAAAAAAAEBIVVVVVVVAQEiqqqqqqqBASQAAAAAAAEBJVVVVVVVAQEmqqqqqqqBASgAAAAAAAEBKVVVVVVVAQEqqqqqqqqBASwAAAAAAAEBLVVVVVVVAQEuqqqqqqqBATAAAAAAAAEBMVVVVVVVAQEyqqqqqqqBATQAAAAAAAEBNVVVVVVVAQE2qqqqqqqBATgAAAAAAAEBOVVVVVVVAQE6qqqqqqqBATwAAAAAAAEBPVVVVVVVAQE+qqqqqqqBAUAAAAAAAAEBQKqqqqqqgQFBVVVVVVVBAUIAAAAAAAEBQqqqqqqqgQFDVVVVVVVBAUQAAAAAAAEBRKqqqqqqgQFFVVVVVVVBAUYAAAAAAAEBRqqqqqqqgQFHVVVVVVVBAUgAAAAAAAEBSKqqqqqqg'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "NetCDF3ToZarr(staticfile, inline_threshold=5000).translate()[\"refs\"][\"xh/0\"]" ] }, { "cell_type": "markdown", "id": "9e06196b-1e6e-4ace-98eb-442c63f3a300", "metadata": {}, "source": [ "### Utilities" ] }, { "cell_type": "markdown", "id": "5264394d-a633-41a8-ba83-a8a77e7f87c9", "metadata": {}, "source": [ "Make this a function for reuse later." ] }, { "cell_type": "code", "execution_count": 10, "id": "c1825836-722c-426a-b38e-163fdec4c7f9", "metadata": {}, "outputs": [], "source": [ "def gen_ref(f):\n", " return NetCDF3ToZarr(f, inline_threshold=5000).translate()" ] }, { "cell_type": "markdown", "id": "4f3c2098-2edd-4686-ba42-7470802eae05", "metadata": {}, "source": [ "Manipulating the references dictionary can be painful. kerchunk comes with some useful pre-processors.\n", "\n", "Here we'll use `kerchunk.combine.drop` to drop the `time` variable to avoid some problems later on." ] }, { "cell_type": "code", "execution_count": 11, "id": "1413f10a-d139-42d0-a8ae-e0ed466730a0", "metadata": {}, "outputs": [], "source": [ "# The static file with time-invariant variables has a useless `time` dimension.\n", "# This messes up kerchunk's heuristics.\n", "# kerchunk.combine.drop returns a function ...\n", "drop_time = kerchunk.combine.drop(\"time\")\n", "staticdict = drop_time(gen_ref(staticfile))" ] }, { "cell_type": "code", "execution_count": 12, "id": "35c94aab-1888-457b-9e18-0b280d9c059a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['.zgroup', 'xh/.zarray', 'xh/0', 'xh/.zattrs', 'yh/.zarray', 'yh/0', 'yh/.zattrs', 'xq/.zarray', 'xq/0', 'xq/.zattrs', 'yq/.zarray', 'yq/0', 'yq/.zattrs', 'geolon/.zarray', 'geolon/0.0', 'geolon/.zattrs', 'geolat/.zarray', 'geolat/0.0', 'geolat/.zattrs', 'geolon_c/.zarray', 'geolon_c/0.0', 'geolon_c/.zattrs', 'geolat_c/.zarray', 'geolat_c/0.0', 'geolat_c/.zattrs', 'geolon_u/.zarray', 'geolon_u/0.0', 'geolon_u/.zattrs', 'geolat_u/.zarray', 'geolat_u/0.0', 'geolat_u/.zattrs', 'geolon_v/.zarray', 'geolon_v/0.0', 'geolon_v/.zattrs', 'geolat_v/.zarray', 'geolat_v/0.0', 'geolat_v/.zattrs', 'deptho/.zarray', 'deptho/0.0', 'deptho/.zattrs', 'wet/.zarray', 'wet/0.0', 'wet/.zattrs', 'wet_c/.zarray', 'wet_c/0.0', 'wet_c/.zattrs', 'wet_u/.zarray', 'wet_u/0.0', 'wet_u/.zattrs', 'wet_v/.zarray', 'wet_v/0.0', 'wet_v/.zattrs', 'Coriolis/.zarray', 'Coriolis/0.0', 'Coriolis/.zattrs', 'areacello/.zarray', 'areacello/0.0', 'areacello/.zattrs', 'areacello_cu/.zarray', 'areacello_cu/0.0', 'areacello_cu/.zattrs', 'areacello_cv/.zarray', 'areacello_cv/0.0', 'areacello_cv/.zattrs', 'areacello_bu/.zarray', 'areacello_bu/0.0', 'areacello_bu/.zattrs', 'sin_rot/.zarray', 'sin_rot/0.0', 'sin_rot/.zattrs', 'cos_rot/.zarray', 'cos_rot/0.0', 'cos_rot/.zattrs', 'time/.zarray', 'time/.zattrs', 'time/0', '.zattrs'])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "staticdict[\"refs\"].keys()" ] }, { "cell_type": "markdown", "id": "edee72c9-7d26-4518-9bb5-4c4b1fc8cff7", "metadata": {}, "source": [ "## Generate references for the `sfc` and `h` datasets\n", "\n", "This bit generates individual JSONs for the `sfc` and `h` datasets:\n", "1. For each `.nc` file generate references with `gen_ref`\n", "2. Use `kerchunk.combine.MultiZarrToZarr` to consolidate to a single Zarr dataset.\n", "3. Merge in the static dataset references using `kerchunk.combine.merge_vars`.\n", "4. Write a new JSON file" ] }, { "cell_type": "code", "execution_count": 13, "id": "1f3f4faf-d929-442c-8411-866ea32b87b5", "metadata": {}, "outputs": [], "source": [ "def generate_json(root, casename, stream, static_refs):\n", " \"\"\"\n", " Generate Kerchunk references for CESM output.\n", " \"\"\"\n", "\n", " import copy\n", " from pathlib import Path\n", "\n", " import dask.bag\n", " import ujson\n", "\n", " # Get list of files\n", " flist = sorted(glob(f\"{root}/{casename}/run/*mom6.{stream}_*\"))\n", "\n", " # parallelize generating references using dask.bag\n", " # Alternatively this could be dask.delayed\n", " bag = dask.bag.from_sequence(flist, npartitions=len(flist)).map(gen_ref)\n", " dicts = bag.compute()\n", "\n", " # Combine multiple Zarr references (one per file) to\n", " # a single aggregate reference file\n", " mzz = MultiZarrToZarr(dicts, inline_threshold=5000, concat_dims=\"time\")\n", "\n", " # merge in the static variable references\n", " # TODO: this deep-copy is necessary because static_refs gets modified in-place otherwise\n", " merged = kerchunk.combine.merge_vars([copy.deepcopy(static_refs), mzz.translate()])\n", "\n", " # create the output directory if needed\n", " Path(f\"{root}/{casename}/run/jsons/\").mkdir(parents=True, exist_ok=True)\n", "\n", " # write the JSON\n", " with open(f\"{root}/{casename}/run/jsons/{stream}.json\", \"wb\") as f:\n", " f.write(ujson.dumps(merged).encode())" ] }, { "cell_type": "markdown", "id": "59d3616b-18ca-4432-8f3c-8592b488d878", "metadata": {}, "source": [ "Now we generate the JSON files in parallel with dask:\n", "- For the `sfc` stream, it takes 2s per file.\n", "- For the `h` stream, it takes 200ms per file." ] }, { "cell_type": "code", "execution_count": 14, "id": "5bb97678-d2f5-4061-88d9-b97bd845eb93", "metadata": {}, "outputs": [], "source": [ "generate_json(root, casename, stream=\"sfc\", static_refs=staticdict)" ] }, { "cell_type": "code", "execution_count": 15, "id": "8b21ee90-9809-4be1-9fac-fba7500e693e", "metadata": {}, "outputs": [], "source": [ "generate_json(root, casename, stream=\"h\", static_refs=staticdict)" ] }, { "cell_type": "markdown", "id": "2e68c26f-eaa5-4aa1-b065-10a529dc6e23", "metadata": {}, "source": [ "## Demo: reading a dataset" ] }, { "cell_type": "markdown", "id": "6aa9ece7-2b2f-4612-ac5a-03d2088868b5", "metadata": {}, "source": [ "To read the dataset with Xarray, the JSON files needs to be represented as a Zarr dataset.\n", "\n", "Use `fsspec` to do this." ] }, { "cell_type": "code", "execution_count": 16, "id": "77cb3daf-05b0-4bc4-a3bd-e77b4b35fc03", "metadata": {}, "outputs": [], "source": [ "fs = fsspec.filesystem(\n", " \"reference\", # protocol\n", " fo=f\"{root}/{casename}/run/jsons/sfc.json\", # json\n", " skip_instance_cache=True, # skip caching, this is useful when building catalogs.\n", ")\n", "mapper = fs.get_mapper(root=\"\")" ] }, { "cell_type": "markdown", "id": "f461e189-e2f6-4792-8029-d26a38138054", "metadata": {}, "source": [ "Mapper is a dictionary-like object. We can ask it for the `.zgroup` \"file\" for example" ] }, { "cell_type": "code", "execution_count": 17, "id": "cc3025fb-c936-46eb-9877-5a8e38d4de0e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'{\"zarr_format\":2}'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mapper[\".zgroup\"]" ] }, { "cell_type": "markdown", "id": "cce53d00-e1ca-43db-b5be-b1666e0e0c90", "metadata": {}, "source": [ "Magic! The zarr library asks the `mapper` for a 'file', the `fsspec` library responds with data from the appropriate bytes stored in a file somewhere else." ] }, { "cell_type": "code", "execution_count": 18, "id": "042deea7-4d3f-4d42-ba9a-0ff432940323", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:       (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2)\n",
       "Coordinates:\n",
       "  * nv            (nv) float64 1.0 2.0\n",
       "  * time          (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00\n",
       "  * xh            (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67\n",
       "  * xq            (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0\n",
       "  * yh            (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74\n",
       "  * yq            (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74\n",
       "Data variables: (12/32)\n",
       "    Coriolis      (yq, xq) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>\n",
       "    SSH           (time, yh, xh) float32 dask.array<chunksize=(1, 458, 540), meta=np.ndarray>\n",
       "    SSU           (time, yh, xq) float32 dask.array<chunksize=(1, 458, 540), meta=np.ndarray>\n",
       "    SSV           (time, yq, xh) float32 dask.array<chunksize=(1, 458, 540), meta=np.ndarray>\n",
       "    areacello     (yh, xh) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>\n",
       "    areacello_bu  (yq, xq) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>\n",
       "    ...            ...\n",
       "    time_bnds     (time, nv) timedelta64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>\n",
       "    tos           (time, yh, xh) float32 dask.array<chunksize=(1, 458, 540), meta=np.ndarray>\n",
       "    wet           (yh, xh) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>\n",
       "    wet_c         (yq, xq) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>\n",
       "    wet_u         (yh, xq) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>\n",
       "    wet_v         (yq, xh) float32 dask.array<chunksize=(458, 540), meta=np.ndarray>\n",
       "Attributes:\n",
       "    associated_files:  areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...\n",
       "    grid_tile:         N/A\n",
       "    grid_type:         regular\n",
       "    title:             MOM6 diagnostic fields table for CESM case: gmom.e23.G...
" ], "text/plain": [ "\n", "Dimensions: (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2)\n", "Coordinates:\n", " * nv (nv) float64 1.0 2.0\n", " * time (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00\n", " * xh (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67\n", " * xq (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0\n", " * yh (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74\n", " * yq (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74\n", "Data variables: (12/32)\n", " Coriolis (yq, xq) float32 dask.array\n", " SSH (time, yh, xh) float32 dask.array\n", " SSU (time, yh, xq) float32 dask.array\n", " SSV (time, yq, xh) float32 dask.array\n", " areacello (yh, xh) float32 dask.array\n", " areacello_bu (yq, xq) float32 dask.array\n", " ... ...\n", " time_bnds (time, nv) timedelta64[ns] dask.array\n", " tos (time, yh, xh) float32 dask.array\n", " wet (yh, xh) float32 dask.array\n", " wet_c (yq, xq) float32 dask.array\n", " wet_u (yh, xq) float32 dask.array\n", " wet_v (yq, xh) float32 dask.array\n", "Attributes:\n", " associated_files: areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...\n", " grid_tile: N/A\n", " grid_type: regular\n", " title: MOM6 diagnostic fields table for CESM case: gmom.e23.G..." ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xr.open_zarr(mapper, use_cftime=True, consolidated=False)" ] }, { "cell_type": "markdown", "id": "02b4789c-1f82-4100-9101-ee4fe033810e", "metadata": {}, "source": [ "(^v^) Looks like surface variables with static variables merged in.\n", "\n", "\n", "```{tip}\n", "\n", "it is a bit annoying to type 4 lines to read the dataset, but this can be hidden away in an intake catalog.\n", "```" ] }, { "cell_type": "markdown", "id": "3f8dd3e0-c4bc-4f02-9e63-7c4639c20405", "metadata": {}, "source": [ "## Combine datasets to single Zarr with groups\n", "\n", "To create the `sfc` group, we read the `sfc.json` file and add `sfc/` to every key.\n", "\n", "Repeat for the `h` dataset, and add a top level `.zgroup` entry.\n", "\n", "Now we have dict representation of a virtual Zarr dataset! Write that to a JSON file." ] }, { "cell_type": "code", "execution_count": 19, "id": "da04fb56-c6c1-481e-9264-bbe060605511", "metadata": {}, "outputs": [], "source": [ "def combine_stream_jsons_as_groups(streams):\n", " ZARR_GROUP_ENTRY = {\".zgroup\": '{\"zarr_format\":2}'}\n", "\n", " import ujson\n", "\n", " newrefs = {}\n", " for stream in streams:\n", " # read in existing JSON references\n", " with open(f\"{root}/{casename}/run/jsons/{stream}.json\", \"rb\") as f:\n", " d = ujson.loads(f.read())\n", "\n", " # Add a new group by renaming the keys\n", " newrefs.update({f\"{stream}/{k}\": v for k, v in d[\"refs\"].items()})\n", "\n", " # Add top-level .zgroup entry\n", " newrefs.update(ZARR_GROUP_ENTRY)\n", "\n", " # This is now the combined dataset\n", " combined = {\"version\": 1, \"refs\": newrefs}\n", "\n", " # write a new reference JSON file\n", " with open(f\"{root}/{casename}/run/jsons/combined.json\", \"wb\") as f:\n", " f.write(ujson.dumps(combined).encode())" ] }, { "cell_type": "markdown", "id": "7790be66-8377-4c5b-a38c-e073fe68a21a", "metadata": {}, "source": [ "Combining them is fast" ] }, { "cell_type": "code", "execution_count": 20, "id": "4200f80a-b26f-4d9e-958d-d42530e56eb8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 411 ms, sys: 46.6 ms, total: 458 ms\n", "Wall time: 469 ms\n" ] } ], "source": [ "%%time\n", "\n", "combine_stream_jsons_as_groups(streams=[\"sfc\", \"h\"])" ] }, { "cell_type": "markdown", "id": "34e0f0eb-a87f-47c5-a809-3ed1255da5db", "metadata": { "tags": [] }, "source": [ "## Reading the combined dataset\n", "\n", "### Create the filesystem and mapper" ] }, { "cell_type": "code", "execution_count": 21, "id": "fae476e0-ed41-4e6a-be45-d9ad3189f337", "metadata": {}, "outputs": [], "source": [ "fs = fsspec.filesystem(\n", " \"reference\",\n", " fo=f\"{root}/{casename}/run/jsons/combined.json\",\n", " skip_instance_cache=True,\n", ")\n", "mapper = fs.get_mapper(root=\"\")" ] }, { "cell_type": "markdown", "id": "eb9525ef-be7d-4696-9893-79573cf5e3c6", "metadata": {}, "source": [ "### Simple xarray.open_dataset\n", "\n", "Specify the `group` kwarg to extract a single group" ] }, { "cell_type": "code", "execution_count": 22, "id": "53eb5e37-569c-495b-b7b5-c8b8d0a0e292", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 340 ms, sys: 9.08 ms, total: 349 ms\n", "Wall time: 344 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:       (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2)\n",
       "Coordinates:\n",
       "  * nv            (nv) float64 1.0 2.0\n",
       "  * time          (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00\n",
       "  * xh            (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67\n",
       "  * xq            (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0\n",
       "  * yh            (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74\n",
       "  * yq            (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74\n",
       "Data variables: (12/32)\n",
       "    Coriolis      (yq, xq) float32 ...\n",
       "    SSH           (time, yh, xh) float32 ...\n",
       "    SSU           (time, yh, xq) float32 ...\n",
       "    SSV           (time, yq, xh) float32 ...\n",
       "    areacello     (yh, xh) float32 ...\n",
       "    areacello_bu  (yq, xq) float32 ...\n",
       "    ...            ...\n",
       "    time_bnds     (time, nv) timedelta64[ns] ...\n",
       "    tos           (time, yh, xh) float32 ...\n",
       "    wet           (yh, xh) float32 ...\n",
       "    wet_c         (yq, xq) float32 ...\n",
       "    wet_u         (yh, xq) float32 ...\n",
       "    wet_v         (yq, xh) float32 ...\n",
       "Attributes:\n",
       "    associated_files:  areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...\n",
       "    grid_tile:         N/A\n",
       "    grid_type:         regular\n",
       "    title:             MOM6 diagnostic fields table for CESM case: gmom.e23.G...
" ], "text/plain": [ "\n", "Dimensions: (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2)\n", "Coordinates:\n", " * nv (nv) float64 1.0 2.0\n", " * time (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00\n", " * xh (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67\n", " * xq (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0\n", " * yh (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74\n", " * yq (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74\n", "Data variables: (12/32)\n", " Coriolis (yq, xq) float32 ...\n", " SSH (time, yh, xh) float32 ...\n", " SSU (time, yh, xq) float32 ...\n", " SSV (time, yq, xh) float32 ...\n", " areacello (yh, xh) float32 ...\n", " areacello_bu (yq, xq) float32 ...\n", " ... ...\n", " time_bnds (time, nv) timedelta64[ns] ...\n", " tos (time, yh, xh) float32 ...\n", " wet (yh, xh) float32 ...\n", " wet_c (yq, xq) float32 ...\n", " wet_u (yh, xq) float32 ...\n", " wet_v (yq, xh) float32 ...\n", "Attributes:\n", " associated_files: areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...\n", " grid_tile: N/A\n", " grid_type: regular\n", " title: MOM6 diagnostic fields table for CESM case: gmom.e23.G..." ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "xr.open_dataset(mapper, engine=\"zarr\", group=\"sfc\", use_cftime=True, consolidated=False)" ] }, { "cell_type": "markdown", "id": "de222dbc-7507-43d9-b953-f1fe131728a5", "metadata": {}, "source": [ "### Using datatree\n", "\n", "Open all groups at one go using [datatree](https://xarray-datatree.readthedocs.io/en/latest/)" ] }, { "cell_type": "code", "execution_count": 23, "id": "3a4fc2e0-3c71-46a3-a990-e65530299477", "metadata": {}, "outputs": [], "source": [ "import datatree" ] }, { "cell_type": "code", "execution_count": 24, "id": "d09f80f6-f330-4386-b317-8175f4e438c5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 611 ms, sys: 8.22 ms, total: 619 ms\n", "Wall time: 612 ms\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DatasetView>\n",
       "Dimensions:  ()\n",
       "Data variables:\n",
       "    *empty*
" ], "text/plain": [ "DataTree('None', parent=None)\n", "├── DataTree('h')\n", "│ Dimensions: (yq: 458, xq: 540, yh: 458, xh: 540, time: 300, z_l: 34,\n", "│ nv: 2, z_i: 35)\n", "│ Coordinates:\n", "│ * nv (nv) float64 1.0 2.0\n", "│ * time (time) object 0046-01-22 12:00:00 ... 0070-12-22 12:00:00\n", "│ * xh (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67\n", "│ * xq (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0\n", "│ * yh (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74\n", "│ * yq (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74\n", "│ * z_i (z_i) float64 0.0 5.0 15.0 25.0 ... 5.25e+03 5.75e+03 6.25e+03\n", "│ * z_l (z_l) float64 2.5 10.0 20.0 32.5 ... 5e+03 5.5e+03 6e+03\n", "│ Data variables: (12/36)\n", "│ Coriolis (yq, xq) float32 ...\n", "│ areacello (yh, xh) float32 ...\n", "│ areacello_bu (yq, xq) float32 ...\n", "│ areacello_cu (yh, xq) float32 ...\n", "│ areacello_cv (yq, xh) float32 ...\n", "│ average_DT (time) timedelta64[ns] ...\n", "│ ... ...\n", "│ vo (time, z_l, yq, xh) float32 ...\n", "│ volcello (time, z_l, yh, xh) float32 ...\n", "│ wet (yh, xh) float32 ...\n", "│ wet_c (yq, xq) float32 ...\n", "│ wet_u (yh, xq) float32 ...\n", "│ wet_v (yq, xh) float32 ...\n", "│ Attributes:\n", "│ associated_files: areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...\n", "│ grid_tile: N/A\n", "│ grid_type: regular\n", "│ title: MOM6 diagnostic fields table for CESM case: gmom.e23.G...\n", "└── DataTree('sfc')\n", " Dimensions: (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2)\n", " Coordinates:\n", " * nv (nv) float64 1.0 2.0\n", " * time (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00\n", " * xh (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67\n", " * xq (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0\n", " * yh (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74\n", " * yq (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74\n", " Data variables: (12/32)\n", " Coriolis (yq, xq) float32 ...\n", " SSH (time, yh, xh) float32 ...\n", " SSU (time, yh, xq) float32 ...\n", " SSV (time, yq, xh) float32 ...\n", " areacello (yh, xh) float32 ...\n", " areacello_bu (yq, xq) float32 ...\n", " ... ...\n", " time_bnds (time, nv) timedelta64[ns] ...\n", " tos (time, yh, xh) float32 ...\n", " wet (yh, xh) float32 ...\n", " wet_c (yq, xq) float32 ...\n", " wet_u (yh, xq) float32 ...\n", " wet_v (yq, xh) float32 ...\n", " Attributes:\n", " associated_files: areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...\n", " grid_tile: N/A\n", " grid_type: regular\n", " title: MOM6 diagnostic fields table for CESM case: gmom.e23.G..." ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "tree = datatree.open_datatree(mapper, engine=\"zarr\", use_cftime=True, consolidated=False)\n", "tree" ] }, { "cell_type": "code", "execution_count": 25, "id": "d2ebadfa-6b1b-4237-83fb-f55ac3b2235d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DatasetView>\n",
       "Dimensions:       (yq: 458, xq: 540, yh: 458, xh: 540, time: 300, z_l: 34,\n",
       "                   nv: 2, z_i: 35)\n",
       "Coordinates:\n",
       "  * nv            (nv) float64 1.0 2.0\n",
       "  * time          (time) object 0046-01-22 12:00:00 ... 0070-12-22 12:00:00\n",
       "  * xh            (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67\n",
       "  * xq            (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0\n",
       "  * yh            (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74\n",
       "  * yq            (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74\n",
       "  * z_i           (z_i) float64 0.0 5.0 15.0 25.0 ... 5.25e+03 5.75e+03 6.25e+03\n",
       "  * z_l           (z_l) float64 2.5 10.0 20.0 32.5 ... 5e+03 5.5e+03 6e+03\n",
       "Data variables: (12/36)\n",
       "    Coriolis      (yq, xq) float32 ...\n",
       "    areacello     (yh, xh) float32 ...\n",
       "    areacello_bu  (yq, xq) float32 ...\n",
       "    areacello_cu  (yh, xq) float32 ...\n",
       "    areacello_cv  (yq, xh) float32 ...\n",
       "    average_DT    (time) timedelta64[ns] ...\n",
       "    ...            ...\n",
       "    vo            (time, z_l, yq, xh) float32 ...\n",
       "    volcello      (time, z_l, yh, xh) float32 ...\n",
       "    wet           (yh, xh) float32 ...\n",
       "    wet_c         (yq, xq) float32 ...\n",
       "    wet_u         (yh, xq) float32 ...\n",
       "    wet_v         (yq, xh) float32 ...\n",
       "Attributes:\n",
       "    associated_files:  areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...\n",
       "    grid_tile:         N/A\n",
       "    grid_type:         regular\n",
       "    title:             MOM6 diagnostic fields table for CESM case: gmom.e23.G...
" ], "text/plain": [ "DataTree('h', parent=\"None\")\n", " Dimensions: (yq: 458, xq: 540, yh: 458, xh: 540, time: 300, z_l: 34,\n", " nv: 2, z_i: 35)\n", " Coordinates:\n", " * nv (nv) float64 1.0 2.0\n", " * time (time) object 0046-01-22 12:00:00 ... 0070-12-22 12:00:00\n", " * xh (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67\n", " * xq (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0\n", " * yh (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74\n", " * yq (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74\n", " * z_i (z_i) float64 0.0 5.0 15.0 25.0 ... 5.25e+03 5.75e+03 6.25e+03\n", " * z_l (z_l) float64 2.5 10.0 20.0 32.5 ... 5e+03 5.5e+03 6e+03\n", " Data variables: (12/36)\n", " Coriolis (yq, xq) float32 ...\n", " areacello (yh, xh) float32 ...\n", " areacello_bu (yq, xq) float32 ...\n", " areacello_cu (yh, xq) float32 ...\n", " areacello_cv (yq, xh) float32 ...\n", " average_DT (time) timedelta64[ns] ...\n", " ... ...\n", " vo (time, z_l, yq, xh) float32 ...\n", " volcello (time, z_l, yh, xh) float32 ...\n", " wet (yh, xh) float32 ...\n", " wet_c (yq, xq) float32 ...\n", " wet_u (yh, xq) float32 ...\n", " wet_v (yq, xh) float32 ...\n", " Attributes:\n", " associated_files: areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...\n", " grid_tile: N/A\n", " grid_type: regular\n", " title: MOM6 diagnostic fields table for CESM case: gmom.e23.G..." ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree[\"h\"]" ] }, { "cell_type": "code", "execution_count": 26, "id": "6ff8ce1a-fd0f-480c-ac2f-d79a38f59ad0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DatasetView>\n",
       "Dimensions:       (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2)\n",
       "Coordinates:\n",
       "  * nv            (nv) float64 1.0 2.0\n",
       "  * time          (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00\n",
       "  * xh            (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67\n",
       "  * xq            (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0\n",
       "  * yh            (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74\n",
       "  * yq            (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74\n",
       "Data variables: (12/32)\n",
       "    Coriolis      (yq, xq) float32 ...\n",
       "    SSH           (time, yh, xh) float32 ...\n",
       "    SSU           (time, yh, xq) float32 ...\n",
       "    SSV           (time, yq, xh) float32 ...\n",
       "    areacello     (yh, xh) float32 ...\n",
       "    areacello_bu  (yq, xq) float32 ...\n",
       "    ...            ...\n",
       "    time_bnds     (time, nv) timedelta64[ns] ...\n",
       "    tos           (time, yh, xh) float32 ...\n",
       "    wet           (yh, xh) float32 ...\n",
       "    wet_c         (yq, xq) float32 ...\n",
       "    wet_u         (yh, xq) float32 ...\n",
       "    wet_v         (yq, xh) float32 ...\n",
       "Attributes:\n",
       "    associated_files:  areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...\n",
       "    grid_tile:         N/A\n",
       "    grid_type:         regular\n",
       "    title:             MOM6 diagnostic fields table for CESM case: gmom.e23.G...
" ], "text/plain": [ "DataTree('sfc', parent=\"None\")\n", " Dimensions: (yq: 458, xq: 540, time: 9101, yh: 458, xh: 540, nv: 2)\n", " Coordinates:\n", " * nv (nv) float64 1.0 2.0\n", " * time (time) object 0046-01-07 12:00:00 ... 0071-01-06 12:00:00\n", " * xh (xh) float64 -286.7 -286.0 -285.3 -284.7 ... 71.33 72.0 72.67\n", " * xq (xq) float64 -286.3 -285.7 -285.0 -284.3 ... 71.67 72.33 73.0\n", " * yh (yh) float64 -79.2 -79.08 -78.95 -78.82 ... 87.64 87.71 87.74\n", " * yq (yq) float64 -79.14 -79.01 -78.89 -78.76 ... 87.68 87.73 87.74\n", " Data variables: (12/32)\n", " Coriolis (yq, xq) float32 ...\n", " SSH (time, yh, xh) float32 ...\n", " SSU (time, yh, xq) float32 ...\n", " SSV (time, yq, xh) float32 ...\n", " areacello (yh, xh) float32 ...\n", " areacello_bu (yq, xq) float32 ...\n", " ... ...\n", " time_bnds (time, nv) timedelta64[ns] ...\n", " tos (time, yh, xh) float32 ...\n", " wet (yh, xh) float32 ...\n", " wet_c (yq, xq) float32 ...\n", " wet_u (yh, xq) float32 ...\n", " wet_v (yq, xh) float32 ...\n", " Attributes:\n", " associated_files: areacello: gmom.e23.GJRAv3.TL319_t061_zstar_N65.baseli...\n", " grid_tile: N/A\n", " grid_type: regular\n", " title: MOM6 diagnostic fields table for CESM case: gmom.e23.G..." ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree[\"sfc\"]" ] }, { "cell_type": "markdown", "id": "6e5248b7-e4ed-4c12-b00c-26cc839c7dd5", "metadata": {}, "source": [ "## Next\n", "\n", "1. We could even consider adding higher-level `lnd`, `atm`, `ocn` groups so that single virtual dataset represents all output streams for all components from a single simulation.\n", "2. In `intake-esm` terminology, a single JSON file representating an aggregate dataset could be a single asset." ] } ], "metadata": { "author": "Deepak Cherian", "date": "March 7, 2023", "kernelspec": { "display_name": "Python [conda env:miniconda3-pump]", "language": "python", "name": "conda-env-miniconda3-pump-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }