{ "cells": [ { "cell_type": "markdown", "id": "4cb0f486-9005-4fb6-b725-9f63d305b812", "metadata": {}, "source": [ "# Thinking through CESM data access\n", "\n", "We want to read a large number of netCDF files, combine them to form a single dataset, and then analyze that. How do we think about it?\n", "\n", "In pseudocode we want\n", "```python\n", "# loop over every file and read in metadata\n", "datasets = [xr.open_dataset(file) for file in files]\n", "# optionally make modifications\n", "preprocessed = [preprocess(dataset) for dataset in datasets]\n", "# combine to create a single dataset\n", "combined = xr.combine_XXX(preprocessed, ...)\n", "```\n", "\n", "Xarray's [`open_mfdataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html#xarray.open_mfdataset) implements this pattern with the option of parallelizing the loop over all files using `dask`. This can be quite handy." ] }, { "cell_type": "markdown", "id": "72c4c5db-fbe2-4772-ace9-7435647075d0", "metadata": { "tags": [] }, "source": [ "## Creating a new data pipeline\n", "\n", "### First create a list of files\n", "\n", "The `glob` package is good for this ([docs](https://docs.python.org/3/library/glob.html))\n", "\n", "> The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order. No tilde expansion is done, but *, ?, and character ranges expressed with [] will be correctly matched. \n", "\n", "```{important}\n", "- The tilde `~` is not expanded to the user's home directory. Use [`os.path.expanduser`](https://docs.python.org/3/library/os.path.html#os.path.expanduser) for that.\n", "- The list of files returned by `glob` is not sorted ! Use [`sorted`](https://docs.python.org/3/library/functions.html#sorted) to sort the list.\n", "```\n", "\n", "Here's a list of files: these are timeseries files with output for years 1850-2100, and 50 ensemble members." ] }, { "cell_type": "code", "execution_count": 1, "id": "aeca51e1-7553-443a-8579-c43932a40e31", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1251.011.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1251.012.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1281.019.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1071.004.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1171.009.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1151.008.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1231.018.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1301.014.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1111.006.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1301.013.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1251.018.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1281.015.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1251.016.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1091.005.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1251.020.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1251.015.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1251.013.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1301.017.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1301.011.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1011.001.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1281.011.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1281.018.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1301.018.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1231.014.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1281.013.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1231.019.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1231.015.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1281.016.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1131.007.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1301.012.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1231.020.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1051.003.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1191.010.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1231.016.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1231.013.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1281.014.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1301.019.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1231.011.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1301.016.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1231.012.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1301.015.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1251.019.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1281.017.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1251.014.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1231.017.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1301.020.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1281.020.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1251.017.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1031.002.cam.h1.PRECT.18500101-21001231.nc',\n", " '/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/b.e21.BHISTsmbb.f09_g17.LE2-1281.012.cam.h1.PRECT.18500101-21001231.nc']" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import glob\n", "\n", "files = glob.glob(\n", " \"/glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/PRECT/*smbb**h1**18500101-21001231*\"\n", ")\n", "files" ] }, { "cell_type": "markdown", "id": "0956b136-1705-484e-bdb9-0df8492fa56c", "metadata": {}, "source": [ "There are 50 files, one per ensemble member" ] }, { "cell_type": "code", "execution_count": 2, "id": "caddcf0b-0caf-44b9-b0cb-97e65b9de01e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "50" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(files)" ] }, { "cell_type": "markdown", "id": "57f470ba-c9d1-4b78-8357-1844b1a11a9c", "metadata": { "tags": [] }, "source": [ "### Start by opening a single file" ] }, { "cell_type": "code", "execution_count": 3, "id": "1ceb31c0-1498-495d-840d-62710e300582", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:    (time: 8761, bnds: 2, lon: 288, lat: 192)\n",
       "Coordinates:\n",
       "  * time       (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n",
       "  * lon        (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n",
       "  * lat        (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n",
       "Dimensions without coordinates: bnds\n",
       "Data variables:\n",
       "    time_bnds  (time, bnds) object ...\n",
       "    PRECT      (time, lat, lon) float32 ...\n",
       "Attributes: (12/13)\n",
       "    CDI:               Climate Data Interface version 2.0.2 (https://mpimet.m...\n",
       "    Conventions:       CF-1.0\n",
       "    source:            CAM\n",
       "    case:              b.e21.BHISTsmbb.f09_g17.LE2-1251.011\n",
       "    logname:           sunseon\n",
       "    host:              mom1\n",
       "    ...                ...\n",
       "    topography_file:   /mnt/lustre/share/CESM/cesm_input/atm/cam/topo/fv_0.9x...\n",
       "    model_doi_url:     https://doi.org/10.5065/D67H1H0V\n",
       "    time_period_freq:  day_1\n",
       "    history:           Wed May 17 11:54:18 2023: cdo selvar,PRECT tmp.nc PREC...\n",
       "    NCO:               netCDF Operators version 5.0.3 (Homepage = http://nco....\n",
       "    CDO:               Climate Data Operators version 2.0.1 (https://mpimet.m...
" ], "text/plain": [ "\n", "Dimensions: (time: 8761, bnds: 2, lon: 288, lat: 192)\n", "Coordinates:\n", " * time (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n", " * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n", " * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n", "Dimensions without coordinates: bnds\n", "Data variables:\n", " time_bnds (time, bnds) object ...\n", " PRECT (time, lat, lon) float32 ...\n", "Attributes: (12/13)\n", " CDI: Climate Data Interface version 2.0.2 (https://mpimet.m...\n", " Conventions: CF-1.0\n", " source: CAM\n", " case: b.e21.BHISTsmbb.f09_g17.LE2-1251.011\n", " logname: sunseon\n", " host: mom1\n", " ... ...\n", " topography_file: /mnt/lustre/share/CESM/cesm_input/atm/cam/topo/fv_0.9x...\n", " model_doi_url: https://doi.org/10.5065/D67H1H0V\n", " time_period_freq: day_1\n", " history: Wed May 17 11:54:18 2023: cdo selvar,PRECT tmp.nc PREC...\n", " NCO: netCDF Operators version 5.0.3 (Homepage = http://nco....\n", " CDO: Climate Data Operators version 2.0.1 (https://mpimet.m..." ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import xarray as xr\n", "\n", "single = xr.open_dataset(files[0])\n", "single" ] }, { "cell_type": "markdown", "id": "6a414ced-0214-42af-acfa-128e9f979abe", "metadata": {}, "source": [ "First check data size" ] }, { "cell_type": "code", "execution_count": 4, "id": "f9a9e75a-d92e-4290-997f-3ff38d525096", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.938007128" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "single.nbytes / 1e9 # approx GB" ] }, { "cell_type": "markdown", "id": "f2911db3-045b-49db-8d5e-6f6febc2a1f9", "metadata": {}, "source": [ "Each single file is 20GB and we have 50 of them, so approximately a terabyte in total. We will have to use dask.\n", "\n", "That means we need to make chunking decisions.\n", "\n", "Later on, we will extract time series at a single point, so let's chunk in space, and choosing chunksizes for the data variable `PRECT`.\n", "\n", "Start by looking at dimension names for `PRECT`" ] }, { "cell_type": "code", "execution_count": 5, "id": "79d5044a-4ff9-42d1-8ccd-d948d239664d", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'PRECT' (time: 8761, lat: 192, lon: 288)>\n",
       "[484448256 values with dtype=float32]\n",
       "Coordinates:\n",
       "  * time     (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n",
       "  * lon      (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n",
       "  * lat      (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0\n",
       "Attributes:\n",
       "    long_name:     Total (convective and large-scale) precipitation rate (liq...\n",
       "    units:         m/s\n",
       "    cell_methods:  time: mean
" ], "text/plain": [ "\n", "[484448256 values with dtype=float32]\n", "Coordinates:\n", " * time (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n", " * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n", " * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0\n", "Attributes:\n", " long_name: Total (convective and large-scale) precipitation rate (liq...\n", " units: m/s\n", " cell_methods: time: mean" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "single.PRECT" ] }, { "cell_type": "markdown", "id": "bc1e457f-a069-410c-89e4-b1c3bfbf6f76", "metadata": { "tags": [] }, "source": [ "### Choosing a chunk size\n", "\n", "This is a timeseries file with daily average output using the `noleap` calendar.\n", "\n", "We will concatenate ensemble members together to create a single dataset along a new dimension `\"ensemble\"`. Today, we *cannot* create an xarray dataset with chunksizes that span files. In other words, because there is one file per ensemble member, and we are concatenating ensemble members along a new dimension, the chunksize for the new dimension **will** be one. It is possible to rechunk later, but that will involve expensive communication that is best to avoid unless you really need to do so.\n", "\n", "We *could* chunk along space because we want to plot time series at a single point later. After some experimenting we choose a size of 16 along `lat`, 32 along `lon`, and all timesteps in a single chunk, for a chunksize of ~180MB.\n", "\n", "```{tip}\n", "Many other chunking choices are possible, it all depends on what you want to do later. For example we could have bigger spatial chunks, and smaller chunks along time. Here is some reading material on chunking/xarray/dask:\n", "- [dask docs on best practices](https://docs.dask.org/en/stable/array-best-practices.html#select-a-good-chunk-size)\n", "- [xarray docs](https://docs.xarray.dev/en/stable/user-guide/dask.html#optimization-tips)\n", "- [dask docs on chunking](https://docs.dask.org/en/latest/array-chunks.html)\n", "- [dask blog](https://blog.dask.org/2020/07/30/beginners-config)\n", "```\n", "\n", "> When choosing the size of chunks it is best to make them neither too small, nor too big (around 100MB is often reasonable). Each chunk needs to be able to fit into the worker memory and operations on that chunk should take some non-trivial amount of time (more than 100ms). For many more recommendations take a look at the docs on [chunks](https://docs.dask.org/en/latest/array-chunks.html)...\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "65bce946-544c-44c8-a5ce-92d3839b0fad", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'PRECT' (time: 8761, lat: 192, lon: 288)>\n",
       "dask.array<xarray-<this-array>, shape=(8761, 192, 288), dtype=float32, chunksize=(1825, 192, 288), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * time     (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n",
       "  * lon      (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n",
       "  * lat      (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0\n",
       "Attributes:\n",
       "    long_name:     Total (convective and large-scale) precipitation rate (liq...\n",
       "    units:         m/s\n",
       "    cell_methods:  time: mean
" ], "text/plain": [ "\n", "dask.array, shape=(8761, 192, 288), dtype=float32, chunksize=(1825, 192, 288), chunktype=numpy.ndarray>\n", "Coordinates:\n", " * time (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n", " * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n", " * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0\n", "Attributes:\n", " long_name: Total (convective and large-scale) precipitation rate (liq...\n", " units: m/s\n", " cell_methods: time: mean" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "single.PRECT.chunk({\"time\": 365 * 5})" ] }, { "cell_type": "markdown", "id": "f42d2d97-c14f-4581-8e04-625770eae538", "metadata": {}, "source": [ "### Test with a small subset\n", "\n", "Let's trying reading just 3 files to make sure the output looks as we expect.\n", "\n", "We choose `combine=\"nested\"` instead of `combine=\"by_coords\"`. This will simply concatenate the files in the order provided. The term \"nested\" is used becasue it can accept a nested list-of-lists as input and concatenate along multiple dimensions. In contrast, `by_coords` will look at coordinate location values and make decisions about which dimensions to concatenate along. This can sometimes backfire, and it is almost always better to be explicit by providing the files in the right order, and specify `combine=\"nested\"`.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "d83f270d-3cfb-4426-b545-43f86350c6ca", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:    (time: 8761, ensemble: 3, bnds: 2, lon: 288, lat: 192)\n",
       "Coordinates:\n",
       "  * time       (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n",
       "  * lon        (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n",
       "  * lat        (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n",
       "Dimensions without coordinates: ensemble, bnds\n",
       "Data variables:\n",
       "    time_bnds  (ensemble, time, bnds) object dask.array<chunksize=(1, 8761, 2), meta=np.ndarray>\n",
       "    PRECT      (ensemble, time, lat, lon) float32 dask.array<chunksize=(1, 8761, 16, 32), meta=np.ndarray>\n",
       "Attributes: (12/13)\n",
       "    CDI:               Climate Data Interface version 2.0.2 (https://mpimet.m...\n",
       "    Conventions:       CF-1.0\n",
       "    source:            CAM\n",
       "    case:              b.e21.BHISTsmbb.f09_g17.LE2-1011.001\n",
       "    logname:           sunseon\n",
       "    host:              mom2\n",
       "    ...                ...\n",
       "    topography_file:   /mnt/lustre/share/CESM/cesm_input/atm/cam/topo/fv_0.9x...\n",
       "    model_doi_url:     https://doi.org/10.5065/D67H1H0V\n",
       "    time_period_freq:  day_1\n",
       "    history:           Wed May 17 11:17:29 2023: cdo selvar,PRECT tmp.nc PREC...\n",
       "    NCO:               netCDF Operators version 5.0.3 (Homepage = http://nco....\n",
       "    CDO:               Climate Data Operators version 2.0.1 (https://mpimet.m...
" ], "text/plain": [ "\n", "Dimensions: (time: 8761, ensemble: 3, bnds: 2, lon: 288, lat: 192)\n", "Coordinates:\n", " * time (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n", " * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n", " * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n", "Dimensions without coordinates: ensemble, bnds\n", "Data variables:\n", " time_bnds (ensemble, time, bnds) object dask.array\n", " PRECT (ensemble, time, lat, lon) float32 dask.array\n", "Attributes: (12/13)\n", " CDI: Climate Data Interface version 2.0.2 (https://mpimet.m...\n", " Conventions: CF-1.0\n", " source: CAM\n", " case: b.e21.BHISTsmbb.f09_g17.LE2-1011.001\n", " logname: sunseon\n", " host: mom2\n", " ... ...\n", " topography_file: /mnt/lustre/share/CESM/cesm_input/atm/cam/topo/fv_0.9x...\n", " model_doi_url: https://doi.org/10.5065/D67H1H0V\n", " time_period_freq: day_1\n", " history: Wed May 17 11:17:29 2023: cdo selvar,PRECT tmp.nc PREC...\n", " NCO: netCDF Operators version 5.0.3 (Homepage = http://nco....\n", " CDO: Climate Data Operators version 2.0.1 (https://mpimet.m..." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xr.open_mfdataset(\n", " # make sure we sort\n", " sorted(files)[:3],\n", " # concatenate along a new dimension called \"ensemble\"\n", " concat_dim=\"ensemble\",\n", " # just concatenate them together\n", " combine=\"nested\",\n", " chunks={\"lat\": 16, \"lon\": 32},\n", " parallel=True,\n", ")" ] }, { "cell_type": "markdown", "id": "b3b4e71f-61be-400f-9d3f-a53700a9dab3", "metadata": {}, "source": [ "Notice that *all* variables have been concatenated along the `ensemble` dimension even if we know it to be a constant: e.g. `P0`.\n", "\n", "\n", "### Choosing combine options\n", "\n", "\n", "Xarray has a number of options to control this concatenation behaviour. The [normal recommendation](https://docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets) is the hard-to-interpret sequence `data_vars=\"minimal\", coords=\"minimal\", compat=\"override\"`. What does this mean?\n", "1. `\"minimal\"` for `data_vars` and `coords` means only concatenate variables that have the concatenation dimension already.\n", "2. For those variables without the concatenation dimension, xarray will look at the `compat` kwarg. For `compat=\"different\"`, the default, Xarray will check for equality of the variable across all files. Those that are different get concatenated, those that are the same, are simply copied over. This can get quite expensive, so `compat=\"override\"` allows you to skip equality checking and simply pick the variable from the first file. This is great for so-called 'static variables' such as grid variables that are invariant in time (and ensemble member).\n", "\n", "Let's try that" ] }, { "cell_type": "code", "execution_count": 8, "id": "8c8040ad-9f4b-479e-8ba7-0ac4db1fe967", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:    (time: 8761, bnds: 2, lon: 288, lat: 192)\n",
       "Coordinates:\n",
       "  * time       (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n",
       "  * lon        (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n",
       "  * lat        (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n",
       "Dimensions without coordinates: bnds\n",
       "Data variables:\n",
       "    time_bnds  (time, bnds) object dask.array<chunksize=(8761, 2), meta=np.ndarray>\n",
       "    PRECT      (time, lat, lon) float32 dask.array<chunksize=(8761, 16, 32), meta=np.ndarray>\n",
       "Attributes: (12/13)\n",
       "    CDI:               Climate Data Interface version 2.0.2 (https://mpimet.m...\n",
       "    Conventions:       CF-1.0\n",
       "    source:            CAM\n",
       "    case:              b.e21.BHISTsmbb.f09_g17.LE2-1251.011\n",
       "    logname:           sunseon\n",
       "    host:              mom1\n",
       "    ...                ...\n",
       "    topography_file:   /mnt/lustre/share/CESM/cesm_input/atm/cam/topo/fv_0.9x...\n",
       "    model_doi_url:     https://doi.org/10.5065/D67H1H0V\n",
       "    time_period_freq:  day_1\n",
       "    history:           Wed May 17 11:54:18 2023: cdo selvar,PRECT tmp.nc PREC...\n",
       "    NCO:               netCDF Operators version 5.0.3 (Homepage = http://nco....\n",
       "    CDO:               Climate Data Operators version 2.0.1 (https://mpimet.m...
" ], "text/plain": [ "\n", "Dimensions: (time: 8761, bnds: 2, lon: 288, lat: 192)\n", "Coordinates:\n", " * time (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n", " * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n", " * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n", "Dimensions without coordinates: bnds\n", "Data variables:\n", " time_bnds (time, bnds) object dask.array\n", " PRECT (time, lat, lon) float32 dask.array\n", "Attributes: (12/13)\n", " CDI: Climate Data Interface version 2.0.2 (https://mpimet.m...\n", " Conventions: CF-1.0\n", " source: CAM\n", " case: b.e21.BHISTsmbb.f09_g17.LE2-1251.011\n", " logname: sunseon\n", " host: mom1\n", " ... ...\n", " topography_file: /mnt/lustre/share/CESM/cesm_input/atm/cam/topo/fv_0.9x...\n", " model_doi_url: https://doi.org/10.5065/D67H1H0V\n", " time_period_freq: day_1\n", " history: Wed May 17 11:54:18 2023: cdo selvar,PRECT tmp.nc PREC...\n", " NCO: netCDF Operators version 5.0.3 (Homepage = http://nco....\n", " CDO: Climate Data Operators version 2.0.1 (https://mpimet.m..." ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined = xr.open_mfdataset(\n", " # make sure we sort\n", " sorted(files[:3]),\n", " # concatenate along a new dimension called \"ensemble\"\n", " concat_dim=\"ensemble\",\n", " chunks={\"lat\": 16, \"lon\": 32},\n", " data_vars=\"minimal\",\n", " coords=\"minimal\",\n", " compat=\"override\",\n", " # just concatenate them together\n", " combine=\"nested\",\n", " parallel=True,\n", ")\n", "combined" ] }, { "cell_type": "markdown", "id": "8ce75a6f-6b89-4340-82ed-57f50cced64c", "metadata": {}, "source": [ "Oops this doesn't work for us! We didn't concatenate `PRECT` along the new `ensemble` dimension." ] }, { "cell_type": "code", "execution_count": 9, "id": "6ba30a3d-e6e5-4b87-a0ea-e2b1722e3a58", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('time', 'lat', 'lon')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined.PRECT.dims" ] }, { "cell_type": "markdown", "id": "2658bc9c-5a3c-4d37-9574-8a90946d8f47", "metadata": {}, "source": [ "### Try preprocessing the dataset to make it work better\n", "\n", "Our dataset doesn't really fit the assumptions of `open_mfdataset`. Luckily we can modify our datasets before the concatenation stage using the `preprocess` kwarg ([docs](https://docs.xarray.dev/en/stable/generated/xarray.open_mfdataset.html#xarray.open_mfdataset))\n", "> `preprocess`: If provided, call this function on each dataset prior to concatenation. You can find the file-name from which each dataset was loaded in ds.encoding[\"source\"].\n", "\n", "What we'll do is to add a new dimension `ensemble` to the `PRECT` variable using [`expand_dims`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.expand_dims.html#xarray.DataArray.expand_dims).\n", "\n", "This makes it clear that only `PRECT` should be concatenated along the new `ensemble` dimension" ] }, { "cell_type": "code", "execution_count": 10, "id": "e33bec6f-036d-47a8-8c5e-e8ce41911a83", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:    (time: 8761, bnds: 2, lon: 288, lat: 192, ensemble: 3)\n",
       "Coordinates:\n",
       "  * time       (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n",
       "  * lon        (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n",
       "  * lat        (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n",
       "Dimensions without coordinates: bnds, ensemble\n",
       "Data variables:\n",
       "    time_bnds  (time, bnds) object dask.array<chunksize=(8761, 2), meta=np.ndarray>\n",
       "    PRECT      (ensemble, time, lat, lon) float32 dask.array<chunksize=(1, 8761, 16, 32), meta=np.ndarray>\n",
       "Attributes: (12/13)\n",
       "    CDI:               Climate Data Interface version 2.0.2 (https://mpimet.m...\n",
       "    Conventions:       CF-1.0\n",
       "    source:            CAM\n",
       "    case:              b.e21.BHISTsmbb.f09_g17.LE2-1011.001\n",
       "    logname:           sunseon\n",
       "    host:              mom2\n",
       "    ...                ...\n",
       "    topography_file:   /mnt/lustre/share/CESM/cesm_input/atm/cam/topo/fv_0.9x...\n",
       "    model_doi_url:     https://doi.org/10.5065/D67H1H0V\n",
       "    time_period_freq:  day_1\n",
       "    history:           Wed May 17 11:17:29 2023: cdo selvar,PRECT tmp.nc PREC...\n",
       "    NCO:               netCDF Operators version 5.0.3 (Homepage = http://nco....\n",
       "    CDO:               Climate Data Operators version 2.0.1 (https://mpimet.m...
" ], "text/plain": [ "\n", "Dimensions: (time: 8761, bnds: 2, lon: 288, lat: 192, ensemble: 3)\n", "Coordinates:\n", " * time (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n", " * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n", " * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n", "Dimensions without coordinates: bnds, ensemble\n", "Data variables:\n", " time_bnds (time, bnds) object dask.array\n", " PRECT (ensemble, time, lat, lon) float32 dask.array\n", "Attributes: (12/13)\n", " CDI: Climate Data Interface version 2.0.2 (https://mpimet.m...\n", " Conventions: CF-1.0\n", " source: CAM\n", " case: b.e21.BHISTsmbb.f09_g17.LE2-1011.001\n", " logname: sunseon\n", " host: mom2\n", " ... ...\n", " topography_file: /mnt/lustre/share/CESM/cesm_input/atm/cam/topo/fv_0.9x...\n", " model_doi_url: https://doi.org/10.5065/D67H1H0V\n", " time_period_freq: day_1\n", " history: Wed May 17 11:17:29 2023: cdo selvar,PRECT tmp.nc PREC...\n", " NCO: netCDF Operators version 5.0.3 (Homepage = http://nco....\n", " CDO: Climate Data Operators version 2.0.1 (https://mpimet.m..." ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def add_ensemble_dim(ds):\n", " ds[\"PRECT\"] = ds.PRECT.expand_dims(\"ensemble\")\n", " return ds\n", "\n", "\n", "combined = xr.open_mfdataset(\n", " # make sure we sort\n", " sorted(files)[:3],\n", " # chunk the dataset from each file properly\n", " chunks={\"lat\": 16, \"lon\": 32},\n", " # concatenate along a new dimension called \"ensemble\"\n", " concat_dim=\"ensemble\",\n", " data_vars=\"minimal\",\n", " coords=\"minimal\",\n", " compat=\"override\",\n", " combine=\"nested\",\n", " parallel=True,\n", " preprocess=add_ensemble_dim,\n", ")\n", "combined" ] }, { "cell_type": "markdown", "id": "bf62a026-91db-4077-8845-7353bd14601f", "metadata": {}, "source": [ "Much better!" ] }, { "cell_type": "code", "execution_count": 11, "id": "83042a2d-26c8-47f6-9b93-10f970823b73", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.DataArray 'PRECT' (ensemble: 3, time: 8761, lat: 192, lon: 288)>\n",
       "dask.array<concatenate, shape=(3, 8761, 192, 288), dtype=float32, chunksize=(1, 8761, 16, 32), chunktype=numpy.ndarray>\n",
       "Coordinates:\n",
       "  * time     (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n",
       "  * lon      (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n",
       "  * lat      (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0\n",
       "Dimensions without coordinates: ensemble\n",
       "Attributes:\n",
       "    long_name:     Total (convective and large-scale) precipitation rate (liq...\n",
       "    units:         m/s\n",
       "    cell_methods:  time: mean
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * time (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n", " * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n", " * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0\n", "Dimensions without coordinates: ensemble\n", "Attributes:\n", " long_name: Total (convective and large-scale) precipitation rate (liq...\n", " units: m/s\n", " cell_methods: time: mean" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined.PRECT" ] }, { "cell_type": "markdown", "id": "4e813a97-95f7-4606-87cf-c9cc88831d02", "metadata": {}, "source": [ "## Read and concatenate the whole dataset" ] }, { "cell_type": "markdown", "id": "9a0f5df4-4253-46f2-9aff-87301201d7de", "metadata": { "tags": [] }, "source": [ "### Create a dask cluster\n", "\n", "We'll use an adaptive cluster to be polite.\n", "\n", "The dask cluster helps by parallelizing the initial reading of every file." ] }, { "cell_type": "code", "execution_count": 12, "id": "548b7a97-dbc1-4cb9-922b-5634caa6afc8", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/glade/u/home/dcherian/miniconda3/envs/pump/lib/python3.10/site-packages/dask_jobqueue/core.py:20: FutureWarning: tmpfile is deprecated and will be removed in a future release. Please use dask.utils.tmpfile instead.\n", " from distributed.utils import tmpfile\n", "/glade/u/home/dcherian/miniconda3/envs/pump/lib/python3.10/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.\n", "Perhaps you already have a cluster running?\n", "Hosting the HTTP server on port 45848 instead\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
\n", "
\n", "
\n", "

Client

\n", "

Client-6ae07e09-3c6e-11ee-81c8-3cecef1b11f8

\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "
Connection method: Cluster objectCluster type: dask_jobqueue.PBSCluster
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/casper/proxy/45848/status\n", "
\n", "\n", " \n", " \n", " \n", "\n", " \n", "
\n", "

Cluster Info

\n", "
\n", "
\n", "
\n", "
\n", "

PBSCluster

\n", "

f59e9f61

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/casper/proxy/45848/status\n", " \n", " Workers: 0\n", "
\n", " Total threads: 0\n", " \n", " Total memory: 0 B\n", "
\n", "\n", "
\n", " \n", "

Scheduler Info

\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "

Scheduler

\n", "

Scheduler-7028ad57-67ed-4993-a974-e3a9b95111fb

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " Comm: tcp://10.12.206.35:45783\n", " \n", " Workers: 0\n", "
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/casper/proxy/45848/status\n", " \n", " Total threads: 0\n", "
\n", " Started: Just now\n", " \n", " Total memory: 0 B\n", "
\n", "
\n", "
\n", "\n", "
\n", " \n", "

Workers

\n", "
\n", "\n", " \n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import dask_jobqueue\n", "\n", "cluster = dask_jobqueue.PBSCluster(\n", " cores=4, # The number of cores you want\n", " memory=\"23GB\", # Amount of memory\n", " processes=1, # How many processes\n", " queue=\"casper\", # The type of queue to utilize (/glade/u/apps/dav/opt/usr/bin/execcasper)\n", " local_directory=\"/local_scratch/pbs.$PBS_JOBID/dask/spill\",\n", " log_directory=\"/glade/scratch/dcherian/dask/\",\n", " resource_spec=\"select=1:ncpus=4:mem=23GB\", # Specify resources\n", " project=\"ncgd0011\", # Input your project ID here\n", " walltime=\"02:00:00\", # Amount of wall time\n", " interface=\"ib0\", # Interface to use\n", ")\n", "# create an adaptive cluster with one job always requested,\n", "# scale to a maximum of 6 jobs\n", "# and hold on to each job for 600 seconds of idle time\n", "cluster.adapt(minimum_jobs=1, maximum_jobs=6, wait_count=600)\n", "\n", "import distributed\n", "\n", "client = distributed.Client(cluster)\n", "\n", "client" ] }, { "cell_type": "code", "execution_count": 13, "id": "0bf926e2-dc30-412f-9b88-a3ac72bfd145", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "
\n", "

Client

\n", "

Client-6ae5eeca-3c6e-11ee-81c8-3cecef1b11f8

\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "
Connection method: Cluster objectCluster type: dask_jobqueue.PBSCluster
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/casper/proxy/45848/status\n", "
\n", "\n", " \n", " \n", " \n", "\n", " \n", "
\n", "

Cluster Info

\n", "
\n", "
\n", "
\n", "
\n", "

PBSCluster

\n", "

f59e9f61

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/casper/proxy/45848/status\n", " \n", " Workers: 0\n", "
\n", " Total threads: 0\n", " \n", " Total memory: 0 B\n", "
\n", "\n", "
\n", " \n", "

Scheduler Info

\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", "

Scheduler

\n", "

Scheduler-7028ad57-67ed-4993-a974-e3a9b95111fb

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " Comm: tcp://10.12.206.35:45783\n", " \n", " Workers: 0\n", "
\n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/dcherian/casper/proxy/45848/status\n", " \n", " Total threads: 0\n", "
\n", " Started: Just now\n", " \n", " Total memory: 0 B\n", "
\n", "
\n", "
\n", "\n", "
\n", " \n", "

Workers

\n", "
\n", "\n", " \n", "\n", "
\n", "
\n", "\n", "
\n", "
\n", "
\n", "
\n", " \n", "\n", "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import distributed\n", "\n", "client = distributed.Client(cluster)\n", "\n", "client" ] }, { "cell_type": "markdown", "id": "89490b69-45a2-4661-b656-47eacb5ac9ac", "metadata": {}, "source": [ "### Read\n", "\n", "Now we can scale it up. " ] }, { "cell_type": "markdown", "id": "948459c8-5978-42ad-874a-9d2c1d7fbef0", "metadata": {}, "source": [ "We generalize a little by having `add_ensemble_dim` expand the dimensions of any variable with 3 or more dimensions." ] }, { "cell_type": "code", "execution_count": 14, "id": "34eadbcf-9b96-459e-9341-43ddb9a1f2ec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset>\n",
       "Dimensions:    (time: 8761, bnds: 2, lon: 288, lat: 192, ensemble: 50)\n",
       "Coordinates:\n",
       "  * time       (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n",
       "  * lon        (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n",
       "  * lat        (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n",
       "Dimensions without coordinates: bnds, ensemble\n",
       "Data variables:\n",
       "    time_bnds  (time, bnds) object dask.array<chunksize=(1825, 2), meta=np.ndarray>\n",
       "    PRECT      (ensemble, time, lat, lon) float32 dask.array<chunksize=(1, 1825, 192, 288), meta=np.ndarray>\n",
       "Attributes: (12/13)\n",
       "    CDI:               Climate Data Interface version 2.0.2 (https://mpimet.m...\n",
       "    Conventions:       CF-1.0\n",
       "    source:            CAM\n",
       "    case:              b.e21.BHISTsmbb.f09_g17.LE2-1011.001\n",
       "    logname:           sunseon\n",
       "    host:              mom2\n",
       "    ...                ...\n",
       "    topography_file:   /mnt/lustre/share/CESM/cesm_input/atm/cam/topo/fv_0.9x...\n",
       "    model_doi_url:     https://doi.org/10.5065/D67H1H0V\n",
       "    time_period_freq:  day_1\n",
       "    history:           Wed May 17 11:17:29 2023: cdo selvar,PRECT tmp.nc PREC...\n",
       "    NCO:               netCDF Operators version 5.0.3 (Homepage = http://nco....\n",
       "    CDO:               Climate Data Operators version 2.0.1 (https://mpimet.m...
" ], "text/plain": [ "\n", "Dimensions: (time: 8761, bnds: 2, lon: 288, lat: 192, ensemble: 50)\n", "Coordinates:\n", " * time (time) object 2000-01-01 00:00:00 ... 2023-12-31 00:00:00\n", " * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n", " * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n", "Dimensions without coordinates: bnds, ensemble\n", "Data variables:\n", " time_bnds (time, bnds) object dask.array\n", " PRECT (ensemble, time, lat, lon) float32 dask.array\n", "Attributes: (12/13)\n", " CDI: Climate Data Interface version 2.0.2 (https://mpimet.m...\n", " Conventions: CF-1.0\n", " source: CAM\n", " case: b.e21.BHISTsmbb.f09_g17.LE2-1011.001\n", " logname: sunseon\n", " host: mom2\n", " ... ...\n", " topography_file: /mnt/lustre/share/CESM/cesm_input/atm/cam/topo/fv_0.9x...\n", " model_doi_url: https://doi.org/10.5065/D67H1H0V\n", " time_period_freq: day_1\n", " history: Wed May 17 11:17:29 2023: cdo selvar,PRECT tmp.nc PREC...\n", " NCO: netCDF Operators version 5.0.3 (Homepage = http://nco....\n", " CDO: Climate Data Operators version 2.0.1 (https://mpimet.m..." ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def add_ensemble_dim(ds):\n", " # find all 3D variables\n", " names = [name for name, variable in ds.variables.items() if variable.ndim >= 3]\n", " # add a new dimension `ensemble` of size 1\n", " # and replace the existing 3D variables.\n", " ds = ds.update(ds[names].expand_dims(\"ensemble\"))\n", " return ds\n", "\n", "\n", "combined = xr.open_mfdataset(\n", " # make sure we sort\n", " sorted(files),\n", " # chunk each individual file\n", " chunks={\"time\": 365 * 5},\n", " # Add the ensemble dimension to 3D variables\n", " preprocess=add_ensemble_dim,\n", " # concatenate along a new dimension called \"ensemble\"\n", " concat_dim=\"ensemble\",\n", " # only concatenate variables with the `ensemble` dimension.\n", " data_vars=\"minimal\",\n", " coords=\"minimal\",\n", " compat=\"override\",\n", " combine=\"nested\",\n", " # parallelize reading of each file using dask\n", " parallel=True,\n", ")\n", "combined" ] }, { "cell_type": "markdown", "id": "d40c4c32-c155-4675-bc08-7a030bf2f1d7", "metadata": {}, "source": [ "## Note that on-disk chunking matters\n", "\n", "Running\n", "```\n", "ncdump -sh /glade/scratch/anukesh/CESM-LE/PRECT/ENSEMBLE/b.e21.BHISTsmbb.f09_g17.LE2-1251.011.cam.h1.PRECT.18500101-21001231.nc\n", "```\n", "shows\n", "```\n", "float PRECT(time, lat, lon) ;\n", "\t\tPRECT:units = \"m/s\" ;\n", "\t\tPRECT:long_name = \"Total (convective and large-scale) precipitation rate (liq + ice)\" ;\n", "\t\tPRECT:cell_methods = \"time: mean\" ;\n", "\t\tPRECT:_Storage = \"chunked\" ;\n", "\t\tPRECT:_ChunkSizes = 1, 192, 288 ;\n", "\t\tPRECT:_DeflateLevel = 1 ;\n", "\t\tPRECT:_Shuffle = \"true\" ;\n", "\t\tPRECT:_Endianness = \"little\" ;\n", "```\n", "\n", "This bit is important: `PRECT:_ChunkSizes = 1, 192, 288 ;` The data on-disk is chunked to have a chunksize of 1 along time, and all spatial points in one chunk. This is orthogonal to our proposed chunking scheme of chunking small in space and big in time (`chunks={\"lat\": 16, \"lon\": 32}`).\n", "\n", "Actually reading data with `chunks={\"lat\": 16, \"lon\": 32}` will be quite slow, because we will end up reading the whole file to extract data for a single chunk.\n", "\n", "\n", "```{tip}\n", "See this Unidata [blogpost](https://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters) on netCDF chunking for more.\n", "```" ] } ], "metadata": { "author": "Deepak Cherian", "date": "Aug 16, 2023", "kernelspec": { "display_name": "miniconda3-pump", "language": "python", "name": "conda-env-miniconda3-pump-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }