{ "cells": [ { "cell_type": "markdown", "id": "starting-spokesman", "metadata": {}, "source": [ "# Building an Intake-esm catalog from CESM2 History Files\n", "\n", "As mentioned in a couple of ESDS posts ([intake-esm and Dask](https://ncar.github.io/esds/posts/intake_esm_dask/), [debugging intake-esm](https://ncar.github.io/esds/posts/intake_cmip6_debug/)), [intake-esm](https://intake-esm.readthedocs.io/en/latest/) can be a helpful tool to work with when dealing with model data, especially CESM. One of the requirements for using intake-esm is having a catalog which is comprised of two pieces:\n", "* A table of the relevant metadata (ex. file path, variable, stream, etc.)\n", "* A json describing the dataset, including how to aggregate the variables\n", "\n", "Typically, these pieces are constructed \"manually\" using information within the file path, on a very ad-hoc basis. Also, these catalogs are typically only created for \"larger\", community datasets, not neccessarily used within smaller model runs/daily workflows. A new package (currently a prototype), called [ecgtools](https://ecgtools.readthedocs.io/en/latest/) works to solve the issues of generating these intake-esm catalogs. Ecgtools stands for Earth System Model (ESM) Catalog Generation tools. The current catalog generation tools supported are:\n", "* CMIP6 models\n", "* CESM \"history\" files\n", "* CESM \"timeseries\" files\n", "\n", "This package has not officially been release yet on [pypi](https://pypi.org/) or [conda-forge](https://conda-forge.org/), but this release will likely happen soon. This post will give an overview of using [ecgtools](https://ecgtools.readthedocs.io/en/latest/) for parsing CESM history file model output, and reading in the data using \n", "[intake-esm](https://intake-esm.readthedocs.io/en/latest/). In this example, we use model output using the default component-set (compset) detailed in the [CESM Quickstart Guide](https://escomp.github.io/CESM/versions/cesm2.1/html/).\n", "\n", "## What's a \"history\" file?\n", "A history file is the default output from CESM, where each file is a single time \"slice\" with every variable from the component of interest. These types of files can be difficult to work with, since often times one is interested in a time series of a single variable. Building a catalog can be helpful in accessing your data, querying for certain variables, and potentially creating timeseries files later down the road.\n", "\n", "Let's get started!\n", "\n", "## Downloading the beta version of ecgtools\n", "In order to access this in its current state, you will need to clone the repository from github. On the machine with the data you are planning on creating a catalog for, run the following:\n", "\n", "```\n", "git clone https://github.com/NCAR/ecgtools.git\n", "```\n", "\n", "This will create a clone of the repository on your machine. After you clone the respository, run\n", "\n", "```\n", "pip install -e ecgtools\n", "```\n", "\n", "This will install the package into the python environment you currently activated (for more on dealing with conda environments, check out the [faq page](https://ncar.github.io/esds/faq/#conda-environments)\n", "\n", "You will also want to install intake-esm, which you can install using `conda-forge`\n", "\n", "```\n", "conda install -c conda-forge intake-esm\n", "```\n", "\n", "## Imports\n", "The only parts of ecgtools we need are the `Builder` object and the `parse_cesm_history` parser from the CESM parsers! We import `glob` to take a look at the files we are parsing." ] }, { "cell_type": "code", "execution_count": 1, "id": "nominated-battle", "metadata": {}, "outputs": [], "source": [ "import glob\n", "\n", "from ecgtools import Builder\n", "from ecgtools.parsers.cesm import parse_cesm_history" ] }, { "cell_type": "markdown", "id": "czech-stylus", "metadata": {}, "source": [ "### Understanding the Directory Structure\n", "\n", "The first step to setting up the `Builder` object is determining where your files are stored. As mentioned previously, we have a sample dataset of CESM2 model output, which is stored in `/glade/work/mgrover/cesm_test_data/`\n", "\n", "Taking a look at that directory, we see that there is a single case `b.e20.B1850.f19_g17.test`" ] }, { "cell_type": "code", "execution_count": 2, "id": "south-nancy", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glob.glob('/glade/work/mgrover/cesm_test_data/*')" ] }, { "cell_type": "markdown", "id": "literary-scholarship", "metadata": {}, "source": [ "Once we go into that directory, we see all the different components, including the atmosphere (atm), ocean (ocn), and land (lnd)!" ] }, { "cell_type": "code", "execution_count": 3, "id": "equipped-recycling", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/logs',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/cpl',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/atm',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ocn',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/lnd',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/esp',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/glc',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/rof',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/rest',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/wav',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ice']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glob.glob('/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/*')" ] }, { "cell_type": "markdown", "id": "relevant-conflict", "metadata": {}, "source": [ "If we go one step further, we notice that within each component, is a `hist` directory which contains the model output" ] }, { "cell_type": "code", "execution_count": 4, "id": "velvet-synthesis", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/atm/hist/b.e20.B1850.f19_g17.test.cam.h0.0002-08.nc',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/atm/hist/b.e20.B1850.f19_g17.test.cam.h0.0001-09.nc',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/atm/hist/b.e20.B1850.f19_g17.test.cam.h0.0002-07.nc']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glob.glob('/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/atm/*/*.nc')[0:3]" ] }, { "cell_type": "markdown", "id": "viral-stroke", "metadata": {}, "source": [ "If we take a look at the `ocn` component though, we notice that there are a few timeseries files in there..." ] }, { "cell_type": "code", "execution_count": 5, "id": "medium-bishop", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ocn/tseries/b.e20.B1850.f19_g17.test.pop.h.pCO2SURF.000101-001012.nc',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ocn/tseries/b.e20.B1850.f19_g17.test.pop.h.SiO3_RIV_FLUX.000101-001012.nc',\n", " '/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ocn/tseries/b.e20.B1850.f19_g17.test.pop.h.graze_sp_zootot.000101-001012.nc']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "glob.glob('/glade/work/mgrover/cesm_test_data/b.e20.B1850.f19_g17.test/ocn/*/*.nc')[0:3]" ] }, { "cell_type": "markdown", "id": "sunset-animation", "metadata": {}, "source": [ "When we setup our catalog builder, we will need to specify not including the timeseries (tseries) and restart (rest) directories!\n", "\n", "Now that we understand the directory structure, let's make the catalog." ] }, { "cell_type": "markdown", "id": "growing-anthony", "metadata": {}, "source": [ "## Build the catalog!\n", "\n", "Let's start by inspecting the builder object" ] }, { "cell_type": "code", "execution_count": 6, "id": "promising-installation", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\u001b[0;31mInit signature:\u001b[0m\n", "\u001b[0mBuilder\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mroot_path\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mpydantic\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtypes\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDirectoryPath\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mextension\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'.nc'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mdepth\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mexclude_patterns\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mList\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m \u001b[0mnjobs\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mint\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\n", "\u001b[0;34m\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m \n", "Generates a catalog from a list of files.\n", "\n", "Parameters\n", "----------\n", "root_path : str\n", " Path of root directory.\n", "extension : str, optional\n", " File extension, by default None. If None, the builder will look for files with\n", " \"*.nc\" extension.\n", "depth : int, optional\n", " Recursion depth. Recursively crawl `root_path` up to a specified depth, by default 0\n", "exclude_patterns : list, optional\n", " Directory, file patterns to exclude during catalog generation.\n", " These could be substring or regular expressions. by default None\n", "njobs : int, optional\n", " The maximum number of concurrently running jobs,\n", " by default -1 meaning all CPUs are used.\n", "\u001b[0;31mFile:\u001b[0m /glade/work/mgrover/git_repos/ecgtools/ecgtools/builder.py\n", "\u001b[0;31mType:\u001b[0m type\n", "\u001b[0;31mSubclasses:\u001b[0m \n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Builder?" ] }, { "cell_type": "markdown", "id": "ccf85513-d76a-4b31-9908-dc6af39afc12", "metadata": {}, "source": [ "
Info
\n", " Note that as of 21 June, 2021, theparsing_func
parameter is now used in the .build()
method!\n",
"\n", " | component | \n", "stream | \n", "case | \n", "date | \n", "frequency | \n", "variables | \n", "path | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0002-08 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
1 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0001-09 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
2 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0002-07 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
3 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0003-05 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
4 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0002-01 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
259 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0001-08 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
260 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0001-03 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
261 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0002-11 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
262 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0002-10 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
263 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0003-12 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
259 rows × 7 columns
\n", "\n", " | INVALID_ASSET | \n", "TRACEBACK | \n", "
---|---|---|
15 | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "Traceback (most recent call last):\\n File \"/g... | \n", "
28 | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "Traceback (most recent call last):\\n File \"/g... | \n", "
34 | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "Traceback (most recent call last):\\n File \"/g... | \n", "
130 | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "Traceback (most recent call last):\\n File \"/g... | \n", "
191 | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "Traceback (most recent call last):\\n File \"/g... | \n", "
\n", " | component | \n", "stream | \n", "case | \n", "date | \n", "frequency | \n", "variables | \n", "path | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0002-08 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
1 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0001-09 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
2 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0002-07 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
3 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0003-05 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
4 | \n", "atm | \n", "cam.h0 | \n", "b.e20.B1850.f19_g17.test | \n", "0002-01 | \n", "month_1 | \n", "[date, datesec, date_written, time_written, nd... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
259 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0001-08 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
260 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0001-03 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
261 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0002-11 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
262 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0002-10 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
263 | \n", "ice | \n", "cice.h | \n", "b.e20.B1850.f19_g17.test | \n", "0003-12 | \n", "month_1 | \n", "[hi, hs, snowfrac, Tsfc, aice, uvel, vvel, uat... | \n", "/glade/work/mgrover/cesm_test_data/b.e20.B1850... | \n", "
259 rows × 7 columns
\n", "None catalog with 9 dataset(s) from 259 asset(s):
\n", " | unique | \n", "
---|---|
component | \n", "6 | \n", "
stream | \n", "9 | \n", "
case | \n", "1 | \n", "
date | \n", "79 | \n", "
frequency | \n", "4 | \n", "
variables | \n", "1447 | \n", "
path | \n", "259 | \n", "
None catalog with 1 dataset(s) from 36 asset(s):
\n", " | unique | \n", "
---|---|
component | \n", "1 | \n", "
stream | \n", "1 | \n", "
case | \n", "1 | \n", "
date | \n", "36 | \n", "
frequency | \n", "1 | \n", "
variables | \n", "434 | \n", "
path | \n", "36 | \n", "