Comparing Atmospheric Model Output with Observations Using Intake-ESM#

Comparing models and observations is a critical component of climate diagnostic packages. This process can be challenging though - given the number of observational datasets to compare against, and the difference in spatiotemporal resolutions. In the previous iteration of the diagnostics package used for atmospheric data from the Community Earth System Model (CESM), they used pre-computed, observational datasets stored in a directory on the GLADE filesystem (/glade/p/cesm/amwg/amwg_diagnostics/obs_data)

Within this example, we walk though generating an intake-esm catalog from the observational data, reading in CESM data, and compare models and observations

Imports#

import ast
import glob
import pathlib
import traceback
import warnings
from datetime import datetime

import cartopy.crs as ccrs
import geocat.comp
import geoviews.feature as gf
import holoviews as hv
import hvplot
import hvplot.xarray
import intake
import matplotlib.pyplot as plt
import pandas as pd
import xarray as xr
from distributed import Client
from ecgtools import Builder
from ecgtools.builder import INVALID_ASSET, TRACEBACK
from ncar_jobqueue import NCARCluster

hv.extension('bokeh')

Build an `Intake-ESM` catalog from the observational dataset#

Before we jump into building the catalog, let’s take a look at what data we are working with

Build a parser using `ecgtools`#

We can use the ecgtools package to help build an intake-esm catalog for this dataset!

We start first by using pathlib to generically look at the filepaths

path = pathlib.Path(files[0])
path.stem

'AIRS_01_climo'

This path can be split using .split('_'), separates the path into the following:

observational dataset source
month number or season
“climo”

path.stem.split('_')

['AIRS', '01', 'climo']

We can also gather useful insight by opening the file!

ds = xr.open_dataset(files[0])

Let’s look at the variable “Temperature” (T)

You’ll notice that we have additional attributes about the dataset which we can use, including:

units
long_name
climatology, which includes more information about what dates were used

Adding the parsing function#

def parse_amwg_obs(file):
    """Atmospheric observational data used within the AMWG diagnostic package

    Parameters
    ----------
    file: str
      filepath from the data directory including observational climatologies

    Returns
    -------
    info: dict
       Dictionary with information to add as columns in the data catalog

    """
    file = pathlib.Path(file)
    info_list = []
    info = {}

    try:
        stem = file.stem
        split = stem.split('_')
        info['source'] = split[0]
        temporal = split[-2]
        if len(split[-2]) == 2:
            month_number = int(temporal)
            info['time_period'] = 'monthly'
            info['temporal'] = datetime(2020, month_number, 1).strftime('%b').upper()

        elif len(temporal) == 3:
            info['time_period'] = 'seasonal'
            info['temporal'] = temporal

        else:
            info['time_period'] = np.nan
            info['temporal'] = np.nan

        with xr.open_dataset(file, chunks={}, decode_times=False) as ds:
            variables = [v for v, da in ds.variables.items() if 'time' in da.dims]
            info['variables'] = variables

            info['path'] = str(path)

        return info

    except Exception:
        return {INVALID_ASSET: file, TRACEBACK: traceback.format_exc()}

'time' in ds['T'].dims

True

info = parse_amwg_obs(files[0])
info

{'source': 'AIRS',
 'time_period': 'monthly',
 'temporal': 'JAN',
 'variables': ['T', 'time', 'RELHUM', 'O3', 'SHUM'],
 'path': '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc'}

pd.DataFrame(info)

	source	time_period	temporal	variables	path
0	AIRS	monthly	JAN	T	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
1	AIRS	monthly	JAN	time	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
2	AIRS	monthly	JAN	RELHUM	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3	AIRS	monthly	JAN	O3	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
4	AIRS	monthly	JAN	SHUM	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...

We can test this on a single file!

This dictionary can easily be converted into a pandas.Dataframe

pd.DataFrame([info, info])

	source	time_period	temporal	variables	path
0	AIRS	monthly	JAN	[T, time, RELHUM, O3, SHUM]	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
1	AIRS	monthly	JAN	[T, time, RELHUM, O3, SHUM]	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...

Use `ecgtools` to build the catalog#

Now that we have our parser ready to go, we can use ecgtools to build the catalog

b = Builder(
    # Directory with the output
    f"/glade/p/cesm/amwg/amwg_diagnostics/obs_data/",
    depth=3,
    # Number of jobs to execute - should be equal to # threads you are using
    njobs=-1,
)

Once we set up the builder, can call .build, passing in the parser we built

b.build(parse_amwg_obs)

b.df

	source	time_period	temporal	variables	path
0	AIRS	monthly	JAN	[T, time, RELHUM, O3, SHUM]	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
1	AIRS	monthly	FEB	[T, time, RELHUM, O3, SHUM]	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
2	AIRS	monthly	MAR	[T, time, RELHUM, O3, SHUM]	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3	AIRS	monthly	APR	[T, time, RELHUM, O3, SHUM]	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
4	AIRS	monthly	MAY	[T, time, RELHUM, O3, SHUM]	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
...	...	...	...	...	...
3778	mlsw	seasonal	ANN	[T, time, Z3, H2O, O3, RELHUM]	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3779	mlsw	seasonal	DJF	[T, time, Z3, H2O, O3, RELHUM]	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3780	mlsw	seasonal	JJA	[T, time, Z3, H2O, O3, RELHUM]	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3781	mlsw	seasonal	MAM	[T, time, Z3, H2O, O3, RELHUM]	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3782	mlsw	seasonal	SON	[T, time, Z3, H2O, O3, RELHUM]	/glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...

1243 rows × 5 columns

Save the catalog#

b.save(
    # File path - could save as .csv (uncompressed csv) or .csv.gz (compressed csv)
    "/glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.csv",
    # Column name including filepath
    path_column_name='path',
    # Column name including variables
    variable_column_name='variables',
    # Data file format - could be netcdf or zarr (in this case, netcdf)
    data_format="netcdf",
    # Which attributes to groupby when reading in variables using intake-esm
    groupby_attrs=["source", "time_period"],
    # Aggregations which are fed into xarray when reading in data using intake
    aggregations=[
        {
            'type': 'join_new',
            'attribute_name': 'temporal',
            'options': {'coords': 'minimal', 'compat': 'override'},
        },
    ],
)

Saved catalog location: /glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.json and /glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.csv

/glade/scratch/mgrover/ipykernel_140729/3449059562.py:1: UserWarning: Unable to parse 2541 assets/files. A list of these assets can be found in /glade/work/mgrover/intake-esm-catalogs/invalid_assets_amwg_obs_datasets.csv.
  b.save(

Read in Observational Data from the Catalog#

We use intake-esm to read in the observational data from the catalog we just created

obs_catalog = intake.open_esm_datastore(
    "/glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.json",
    csv_kwargs={"converters": {"variables": ast.literal_eval}},
    sep="/",
)

We are interested in Temperature (T), from the AIRS dataset, which comes from NASA

obs_catalog_subset = obs_catalog.search(variables='T', source='AIRS')

dsets = obs_catalog_subset.to_dataset_dict()

--> The keys in the returned dictionary of datasets are constructed as follows:
	'source/time_period'

100.00% [2/2 00:00<00:00]

Our dictionary of datasets has two keys, one for seasonal and the other for monthly climatologies

dsets.keys()

dict_keys(['AIRS/seasonal', 'AIRS/monthly'])

seasonal_obs_ds = dsets['AIRS/seasonal']
monthly_obs_ds = dsets['AIRS/monthly']

Plot Observational Temperature Climatologies#

Now that we read in our data, we can plot up our results!

Plot Seasonal Temperature Climatologies from Observations#

seasonal_obs_plot = (
    seasonal_obs_ds.isel(time=0, lev=range(3)).T.hvplot.quadmesh(
        rasterize=True, groupby=['lev', 'temporal'], cmap='magma', projection=ccrs.Robinson()
    )
    * gf.coastline
)
seasonal_obs_plot

Plot Monthly Temperature Climatologies from Observations#

monthly_obs_plot = (
    monthly_obs_ds.isel(time=0, lev=range(3)).T.hvplot.quadmesh(
        rasterize=True, groupby=['lev', 'temporal'], cmap='magma', projection=ccrs.Robinson()
    )
    * gf.coastline
)
monthly_obs_plot

Read in CESM Output#

cesm_data_catalog = intake.open_esm_datastore(
    '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json'
)

/glade/work/mgrover/miniconda3/envs/cesm-collections-dev/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3441: DtypeWarning: Columns (8,9) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)

cluster = NCARCluster(memory='10 GB')
cluster.scale(20)
client = Client(cluster)

client

Client

Client-a916e23c-077a-11ec-83b6-3cecef1b11fa

Connection method: Cluster object	Cluster type: PBSCluster
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/37300/status

Cluster Info

PBSCluster

c5c7baf9

Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/37300/status	Workers: 20
Total threads: 40	Total memory: 186.20 GiB

Scheduler Info

Scheduler

Scheduler-21aaa4ef-3fb4-4cf3-8445-824a2fbf35c5

Comm: tcp://10.12.206.54:39511	Workers: 20
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/37300/status	Total threads: 40
Started: Just now	Total memory: 186.20 GiB

Workers

Worker: 0

Comm: tcp://10.12.206.33:44592	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/37012/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.33:43117
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-m222qwsu

Worker: 1

Comm: tcp://10.12.206.40:37640	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/38447/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.40:33049
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-4d1rk9dp

Worker: 2

Comm: tcp://10.12.206.36:35739	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/45826/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.36:33354
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-vufi_4f6

Worker: 3

Comm: tcp://10.12.206.36:38619	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/39734/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.36:36501
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-y402nce5

Worker: 4

Comm: tcp://10.12.206.57:33624	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/32910/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.57:45128
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-2r47banl

Worker: 5

Comm: tcp://10.12.206.52:43348	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/40363/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.52:33827
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-sioaa0mt

Worker: 6

Comm: tcp://10.12.206.36:35941	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/41518/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.36:36320
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-__saonvw

Worker: 7

Comm: tcp://10.12.206.36:45373	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/42370/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.36:46702
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-f6zb330g

Worker: 8

Comm: tcp://10.12.206.36:34834	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/42757/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.36:36298
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-mecj2384

Worker: 9

Comm: tcp://10.12.206.40:46231	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/35795/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.40:42007
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-xmf87s1j

Worker: 10

Comm: tcp://10.12.206.53:44356	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/39194/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.53:34434
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-qsv6hesl

Worker: 11

Comm: tcp://10.12.206.36:39578	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/41349/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.36:39417
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-azjmpit9

Worker: 12

Comm: tcp://10.12.206.36:38517	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/46172/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.36:40589
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-nwj08h26

Worker: 13

Comm: tcp://10.12.206.57:39800	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/39636/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.57:41925
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-kibi9ldk

Worker: 14

Comm: tcp://10.12.206.36:41381	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/39753/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.36:33598
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-qssg0k7_

Worker: 15

Comm: tcp://10.12.206.52:45124	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/44387/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.52:40194
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-gfevyho4

Worker: 16

Comm: tcp://10.12.206.32:40429	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/35112/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.32:34670
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-yv6ztsjs

Worker: 17

Comm: tcp://10.12.206.36:45101	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/39483/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.36:40381
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-l9t684r1

Worker: 18

Comm: tcp://10.12.206.36:44174	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/44713/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.36:45567
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-gjnimb52

Worker: 19

Comm: tcp://10.12.206.36:40561	Total threads: 2
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/41639/status	Memory: 9.31 GiB
Nanny: tcp://10.12.206.36:44220
Local directory: /glade/scratch/mgrover/dask/casper-dav/local-dir/dask-worker-space/worker-gngr7asx

Query for monthly temperature (`T`) values, using the historical experiment#

cesm_data_catalog_subset = cesm_data_catalog.search(
    variable='T', control_branch_year=1001, frequency='month_1', experiment='historical'
)

Since we taking the average over time, we choose to chunk by vertical levels (lev)

dsets = cesm_data_catalog_subset.to_dataset_dict(cdf_kwargs={'chunks': {'lev': 5}})

--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.stream.forcing_variant.control_branch_year.variable'

100.00% [1/1 00:00<00:00]

We only have a single key - with the component.experiment.stream.forcing_variant.control_branch_year.variable

ds = dsets['atm.historical.cam.h0.cmip6.1001.T']

Our dataset has a chunk size of ~125 mb which is around the optimal size!

ds.T

<xarray.DataArray 'T' (member_id: 1, time: 1980, lev: 32, lat: 192, lon: 288)>
dask.array<broadcast_to, shape=(1, 1980, 32, 192, 288), dtype=float32, chunksize=(1, 120, 5, 192, 288), chunktype=numpy.ndarray>
Coordinates:
  * lat        (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
  * lev        (lev) float64 3.643 7.595 14.36 24.61 ... 936.2 957.5 976.3 992.6
  * time       (time) object 1860-02-01 00:00:00 ... 1980-01-01 00:00:00
  * member_id  (member_id) <U11 'r1i1001p1f1'
Attributes:
    mdims:         1
    units:         K
    long_name:     Temperature
    cell_methods:  time: mean

xarray.DataArray

'T'

member_id: 1
time: 1980
lev: 32
lat: 192
lon: 288

dask.array<chunksize=(1, 120, 5, 192, 288), meta=np.ndarray>

	Array	Chunk
Bytes	13.05 GiB	126.56 MiB
Shape	(1, 1980, 32, 192, 288)	(1, 120, 5, 192, 288)
Count	374 Tasks	119 Chunks
Type	float32	numpy.ndarray

Coordinates: (5)

Attributes: (4)
mdims :
1
units :
K
long_name :
Temperature
cell_methods :
time: mean

Plot CESM2-LE Temperature Climatologies#

Plot up the Seasonal Mean Temperature#

We select for our single member_id, and choose the last few vertical levels since the vertical dimensions are sorted from the top of the atmosphere, to the bottom, which is the inverse of how the observational data is sorted

cesm_seasonal_temperature_plot = (
    cesm_seasonal_mean_temperature.isel(member_id=0, lev=range(-5, -1)).T.hvplot.quadmesh(
        rasterize=True,
        groupby=['lev', 'season'],
        x='lon',
        y='lat',
        cmap='magma',
        projection=ccrs.Robinson(),
        project=True,
    )
    * gf.coastline
)
cesm_seasonal_temperature_plot

Plot up the Monthly Mean Temperature#

cesm_monthly_temperature_plot = (
    cesm_monthly_mean_temperature.isel(member_id=0, lev=range(-5, -1)).T.hvplot.quadmesh(
        rasterize=True,
        groupby=['lev', 'month'],
        x='lon',
        y='lat',
        cmap='magma',
        projection=ccrs.Robinson(),
        project=True,
    )
    * gf.coastline
)
cesm_monthly_temperature_plot

Conclusion#

Throughout this example, we showed how useful intake-esm can be when querying datasets, both observational and model datasets! We also showed you can use hvPlot to plot interactive maps, even transforming to different coordinate reference systems. While the observational and model datasets did not have matching spatial or vertical coordinates, plotting these comparisons as a first cut can still be useful!

If you are interested in using the data catalogs, go for it! Here are the paths:

AMWG Observational Comparison Catalog - /glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.json
CESM2 Large Ensemble Catalog = /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json

Examining Diagnostics Using Intake-ESM and hvPlot GeoCAT-Comp Tutorial

27 August 2021

Recent Posts

Archives