Comparing Atmospheric Model Output with Observations Using Intake-ESM#

Comparing models and observations is a critical component of climate diagnostic packages. This process can be challenging though - given the number of observational datasets to compare against, and the difference in spatiotemporal resolutions. In the previous iteration of the diagnostics package used for atmospheric data from the Community Earth System Model (CESM), they used pre-computed, observational datasets stored in a directory on the GLADE filesystem (/glade/p/cesm/amwg/amwg_diagnostics/obs_data)

Within this example, we walk though generating an intake-esm catalog from the observational data, reading in CESM data, and compare models and observations

Imports#

import ast
import glob
import pathlib
import traceback
import warnings
from datetime import datetime

import cartopy.crs as ccrs
import geocat.comp
import geoviews.feature as gf
import holoviews as hv
import hvplot
import hvplot.xarray
import intake
import matplotlib.pyplot as plt
import pandas as pd
import xarray as xr
from distributed import Client
from ecgtools import Builder
from ecgtools.builder import INVALID_ASSET, TRACEBACK
from ncar_jobqueue import NCARCluster

hv.extension('bokeh')

Build an Intake-ESM catalog from the observational dataset#

Before we jump into building the catalog, let’s take a look at what data we are working with

Investigate the File Structure and Sample Dataset#

files = sorted(glob.glob('/glade/p/cesm/amwg/amwg_diagnostics/obs_data/*'))
files[::20]
['/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ARM_annual_cycle_twp_c2_cmbe_sound_p_f.cdf',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES-EBAF_01_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES2_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CLOUDSATCOSP_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CLOUDSAT_10_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ECMWF_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/EP.ERAI_DJF_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERAI_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERBE_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERS_12_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/GPCP_JJA_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_CL_03_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_PD_02_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_PI_05_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCPCOSP_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCPFD_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCP_12_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/JRA25_SON_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/LEGATES_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MERRAW_19x2_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MERRA_12_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MISRCOSP_JJA_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MODIS_ANN_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/NVAP_03_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/PRECL_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/SSMI_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/SSMI_SEAICE_DJF_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/TRMM_MAM_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/WARREN_DJF_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/WILLMOTT_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/XIEARKIN_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlsg_10_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlso_ANN_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlsw_MAM_climo.nc']

Observational datasetsets in this directory follow the convention source_(month or season)_climo.nc.

Let’s open up one of those datasets

ds = xr.open_dataset('/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES-EBAF_01_climo.nc')
ds
<xarray.Dataset>
Dimensions:  (lon: 360, lat: 180, time: 1)
Coordinates:
  * lon      (lon) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
  * lat      (lat) float32 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
  * time     (time) float32 1.0
Data variables:
    SOLIN    (time, lat, lon) float32 495.0 495.0 495.0 495.0 ... 0.0 0.0 0.0
    FLUT     (time, lat, lon) float32 187.4 187.4 187.4 ... 170.7 170.7 170.7
    FLUTC    (time, lat, lon) float32 188.8 188.8 188.8 ... 178.9 178.9 178.9
    FSNTOA   (time, lat, lon) float32 147.8 147.8 147.8 ... -0.049 -0.049 -0.049
    FSNTOAC  (time, lat, lon) float32 150.9 150.9 150.9 ... -0.006 -0.006 -0.006
    SWCF     (time, lat, lon) float32 -3.149 -3.149 -3.149 ... -0.043 -0.043
    LWCF     (time, lat, lon) float32 1.391 1.391 1.391 ... 8.272 8.272 8.272
    RESTOA   (time, lat, lon) float32 -39.6 -39.6 -39.6 ... -170.7 -170.7 -170.7
    ALBEDO   (time, lat, lon) float32 0.7015 0.7015 0.7015 ... nan nan nan
    ALBEDOC  (time, lat, lon) float32 0.6951 0.6951 0.6951 ... nan nan nan
    gw       (lat) float64 0.0001523 0.0004569 0.0007613 ... 0.0004569 0.0001523
Attributes:
    version:             This is version 2.8: March 7, 2014
    institution:         NASA Langley Research Center
    comment:             Data is from East to West and South to North. Climat...
    title:               CERES EBAF (Energy Balanced and Filled) Fluxes. Mont...
    AMWG_author:         Cecile Hannay
    AMWG_creation_date:  Thu Jul 24 16:08:10 MDT 2014 for AMWG package
    history:             Thu Jul 24 16:08:10 2014: ncks -A -v gw CERES2_01_cl...
    NCO:                 20140724

We see that this dataset is gridded on a global 0.5° grid, with several variables related to solar fluxes (ex. TOA net shortwave)

Build a parser using ecgtools#

We can use the ecgtools package to help build an intake-esm catalog for this dataset!

We start first by using pathlib to generically look at the filepaths

path = pathlib.Path(files[0])
path.stem
'AIRS_01_climo'

This path can be split using .split('_'), separates the path into the following:

  • observational dataset source

  • month number or season

  • “climo”

path.stem.split('_')
['AIRS', '01', 'climo']

We can also gather useful insight by opening the file!

ds = xr.open_dataset(files[0])

Let’s look at the variable “Temperature” (T)

ds.T
<xarray.DataArray 'T' (time: 1, lev: 13, lat: 94, lon: 192)>
[234624 values with dtype=float32]
Coordinates:
  * lat      (lat) float64 -88.54 -86.65 -84.75 -82.85 ... 84.75 86.65 88.54
  * time     (time) int32 1
  * lev      (lev) float32 1e+03 925.0 850.0 700.0 ... 200.0 150.0 100.0 70.0
  * lon      (lon) float32 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
Attributes:
    units:        K
    long_name:    Temperature
    climatology:  AIRS monthly climatology 9/2002-8/2006

You’ll notice that we have additional attributes about the dataset which we can use, including:

  • units

  • long_name

  • climatology, which includes more information about what dates were used

Adding the parsing function#

def parse_amwg_obs(file):
    """Atmospheric observational data used within the AMWG diagnostic package

    Parameters
    ----------
    file: str
      filepath from the data directory including observational climatologies

    Returns
    -------
    info: dict
       Dictionary with information to add as columns in the data catalog

    """
    file = pathlib.Path(file)
    info_list = []
    info = {}

    try:
        stem = file.stem
        split = stem.split('_')
        info['source'] = split[0]
        temporal = split[-2]
        if len(split[-2]) == 2:
            month_number = int(temporal)
            info['time_period'] = 'monthly'
            info['temporal'] = datetime(2020, month_number, 1).strftime('%b').upper()

        elif len(temporal) == 3:
            info['time_period'] = 'seasonal'
            info['temporal'] = temporal

        else:
            info['time_period'] = np.nan
            info['temporal'] = np.nan

        with xr.open_dataset(file, chunks={}, decode_times=False) as ds:
            variables = [v for v, da in ds.variables.items() if 'time' in da.dims]
            info['variables'] = variables

            info['path'] = str(path)

        return info

    except Exception:
        return {INVALID_ASSET: file, TRACEBACK: traceback.format_exc()}
'time' in ds['T'].dims
True
info = parse_amwg_obs(files[0])
info
{'source': 'AIRS',
 'time_period': 'monthly',
 'temporal': 'JAN',
 'variables': ['T', 'time', 'RELHUM', 'O3', 'SHUM'],
 'path': '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc'}
pd.DataFrame(info)
source time_period temporal variables path
0 AIRS monthly JAN T /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
1 AIRS monthly JAN time /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
2 AIRS monthly JAN RELHUM /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3 AIRS monthly JAN O3 /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
4 AIRS monthly JAN SHUM /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...

We can test this on a single file!

This dictionary can easily be converted into a pandas.Dataframe

pd.DataFrame([info, info])
source time_period temporal variables path
0 AIRS monthly JAN [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
1 AIRS monthly JAN [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...

Use ecgtools to build the catalog#

Now that we have our parser ready to go, we can use ecgtools to build the catalog

b = Builder(
    # Directory with the output
    f"/glade/p/cesm/amwg/amwg_diagnostics/obs_data/",
    depth=3,
    # Number of jobs to execute - should be equal to # threads you are using
    njobs=-1,
)

Once we set up the builder, can call .build, passing in the parser we built

b.build(parse_amwg_obs)
b.df
source time_period temporal variables path
0 AIRS monthly JAN [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
1 AIRS monthly FEB [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
2 AIRS monthly MAR [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3 AIRS monthly APR [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
4 AIRS monthly MAY [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
... ... ... ... ... ...
3778 mlsw seasonal ANN [T, time, Z3, H2O, O3, RELHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3779 mlsw seasonal DJF [T, time, Z3, H2O, O3, RELHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3780 mlsw seasonal JJA [T, time, Z3, H2O, O3, RELHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3781 mlsw seasonal MAM [T, time, Z3, H2O, O3, RELHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3782 mlsw seasonal SON [T, time, Z3, H2O, O3, RELHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...

1243 rows × 5 columns

Save the catalog#

b.save(
    # File path - could save as .csv (uncompressed csv) or .csv.gz (compressed csv)
    "/glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.csv",
    # Column name including filepath
    path_column_name='path',
    # Column name including variables
    variable_column_name='variables',
    # Data file format - could be netcdf or zarr (in this case, netcdf)
    data_format="netcdf",
    # Which attributes to groupby when reading in variables using intake-esm
    groupby_attrs=["source", "time_period"],
    # Aggregations which are fed into xarray when reading in data using intake
    aggregations=[
        {
            'type': 'join_new',
            'attribute_name': 'temporal',
            'options': {'coords': 'minimal', 'compat': 'override'},
        },
    ],
)
Saved catalog location: /glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.json and /glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.csv
/glade/scratch/mgrover/ipykernel_140729/3449059562.py:1: UserWarning: Unable to parse 2541 assets/files. A list of these assets can be found in /glade/work/mgrover/intake-esm-catalogs/invalid_assets_amwg_obs_datasets.csv.
  b.save(

Read in Observational Data from the Catalog#

We use intake-esm to read in the observational data from the catalog we just created

obs_catalog = intake.open_esm_datastore(
    "/glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.json",
    csv_kwargs={"converters": {"variables": ast.literal_eval}},
    sep="/",
)

We are interested in Temperature (T), from the AIRS dataset, which comes from NASA

obs_catalog_subset = obs_catalog.search(variables='T', source='AIRS')
dsets = obs_catalog_subset.to_dataset_dict()
--> The keys in the returned dictionary of datasets are constructed as follows:
	'source/time_period'
100.00% [2/2 00:00<00:00]

Our dictionary of datasets has two keys, one for seasonal and the other for monthly climatologies

dsets.keys()
dict_keys(['AIRS/seasonal', 'AIRS/monthly'])
seasonal_obs_ds = dsets['AIRS/seasonal']
monthly_obs_ds = dsets['AIRS/monthly']

Plot Observational Temperature Climatologies#

Now that we read in our data, we can plot up our results!

Plot Seasonal Temperature Climatologies from Observations#

seasonal_obs_plot = (
    seasonal_obs_ds.isel(time=0, lev=range(3)).T.hvplot.quadmesh(
        rasterize=True, groupby=['lev', 'temporal'], cmap='magma', projection=ccrs.Robinson()
    )
    * gf.coastline
)
seasonal_obs_plot

Plot Monthly Temperature Climatologies from Observations#

monthly_obs_plot = (
    monthly_obs_ds.isel(time=0, lev=range(3)).T.hvplot.quadmesh(
        rasterize=True, groupby=['lev', 'temporal'], cmap='magma', projection=ccrs.Robinson()
    )
    * gf.coastline
)
monthly_obs_plot

Read in CESM Output#

cesm_data_catalog = intake.open_esm_datastore(
    '/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json'
)
/glade/work/mgrover/miniconda3/envs/cesm-collections-dev/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3441: DtypeWarning: Columns (8,9) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)
cluster = NCARCluster(memory='10 GB')
cluster.scale(20)
client = Client(cluster)
client

Client

Client-a916e23c-077a-11ec-83b6-3cecef1b11fa

Connection method: Cluster object Cluster type: PBSCluster
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/37300/status

Cluster Info

Query for monthly temperature (T) values, using the historical experiment#

cesm_data_catalog_subset = cesm_data_catalog.search(
    variable='T', control_branch_year=1001, frequency='month_1', experiment='historical'
)

Since we taking the average over time, we choose to chunk by vertical levels (lev)

dsets = cesm_data_catalog_subset.to_dataset_dict(cdf_kwargs={'chunks': {'lev': 5}})
--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.stream.forcing_variant.control_branch_year.variable'
100.00% [1/1 00:00<00:00]

We only have a single key - with the component.experiment.stream.forcing_variant.control_branch_year.variable

ds = dsets['atm.historical.cam.h0.cmip6.1001.T']

Our dataset has a chunk size of ~125 mb which is around the optimal size!

ds.T
<xarray.DataArray 'T' (member_id: 1, time: 1980, lev: 32, lat: 192, lon: 288)>
dask.array<broadcast_to, shape=(1, 1980, 32, 192, 288), dtype=float32, chunksize=(1, 120, 5, 192, 288), chunktype=numpy.ndarray>
Coordinates:
  * lat        (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
  * lev        (lev) float64 3.643 7.595 14.36 24.61 ... 936.2 957.5 976.3 992.6
  * time       (time) object 1860-02-01 00:00:00 ... 1980-01-01 00:00:00
  * member_id  (member_id) <U11 'r1i1001p1f1'
Attributes:
    mdims:         1
    units:         K
    long_name:     Temperature
    cell_methods:  time: mean

Compute Monthly and Seasonal Averages#

Over the summer, additional functionality was added to GeoCAT, with functions supporting climatology calculations

cesm_monthly_mean_temperature = geocat.comp.climatology(ds.T, 'month')
cesm_seasonal_mean_temperature = geocat.comp.climatology(ds.T, 'season')
cesm_monthly_mean_temperature.load()
cesm_seasonal_mean_temperature.load()
<xarray.DataArray 'T' (member_id: 1, season: 4, lev: 32, lat: 192, lon: 288)>
array([[[[[271.27997, 271.27997, 271.27997, ..., 271.27997, 271.27997,
           271.27997],
          [271.22708, 271.22745, 271.22787, ..., 271.226  , 271.22632,
           271.2267 ],
          [271.1701 , 271.1706 , 271.1711 , ..., 271.16895, 271.16928,
           271.1697 ],
          ...,
          [204.17365, 204.15831, 204.14384, ..., 204.22487, 204.20694,
           204.18985],
          [204.95549, 204.9503 , 204.94557, ..., 204.97383, 204.96724,
           204.96115],
          [206.02718, 206.02718, 206.02718, ..., 206.02718, 206.02718,
           206.02718]],

         [[255.13527, 255.13527, 255.13527, ..., 255.13527, 255.13527,
           255.13527],
          [255.05142, 255.05226, 255.05318, ..., 255.0489 , 255.04971,
           255.05055],
          [254.94571, 254.94685, 254.94814, ..., 254.94264, 254.94356,
           254.9446 ],
...
          [265.77487, 265.7838 , 265.7927 , ..., 265.747  , 265.75638,
           265.76566],
          [265.74924, 265.7531 , 265.757  , ..., 265.7369 , 265.7411 ,
           265.7452 ],
          [265.73276, 265.73276, 265.73276, ..., 265.7328 , 265.73276,
           265.73276]],

         [[217.79984, 217.79984, 217.79984, ..., 217.79984, 217.79984,
           217.79984],
          [218.30846, 218.30481, 218.30156, ..., 218.32202, 218.31757,
           218.3122 ],
          [218.94579, 218.91786, 218.88835, ..., 219.02962, 219.     ,
           218.97177],
          ...,
          [265.71143, 265.72287, 265.73416, ..., 265.67624, 265.6879 ,
           265.69962],
          [265.69626, 265.701  , 265.70557, ..., 265.68204, 265.68677,
           265.6915 ],
          [265.68402, 265.68405, 265.68408, ..., 265.68396, 265.684  ,
           265.684  ]]]]], dtype=float32)
Coordinates:
  * lat        (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lon        (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
  * lev        (lev) float64 3.643 7.595 14.36 24.61 ... 936.2 957.5 976.3 992.6
  * member_id  (member_id) <U11 'r1i1001p1f1'
  * season     (season) object 'DJF' 'JJA' 'MAM' 'SON'
Attributes:
    mdims:         1
    units:         K
    long_name:     Temperature
    cell_methods:  time: mean

Plot CESM2-LE Temperature Climatologies#

Plot up the Seasonal Mean Temperature#

We select for our single member_id, and choose the last few vertical levels since the vertical dimensions are sorted from the top of the atmosphere, to the bottom, which is the inverse of how the observational data is sorted

cesm_seasonal_temperature_plot = (
    cesm_seasonal_mean_temperature.isel(member_id=0, lev=range(-5, -1)).T.hvplot.quadmesh(
        rasterize=True,
        groupby=['lev', 'season'],
        x='lon',
        y='lat',
        cmap='magma',
        projection=ccrs.Robinson(),
        project=True,
    )
    * gf.coastline
)
cesm_seasonal_temperature_plot

Plot up the Monthly Mean Temperature#

cesm_monthly_temperature_plot = (
    cesm_monthly_mean_temperature.isel(member_id=0, lev=range(-5, -1)).T.hvplot.quadmesh(
        rasterize=True,
        groupby=['lev', 'month'],
        x='lon',
        y='lat',
        cmap='magma',
        projection=ccrs.Robinson(),
        project=True,
    )
    * gf.coastline
)
cesm_monthly_temperature_plot

Conclusion#

Throughout this example, we showed how useful intake-esm can be when querying datasets, both observational and model datasets! We also showed you can use hvPlot to plot interactive maps, even transforming to different coordinate reference systems. While the observational and model datasets did not have matching spatial or vertical coordinates, plotting these comparisons as a first cut can still be useful!

If you are interested in using the data catalogs, go for it! Here are the paths:

  • AMWG Observational Comparison Catalog - /glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.json

  • CESM2 Large Ensemble Catalog = /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cesm2-le.json