Comparing Atmospheric Model Output with Observations Using Intake-ESM#

Comparing models and observations is a critical component of climate diagnostic packages. This process can be challenging though - given the number of observational datasets to compare against, and the difference in spatiotemporal resolutions. In the previous iteration of the diagnostics package used for atmospheric data from the Community Earth System Model (CESM), they used pre-computed, observational datasets stored in a directory on the GLADE filesystem (/glade/p/cesm/amwg/amwg_diagnostics/obs_data)

Within this example, we walk though generating an intake-esm catalog from the observational data, reading in CESM data, and compare models and observations

Imports#

import ast
import glob
import pathlib
import traceback
import warnings
from datetime import datetime

import cartopy.crs as ccrs
import geocat.comp
import geoviews.feature as gf
import holoviews as hv
import hvplot
import hvplot.xarray
import intake
import matplotlib.pyplot as plt
import pandas as pd
import xarray as xr
from distributed import Client
from ecgtools import Builder
from ecgtools.builder import INVALID_ASSET, TRACEBACK
from ncar_jobqueue import NCARCluster

hv.extension('bokeh')

Build an Intake-ESM catalog from the observational dataset#

Before we jump into building the catalog, let’s take a look at what data we are working with

Investigate the File Structure and Sample Dataset#

files = sorted(glob.glob('/glade/p/cesm/amwg/amwg_diagnostics/obs_data/*'))
files[::20]
['/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ARM_annual_cycle_twp_c2_cmbe_sound_p_f.cdf',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES-EBAF_01_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES2_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CLOUDSATCOSP_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CLOUDSAT_10_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ECMWF_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/EP.ERAI_DJF_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERAI_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERBE_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERS_12_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/GPCP_JJA_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_CL_03_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_PD_02_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_PI_05_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCPCOSP_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCPFD_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ISCCP_12_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/JRA25_SON_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/LEGATES_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MERRAW_19x2_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MERRA_12_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MISRCOSP_JJA_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/MODIS_ANN_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/NVAP_03_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/PRECL_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/SSMI_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/SSMI_SEAICE_DJF_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/TRMM_MAM_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/WARREN_DJF_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/WILLMOTT_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/XIEARKIN_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlsg_10_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlso_ANN_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/mlsw_MAM_climo.nc']

Observational datasetsets in this directory follow the convention source_(month or season)_climo.nc.

Let’s open up one of those datasets

ds = xr.open_dataset('/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES-EBAF_01_climo.nc')
ds
<xarray.Dataset>
Dimensions:  (lon: 360, lat: 180, time: 1)
Coordinates:
  * lon      (lon) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
  * lat      (lat) float32 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
  * time     (time) float32 1.0
Data variables:
    SOLIN    (time, lat, lon) float32 495.0 495.0 495.0 495.0 ... 0.0 0.0 0.0
    FLUT     (time, lat, lon) float32 187.4 187.4 187.4 ... 170.7 170.7 170.7
    FLUTC    (time, lat, lon) float32 188.8 188.8 188.8 ... 178.9 178.9 178.9
    FSNTOA   (time, lat, lon) float32 147.8 147.8 147.8 ... -0.049 -0.049 -0.049
    FSNTOAC  (time, lat, lon) float32 150.9 150.9 150.9 ... -0.006 -0.006 -0.006
    SWCF     (time, lat, lon) float32 -3.149 -3.149 -3.149 ... -0.043 -0.043
    LWCF     (time, lat, lon) float32 1.391 1.391 1.391 ... 8.272 8.272 8.272
    RESTOA   (time, lat, lon) float32 -39.6 -39.6 -39.6 ... -170.7 -170.7 -170.7
    ALBEDO   (time, lat, lon) float32 0.7015 0.7015 0.7015 ... nan nan nan
    ALBEDOC  (time, lat, lon) float32 0.6951 0.6951 0.6951 ... nan nan nan
    gw       (lat) float64 0.0001523 0.0004569 0.0007613 ... 0.0004569 0.0001523
Attributes:
    version:             This is version 2.8: March 7, 2014
    institution:         NASA Langley Research Center
    comment:             Data is from East to West and South to North. Climat...
    title:               CERES EBAF (Energy Balanced and Filled) Fluxes. Mont...
    AMWG_author:         Cecile Hannay
    AMWG_creation_date:  Thu Jul 24 16:08:10 MDT 2014 for AMWG package
    history:             Thu Jul 24 16:08:10 2014: ncks -A -v gw CERES2_01_cl...
    NCO:                 20140724

We see that this dataset is gridded on a global 0.5° grid, with several variables related to solar fluxes (ex. TOA net shortwave)

Build a parser using ecgtools#

We can use the ecgtools package to help build an intake-esm catalog for this dataset!

We start first by using pathlib to generically look at the filepaths

path = pathlib.Path(files[0])
path.stem
'AIRS_01_climo'

This path can be split using .split('_'), separates the path into the following:

  • observational dataset source

  • month number or season

  • “climo”

path.stem.split('_')
['AIRS', '01', 'climo']

We can also gather useful insight by opening the file!

ds = xr.open_dataset(files[0])

Let’s look at the variable “Temperature” (T)

ds.T
<xarray.DataArray 'T' (time: 1, lev: 13, lat: 94, lon: 192)>
[234624 values with dtype=float32]
Coordinates:
  * lat      (lat) float64 -88.54 -86.65 -84.75 -82.85 ... 84.75 86.65 88.54
  * time     (time) int32 1
  * lev      (lev) float32 1e+03 925.0 850.0 700.0 ... 200.0 150.0 100.0 70.0
  * lon      (lon) float32 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
Attributes:
    units:        K
    long_name:    Temperature
    climatology:  AIRS monthly climatology 9/2002-8/2006

You’ll notice that we have additional attributes about the dataset which we can use, including:

  • units

  • long_name

  • climatology, which includes more information about what dates were used

Adding the parsing function#

def parse_amwg_obs(file):
    """Atmospheric observational data used within the AMWG diagnostic package

    Parameters
    ----------
    file: str
      filepath from the data directory including observational climatologies

    Returns
    -------
    info: dict
       Dictionary with information to add as columns in the data catalog

    """
    file = pathlib.Path(file)
    info_list = []
    info = {}

    try:
        stem = file.stem
        split = stem.split('_')
        info['source'] = split[0]
        temporal = split[-2]
        if len(split[-2]) == 2:
            month_number = int(temporal)
            info['time_period'] = 'monthly'
            info['temporal'] = datetime(2020, month_number, 1).strftime('%b').upper()

        elif len(temporal) == 3:
            info['time_period'] = 'seasonal'
            info['temporal'] = temporal

        else:
            info['time_period'] = np.nan
            info['temporal'] = np.nan

        with xr.open_dataset(file, chunks={}, decode_times=False) as ds:
            variables = [v for v, da in ds.variables.items() if 'time' in da.dims]
            info['variables'] = variables

            info['path'] = str(path)

        return info

    except Exception:
        return {INVALID_ASSET: file, TRACEBACK: traceback.format_exc()}
'time' in ds['T'].dims
True
info = parse_amwg_obs(files[0])
info
{'source': 'AIRS',
 'time_period': 'monthly',
 'temporal': 'JAN',
 'variables': ['T', 'time', 'RELHUM', 'O3', 'SHUM'],
 'path': '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc'}
pd.DataFrame(info)
source time_period temporal variables path
0 AIRS monthly JAN T /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
1 AIRS monthly JAN time /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
2 AIRS monthly JAN RELHUM /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3 AIRS monthly JAN O3 /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
4 AIRS monthly JAN SHUM /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...

We can test this on a single file!

This dictionary can easily be converted into a pandas.Dataframe

pd.DataFrame([info, info])
source time_period temporal variables path
0 AIRS monthly JAN [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
1 AIRS monthly JAN [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...

Use ecgtools to build the catalog#

Now that we have our parser ready to go, we can use ecgtools to build the catalog

b = Builder(
    # Directory with the output
    f"/glade/p/cesm/amwg/amwg_diagnostics/obs_data/",
    depth=3,
    # Number of jobs to execute - should be equal to # threads you are using
    njobs=-1,
)

Once we set up the builder, can call .build, passing in the parser we built

b.build(parse_amwg_obs)
b.df
source time_period temporal variables path
0 AIRS monthly JAN [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
1 AIRS monthly FEB [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
2 AIRS monthly MAR [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3 AIRS monthly APR [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
4 AIRS monthly MAY [T, time, RELHUM, O3, SHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
... ... ... ... ... ...
3778 mlsw seasonal ANN [T, time, Z3, H2O, O3, RELHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3779 mlsw seasonal DJF [T, time, Z3, H2O, O3, RELHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3780 mlsw seasonal JJA [T, time, Z3, H2O, O3, RELHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3781 mlsw seasonal MAM [T, time, Z3, H2O, O3, RELHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...
3782 mlsw seasonal SON [T, time, Z3, H2O, O3, RELHUM] /glade/p/cesm/amwg/amwg_diagnostics/obs_data/A...

1243 rows × 5 columns

Save the catalog#

b.save(
    # File path - could save as .csv (uncompressed csv) or .csv.gz (compressed csv)
    "/glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.csv",
    # Column name including filepath
    path_column_name='path',
    # Column name including variables
    variable_column_name='variables',
    # Data file format - could be netcdf or zarr (in this case, netcdf)
    data_format="netcdf",
    # Which attributes to groupby when reading in variables using intake-esm
    groupby_attrs=["source", "time_period"],
    # Aggregations which are fed into xarray when reading in data using intake
    aggregations=[
        {
            'type': 'join_new',
            'attribute_name': 'temporal',
            'options': {'coords': 'minimal', 'compat': 'override'},
        },
    ],
)
Saved catalog location: /glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.json and /glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.csv
/glade/scratch/mgrover/ipykernel_140729/3449059562.py:1: UserWarning: Unable to parse 2541 assets/files. A list of these assets can be found in /glade/work/mgrover/intake-esm-catalogs/invalid_assets_amwg_obs_datasets.csv.
  b.save(

Read in Observational Data from the Catalog#

We use intake-esm to read in the observational data from the catalog we just created

obs_catalog = intake.open_esm_datastore(
    "/glade/work/mgrover/intake-esm-catalogs/amwg_obs_datasets.json",
    csv_kwargs={"converters": {"variables": ast.literal_eval}},
    sep="/",
)

We are interested in Temperature (T), from the AIRS dataset, which comes from NASA

obs_catalog_subset = obs_catalog.search(variables='T', source='AIRS')
dsets = obs_catalog_subset.to_dataset_dict()
--> The keys in the returned dictionary of datasets are constructed as follows:
	'source/time_period'
100.00% [2/2 00:00<00:00]

Our dictionary of datasets has two keys, one for seasonal and the other for monthly climatologies

dsets.keys()
dict_keys(['AIRS/seasonal', 'AIRS/monthly'])
seasonal_obs_ds = dsets['AIRS/seasonal']
monthly_obs_ds = dsets['AIRS/monthly']

Plot Observational Temperature Climatologies#

Now that we read in our data, we can plot up our results!

Plot Seasonal Temperature Climatologies from Observations#

seasonal_obs_plot = (
    seasonal_obs_ds.isel(time=0, lev=range(3)).T.hvplot.quadmesh(
        rasterize=True, groupby=['lev', 'temporal'], cmap='magma', projection=ccrs.Robinson()
    )
    * gf.coastline
)
seasonal_obs_plot

Plot Monthly Temperature Climatologies from Observations#

monthly_obs_plot = (
    monthly_obs_ds.isel(time=0, lev=range(3)).T.hvplot.quadmesh(
        rasterize=True, groupby=['lev', 'temporal'], cmap='magma', projection=ccrs.Robinson()
    )
    * gf.coastline
)
monthly_obs_plot