Examining Diagnostics Using Intake-ESM and hvPlot#

In previous weeks, we have looked at building Intake-ESM catalogs from history files and visualizing CESM output using Holoviews and Datashader, but this week we are putting together a few of those pieces to visualize a comparison between model runs.

One of the first ESDS blog posts looked at building an interactive dashboard to look at plots, using high resolution ocean model output as the dataset. One of the limitations of that approach is that the images are static - we are pulling in pngs, and rending on the page, as opposed to more interactive options. In this example, we will read in data generated from ecgtools, from a directory only accessible via NCAR’s high performance computing center.

The main plotting library used here is hvPlot, which is a “A high-level plotting API for the PyData ecosystem built on HoloViews”. Essentially, it is built on top of key components of the Python visualization stack, such as Holoviews, Datashader, and Bokeh, with a seamless integration with Xarray (check out the

Imports#

import ast

import holoviews as hv
import hvplot
import hvplot.xarray
import intake
import matplotlib.pyplot as plt
import panel as pn
import xarray as xr
from distributed import Client
from ncar_jobqueue import NCARCluster

hv.extension('bokeh')
pn.extension()

Spin up a Dask Cluster#

We spin up a cluster, incuding 20 total workers

cluster = NCARCluster()
cluster.scale(20)
client = Client(cluster)
cluster

Read in the data#

As mentioned previously, we are reading in data from within the GLADE filesystem at NCAR. The special line

ast.literal_eval

is included since there are multiple variables in the variables column, an artifact of using the default “history” file output (check out our previous post here for more information about this type of model output)

catalog = intake.open_esm_datastore(
    "/glade/work/mgrover/cesm-validation-catalog.json",
    csv_kwargs={"converters": {"variables": ast.literal_eval}},
    sep="/",
)
catalog

None catalog with 12 dataset(s) from 11103 asset(s):

unique
component 1
stream 4
date 2501
case 3
member_id 2
frequency 4
variables 545
path 11103

We see that there are three cases included in this catalog

catalog.df.case.unique()
array(['b1850.f19_g17.validation_nuopc.004',
       'b1850.f19_g17.validation_mct.004',
       'b1850.f19_g17.validation_mct.002'], dtype=object)

Subset for monthly output from the last 20 years#

We start by subsetting for monthly output

catalog_subset = catalog.search(frequency='month_1')

Next, we subset for the last 20 years of data. Since this is monthly output, the last twenty years corresponds to

12 months * 20 years = 240
dates = sorted(catalog_subset.df.date.unique())[-240:]
print(dates[:12], dates[-12:])
['0081-01', '0081-02', '0081-03', '0081-04', '0081-05', '0081-06', '0081-07', '0081-08', '0081-09', '0081-10', '0081-11', '0081-12'] ['0100-01', '0100-02', '0100-03', '0100-04', '0100-05', '0100-06', '0100-07', '0100-08', '0100-09', '0100-10', '0100-11', '0100-12']
catalog_subset = catalog_subset.search(frequency='month_1', date=dates)

Now, our catlaog includes our 3 cases with monthly data from the last 20 years of the simulation

catalog_subset

None catalog with 3 dataset(s) from 720 asset(s):

unique
component 1
stream 1
date 240
case 3
member_id 2
frequency 1
variables 434
path 720

Read in a dictionary of datasets#

Using the .to_dataset_dict() method, we read in the datasets

dsets = catalog_subset.to_dataset_dict(cdf_kwargs={'use_cftime': True, 'chunks': {'time': 60}})
--> The keys in the returned dictionary of datasets are constructed as follows:
	'component/stream/case'
100.00% [3/3 00:06<00:00]

Taking a look at the keys, we see that the follow the convention component/stream/casename

dsets.keys()
dict_keys(['ocn/pop.h/b1850.f19_g17.validation_mct.002', 'ocn/pop.h/b1850.f19_g17.validation_mct.004', 'ocn/pop.h/b1850.f19_g17.validation_nuopc.004'])

Compute the Temporal Average#

We are interested in the average over the last 20 years, so we take the mean with respect to time. For now, we are interested in two variables:

  • SALT (salinity)

  • TEMP (potential temperature)

variables = ['TEMP', 'SALT']

We also want to make sure that when we operating on the datasets, that we hold onto the attributes. We can do this using:

xr.set_options(keep_attrs=True)
<xarray.core.options.set_options at 0x2ba103a57eb0>
ds_list = []
for key in dsets.keys():

    # Extract the dataset from the dictionary of datasets
    ds = dsets[key]

    # Compute the mean over time
    mean = ds.mean(dim='time')

    # Subset for the variables we are interested in
    out = mean[variables]

    # Modify which variables we include in the attributes
    out.attrs['intake_esm_varname'] = variables

    # Append to the list of datasets
    ds_list.append(out)

Concatenate the datasets from the various cases

merged_ds = xr.concat(ds_list, dim='case')

Each dataset has an attribute title which contains the casename, which we can use as one of the main dimensions or our new merged dataset

merged_ds.title
'b1850.f19_g17.validation_mct.002'
cases = []
for dset in ds_list:
    cases.append(dset.title)
merged_ds['case'] = cases

Load in our computation#

Before we start plotting, we can go ahead and call .persist() on our dataset which will load our temporally averaged, dataset with a subset of original variables

merged_ds.persist()
<xarray.Dataset>
Dimensions:  (case: 3, z_t: 60, nlat: 384, nlon: 320)
Coordinates:
  * z_t      (z_t) float32 500.0 1.5e+03 2.5e+03 ... 5.125e+05 5.375e+05
    ULONG    (nlat, nlon) float64 321.1 322.3 323.4 324.5 ... 319.2 319.6 320.0
    ULAT     (nlat, nlon) float64 -78.95 -78.95 -78.95 ... 72.42 72.41 72.41
    TLONG    (nlat, nlon) float64 320.6 321.7 322.8 323.9 ... 318.9 319.4 319.8
    TLAT     (nlat, nlon) float64 -79.22 -79.22 -79.22 ... 72.2 72.19 72.19
  * case     (case) <U34 'b1850.f19_g17.validation_mct.002' ... 'b1850.f19_g1...
Dimensions without coordinates: nlat, nlon
Data variables:
    TEMP     (case, z_t, nlat, nlon) float32 dask.array<chunksize=(1, 60, 384, 320), meta=np.ndarray>
    SALT     (case, z_t, nlat, nlon) float32 dask.array<chunksize=(1, 60, 384, 320), meta=np.ndarray>
Attributes:
    revision:                $Id$
    title:                   b1850.f19_g17.validation_mct.002
    contents:                Diagnostic and Prognostic Variables
    calendar:                All years have exactly  365 days.
    Conventions:             CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netcdf...
    history:                 none
    time_period_freq:        month_1
    model_doi_url:           https://doi.org/10.5065/D67H1H0V
    cell_methods:            cell_methods = time: mean ==> the variable value...
    source:                  CCSM POP2, the CCSM Ocean Component
    intake_esm_varname:      ['TEMP', 'SALT']
    intake_esm_dataset_key:  ocn/pop.h/b1850.f19_g17.validation_mct.002

Plot a Comparison using Holoviews#

We can setup a couple of helper functions before visualizing our output. The first helps with an issue that might not seem clear when first trying to plot POP ocean model output with Holoviews.

When first checking to see what TEMP’s dimensions are, we see they are z_t (vertical) and nlat/nlon (horizontal)

ds.TEMP.dims
('time', 'z_t', 'nlat', 'nlon')

When Holoviews sees this, we do not see the same dimensions…

first_case = merged_ds.isel(case=0)
first_case.TEMP.hvplot.quadmesh()