Examining Diagnostics Using Intake-ESM and hvPlot#

In previous weeks, we have looked at building Intake-ESM catalogs from history files and visualizing CESM output using Holoviews and Datashader, but this week we are putting together a few of those pieces to visualize a comparison between model runs.

One of the first ESDS blog posts looked at building an interactive dashboard to look at plots, using high resolution ocean model output as the dataset. One of the limitations of that approach is that the images are static - we are pulling in pngs, and rending on the page, as opposed to more interactive options. In this example, we will read in data generated from ecgtools, from a directory only accessible via NCAR’s high performance computing center.

The main plotting library used here is hvPlot, which is a “A high-level plotting API for the PyData ecosystem built on HoloViews”. Essentially, it is built on top of key components of the Python visualization stack, such as Holoviews, Datashader, and Bokeh, with a seamless integration with Xarray (check out the

Imports#

import ast

import holoviews as hv
import hvplot
import hvplot.xarray
import intake
import matplotlib.pyplot as plt
import panel as pn
import xarray as xr
from distributed import Client
from ncar_jobqueue import NCARCluster

hv.extension('bokeh')
pn.extension()

Spin up a Dask Cluster#

We spin up a cluster, incuding 20 total workers

cluster = NCARCluster()
cluster.scale(20)
client = Client(cluster)
cluster

Read in the data#

As mentioned previously, we are reading in data from within the GLADE filesystem at NCAR. The special line

ast.literal_eval

is included since there are multiple variables in the variables column, an artifact of using the default “history” file output (check out our previous post here for more information about this type of model output)

catalog = intake.open_esm_datastore(
    "/glade/work/mgrover/cesm-validation-catalog.json",
    csv_kwargs={"converters": {"variables": ast.literal_eval}},
    sep="/",
)
catalog

None catalog with 12 dataset(s) from 11103 asset(s):

	unique
component	1
stream	4
date	2501
case	3
member_id	2
frequency	4
variables	545
path	11103

We see that there are three cases included in this catalog

catalog.df.case.unique()

array(['b1850.f19_g17.validation_nuopc.004',
       'b1850.f19_g17.validation_mct.004',
       'b1850.f19_g17.validation_mct.002'], dtype=object)

Subset for monthly output from the last 20 years#

We start by subsetting for monthly output

catalog_subset = catalog.search(frequency='month_1')

Next, we subset for the last 20 years of data. Since this is monthly output, the last twenty years corresponds to

12 months * 20 years = 240

dates = sorted(catalog_subset.df.date.unique())[-240:]
print(dates[:12], dates[-12:])

['0081-01', '0081-02', '0081-03', '0081-04', '0081-05', '0081-06', '0081-07', '0081-08', '0081-09', '0081-10', '0081-11', '0081-12'] ['0100-01', '0100-02', '0100-03', '0100-04', '0100-05', '0100-06', '0100-07', '0100-08', '0100-09', '0100-10', '0100-11', '0100-12']

catalog_subset = catalog_subset.search(frequency='month_1', date=dates)

Now, our catlaog includes our 3 cases with monthly data from the last 20 years of the simulation

catalog_subset

None catalog with 3 dataset(s) from 720 asset(s):

	unique
component	1
stream	1
date	240
case	3
member_id	2
frequency	1
variables	434
path	720

Read in a dictionary of datasets#

Using the .to_dataset_dict() method, we read in the datasets

dsets = catalog_subset.to_dataset_dict(cdf_kwargs={'use_cftime': True, 'chunks': {'time': 60}})

--> The keys in the returned dictionary of datasets are constructed as follows:
	'component/stream/case'

100.00% [3/3 00:06<00:00]

Taking a look at the keys, we see that the follow the convention component/stream/casename

dsets.keys()

dict_keys(['ocn/pop.h/b1850.f19_g17.validation_mct.002', 'ocn/pop.h/b1850.f19_g17.validation_mct.004', 'ocn/pop.h/b1850.f19_g17.validation_nuopc.004'])

Compute the Temporal Average#

We are interested in the average over the last 20 years, so we take the mean with respect to time. For now, we are interested in two variables:

SALT (salinity)
TEMP (potential temperature)

variables = ['TEMP', 'SALT']

We also want to make sure that when we operating on the datasets, that we hold onto the attributes. We can do this using:

xr.set_options(keep_attrs=True)

<xarray.core.options.set_options at 0x2ba103a57eb0>

ds_list = []
for key in dsets.keys():

    # Extract the dataset from the dictionary of datasets
    ds = dsets[key]

    # Compute the mean over time
    mean = ds.mean(dim='time')

    # Subset for the variables we are interested in
    out = mean[variables]

    # Modify which variables we include in the attributes
    out.attrs['intake_esm_varname'] = variables

    # Append to the list of datasets
    ds_list.append(out)

Concatenate the datasets from the various cases

merged_ds = xr.concat(ds_list, dim='case')

Each dataset has an attribute title which contains the casename, which we can use as one of the main dimensions or our new merged dataset

merged_ds.title

'b1850.f19_g17.validation_mct.002'

cases = []
for dset in ds_list:
    cases.append(dset.title)
merged_ds['case'] = cases

Load in our computation#

Before we start plotting, we can go ahead and call .persist() on our dataset which will load our temporally averaged, dataset with a subset of original variables

merged_ds.persist()

<xarray.Dataset>
Dimensions:  (case: 3, z_t: 60, nlat: 384, nlon: 320)
Coordinates:
  * z_t      (z_t) float32 500.0 1.5e+03 2.5e+03 ... 5.125e+05 5.375e+05
    ULONG    (nlat, nlon) float64 321.1 322.3 323.4 324.5 ... 319.2 319.6 320.0
    ULAT     (nlat, nlon) float64 -78.95 -78.95 -78.95 ... 72.42 72.41 72.41
    TLONG    (nlat, nlon) float64 320.6 321.7 322.8 323.9 ... 318.9 319.4 319.8
    TLAT     (nlat, nlon) float64 -79.22 -79.22 -79.22 ... 72.2 72.19 72.19
  * case     (case) <U34 'b1850.f19_g17.validation_mct.002' ... 'b1850.f19_g1...
Dimensions without coordinates: nlat, nlon
Data variables:
    TEMP     (case, z_t, nlat, nlon) float32 dask.array<chunksize=(1, 60, 384, 320), meta=np.ndarray>
    SALT     (case, z_t, nlat, nlon) float32 dask.array<chunksize=(1, 60, 384, 320), meta=np.ndarray>
Attributes:
    revision:                $Id$
    title:                   b1850.f19_g17.validation_mct.002
    contents:                Diagnostic and Prognostic Variables
    calendar:                All years have exactly  365 days.
    Conventions:             CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netcdf...
    history:                 none
    time_period_freq:        month_1
    model_doi_url:           https://doi.org/10.5065/D67H1H0V
    cell_methods:            cell_methods = time: mean ==> the variable value...
    source:                  CCSM POP2, the CCSM Ocean Component
    intake_esm_varname:      ['TEMP', 'SALT']
    intake_esm_dataset_key:  ocn/pop.h/b1850.f19_g17.validation_mct.002

xarray.Dataset

Dimensions:
- case: 3
- z_t: 60
- nlat: 384
- nlon: 320

Coordinates: (6)

Data variables: (2)

TEMP

(case, z_t, nlat, nlon)

float32

dask.array<chunksize=(1, 60, 384, 320), meta=np.ndarray>

long_name :: Potential Temperature
units :: degC
grid_loc :: 3111
cell_methods :: time: mean

	Array	Chunk
Bytes	84.38 MiB	28.12 MiB
Shape	(3, 60, 384, 320)	(1, 60, 384, 320)
Count	3 Tasks	3 Chunks
Type	float32	numpy.ndarray

SALT

(case, z_t, nlat, nlon)

float32

dask.array<chunksize=(1, 60, 384, 320), meta=np.ndarray>

long_name :: Salinity
units :: gram/kilogram
grid_loc :: 3111
cell_methods :: time: mean

	Array	Chunk
Bytes	84.38 MiB	28.12 MiB
Shape	(3, 60, 384, 320)	(1, 60, 384, 320)
Count	3 Tasks	3 Chunks
Type	float32	numpy.ndarray

Attributes: (12)
revision :
$Id$
title :
b1850.f19_g17.validation_mct.002
contents :
Diagnostic and Prognostic Variables
calendar :
All years have exactly 365 days.
Conventions :
CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netcdf/CF-current.htm
history :
none
time_period_freq :
month_1
model_doi_url :
https://doi.org/10.5065/D67H1H0V
cell_methods :
cell_methods = time: mean ==> the variable values are averaged over the time interval between the previous time coordinate and the current one. cell_methods absent ==> the variable values are at the time given by the current time coordinate.
source :
CCSM POP2, the CCSM Ocean Component
intake_esm_varname :
['TEMP', 'SALT']
intake_esm_dataset_key :
ocn/pop.h/b1850.f19_g17.validation_mct.002

Plot a Comparison using Holoviews#

We can setup a couple of helper functions before visualizing our output. The first helps with an issue that might not seem clear when first trying to plot POP ocean model output with Holoviews.

When first checking to see what TEMP’s dimensions are, we see they are z_t (vertical) and nlat/nlon (horizontal)

ds.TEMP.dims

('time', 'z_t', 'nlat', 'nlon')

When Holoviews sees this, we do not see the same dimensions…

first_case = merged_ds.isel(case=0)
first_case.TEMP.hvplot.quadmesh()

One way to fix this is by ensuring that nlat and nlon are in the dataset as variables

merged_ds['nlat'] = merged_ds.nlat
merged_ds['nlon'] = merged_ds.nlon

Now, the dimensions are assembled properly and we can proceed with plotting!

Note of Warning: when this plot renders on a webpage, you will note be able to use the slider to view other vertical levels.

You can still pan and zoom on the plot

first_case = merged_ds.isel(case=0)
first_case.TEMP.hvplot.quadmesh()

Let’s add some more arguments and options to make the plot look a bit nicer

first_plot = (
    merged_ds.isel(case=0)
    .TEMP.hvplot.quadmesh(rasterize=True)
    .opts(colorbar=True, width=500, height=400, cmap='magma', tools=['hover'])
)
first_plot

We can add to the title of the plot using the following:

first_plot.relabel(f'{merged_ds.isel(case=0).title}')

We can also place plots side by side using the +

second_case = hvds = merged_ds.isel(case=1)

second_plot = second_case.TEMP.hvplot.quadmesh(rasterize=True).opts(
    colorbar=True, width=500, height=400, cmap='magma', tools=['hover']
)
first_plot + second_plot

We can stackplots on top of one another by using hv.Layout and specifying we want a single column

hv.Layout([first_plot, second_plot]).cols(1)

Creating helper functions for plotting#

We can create a helper function for working with plotting! As seen below

def plot_interactive(ds, variable, cmap='magma'):
    """Plots an interactive plot using holoviews"""

    # Make sure that nlat and nlon are in the dataarray
    da = ds[variable]
    da['nlon'] = ds.nlon
    da['nlat'] = ds.nlat

    quadmesh = da.hvplot.quadmesh(rasterize=True).opts(
        colorbar=True, width=500, height=400, cmap=cmap, tools=['hover']
    )

    return quadmesh.relabel(f'{ds.title}')

plot_interactive(first_case, variable='TEMP')

Another function that would be useful is plotting comparisons between cases. As mentioned previously, we have three cases. We set the baseline to be b1850.f19_g17.validation_mct.004!

merged_ds.case

<xarray.DataArray 'case' (case: 3)>
array(['b1850.f19_g17.validation_mct.002', 'b1850.f19_g17.validation_mct.004',
       'b1850.f19_g17.validation_nuopc.004'], dtype='<U34')
Coordinates:
  * case     (case) <U34 'b1850.f19_g17.validation_mct.002' ... 'b1850.f19_g1...

def plot_comparison(ds, case_to_compare, variable, baseline='b1850.f19_g17.validation_mct.004'):
    """Plot a comparison between cases"""

    # Baseline Case
    baseline_ds = ds.sel(case=baseline)
    baseline_ds.attrs['title'] = baseline
    baseline_plot = plot_interactive(baseline_ds, variable)

    # Comparison Case
    case_compare_ds = ds.sel(case=case_to_compare)
    case_compare_ds.attrs['title'] = case_to_compare
    case_compare_plot = plot_interactive(case_compare_ds, variable)

    # Absolute difference in native units
    diff = case_compare_ds[[variable]] - baseline_ds[[variable]]
    diff.attrs['title'] = f'{case_to_compare} - {baseline}'
    diff[variable].attrs['units'] = ds[variable].units
    difference_plot = plot_interactive(diff, variable, cmap='seismic')

    # Difference in percent
    diff_percent = (diff / case_compare_ds[[variable]].max()) * 100
    diff_percent.attrs['title'] = f'Percent Difference'
    diff_percent[variable].attrs['units'] = 'Percent'
    difference_plot_percent = plot_interactive(diff_percent, variable, cmap='seismic')

    # Use holoviews layout to setup the plots with two columns
    plots = hv.Layout(
        [case_compare_plot + baseline_plot + difference_plot + difference_plot_percent]
    ).cols(2)

    # Add a title with the associated name and units
    panel = pn.Column(
        pn.pane.Markdown(f"## {ds[variable].long_name} {ds[variable].units}", align="center"), plots
    )
    return panel

Plot our final comparisons#

Let’s take a look at our two variables:

TEMP
SALT

plot_comparison(merged_ds, case_to_compare='b1850.f19_g17.validation_nuopc.004', variable='TEMP')

plot_comparison(merged_ds, case_to_compare='b1850.f19_g17.validation_nuopc.004', variable='SALT')

Conclusion#

As you can see, using this framework can be quite powerful; for this analysis, we were able to read data directly from model output, without converting to timeseries files, plotting diagnostics for a few ocean fields. While there are still opportunties for improvement in terms of additional tools to make this easier to implement into a robust diagnostics framework, it is exciting to see many of the key pieces of extensible diagnostics are already available!

Something to keep in mind is the fact that when these are rendered on the web, the individual maps are interactive, but the sliders are not able to render new data onto the plot. There is a detailed discussion regarding this topic on the GeoCAT viz issue related to plotting MPAS data. We should think about ways to enable more interactivity with these analysis notebooks and frameworks!

If you are interested in other plots from this comparison, check out this JupyterBook which includes additional variables!

Plotting with GeoCAT Tutorial Comparing Atmospheric Model Output with Observations Using Intake-ESM

20 August 2021

Recent Posts

Archives