Using Intake-ESM’s New Derived Variable Functionality#

Last week, Anderson Banihirwe added a new feature to intake-esm, enabling users to add “derived variables” to catalogs! This is an exciting addition, and although this has not been included in a release yet, you are welcome to test out the functionality!

Keep in mind that this is an initial prototype and the API is likely to change.

What is a “Derived Variable”#

A “derived variable” in this case is a variable that doesn’t itself exist in an intake-esm catalog, but can be computed (i.e., “derived”) from variables that do exist in the catalog. Currently, the derived variable implementation requires variables on the same grid, etc.; i.e., it assumes that all variables involved can be merged within the same dataset.

An example of a derived variable could be temperature in degrees Fahrenheit. Often times, climate model models write temperature in Celsius or Kelvin, but the user may want degrees Fahrenheit! This is a really simple example; derived variables could include more sophsticated diagnostic output like aggregations of terms in a tracer budget or gradients in a particular field.

A traditional workflow for “derived variables” might consist of the following:

  • Load the data

  • Apply some function to the loaded datasets

  • Plot the output

This has shortcomings in the context of an automated process. For example, imagine the following.

for 

But what if we could couple those first two steps? What if we could have some set of **variable definitions**, consisting of variable requirements, such as `dependent variables`, and a function which derives the quantity. This is what the `derived_variable` funtionality offers in `intake-esm`! This enables users to share a "registry" of derived variables across catalogs!

Let's get started with an example!

Installing the Development Version of Intake-ESM#

Since this has not yet been released in a version of Intake-ESM, you can install the development version using the following:

pip install git+https://github.com/intake/intake-esm.git

Imports#

import holoviews as hv
import hvplot
import hvplot.xarray
import intake
import numpy as np
from distributed import Client
from intake_esm.derived import DerivedVariableRegistry
from ncar_jobqueue import NCARCluster

hv.extension('bokeh')

Spin up a Dask Cluster#

cluster = NCARCluster()
cluster.scale(10)
client = Client(cluster)
client
/glade/work/mgrover/miniconda3/envs/cesm-collections-dev/lib/python3.9/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 40217 instead
  warnings.warn(

Client

Client-a73eb1b3-3352-11ec-8fa4-3cecef1b11fa

Connection method: Cluster object Cluster type: dask_jobqueue.PBSCluster
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/40217/status

Cluster Info

How to add a Derived Variable#

Let’s compute a derived variable - wind speed! This can be derived from using the zonal (U) and meridional (V) components of the wind.

def calc_wind_speed(u, v):
    return np.sqrt(u**2 + v**2)

Creating our Derived Variable Registry#

We need to instantiate our derived variable registry, which will store our derived variable information! We use the variable dvr for this (DerivedVariableRegistry).

dvr = DerivedVariableRegistry()

In order to register this derived variable we need to add a decorator for our function, as seen below. This allows us to define our derived variable, dependent variables, and the function associated with the calculation.

@dvr.register(variable='wind_speed', dependent_variables=['U', 'V'])
def calc_wind_speed(ds):
    ds['wind_speed'] = np.sqrt(ds.U**2 + ds.V**2)
    return ds

You’ll notice dvr now has a registered variable, wind_speed, which was defined in the cell above!

dvr
DerivedVariableRegistry({'wind_speed': DerivedVariable(func=<function calc_wind_speed at 0x2b84d46c84c0>, variable='wind_speed', dependent_variables=['U', 'V'])})

Read in Data with our Registry#

In this case, we will use data from the CESM Large Ensemble. We load in our derived variable catalog using the registry argument.

data_catalog = intake.open_esm_datastore(
    'https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json',
    registry=dvr,
)

You’ll notice we have a new field - derived_variable which has 1 unique value.

data_catalog

aws-cesm1-le catalog with 56 dataset(s) from 442 asset(s):

unique
variable 78
long_name 75
component 5
experiment 4
frequency 6
vertical_levels 3
spatial_domain 5
units 25
start_time 12
end_time 13
path 427
derived_variable 1

Let’s also subset for monthly frequency, as well as the 20th century (20C) and RCP 8.5 (RCP85) experiments.

catalog_subset = data_catalog.search(
    variable=['wind_speed'], frequency='monthly', experiment=['20C', 'RCP85']
)

catalog_subset

aws-cesm1-le catalog with 2 dataset(s) from 4 asset(s):

unique
variable 2
long_name 2
component 1
experiment 2
frequency 1
vertical_levels 1
spatial_domain 1
units 1
start_time 2
end_time 2
path 4
derived_variable 1

Calling to_dataset_dict to Load in the Data#

We load in the data, which lazily adds our calculation for wind_speed to the datasets!

dsets = catalog_subset.to_dataset_dict(
    xarray_open_kwargs={'backend_kwargs': {'storage_options': {'anon': True}}}
)
--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.frequency'
100.00% [2/2 00:00<00:00]
ds = dsets['atm.RCP85.monthly']
ds
<xarray.Dataset>
Dimensions:     (member_id: 40, time: 1140, lev: 30, lat: 192, lon: 288, nbnd: 2)
Coordinates:
  * lat         (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
  * lev         (lev) float64 3.643 7.595 14.36 24.61 ... 957.5 976.3 992.6
  * lon         (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
  * member_id   (member_id) int64 1 2 3 4 5 6 7 8 ... 34 35 101 102 103 104 105
  * time        (time) object 2006-01-16 12:00:00 ... 2100-12-16 12:00:00
    time_bnds   (time, nbnd) object dask.array<chunksize=(1140, 2), meta=np.ndarray>
Dimensions without coordinates: nbnd
Data variables:
    U           (member_id, time, lev, lat, lon) float32 dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray>
    V           (member_id, time, lev, lat, lon) float32 dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray>
    wind_speed  (member_id, time, lev, lat, lon) float32 dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray>
Attributes: (12/21)
    Conventions:                       CF-1.0
    NCO:                               4.3.4
    Version:                           $Name$
    host:                              tcs-f02n07
    important_note:                    This data is part of the project 'Blin...
    initial_file:                      b.e11.B20TRC5CNBDRD.f09_g16.105.cam.i....
    ...                                ...
    intake_esm_attrs/vertical_levels:  30.0
    intake_esm_attrs/spatial_domain:   global
    intake_esm_attrs/units:            m/s
    intake_esm_attrs/start_time:       2006-01-16 12:00:00
    intake_esm_attrs/end_time:         2100-12-16 12:00:00
    intake_esm_dataset_key:            atm.RCP85.monthly

Apply an Annual Mean Calculation#

Let’s apply an annual average to the data - since all the years are 365 days long, we do not need any special weighting

annual_mean = ds.groupby('time.year').mean('time')

Plot the Output#

We use hvPlot here to plot a global map, selecting only the bottom 8 levels of the atmosphere.

annual_mean.isel(lev=range(-8, 0)).wind_speed.hvplot.quadmesh(
    x='lon',
    y='lat',
    geo=True,
    rasterize=True,
    cmap='viridis',
    coastline=True,
    title='Wind Speed (m/s)',
)