Using Intake-ESM’s New Derived Variable Functionality#
Last week, Anderson Banihirwe added a new feature to intake-esm, enabling users to add “derived variables” to catalogs! This is an exciting addition, and although this has not been included in a release yet, you are welcome to test out the functionality!
Keep in mind that this is an initial prototype and the API is likely to change.
What is a “Derived Variable”#
A “derived variable” in this case is a variable that doesn’t itself exist in an intake-esm catalog, but can be computed (i.e., “derived”) from variables that do exist in the catalog. Currently, the derived variable implementation requires variables on the same grid, etc.; i.e., it assumes that all variables involved can be merged within the same dataset.
An example of a derived variable could be temperature in degrees Fahrenheit. Often times, climate model models write temperature in Celsius or Kelvin, but the user may want degrees Fahrenheit! This is a really simple example; derived variables could include more sophsticated diagnostic output like aggregations of terms in a tracer budget or gradients in a particular field.
A traditional workflow for “derived variables” might consist of the following:
Load the data
Apply some function to the loaded datasets
Plot the output
This has shortcomings in the context of an automated process. For example, imagine the following.
for
But what if we could couple those first two steps? What if we could have some set of **variable definitions**, consisting of variable requirements, such as `dependent variables`, and a function which derives the quantity. This is what the `derived_variable` funtionality offers in `intake-esm`! This enables users to share a "registry" of derived variables across catalogs!
Let's get started with an example!
Installing the Development Version of Intake-ESM#
Since this has not yet been released in a version of Intake-ESM, you can install the development version using the following:
pip install git+https://github.com/intake/intake-esm.git
Imports#
import holoviews as hv
import hvplot
import hvplot.xarray
import intake
import numpy as np
from distributed import Client
from intake_esm.derived import DerivedVariableRegistry
from ncar_jobqueue import NCARCluster
hv.extension('bokeh')
Spin up a Dask Cluster#
cluster = NCARCluster()
cluster.scale(10)
client = Client(cluster)
client
/glade/work/mgrover/miniconda3/envs/cesm-collections-dev/lib/python3.9/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 40217 instead
warnings.warn(
Client
Client-a73eb1b3-3352-11ec-8fa4-3cecef1b11fa
Connection method: Cluster object | Cluster type: dask_jobqueue.PBSCluster |
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/40217/status |
Cluster Info
PBSCluster
0bcacd1c
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/40217/status | Workers: 0 |
Total threads: 0 | Total memory: 0 B |
Scheduler Info
Scheduler
Scheduler-1d915812-9bca-44c2-aa26-a00dc4ddb19f
Comm: tcp://10.12.206.54:38580 | Workers: 0 |
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/40217/status | Total threads: 0 |
Started: Just now | Total memory: 0 B |
Workers
How to add a Derived Variable#
Let’s compute a derived variable - wind speed! This can be derived from using the zonal (U
) and meridional (V
) components of the wind.
def calc_wind_speed(u, v):
return np.sqrt(u**2 + v**2)
Creating our Derived Variable Registry#
We need to instantiate our derived variable registry, which will store our derived variable information! We use the variable dvr
for this (DerivedVariableRegistry).
dvr = DerivedVariableRegistry()
In order to register this derived variable we need to add a decorator for our function, as seen below. This allows us to define our derived variable, dependent variables, and the function associated with the calculation.
@dvr.register(variable='wind_speed', dependent_variables=['U', 'V'])
def calc_wind_speed(ds):
ds['wind_speed'] = np.sqrt(ds.U**2 + ds.V**2)
return ds
You’ll notice dvr
now has a registered variable, wind_speed
, which was defined in the cell above!
dvr
DerivedVariableRegistry({'wind_speed': DerivedVariable(func=<function calc_wind_speed at 0x2b84d46c84c0>, variable='wind_speed', dependent_variables=['U', 'V'])})
Read in Data with our Registry#
In this case, we will use data from the CESM Large Ensemble. We load in our derived variable catalog using the registry
argument.
data_catalog = intake.open_esm_datastore(
'https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json',
registry=dvr,
)
You’ll notice we have a new field - derived_variable
which has 1 unique value.
data_catalog
aws-cesm1-le catalog with 56 dataset(s) from 442 asset(s):
unique | |
---|---|
variable | 78 |
long_name | 75 |
component | 5 |
experiment | 4 |
frequency | 6 |
vertical_levels | 3 |
spatial_domain | 5 |
units | 25 |
start_time | 12 |
end_time | 13 |
path | 427 |
derived_variable | 1 |
Let’s also subset for monthly frequency, as well as the 20th century (20C) and RCP 8.5 (RCP85) experiments.
catalog_subset = data_catalog.search(
variable=['wind_speed'], frequency='monthly', experiment=['20C', 'RCP85']
)
catalog_subset
aws-cesm1-le catalog with 2 dataset(s) from 4 asset(s):
unique | |
---|---|
variable | 2 |
long_name | 2 |
component | 1 |
experiment | 2 |
frequency | 1 |
vertical_levels | 1 |
spatial_domain | 1 |
units | 1 |
start_time | 2 |
end_time | 2 |
path | 4 |
derived_variable | 1 |
Calling to_dataset_dict
to Load in the Data#
We load in the data, which lazily adds our calculation for wind_speed
to the datasets!
dsets = catalog_subset.to_dataset_dict(
xarray_open_kwargs={'backend_kwargs': {'storage_options': {'anon': True}}}
)
--> The keys in the returned dictionary of datasets are constructed as follows:
'component.experiment.frequency'
ds = dsets['atm.RCP85.monthly']
ds
<xarray.Dataset> Dimensions: (member_id: 40, time: 1140, lev: 30, lat: 192, lon: 288, nbnd: 2) Coordinates: * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0 * lev (lev) float64 3.643 7.595 14.36 24.61 ... 957.5 976.3 992.6 * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8 * member_id (member_id) int64 1 2 3 4 5 6 7 8 ... 34 35 101 102 103 104 105 * time (time) object 2006-01-16 12:00:00 ... 2100-12-16 12:00:00 time_bnds (time, nbnd) object dask.array<chunksize=(1140, 2), meta=np.ndarray> Dimensions without coordinates: nbnd Data variables: U (member_id, time, lev, lat, lon) float32 dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray> V (member_id, time, lev, lat, lon) float32 dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray> wind_speed (member_id, time, lev, lat, lon) float32 dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray> Attributes: (12/21) Conventions: CF-1.0 NCO: 4.3.4 Version: $Name$ host: tcs-f02n07 important_note: This data is part of the project 'Blin... initial_file: b.e11.B20TRC5CNBDRD.f09_g16.105.cam.i.... ... ... intake_esm_attrs/vertical_levels: 30.0 intake_esm_attrs/spatial_domain: global intake_esm_attrs/units: m/s intake_esm_attrs/start_time: 2006-01-16 12:00:00 intake_esm_attrs/end_time: 2100-12-16 12:00:00 intake_esm_dataset_key: atm.RCP85.monthly
- member_id: 40
- time: 1140
- lev: 30
- lat: 192
- lon: 288
- nbnd: 2
- lat(lat)float64-90.0 -89.06 -88.12 ... 89.06 90.0
- axis :
- Y
- bounds :
- lat_bnds
- long_name :
- latitude
- standard_name :
- latitude
- units :
- degrees_north
array([-90. , -89.057594, -88.115181, -87.172775, -86.23037 , -85.287956, -84.345551, -83.403145, -82.460732, -81.518326, -80.575912, -79.633507, -78.691101, -77.748688, -76.806282, -75.863876, -74.921463, -73.979057, -73.036652, -72.094238, -71.151833, -70.209427, -69.267014, -68.324608, -67.382202, -66.439789, -65.497383, -64.554977, -63.612564, -62.670158, -61.727749, -60.785339, -59.842934, -58.900524, -57.958115, -57.015705, -56.073299, -55.13089 , -54.18848 , -53.246075, -52.303665, -51.361256, -50.41885 , -49.47644 , -48.534031, -47.591621, -46.649216, -45.706806, -44.764397, -43.821991, -42.879581, -41.937172, -40.994766, -40.052357, -39.109947, -38.167538, -37.225132, -36.282722, -35.340313, -34.397907, -33.455498, -32.513088, -31.570681, -30.628273, -29.685863, -28.743456, -27.801046, -26.858639, -25.916231, -24.973822, -24.031414, -23.089005, -22.146597, -21.204189, -20.26178 , -19.319372, -18.376963, -17.434555, -16.492147, -15.549738, -14.607329, -13.664922, -12.722513, -11.780105, -10.837696, -9.895288, -8.95288 , -8.010471, -7.068063, -6.125654, -5.183246, -4.240838, -3.298429, -2.356021, -1.413613, -0.471204, 0.471204, 1.413613, 2.356021, 3.298429, 4.240838, 5.183246, 6.125654, 7.068063, 8.010471, 8.95288 , 9.895288, 10.837696, 11.780105, 12.722513, 13.664922, 14.607329, 15.549738, 16.492147, 17.434555, 18.376963, 19.319372, 20.26178 , 21.204189, 22.146597, 23.089005, 24.031414, 24.973822, 25.916231, 26.858639, 27.801046, 28.743456, 29.685863, 30.628273, 31.570681, 32.513088, 33.455498, 34.397907, 35.340313, 36.282722, 37.225132, 38.167538, 39.109947, 40.052357, 40.994766, 41.937172, 42.879581, 43.821991, 44.764397, 45.706806, 46.649216, 47.591621, 48.534031, 49.47644 , 50.41885 , 51.361256, 52.303665, 53.246075, 54.18848 , 55.13089 , 56.073299, 57.015705, 57.958115, 58.900524, 59.842934, 60.785339, 61.727749, 62.670158, 63.612564, 64.554977, 65.497383, 66.439789, 67.382202, 68.324608, 69.267014, 70.209427, 71.151833, 72.094238, 73.036652, 73.979057, 74.921463, 75.863876, 76.806282, 77.748688, 78.691101, 79.633507, 80.575912, 81.518326, 82.460732, 83.403145, 84.345551, 85.287956, 86.23037 , 87.172775, 88.115181, 89.057594, 90. ])
- lev(lev)float643.643 7.595 14.36 ... 976.3 992.6
- formula_terms :
- a: hyam b: hybm p0: P0 ps: PS
- long_name :
- hybrid level at midpoints (1000*(A+B))
- positive :
- down
- standard_name :
- atmosphere_hybrid_sigma_pressure_coordinate
- units :
- level
array([ 3.643466, 7.59482 , 14.356632, 24.61222 , 38.2683 , 54.59548 , 72.012451, 87.82123 , 103.317127, 121.547241, 142.994039, 168.22508 , 197.908087, 232.828619, 273.910817, 322.241902, 379.100904, 445.992574, 524.687175, 609.778695, 691.38943 , 763.404481, 820.858369, 859.534767, 887.020249, 912.644547, 936.198398, 957.48548 , 976.325407, 992.556095])
- lon(lon)float640.0 1.25 2.5 ... 356.2 357.5 358.8
- long_name :
- longitude
- units :
- degrees_east
array([ 0. , 1.25, 2.5 , ..., 356.25, 357.5 , 358.75])
- member_id(member_id)int641 2 3 4 5 6 ... 101 102 103 104 105
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 101, 102, 103, 104, 105])
- time(time)object2006-01-16 12:00:00 ... 2100-12-...
- bounds :
- time_bnds
- long_name :
- time
array([cftime.DatetimeNoLeap(2006, 1, 16, 12, 0, 0, 0, has_year_zero=True), cftime.DatetimeNoLeap(2006, 2, 15, 0, 0, 0, 0, has_year_zero=True), cftime.DatetimeNoLeap(2006, 3, 16, 12, 0, 0, 0, has_year_zero=True), ..., cftime.DatetimeNoLeap(2100, 10, 16, 12, 0, 0, 0, has_year_zero=True), cftime.DatetimeNoLeap(2100, 11, 16, 0, 0, 0, 0, has_year_zero=True), cftime.DatetimeNoLeap(2100, 12, 16, 12, 0, 0, 0, has_year_zero=True)], dtype=object)
- time_bnds(time, nbnd)objectdask.array<chunksize=(1140, 2), meta=np.ndarray>
- long_name :
- time interval endpoints
Array Chunk Bytes 17.81 kiB 17.81 kiB Shape (1140, 2) (1140, 2) Count 5 Tasks 1 Chunks Type object numpy.ndarray
- U(member_id, time, lev, lat, lon)float32dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray>
- cell_methods :
- time: mean
- long_name :
- Zonal wind
- mdims :
- 1
- units :
- m/s
Array Chunk Bytes 281.80 GiB 113.91 MiB Shape (40, 1140, 30, 192, 288) (1, 18, 30, 192, 288) Count 2561 Tasks 2560 Chunks Type float32 numpy.ndarray - V(member_id, time, lev, lat, lon)float32dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray>
- cell_methods :
- time: mean
- long_name :
- Meridional wind
- mdims :
- 1
- units :
- m/s
Array Chunk Bytes 281.80 GiB 113.91 MiB Shape (40, 1140, 30, 192, 288) (1, 18, 30, 192, 288) Count 2561 Tasks 2560 Chunks Type float32 numpy.ndarray - wind_speed(member_id, time, lev, lat, lon)float32dask.array<chunksize=(1, 18, 30, 192, 288), meta=np.ndarray>
- intake_esm_attrs/derivation :
- dependent_variables: ['U', 'V']
Array Chunk Bytes 281.80 GiB 113.91 MiB Shape (40, 1140, 30, 192, 288) (1, 18, 30, 192, 288) Count 15362 Tasks 2560 Chunks Type float32 numpy.ndarray
- Conventions :
- CF-1.0
- NCO :
- 4.3.4
- Version :
- $Name$
- host :
- tcs-f02n07
- important_note :
- This data is part of the project 'Blind Evaluation of Lossy Data-Compression in LENS'. Please exercise caution before using this data for other purposes.
- initial_file :
- b.e11.B20TRC5CNBDRD.f09_g16.105.cam.i.2006-01-01-00000.nc
- logname :
- mudryk
- nco_openmp_thread_number :
- 1
- revision_Id :
- $Id$
- source :
- CAM
- title :
- UNSET
- topography_file :
- /scratch/p/pjk/mudryk/cesm1_1_2_LENS/inputdata/atm/cam/topo/USGS-gtopo30_0.9x1.25_remap_c051027.nc
- intake_esm_attrs/component :
- atm
- intake_esm_attrs/experiment :
- RCP85
- intake_esm_attrs/frequency :
- monthly
- intake_esm_attrs/vertical_levels :
- 30.0
- intake_esm_attrs/spatial_domain :
- global
- intake_esm_attrs/units :
- m/s
- intake_esm_attrs/start_time :
- 2006-01-16 12:00:00
- intake_esm_attrs/end_time :
- 2100-12-16 12:00:00
- intake_esm_dataset_key :
- atm.RCP85.monthly
Apply an Annual Mean Calculation#
Let’s apply an annual average to the data - since all the years are 365 days long, we do not need any special weighting
annual_mean = ds.groupby('time.year').mean('time')
Plot the Output#
We use hvPlot
here to plot a global map, selecting only the bottom 8 levels of the atmosphere.
annual_mean.isel(lev=range(-8, 0)).wind_speed.hvplot.quadmesh(
x='lon',
y='lat',
geo=True,
rasterize=True,
cmap='viridis',
coastline=True,
title='Wind Speed (m/s)',
)