CESM2-Large Ensemble Reproduction of Kay et al. 2015

Introduction

This Jupyter Notebook demonstrates how one might use the NCAR Community Earth System Model v2 (CESM2) Large Ensemble (CESM2-LE) data hosted on AWS S3. The notebook shows how to reproduce figure 2 from the Kay et al. (2015) paper describing the CESM LENS dataset (doi:10.1175/BAMS-D-13-00255.1), with the LENS2 dataset.

There was a previous notebook which explored this use case, put together by Joe Hamann and Anderson Banihirwe, accessible on the Pangeo Gallery using this link. The specific figure we are replicating is shown below.

cesm1-lens

With the CESM2-LE dataset, we have 100 members which span from 1850 to 2100; whereas the original LENS dataset’s 40 member ensemble include data from 1920 to 2100. In this notebook, we include the ensemble spread from 1850 to 2100, comparing observations from the HADCRUT4 dataset when available. We also explore a more interactive version of this figure!

Imports

%matplotlib inline
import warnings

warnings.filterwarnings("ignore")

import intake
import numpy as np
import pandas as pd
import xarray as xr
import hvplot.pandas, hvplot.xarray
import holoviews as hv
from distributed import LocalCluster, Client
from ncar_jobqueue import NCARCluster
hv.extension('bokeh')

Spin up a Cluster

# If not using NCAR HPC, use the LocalCluster
#cluster = LocalCluster()
cluster = NCARCluster()
cluster.scale(40)

client = Client(cluster)
client

Client

Client-8d8f26fe-570c-11ec-8bf1-3cecef1b11fa

Connection method: Cluster object Cluster type: dask_jobqueue.PBSCluster
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/33398/status

Cluster Info

Read in Data

Use the Intake-ESM Catalog to Access the Data

catalog = intake.open_esm_datastore(
    'https://raw.githubusercontent.com/NCAR/cesm2-le-aws/main/intake-catalogs/aws-cesm2-le.json'
)
catalog

aws-cesm2-le catalog with 34 dataset(s) from 318 asset(s):

unique
variable 51
long_name 49
component 4
experiment 2
forcing_variant 2
frequency 3
vertical_levels 3
spatial_domain 3
units 20
start_time 4
end_time 7
path 297

Subset for Daily Temperature Data (TREFHT)

We use Intake-ESM here to query for our dataset, focusing the smoothed biomass burning (smbb) experiment!

catalog_subset = catalog.search(variable='TREFHT', frequency='daily')
catalog_subset

aws-cesm2-le catalog with 4 dataset(s) from 4 asset(s):

unique
variable 1
long_name 1
component 1
experiment 2
forcing_variant 2
frequency 1
vertical_levels 1
spatial_domain 1
units 1
start_time 2
end_time 2
path 4
catalog_subset.df
variable long_name component experiment forcing_variant frequency vertical_levels spatial_domain units start_time end_time path
0 TREFHT reference height temperature atm historical cmip6 daily 1.0 global K 1850-01-01 12:00:00 2014-12-31 12:00:00 s3://ncar-cesm2-lens/atm/daily/cesm2LE-histori...
1 TREFHT reference height temperature atm historical smbb daily 1.0 global K 1850-01-01 12:00:00 2014-12-31 12:00:00 s3://ncar-cesm2-lens/atm/daily/cesm2LE-histori...
2 TREFHT reference height temperature atm ssp370 cmip6 daily 1.0 global K 2015-01-01 12:00:00 2100-12-31 12:00:00 s3://ncar-cesm2-lens/atm/daily/cesm2LE-ssp370-...
3 TREFHT reference height temperature atm ssp370 smbb daily 1.0 global K 2015-01-01 12:00:00 2100-12-31 12:00:00 s3://ncar-cesm2-lens/atm/daily/cesm2LE-ssp370-...

Taking a look at the dataframe, we see there are two files - one for the historical run, and the other a future scenario (ssp370)

catalog_subset.df
variable long_name component experiment forcing_variant frequency vertical_levels spatial_domain units start_time end_time path
0 TREFHT reference height temperature atm historical cmip6 daily 1.0 global K 1850-01-01 12:00:00 2014-12-31 12:00:00 s3://ncar-cesm2-lens/atm/daily/cesm2LE-histori...
1 TREFHT reference height temperature atm historical smbb daily 1.0 global K 1850-01-01 12:00:00 2014-12-31 12:00:00 s3://ncar-cesm2-lens/atm/daily/cesm2LE-histori...
2 TREFHT reference height temperature atm ssp370 cmip6 daily 1.0 global K 2015-01-01 12:00:00 2100-12-31 12:00:00 s3://ncar-cesm2-lens/atm/daily/cesm2LE-ssp370-...
3 TREFHT reference height temperature atm ssp370 smbb daily 1.0 global K 2015-01-01 12:00:00 2100-12-31 12:00:00 s3://ncar-cesm2-lens/atm/daily/cesm2LE-ssp370-...

Load in our datasets using .to_dataset_dict()

We use .to_dataset_dict() to return a dictionary of datasets, which we can now use for the analysis

dsets = catalog_subset.to_dataset_dict(storage_options={'anon':True})
--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.frequency.forcing_variant'
100.00% [4/4 00:07<00:00]

Here are the keys for each dataset - you will notice they follow the groupby attributes, with component.experiment.frequency.forcing_variant, one for the future and the other historical here

dsets.keys()
dict_keys(['atm.historical.daily.smbb', 'atm.ssp370.daily.smbb', 'atm.historical.daily.cmip6', 'atm.ssp370.daily.cmip6'])

Let’s load these into separate datasets, then merge these together!

historical_smbb = dsets['atm.historical.daily.smbb']
future_smbb = dsets['atm.ssp370.daily.smbb']

historical_cmip6 = dsets['atm.historical.daily.cmip6']
future_cmip6 = dsets['atm.ssp370.daily.cmip6']

Now, we can merged the historical and future scenarios. We will drop any members that do not include the full 1850-2100 output!

merge_ds_smbb = xr.concat([historical_smbb, future_smbb], dim='time')
merge_ds_smbb = merge_ds_smbb.dropna(dim='member_id')

merge_ds_cmip6= xr.concat([historical_cmip6, future_cmip6], dim='time')
merge_ds_cmip6 = merge_ds_cmip6.dropna(dim='member_id')

For ease of plotting, we convert the cftime.datetime times to regular datetime

merge_ds_smbb['time'] = merge_ds_smbb.indexes['time'].to_datetimeindex()
merge_ds_cmip6['time'] = merge_ds_cmip6.indexes['time'].to_datetimeindex()

Finally, we can load in just the variable we are interested in, Reference height temperature (TREFHT). We will separate this into two arrays:

  • t_smbb - the entire period of data for the smoothed biomass burning experiment (1850 to 2100)

  • t_cmip6 - the entire period of data for the cmip6 biomass burning experiment (1850 to 2100)

  • t_ref - the reference period used in the Kay et al. 2015 paper (1961 to 1990)

t_smbb = merge_ds_smbb.TREFHT
t_cmip6 = merge_ds_cmip6.TREFHT
t_ref = t_cmip6.sel(time=slice('1961', '1990'))

Read in the Grid Data

We also have Zarr stores of grid data, such as the area of each grid cell. We will need to follow a similar process here, querying for our experiment and extracting the dataset

grid_subset = catalog.search(component='atm', frequency='static', experiment='historical', forcing_variant='cmip6')

We can load in the dataset, calling pop_item, which grabs the dataset, assigning _ to the meaningless key

_, grid = grid_subset.to_dataset_dict(aggregate=False, zarr_kwargs={"consolidated": True}, storage_options={'anon':True}).popitem()
grid
--> The keys in the returned dictionary of datasets are constructed as follows:
	'variable.long_name.component.experiment.forcing_variant.frequency.vertical_levels.spatial_domain.units.start_time.end_time.path'
100.00% [1/1 00:00<00:00]
<xarray.Dataset>
Dimensions:   (lat: 192, lon: 288, ilev: 31, lev: 30, bnds: 2, slat: 191, slon: 288)
Coordinates: (12/21)
    P0        float64 ...
    area      (lat, lon) float32 dask.array<chunksize=(192, 288), meta=np.ndarray>
    gw        (lat) float64 dask.array<chunksize=(192,), meta=np.ndarray>
    hyai      (ilev) float64 dask.array<chunksize=(31,), meta=np.ndarray>
    hyam      (lev) float64 dask.array<chunksize=(30,), meta=np.ndarray>
    hybi      (ilev) float64 dask.array<chunksize=(31,), meta=np.ndarray>
    ...        ...
    ntrn      int32 ...
  * slat      (slat) float64 -89.53 -88.59 -87.64 -86.7 ... 87.64 88.59 89.53
  * slon      (slon) float64 -0.625 0.625 1.875 3.125 ... 355.6 356.9 358.1
    time      object ...
    w_stag    (slat) float64 dask.array<chunksize=(191,), meta=np.ndarray>
    wnummax   (lat) int32 dask.array<chunksize=(192,), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables:
    *empty*
Attributes:
    intake_esm_varname:      None
    intake_esm_dataset_key:  nan.nan.atm.historical.cmip6.static.nan.global.n...

Extract the Grid Area and Compute the Total Area

Here, we extract the grid area and compute the total area across all grid cells - this will help when computing the weights…

cell_area = grid.area.load()
total_area = cell_area.sum()
cell_area
<xarray.DataArray 'area' (lat: 192, lon: 288)>
array([[2.9948368e+07, 2.9948368e+07, 2.9948368e+07, ..., 2.9948368e+07,
        2.9948368e+07, 2.9948368e+07],
       [2.3957478e+08, 2.3957478e+08, 2.3957478e+08, ..., 2.3957478e+08,
        2.3957478e+08, 2.3957478e+08],
       [4.7908477e+08, 4.7908477e+08, 4.7908477e+08, ..., 4.7908477e+08,
        4.7908477e+08, 4.7908477e+08],
       ...,
       [4.7908477e+08, 4.7908477e+08, 4.7908477e+08, ..., 4.7908477e+08,
        4.7908477e+08, 4.7908477e+08],
       [2.3957478e+08, 2.3957478e+08, 2.3957478e+08, ..., 2.3957478e+08,
        2.3957478e+08, 2.3957478e+08],
       [2.9948368e+07, 2.9948368e+07, 2.9948368e+07, ..., 2.9948368e+07,
        2.9948368e+07, 2.9948368e+07]], dtype=float32)
Coordinates:
    P0       float64 nan
    area     (lat, lon) float32 2.995e+07 2.995e+07 ... 2.995e+07 2.995e+07
    gw       (lat) float64 nan nan nan nan nan nan ... nan nan nan nan nan nan
  * lat      (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0
  * lon      (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
    ntrk     int32 0
    ntrm     int32 0
    ntrn     int32 0
    time     object 1850-02-01 00:00:00
    wnummax  (lat) int32 1543505128 10970 1543505128 ... 2146959360 0 2146959360
Attributes:
    cell_methods:  time: mean
    long_name:     area of grid box
    units:         m2

Compute the Weighted Annual Averages

Here, we utilize the resample(time="AS") method which does an annual resampling based on start of calendar year.

Note we also weight by the cell area, multiplying each value by the corresponding cell area, summing over the grid (lat, lon), then dividing by the total area

Setup the Computation - Below, we lazily prepare the calculation!

You’ll notice how quickly this cell runs - which is suspicious 👀 - we didn’t actually do any computation, but rather prepared the calculation

%%time
t_ref_ts = (
    (t_ref.resample(time="AS").mean("time")).weighted(cell_area).mean(('lat', 'lon'))
).mean(dim=("time", "member_id"))

t_smbb_ts = (
    (t_smbb.resample(time="AS").mean("time")).weighted(cell_area).mean(('lat', 'lon')))

t_cmip6_ts = (
    (t_cmip6.resample(time="AS").mean("time")).weighted(cell_area).mean(('lat', 'lon')))
CPU times: user 1.63 s, sys: 20.7 ms, total: 1.65 s
Wall time: 1.74 s

Compute the Averages

At this point, we actually compute the averages - using %%time to help us measure how long it takes to compute…

%%time
t_ref_mean = t_ref_ts.compute()
CPU times: user 17.5 s, sys: 552 ms, total: 18 s
Wall time: 41.4 s
%%time
t_smbb_mean = t_smbb_ts.compute()
CPU times: user 1min 37s, sys: 3.41 s, total: 1min 41s
Wall time: 3min 14s
%%time
t_cmip6_mean = t_cmip6_ts.compute()
CPU times: user 2min 32s, sys: 4.84 s, total: 2min 37s
Wall time: 5min 36s

Convert to Dataframes

Our output data format is still an xarray.DataArray which isn’t neccessarily ideal… we could convert this to a dataframe which would be easier to export.

For example, you could call anomaly_smbb.to_csv('some_output_file.csv') to export to csv, which you could then share with others!

We can convert our xarray.DataArray to a pandas dataframe using the following:

t_smbb_ts_df = t_smbb_mean.to_series().unstack().T
t_cmip6_ts_df = t_cmip6_mean.to_series().unstack().T

Now that we have our dataframes, we can compute the anomaly by subtracting the mean value t_ref_mean

anomaly_smbb = (t_smbb_ts_df - t_ref_mean.data)
anomaly_cmip6 = (t_cmip6_ts_df - t_ref_mean.data)

Grab some Observational Data

In this case, we use the HADCRUT4 dataset. We include the url here, which can be used to remotely access the data.

# Observational time series data for comparison with ensemble average
obsDataURL = "https://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/cru/hadcrut4/air.mon.anom.median.nc"
ds = xr.open_dataset(obsDataURL).load()
ds
<xarray.Dataset>
Dimensions:    (lat: 36, lon: 72, time: 2061, nbnds: 2)
Coordinates:
  * lat        (lat) float32 87.5 82.5 77.5 72.5 ... -72.5 -77.5 -82.5 -87.5
  * lon        (lon) float32 -177.5 -172.5 -167.5 -162.5 ... 167.5 172.5 177.5
  * time       (time) datetime64[ns] 1850-01-01 1850-02-01 ... 2021-09-01
Dimensions without coordinates: nbnds
Data variables:
    time_bnds  (time, nbnds) datetime64[ns] 1850-01-01 1850-01-31 ... 2021-09-30
    air        (time, lat, lon) float32 nan nan nan nan nan ... nan nan nan nan
Attributes:
    platform:                        Surface
    title:                           HADCRUT4 Combined Air Temperature/SST An...
    history:                         Originally created at NOAA/ESRL PSD by C...
    Conventions:                     CF-1.0
    Comment:                         This dataset supersedes V3
    Source:                          Obtained from http://hadobs.metoffice.co...
    version:                         4.2.0
    dataset_title:                   HadCRUT4
    References:                      https://www.psl.noaa.gov/data/gridded/da...
    DODS_EXTRA.Unlimited_Dimension:  time

Compute the Weighted Temporal Mean from Seasons

This observational dataset comes in as seasonal values, but we want yearly averages.. this requires weighting by the length of each season, so we include this helper function, weighted_temporal_mean!

def weighted_temporal_mean(ds):
    """
    weight by days in each month
    """
    time_bound_diff = ds.time_bnds.diff(dim="nbnds")[:, 0]
    wgts = time_bound_diff.groupby("time.year") / time_bound_diff.groupby(
        "time.year"
    ).sum(xr.ALL_DIMS)
    np.testing.assert_allclose(wgts.groupby("time.year").sum(xr.ALL_DIMS), 1.0)
    obs = ds["air"]
    cond = obs.isnull()
    ones = xr.where(cond, 0.0, 1.0)
    obs_sum = (obs * wgts).resample(time="AS").sum(dim="time")
    ones_out = (ones * wgts).resample(time="AS").sum(dim="time")
    obs_s = (obs_sum / ones_out).mean(("lat", "lon")).to_series()
    return obs_s

Let’s apply this to our dataset now!

obs_s = weighted_temporal_mean(ds)
obs_s
time
1850-01-01   -0.338822
1851-01-01   -0.245482
1852-01-01   -0.291014
1853-01-01   -0.342457
1854-01-01   -0.276820
                ...   
2017-01-01    0.777498
2018-01-01    0.641953
2019-01-01    0.809306
2020-01-01    0.865248
2021-01-01    0.677946
Freq: AS-JAN, Length: 172, dtype: float64

Convert the dataset into a dataframe

obs_df = pd.DataFrame(obs_s).rename(columns={0:'value'})
obs_df
value
time
1850-01-01 -0.338822
1851-01-01 -0.245482
1852-01-01 -0.291014
1853-01-01 -0.342457
1854-01-01 -0.276820
... ...
2017-01-01 0.777498
2018-01-01 0.641953
2019-01-01 0.809306
2020-01-01 0.865248
2021-01-01 0.677946

172 rows × 1 columns

Plot the Output

In this this case, we use hvPlot to plot the output! We setup the plots in the following cell

smbb_plot = (
    anomaly_smbb.hvplot.scatter(
        'time',
        height=300,
        width=500,
        color='lightgrey',
        legend=False,
        xlabel='Year',
        ylabel='Global Mean \n Temperature Anomaly (K)',
        ylim=(-1, 5),
        hover=False,
    )
    * anomaly_smbb.mean(1).hvplot.line('time', color='k', line_width=3, label='ensemble mean')
    * obs_df.hvplot.line('time', color='red', label='observations')
).opts(title='Smoothed Biomass Burning')


cmip6_plot = (
    anomaly_cmip6.hvplot.scatter(
        'time',
        height=300,
        width=500,
        color='lightgrey',
        legend=False,
        xlabel='Year',
        ylabel='Global Mean \n Temperature Anomaly (K)',
        ylim=(-1, 5),
        hover=False,
    )
    * anomaly_cmip6.mean(1).hvplot.line('time', color='k', line_width=3, label='ensemble mean')
    * obs_df.hvplot.line('time', color='red', label='observations').opts(
        title='CMIP6 Biomass Burning'
    )
)

# Include both plots in the same column
(smbb_plot + cmip6_plot).cols(1)

Spin Down the Cluster

After we are done, we can spin down our cluster

cluster.close()
client.close()