Data Access¶
- This notebook illustrates how to make diagnostic plots using the dataset produced by the South America Affinity Group (SAAG) hosted on NCAR’s Geoscience Data Exchange (GDEX).
- https://
gdex .ucar .edu /datasets /d616000 /# - This data is open access and can be accessed via 3 protocols
- posix (if you have access to NCAR’s HPC systems like Casper or Derecho)
- HTTPS or
- OSDF using intake-ESM catalogs.
- Learn about intake-ESM: https://
intake -esm .readthedocs .io /en /stable/
#Imports
import intake
import numpy as np
import pandas as pd
import xarray as xr
import seaborn as sns
import matplotlib.pyplot as plt
import os# import fsspec.implementations.http as fshttp
# from pelicanfs.core import PelicanFileSystem, PelicanMap, OSDFFileSystem import dask
from dask_jobqueue import PBSCluster
from dask.distributed import Client
from dask.distributed import performance_reportcat_url = '/gdex/data/d616000/catalogs/d616000_catalog.json' #POSIX access on NCAR
# cat_url = 'https://osdf-data.gdex.ucar.edu/ncar/gdex/d616000/catalogs/d616000_catalog-http.json' #HTTPS access
# cat_url = 'https://osdf-data.gdex.ucar.edu/ncar/gdex/d616000/catalogs/d616000_catalog-osdf.json' #OSDF access
print(cat_url)/gdex/data/d616000/catalogs/d616000_catalog.json
# Set up your scratch folder path
username = os.environ["USER"]
glade_scratch = "/glade/derecho/scratch/" + username
print(glade_scratch)/glade/derecho/scratch/harshah
Create a PBS cluster¶
# Create a PBS cluster object
cluster = PBSCluster(
job_name = 'dask-wk25-hpc',
cores = 1,
memory = '8GiB',
processes = 1,
local_directory = glade_scratch+'/dask/spill/',
log_directory = glade_scratch + '/dask/logs/',
resource_spec = 'select=1:ncpus=1:mem=8GB',
queue = 'casper',
walltime = '5:00:00',
#interface = 'ib0'
interface = 'ext'
)# Scale the cluster and display cluster dashboard URL
n_workers = 5
client = Client(cluster)
cluster.scale(n_workers)
client.wait_for_workers(n_workers = n_workers)
clusterLoading...
Load SAAG data from NCAR’s GDEX using an intake catalog¶
col = intake.open_esm_datastore(cat_url)
colLoading...
- col.df turns the catalog object into a pandas dataframe!
- (Actually, it accesses the dataframe attribute of the catalog)
col.dfLoading...
Select data and plot¶
What if you don’t know the variable names ?¶
- Use pandas logic to print out the short_name and long_name
col.df[['variable','long_name']]Loading...
- We notice that long_name is not available for some variables like ‘V’
- In such cases, please look at the dataset documentation for additional information: https://
gdex .ucar .edu /datasets /d616000 /documentation /#
Temperature¶
- Plot temperature for a random date
cat_temp = col.search(variable='T2')
cat_temp.df.head()Loading...
- The data is organized in (virtual) zarr stores with one year’s worth of data in one file
- Select a year. This is done by selcting the start time to be Jan 1st of that year or the end time to be Dec 31st of the same year
- This also means that if you want to request data for other days, say Oct 1 for the year YYYY, you first have to load the data for one year YYYY and then select the data for that particular day. This example is discussed below.
date = "2020-01-01"
# year = "2021"
cat_temp_subset = cat_temp.search(start_time = date)
cat_temp_subsetLoading...
Load data into xarray¶
# Load catalog entries for subset into a dictionary of xarray datasets, and open the first one.
dsets = cat_temp_subset.to_dataset_dict(zarr_kwargs={"consolidated": True})
print(f"\nDataset dictionary keys:\n {dsets.keys()}")Loading...
# Load the first dataset and display a summary.
dataset_key = list(dsets.keys())[0]
# store_name = dataset_key + ".zarr"
print(dsets.keys())
ds = dsets[dataset_key]
ds = ds.T2
ds%%time
desired_date = "2020-10-01"
ds_subset = ds.sel(Time=desired_date,method='nearest')
ds_subset%%time
ds_subset.plot(cmap='inferno')cluster.close()