ESDS Update October 2021#
October has been an active month! There were a variety of talks, a variety of answered Python questions during office hours, and a Python tutorial!
Check out the following ESDS update for the month of October 2021.
Xdev has made some important advances on
Intake-ESM, which is a data catalog utility comprising an API to data assets. Essentially, intake-esm “abstracts away” the file system, enabling data search and discovery, automated queries and dataset construction, and portability across cloud and HPC platforms. We’re now working on a set of ideas we’re calling Funnel; this extends the data catalog with “analysis recipes”, providing an effective strategy for modularization and extensibility of workflows.
We also held our first discussion on
xwrf, which is a new package meant to bring Weather Research and Forecasting (WRF) data into the Pangeo Ecosystem! Using this tool, users can read WRF output directly into
Xarray, enabling the use of
hvPlot. If you are interested in following along with that development, be sure to check out the
Python Package Overviews#
A Jupyter Based Diagnostics Prototype (4 October 2021) - Max Grover (CGD)
An Overview of Xdev and Analysis Pain Points (18 October 2021) - Kevin Paul (CISL)
ESDS Blog Posts#
End to End Workflow#
Office Hour Questions#
During the month of October 2021, our team answered a total of 14 questions at our weekly Xdev Office Hours.
Below is a summary of the most common questions brought up during office hours!
How do you get dask to work with stacking CESM2-LE data?
Worked on an example subsetting the data, developing pipeline
How to submit jobs with different schedulers?
Suggested checking out Dask jobqueue options
What is the most efficient way to compute annual means from a bunch of Earth System Prediction (ESP) data?
For some cases, makes sense using the preprocess function when the files are big enough (ex. ESP Decadal Prediction datasets)
Good case for preprocess - calculating annual means with files ~10s of GB in size
Bad case for preprocess - working with many smaller files, which leads to a large number of tasks and a slower process
How do you optimize file read in with ESP data?
Make sure to know when to use the preprocess function with computations
How to use one dataset to mask another with different dims?
Needed to create a loop and create new dimensions for the datasets