Dask Distributed Summit 2021 Takeaways#

Pangeo Session #

Cloud optimized datasets help improve speed of analysis
- Entire 4 TB datasets open up in a few seconds
- Important to chunk you data - store in object-store
Reading netcdf data remotely, especially on the cloud, can take a while…
- Even with using dask-delayed, quite slow
Open-source projects, such as XGCM, make answering big science questions easier
- Ex. looking at oxygen minimum zones, converting to density coords
Persistent pain points
- Using dask vs. understanding dask - can we make this transition easier for non-daskers?
- It can tough to pin down specific parts of code which is causing workers to die
We should get more science workflows in the cloud
- Ensure these workflows are documented, especially in relation to dask
- Document cases of Dask workers being killed and how to go about solving these issues
  - “Dask worker serial killers”
Minimum reproducible examples don’t always work - instead, curate domain-specific examples
STAC (spatio-temporal asset catalog) can be helpful when working with satellite imagery, making it easy to search for specific datasets
- Similar to intake, able to search spatially as well
- Recent development of StackStac allows for fast loading of STAC catalogs

Pangeo Session Talk Links #

Chelle: notebook can be found in folder AWS-notebooks
- Jupyter Notebooks
Julius:
Rob
- Info about STAC specification
Deepak:
- Dask Groupby Repository
Haiying:
- Pangeo Benchmarking Repository
Damien:
- Software Carpentries Lessons

Growth and History of Anaconda + Numpy + SciPy + Dask #

This session was a keynote focused on how these packages + ecosystem grew

Key question: how do you get from a few passionate few to productive many?
- Keep your motivated few to less than 5 people
Where efforts should be focused
- Dedicating full time employees to work on community dev projects
- Build the community as you go - this is essential
  - Sometimes it takes a while to build up your product… that’s okay!
Really successful projects limit scope and provide interfaces
- Maybe success means that a project becomes an interface and becomes an underlying layer for something else
Working code solves most arguements
- At the end of the day, someone has to sit down and code something up.. if it works, great!
Great projects start with a need
- The conceptual idea is not always important - specific use cases are more important
Start by solving something
- This may require shrinking the scope to a more manageable size
Let teams go and do their own thing
- Taken scope -> give to others -> do that thing
You need to be humble with your project, but not too humble; simple but not too simple
- It is all about balance
Top down approaches typically do not work
- You will always want more resources…
The power of Python is the community
- Key idea - there are more smart people outside of your organization than inside your organization
- If you try to do everything, it will inevitably fail - need to make it more simple
- Unlock the community - entrain users, become contributors
At some point, it transitions to where you “give away your legos”
Provide scope and build people
- Move towards “grassroots” not so much top down

Xarray User Forum + Updates #

Xarray now includes a backend api, allowing contributors to add new backends for reading + operating on datasets
- Including the raw_indexing method is important to enable lazy loading
Flexible indexing is coming to xarray soon
- Helps when dealing with irregular spaced data, working with projected coordinate systems, staggered grids (ex. 2D lat-lon coordinate variables)
- Currently, Xarray uses pandas for indexing, which is problematic when wanting to do flexible indexing, only supporting 1D in-memory indexes
- Current Xarray accessor - Xoak
  - See previous blog post on working with unstructured grids
- See the design documents for flexible indexing in Xarray
Duck arrays are now supported in Xarray
- Duck arrays are arrays that look like numpy arrays in the sense that they have a similar api, but are fundamentally different data types (Ex. a pint array which looks like a numpy array)
- Helps when working with pint
- Other examples of duck arrays
  - SciPy Sparse Matrices
  - CuPy
- Most of the xarray API supports duck arrays, few aspects that still don’t

Xarray User Forum Talk Links #

Aureliana Barghini’s Talk
- New backend plugin infrastructure
Links from Benoit Bovy’s talk
- Flexible indexes design notes
- Explicit indexes project
Davis Bennett’s talk
- Hierarchical storage data structure in xarray
Ravi Kumar’s talk:
- Using Xarray to store statistical diagnostic data

Dask on High Performance Computing Systems #

Dask-jobqueue provides a uniform api to spinning up dask-clusters, interfacing with scheduler
MPI-based communication with Dask shows promising results
We are fortunate to have the HPC system we have (Cheyenne/Casper)
- In terms of data access, integration of compute and login nodes
- NCAR JupyterHub which enables improved access to Jupyter environment

Dask on HPC Talk Links #

Anderson
- The Current State of Deploying Dask on HPC Systems
  - Presentation
  - Notebook example
Tina
- Real Time Monitoring of HPC Simulation Outputs using Dask and Xarray
  - Dask scheduler written in Rust

How can we tie this back to ESDS?#

Several of these sessions included speakers/organizers from within NCAR, such as Anderson Banihirwe (@andersy005) and Deepak Cherian (@dcherian). They, along with several others with NCAR, could be considered the “passionate few”. They are core developers of upstream projects such as Xarray and Dask, which are critical components to the scientific python ecosystem, especially for geoscience. The key to ESDS is growing the userbase, moving toward the “productive many”.

We should see the ESDS initiative as an opportunity to test out these ideas, and push to forge ahead with a grassroots movement focused around promoting open-science and open-devlopment, empowering people to be more productive in their data workflows. Here are a few key points for us to consider:

Developing in small teams
Continue to use Dask for scalable analysis using our HPC resources
Stressing the importance of open-science, providing examples of what this looks like
Documenting complex scientific workflows
Continue to explore moving toward cloud-native datasets
Focus efforts on training people, making them contributors

We should continue to foster collaboration across NCAR, finding opportunities to share our workflows, find improved methods of sharing code, and move outside of our typical “silos”, toward a more open, inclusive, and collaborative environment. We can build on the tools already available within the scientific Python ecosystem, such as Dask, to continue to gather insight from large datasets within the geoscience community.

Pandas Tutorial Building an Intake-esm catalog from CESM2 History Files

28 May 2021

Recent Posts

Archives