Dask Distributed Summit 2021 Takeaways#
Pangeo Session#
Cloud optimized datasets help improve speed of analysis
Entire 4 TB datasets open up in a few seconds
Important to chunk you data - store in object-store
Reading netcdf data remotely, especially on the cloud, can take a while…
Even with using dask-delayed, quite slow
Open-source projects, such as XGCM, make answering big science questions easier
Ex. looking at oxygen minimum zones, converting to density coords
Persistent pain points
Using dask vs. understanding dask - can we make this transition easier for non-daskers?
It can tough to pin down specific parts of code which is causing workers to die
We should get more science workflows in the cloud
Ensure these workflows are documented, especially in relation to dask
Document cases of Dask workers being killed and how to go about solving these issues
“Dask worker serial killers”
Minimum reproducible examples don’t always work - instead, curate domain-specific examples
STAC (spatio-temporal asset catalog) can be helpful when working with satellite imagery, making it easy to search for specific datasets
Similar to intake, able to search spatially as well
Recent development of StackStac allows for fast loading of STAC catalogs
Pangeo Session Talk Links#
Chelle: notebook can be found in folder AWS-notebooks
Julius:
Rob
Deepak:
Haiying:
Damien:
Growth and History of Anaconda + Numpy + SciPy + Dask#
This session was a keynote focused on how these packages + ecosystem grew
Key question: how do you get from a few passionate few to productive many?
Keep your motivated few to less than 5 people
Where efforts should be focused
Dedicating full time employees to work on community dev projects
Build the community as you go - this is essential
Sometimes it takes a while to build up your product… that’s okay!
Really successful projects limit scope and provide interfaces
Maybe success means that a project becomes an interface and becomes an underlying layer for something else
Working code solves most arguements
At the end of the day, someone has to sit down and code something up.. if it works, great!
Great projects start with a need
The conceptual idea is not always important - specific use cases are more important
Start by solving something
This may require shrinking the scope to a more manageable size
Let teams go and do their own thing
Taken scope -> give to others -> do that thing
You need to be humble with your project, but not too humble; simple but not too simple
It is all about balance
Top down approaches typically do not work
You will always want more resources…
The power of Python is the community
Key idea - there are more smart people outside of your organization than inside your organization
If you try to do everything, it will inevitably fail - need to make it more simple
Unlock the community - entrain users, become contributors
At some point, it transitions to where you “give away your legos”
Provide scope and build people
Move towards “grassroots” not so much top down
Xarray User Forum + Updates#
Xarray now includes a backend api, allowing contributors to add new backends for reading + operating on datasets
Including the
raw_indexing
method is important to enable lazy loading
Flexible indexing is coming to xarray soon
Helps when dealing with irregular spaced data, working with projected coordinate systems, staggered grids (ex. 2D lat-lon coordinate variables)
Currently, Xarray uses pandas for indexing, which is problematic when wanting to do flexible indexing, only supporting 1D in-memory indexes
Current Xarray accessor - Xoak
Duck arrays are now supported in Xarray
Duck arrays are arrays that look like numpy arrays in the sense that they have a similar api, but are fundamentally different data types (Ex. a pint array which looks like a numpy array)
Helps when working with pint
Other examples of duck arrays
Most of the xarray API supports duck arrays, few aspects that still don’t
Xarray User Forum Talk Links#
Aureliana Barghini’s Talk
Links from Benoit Bovy’s talk
Davis Bennett’s talk
Ravi Kumar’s talk:
Dask on High Performance Computing Systems#
Dask-jobqueue provides a uniform api to spinning up dask-clusters, interfacing with scheduler
MPI-based communication with Dask shows promising results
We are fortunate to have the HPC system we have (Cheyenne/Casper)
In terms of data access, integration of compute and login nodes
NCAR JupyterHub which enables improved access to Jupyter environment
Dask on HPC Talk Links#
Anderson
The Current State of Deploying Dask on HPC Systems
Tina
Real Time Monitoring of HPC Simulation Outputs using Dask and Xarray
How can we tie this back to ESDS?#
Several of these sessions included speakers/organizers from within NCAR, such as Anderson Banihirwe (@andersy005) and Deepak Cherian (@dcherian). They, along with several others with NCAR, could be considered the “passionate few”. They are core developers of upstream projects such as Xarray and Dask, which are critical components to the scientific python ecosystem, especially for geoscience. The key to ESDS is growing the userbase, moving toward the “productive many”.
We should see the ESDS initiative as an opportunity to test out these ideas, and push to forge ahead with a grassroots movement focused around promoting open-science and open-devlopment, empowering people to be more productive in their data workflows. Here are a few key points for us to consider:
Developing in small teams
Continue to use Dask for scalable analysis using our HPC resources
Stressing the importance of open-science, providing examples of what this looks like
Documenting complex scientific workflows
Continue to explore moving toward cloud-native datasets
Focus efforts on training people, making them contributors
We should continue to foster collaboration across NCAR, finding opportunities to share our workflows, find improved methods of sharing code, and move outside of our typical “silos”, toward a more open, inclusive, and collaborative environment. We can build on the tools already available within the scientific Python ecosystem, such as Dask, to continue to gather insight from large datasets within the geoscience community.