Dask Distributed Summit 2021 Takeaways#

Pangeo Session#

  • Cloud optimized datasets help improve speed of analysis

    • Entire 4 TB datasets open up in a few seconds

    • Important to chunk you data - store in object-store

  • Reading netcdf data remotely, especially on the cloud, can take a while…

    • Even with using dask-delayed, quite slow

  • Open-source projects, such as XGCM, make answering big science questions easier

    • Ex. looking at oxygen minimum zones, converting to density coords

  • Persistent pain points

    • Using dask vs. understanding dask - can we make this transition easier for non-daskers?

    • It can tough to pin down specific parts of code which is causing workers to die

  • We should get more science workflows in the cloud

    • Ensure these workflows are documented, especially in relation to dask

    • Document cases of Dask workers being killed and how to go about solving these issues

      • “Dask worker serial killers”

  • Minimum reproducible examples don’t always work - instead, curate domain-specific examples

  • STAC (spatio-temporal asset catalog) can be helpful when working with satellite imagery, making it easy to search for specific datasets

    • Similar to intake, able to search spatially as well

    • Recent development of StackStac allows for fast loading of STAC catalogs

Growth and History of Anaconda + Numpy + SciPy + Dask#

This session was a keynote focused on how these packages + ecosystem grew

  • Key question: how do you get from a few passionate few to productive many?

    • Keep your motivated few to less than 5 people

  • Where efforts should be focused

    • Dedicating full time employees to work on community dev projects

    • Build the community as you go - this is essential

      • Sometimes it takes a while to build up your product… that’s okay!

  • Really successful projects limit scope and provide interfaces

    • Maybe success means that a project becomes an interface and becomes an underlying layer for something else

  • Working code solves most arguements

    • At the end of the day, someone has to sit down and code something up.. if it works, great!

  • Great projects start with a need

    • The conceptual idea is not always important - specific use cases are more important

  • Start by solving something

    • This may require shrinking the scope to a more manageable size

  • Let teams go and do their own thing

    • Taken scope -> give to others -> do that thing

  • You need to be humble with your project, but not too humble; simple but not too simple

    • It is all about balance

  • Top down approaches typically do not work

    • You will always want more resources…

  • The power of Python is the community

    • Key idea - there are more smart people outside of your organization than inside your organization

    • If you try to do everything, it will inevitably fail - need to make it more simple

    • Unlock the community - entrain users, become contributors

  • At some point, it transitions to where you “give away your legos”

  • Provide scope and build people

    • Move towards “grassroots” not so much top down

Xarray User Forum + Updates#

  • Xarray now includes a backend api, allowing contributors to add new backends for reading + operating on datasets

    • Including the raw_indexing method is important to enable lazy loading

  • Flexible indexing is coming to xarray soon

  • Duck arrays are now supported in Xarray

    • Duck arrays are arrays that look like numpy arrays in the sense that they have a similar api, but are fundamentally different data types (Ex. a pint array which looks like a numpy array)

    • Helps when working with pint

    • Other examples of duck arrays

    • Most of the xarray API supports duck arrays, few aspects that still don’t

Dask on High Performance Computing Systems#

  • Dask-jobqueue provides a uniform api to spinning up dask-clusters, interfacing with scheduler

  • MPI-based communication with Dask shows promising results

  • We are fortunate to have the HPC system we have (Cheyenne/Casper)

    • In terms of data access, integration of compute and login nodes

    • NCAR JupyterHub which enables improved access to Jupyter environment

How can we tie this back to ESDS?#

Several of these sessions included speakers/organizers from within NCAR, such as Anderson Banihirwe (@andersy005) and Deepak Cherian (@dcherian). They, along with several others with NCAR, could be considered the “passionate few”. They are core developers of upstream projects such as Xarray and Dask, which are critical components to the scientific python ecosystem, especially for geoscience. The key to ESDS is growing the userbase, moving toward the “productive many”.

We should see the ESDS initiative as an opportunity to test out these ideas, and push to forge ahead with a grassroots movement focused around promoting open-science and open-devlopment, empowering people to be more productive in their data workflows. Here are a few key points for us to consider:

  • Developing in small teams

  • Continue to use Dask for scalable analysis using our HPC resources

  • Stressing the importance of open-science, providing examples of what this looks like

  • Documenting complex scientific workflows

  • Continue to explore moving toward cloud-native datasets

  • Focus efforts on training people, making them contributors

We should continue to foster collaboration across NCAR, finding opportunities to share our workflows, find improved methods of sharing code, and move outside of our typical “silos”, toward a more open, inclusive, and collaborative environment. We can build on the tools already available within the scientific Python ecosystem, such as Dask, to continue to gather insight from large datasets within the geoscience community.