Stream: dask

Topic: Dask Performance Issues?


view this post on Zulip Kevin Paul (May 06 2020 at 16:57):

Hello, Dask Users!

I'm writing this to solicit feedback from Dask+Xarray users about any performance issues you have when attempting to do parallel operations with Dask. If you have any particular issues, please respond to this message with a detailed description of the problem you are having. If you can provide a "minimal example" of the problem so that others can reproduce the problem, that is even better. And if someone responds with an issue that you are experiencing, too, then tag their message with a :+1:.

Thanks for the feedback!

-- Kevin

view this post on Zulip Brian Bonnlander (May 06 2020 at 18:10):

I don't know how simple I can make this, but it is reproducible and relatively easy to set up.

We have a notebook we use to create Zarr stores from NetCDF files here: https://github.com/NCAR/cesm-lens-zarrification/blob/master/notebooks/zarrify.ipynb. . In this same folder is a config.yaml file that specifies which LENS variables we wish to work with. When saving a particularly large Zarr store, for example SALT for the 20C experiment, the Zarr creation step will fail if the task graph is created with 2-4 workers allocated, even if within a minute or two there are > 10 workers available. The 10 workers must be available when the task graph is first created, or a KilledWorker and failed Zarr save are the result.

It would seem that Dask is not redistributing tasks evenly when more workers become available. I realize that there is probably a subjective choice that must be made around when to redistribute tasks, and of course I don't understand the underlying technical issues that might prevent something like this from being a simple matter.

Let me know if I can provide more information.

view this post on Zulip Kevin Paul (May 06 2020 at 19:58):

@Brian Bonnlander Yeah. That's odd.

view this post on Zulip Brian Bonnlander (May 06 2020 at 20:03):

Dask version being used is 2.14.0, partly because 2.15.0 seems to cause dask dashboard issues on Casper. I have seen something in the discussions around recent changes in Dask concurrent futures. I'm not sure if the problem goes away with newer versions.

view this post on Zulip Kevin Paul (May 06 2020 at 20:03):

Ok. Let's keep an eye on this, then.

view this post on Zulip Brian Bonnlander (May 06 2020 at 20:05):

And to be clear, the problem I'm having is not minor because the load on Casper is pretty high most of the time. That means that I have to try creating these stores at strange hours, because 2-4 workers is often all that is available until evening hours.

view this post on Zulip Kevin Paul (May 06 2020 at 20:09):

Understood.

view this post on Zulip Deepak Cherian (May 06 2020 at 20:33):

@Brian Bonnlander Are you rechunking? This smells like a memory management problem which is common in rechunking workloads (see https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588)

view this post on Zulip Brian Bonnlander (May 06 2020 at 20:46):

Yes, there is rechunking as part of the overall graph, and the workers are killed when memory runs out for one of the original workers. But it appears that the other workers that are added after the initial graph is created have plenty of memory. Thanks for the link; I will look at it today when I have a chance.


Last updated: Jan 30 2022 at 12:01 UTC