Stream: ESDS

Topic: Dask: KilledWorker Error


view this post on Zulip Notification Bot (Mar 17 2022 at 13:23):

This topic was moved by Anderson Banihirwe to #dask > KilledWorker Error

view this post on Zulip Alice DuVivier (Mar 29 2022 at 04:29):

@Michael Levy , following up from what we were working on today. I used the DASK set up you suggested as well as split up the notebook and am trying to just process the small experiment with 5 ens members, but I'm getting the killed worker error when trying to write my 500MB file. If you're willing to look at this again, I'd appreciate your input.
Notebook: /glade/p/cgd/ppc/duvivier/cesm2_arctic_cyclones/rufmod_analysis/version_3/wind_vertical/rufmod_wind_vertical_processing.ipynb

Error: KilledWorker: ("('_vertical_remap-12d6f63cd38f9fd5cbb1291da0026378', 0, 0, 0, 0, 0)", <WorkerState 'tcp://10.12.206.37:33228', name: PBSCluster-0-5, status: closed, memory: 0, processing: 1>)

view this post on Zulip Michael Levy (Mar 29 2022 at 14:21):

@Alice DuVivier I'm happy to look at this again -- if you want to chat about it, my only meeting today is at 2:00 so I'm available any other tie

view this post on Zulip Alice DuVivier (Mar 30 2022 at 01:19):

@Michael Levy and I worked on this today. Basically it was a question of making sure the chunk-size and number of chunks was reasonable and then asking for more memory, processes, and cores in the dask request. With those changes everything works now.

view this post on Zulip Deepak Cherian (Mar 31 2022 at 16:42):

Alice DuVivier said:

Michael Levy and I worked on this today. Basically it was a question of making sure the chunk-size and number of chunks was reasonable and then asking for more memory, processes, and cores in the dask request. With those changes everything works now.

This would be a cool short blog post! Just a before/after comparison.

view this post on Zulip Michael Levy (Mar 31 2022 at 16:50):

Deepak Cherian said:

Alice DuVivier said:

Michael Levy and I worked on this today. Basically it was a question of making sure the chunk-size and number of chunks was reasonable and then asking for more memory, processes, and cores in the dask request. With those changes everything works now.

This would be a cool short blog post! Just a before/after comparison.

I agree that it would be a great blog post to have - I could give it a shot, but I worry that my justification for the chunk choices and some arguments we picked for PBSCluster() will be "well, we tried it and it worked so that was cool." I think the big thing was running the cluster with

cluster = PBSCluster(
    cores=36,
    memory='300 GB',
    processes=9,
    resource_spec='select=1:ncpus=36:mem=300GB',
)

I would love to have a permanent link to that reference instead of needing to remember which of my notebooks actually use dask-jobqueue and are configured correctly.

Another concern about writing the blog post is that it is still unclear to me if we should explicitly recommend other arguments to avoid potential conflicts with ~/.config/dask/jobqueue.yaml (or if we should be recommending any settings for that file instead / in addition). @Alice DuVivier might have been explicitly setting queue, walltime, and / or interface, and obviously we should mention project

view this post on Zulip Deepak Cherian (Mar 31 2022 at 16:53):

my justification for the chunk choices and some arguments we picked for PBSCluster() will be "well, we tried it and it worked so that was cool."

I think this is totally fine.

. Another concern about writing the blog post is that it is still unclear to me if we should explicitly recommend other arguments to avoid potential conflicts

I think we can view the blogpost as trying to reason about when such modifications might be useful.

view this post on Zulip Alice DuVivier (Mar 31 2022 at 17:55):

It was also things like setting explicitly the chunks argument. E.g.
| ds4_1=xr.open_mfdataset(my_files,combine='by_coords',chunks={'time':129}, parallel=True, compat='override', coords='minimal')

I do think some guidance on when is the "best" time to load data would be good. Like how would one choose when it's appropriate to do so? A lot of the calculations are the "lazy" kind in python, and this was new to me after using NCL and such. So some guidance on when and how to actually load the data would be great.

view this post on Zulip Katie Dagon (Mar 31 2022 at 17:58):

Alice DuVivier said:

I do think some guidance on when is the "best" time to load data would be good. Like how would one choose when it's appropriate to do so? A lot of the calculations are the "lazy" kind in python, and this was new to me after using NCL and such. So some guidance on when and how to actually load the data would be great.

Yeah hearing from others about this would be great. I tend to wait until I need to start plotting something, and then use .persist() on the final data array (after all the computations) so that it's easier to re-run the plot code over and over as I'm tweaking the visualization.

view this post on Zulip Alice DuVivier (Mar 31 2022 at 20:54):

I guess I don't really know why one would choose .persist() vs. .load() in different circumstances. I often try one or the other if it isn't working. For this particular notebook, it took long enough to do the calculations that I saved the data I wanted to plot as a netcdf file and then in a separate notebook do the plotting so as I tweak the visualization I don't have to worry about re-running the data manipulation if I time out or something. I don't always do that, but in this case I did.

view this post on Zulip Matt Long (Mar 31 2022 at 21:05):

.persist() keeps the memory distributed; .load() loads everything locally.

view this post on Zulip Katie Dagon (Mar 31 2022 at 21:15):

I found these documentation pages to be helpful in understanding the differences between .persist(), .compute(), and .load():
https://distributed.dask.org/en/stable/memory.html
https://distributed.dask.org/en/latest/manage-computation.html
https://xarray.pydata.org/en/stable/user-guide/dask.html#using-dask-with-xarray


Last updated: May 16 2025 at 17:14 UTC