This topic was moved by Anderson Banihirwe to #dask > KilledWorker Error
@Michael Levy , following up from what we were working on today. I used the DASK set up you suggested as well as split up the notebook and am trying to just process the small experiment with 5 ens members, but I'm getting the killed worker error when trying to write my 500MB file. If you're willing to look at this again, I'd appreciate your input.
Notebook: /glade/p/cgd/ppc/duvivier/cesm2_arctic_cyclones/rufmod_analysis/version_3/wind_vertical/rufmod_wind_vertical_processing.ipynb
Error: KilledWorker: ("('_vertical_remap-12d6f63cd38f9fd5cbb1291da0026378', 0, 0, 0, 0, 0)", <WorkerState 'tcp://10.12.206.37:33228', name: PBSCluster-0-5, status: closed, memory: 0, processing: 1>)
@Alice DuVivier I'm happy to look at this again -- if you want to chat about it, my only meeting today is at 2:00 so I'm available any other tie
@Michael Levy and I worked on this today. Basically it was a question of making sure the chunk-size and number of chunks was reasonable and then asking for more memory, processes, and cores in the dask request. With those changes everything works now.
Alice DuVivier said:
Michael Levy and I worked on this today. Basically it was a question of making sure the chunk-size and number of chunks was reasonable and then asking for more memory, processes, and cores in the dask request. With those changes everything works now.
This would be a cool short blog post! Just a before/after comparison.
Deepak Cherian said:
Alice DuVivier said:
Michael Levy and I worked on this today. Basically it was a question of making sure the chunk-size and number of chunks was reasonable and then asking for more memory, processes, and cores in the dask request. With those changes everything works now.
This would be a cool short blog post! Just a before/after comparison.
I agree that it would be a great blog post to have - I could give it a shot, but I worry that my justification for the chunk
choices and some arguments we picked for PBSCluster()
will be "well, we tried it and it worked so that was cool." I think the big thing was running the cluster with
cluster = PBSCluster(
cores=36,
memory='300 GB',
processes=9,
resource_spec='select=1:ncpus=36:mem=300GB',
)
I would love to have a permanent link to that reference instead of needing to remember which of my notebooks actually use dask-jobqueue
and are configured correctly.
Another concern about writing the blog post is that it is still unclear to me if we should explicitly recommend other arguments to avoid potential conflicts with ~/.config/dask/jobqueue.yaml
(or if we should be recommending any settings for that file instead / in addition). @Alice DuVivier might have been explicitly setting queue
, walltime
, and / or interface
, and obviously we should mention project
my justification for the chunk choices and some arguments we picked for PBSCluster() will be "well, we tried it and it worked so that was cool."
I think this is totally fine.
. Another concern about writing the blog post is that it is still unclear to me if we should explicitly recommend other arguments to avoid potential conflicts
I think we can view the blogpost as trying to reason about when such modifications might be useful.
It was also things like setting explicitly the chunks argument. E.g.
| ds4_1=xr.open_mfdataset(my_files,combine='by_coords',chunks={'time':129}, parallel=True, compat='override', coords='minimal')
I do think some guidance on when is the "best" time to load data would be good. Like how would one choose when it's appropriate to do so? A lot of the calculations are the "lazy" kind in python, and this was new to me after using NCL and such. So some guidance on when and how to actually load the data would be great.
Alice DuVivier said:
I do think some guidance on when is the "best" time to load data would be good. Like how would one choose when it's appropriate to do so? A lot of the calculations are the "lazy" kind in python, and this was new to me after using NCL and such. So some guidance on when and how to actually load the data would be great.
Yeah hearing from others about this would be great. I tend to wait until I need to start plotting something, and then use .persist()
on the final data array (after all the computations) so that it's easier to re-run the plot code over and over as I'm tweaking the visualization.
I guess I don't really know why one would choose .persist()
vs. .load()
in different circumstances. I often try one or the other if it isn't working. For this particular notebook, it took long enough to do the calculations that I saved the data I wanted to plot as a netcdf file and then in a separate notebook do the plotting so as I tweak the visualization I don't have to worry about re-running the data manipulation if I time out or something. I don't always do that, but in this case I did.
.persist()
keeps the memory distributed; .load()
loads everything locally.
I found these documentation pages to be helpful in understanding the differences between .persist()
, .compute()
, and .load()
:
https://distributed.dask.org/en/stable/memory.html
https://distributed.dask.org/en/latest/manage-computation.html
https://xarray.pydata.org/en/stable/user-guide/dask.html#using-dask-with-xarray
Last updated: May 16 2025 at 17:14 UTC