I'm trying to do some dask-enabled computations on casper/dav, which I'm using because the input files are on campaign storage. I recently realized that my scripts have been consistently slower than they used to be. After digging in, I realized that the computations are essentially serial, because the dask workers are being canceled shortly after they start up. The slurm job standard error files for the workers all contain
distributed.dask_worker - INFO - Timed out starting worker distributed.dask_worker - INFO - End worker
I'm trying to figure out if this is being caused by a change in the system or from change in my environment and/or scripts. I'm not sure how long this has been going on. Have others seen this behavior on casper?
Have others seen this behavior on casper?
I haven't encountered this behavior myself, but I've seen similar reports online before. Can you point me to the location of the log file?
I haven't encountered this behavior myself
Scratch that :slight_smile: I just started seeing this behavior myself. I don't know what's going on yet:
distributed.worker - INFO - Start worker at: tcp://10.12.205.25:36228 distributed.worker - INFO - Listening to: tcp://10.12.205.25:36228 distributed.worker - INFO - dashboard at: 10.12.205.25:40625 distributed.worker - INFO - Waiting to connect to: tcp://128.117.181.206:36764 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 1 distributed.worker - INFO - Memory: 50.00 GB distributed.worker - INFO - Local Directory: /glade/scratch/abanihi/worker-nww69z_s distributed.worker - INFO - ------------------------------------------------- distributed.nanny - INFO - Closing Nanny at 'tcp://10.12.205.25:44881' distributed.worker - INFO - Stopping worker at tcp://10.12.205.25:36228 distributed.worker - INFO - Closed worker has not yet started: None distributed.dask_worker - INFO - Timed out starting worker distributed.dask_worker - INFO - End worker
@Keith Lindsay
Were you using cluster.adapt()
or cluster.scale()
??
cluster.scale()
The client is being instantiated in function _tseries_gen
of /glade/work/klindsay/analysis/CESM2_coup_carb_cycle_JAMES/src/tseries_mod.py
I used cluster.scale()
Sorry, "this is a different conversation"
cluster.scale()
The client is being instantiated in function_tseries_gen
of/glade/work/klindsay/analysis/CESM2_coup_carb_cycle_JAMES/src/tseries_mod.py
After some digging and looking at tseries_mod.py
, my speculation so far is that this happens just when a worker comes online. The error is logged by this line. Throughout tseries_mod.py
, the cluster is scaled up and down in a few places, and it's likely that in some cases the following is happening:
cluster.scale_up(num_workers)
is called, num_workers
. This leaves a few workers in the queue ( Slurm or PBS hasn't provisioned the resources for some of the workers yet)cluster.scale_down(num_workers)
is called. A signal
is then sent to some workers to shut down. The workers which are still in the PBS/SLURM queue receive the shut down signal before getting a chance to start, which could explain the logs:distributed.nanny - INFO - Closing Nanny at 'tcp://10.12.205.25:44881'
distributed.worker - INFO - Stopping worker at tcp://10.12.205.25:36228
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker
scaling
the cluster once (i.e. calling cluster.scale()
once) throughout the computation may provide more insights.
I will do more digging, and will let you know if I find anything
Thanks for taking a look @Anderson Banihirwe
While tseries_mod.py
can end up calling cluster.scale()
multiple times with different arguments, the problems that I'm currently encountering are occurring with cluster.scale()
having been called just once. I just added a print
statement to tseries_mod.py
to confirm this and reran one of my notebooks. Now I'm getting an error from distributed
that I don't recall encountering previously:
distributed.utils - ERROR - addresses should be strings or tuples, got None Traceback (most recent call last): File "/glade/work/klindsay/miniconda3/envs/CESM2_coup_carb_cycle_JAMES/lib/python3.7/site-packages/distributed/utils.py", line 662, in log_errors yield File "/glade/work/klindsay/miniconda3/envs/CESM2_coup_carb_cycle_JAMES/lib/python3.7/site-packages/distributed/scheduler.py", line 2122, in remove_worker address = self.coerce_address(address) File "/glade/work/klindsay/miniconda3/envs/CESM2_coup_carb_cycle_JAMES/lib/python3.7/site-packages/distributed/scheduler.py", line 4831, in coerce_address raise TypeError("addresses should be strings or tuples, got %r" % (addr,)) TypeError: addresses should be strings or tuples, got None
I see some github issues on this GH #3374 and GH #3386 that I'll take a look at.
Perhaps I got some incompatibilities in a recent environment update.
@Anderson Banihirwe , I tried using various earlier versions of distributed, which also led to me using earlier versions of dask. These attempts didn't pan out.
However, I realized that my script was working on some datasets, the workers didn't fail for all datasets. The datasets that are leading to failed workers are 4D (3D space+time), while the ones that work are 3D (2D space+time). For 4D datasets, I'm using smaller chunk sizes in time, in an attempt to keep memory per chunk from getting too big. I tried using small time chunks for 3D datasets and this led to failed workers, telling me that I was on to something. So I tried increasing the time chunksize for 4D datasets. This worked, and I'm now up and running.
I have to admit that I don't understand why the small time chunk sizes are a problem now, while they previously wasn't.
I don't know if this is related at all to why you are seeing the symptom of failed workers.
With the larger chunks, I did find it necessary to increase memory per worker in ~/.dask/jobqueue.yaml
.
Last updated: May 16 2025 at 17:14 UTC