Stream: python-questions
Topic: dask workers on casper timing out during start up
Keith Lindsay (Jan 25 2020 at 14:24):
I'm trying to do some dask-enabled computations on casper/dav, which I'm using because the input files are on campaign storage. I recently realized that my scripts have been consistently slower than they used to be. After digging in, I realized that the computations are essentially serial, because the dask workers are being canceled shortly after they start up. The slurm job standard error files for the workers all contain
distributed.dask_worker - INFO - Timed out starting worker distributed.dask_worker - INFO - End worker
I'm trying to figure out if this is being caused by a change in the system or from change in my environment and/or scripts. I'm not sure how long this has been going on. Have others seen this behavior on casper?
Anderson Banihirwe (Jan 25 2020 at 18:18):
Have others seen this behavior on casper?
I haven't encountered this behavior myself, but I've seen similar reports online before. Can you point me to the location of the log file?
Anderson Banihirwe (Jan 25 2020 at 19:39):
I haven't encountered this behavior myself
Scratch that :slight_smile: I just started seeing this behavior myself. I don't know what's going on yet:
distributed.worker - INFO - Start worker at: tcp://10.12.205.25:36228 distributed.worker - INFO - Listening to: tcp://10.12.205.25:36228 distributed.worker - INFO - dashboard at: 10.12.205.25:40625 distributed.worker - INFO - Waiting to connect to: tcp://128.117.181.206:36764 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 1 distributed.worker - INFO - Memory: 50.00 GB distributed.worker - INFO - Local Directory: /glade/scratch/abanihi/worker-nww69z_s distributed.worker - INFO - ------------------------------------------------- distributed.nanny - INFO - Closing Nanny at 'tcp://10.12.205.25:44881' distributed.worker - INFO - Stopping worker at tcp://10.12.205.25:36228 distributed.worker - INFO - Closed worker has not yet started: None distributed.dask_worker - INFO - Timed out starting worker distributed.dask_worker - INFO - End worker
Anderson Banihirwe (Jan 25 2020 at 23:46):
@Keith Lindsay
Anderson Banihirwe (Jan 25 2020 at 23:46):
Were you using cluster.adapt()
or cluster.scale()
??
Keith Lindsay (Jan 25 2020 at 23:49):
cluster.scale()
The client is being instantiated in function _tseries_gen
of /glade/work/klindsay/analysis/CESM2_coup_carb_cycle_JAMES/src/tseries_mod.py
Precious Mongwe (Jan 25 2020 at 23:49):
I used cluster.scale()
Precious Mongwe (Jan 25 2020 at 23:50):
Sorry, "this is a different conversation"
Anderson Banihirwe (Jan 26 2020 at 00:25):
cluster.scale()
The client is being instantiated in function_tseries_gen
of/glade/work/klindsay/analysis/CESM2_coup_carb_cycle_JAMES/src/tseries_mod.py
After some digging and looking at tseries_mod.py
, my speculation so far is that this happens just when a worker comes online. The error is logged by this line. Throughout tseries_mod.py
, the cluster is scaled up and down in a few places, and it's likely that in some cases the following is happening:
cluster.scale_up(num_workers)
is called,- xarray and dask go on to do the computation with number of workers less than the requested
num_workers
. This leaves a few workers in the queue ( Slurm or PBS hasn't provisioned the resources for some of the workers yet) - The computation finishes, and then
cluster.scale_down(num_workers)
is called. Asignal
is then sent to some workers to shut down. The workers which are still in the PBS/SLURM queue receive the shut down signal before getting a chance to start, which could explain the logs:
distributed.nanny - INFO - Closing Nanny at 'tcp://10.12.205.25:44881'
distributed.worker - INFO - Stopping worker at tcp://10.12.205.25:36228
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker
scaling
the cluster once (i.e. calling cluster.scale()
once) throughout the computation may provide more insights.
I will do more digging, and will let you know if I find anything
Keith Lindsay (Jan 26 2020 at 01:24):
Thanks for taking a look @Anderson Banihirwe
While tseries_mod.py
can end up calling cluster.scale()
multiple times with different arguments, the problems that I'm currently encountering are occurring with cluster.scale()
having been called just once. I just added a print
statement to tseries_mod.py
to confirm this and reran one of my notebooks. Now I'm getting an error from distributed
that I don't recall encountering previously:
distributed.utils - ERROR - addresses should be strings or tuples, got None Traceback (most recent call last): File "/glade/work/klindsay/miniconda3/envs/CESM2_coup_carb_cycle_JAMES/lib/python3.7/site-packages/distributed/utils.py", line 662, in log_errors yield File "/glade/work/klindsay/miniconda3/envs/CESM2_coup_carb_cycle_JAMES/lib/python3.7/site-packages/distributed/scheduler.py", line 2122, in remove_worker address = self.coerce_address(address) File "/glade/work/klindsay/miniconda3/envs/CESM2_coup_carb_cycle_JAMES/lib/python3.7/site-packages/distributed/scheduler.py", line 4831, in coerce_address raise TypeError("addresses should be strings or tuples, got %r" % (addr,)) TypeError: addresses should be strings or tuples, got None
I see some github issues on this GH #3374 and GH #3386 that I'll take a look at.
Perhaps I got some incompatibilities in a recent environment update.
Keith Lindsay (Jan 26 2020 at 05:08):
@Anderson Banihirwe , I tried using various earlier versions of distributed, which also led to me using earlier versions of dask. These attempts didn't pan out.
However, I realized that my script was working on some datasets, the workers didn't fail for all datasets. The datasets that are leading to failed workers are 4D (3D space+time), while the ones that work are 3D (2D space+time). For 4D datasets, I'm using smaller chunk sizes in time, in an attempt to keep memory per chunk from getting too big. I tried using small time chunks for 3D datasets and this led to failed workers, telling me that I was on to something. So I tried increasing the time chunksize for 4D datasets. This worked, and I'm now up and running.
I have to admit that I don't understand why the small time chunk sizes are a problem now, while they previously wasn't.
I don't know if this is related at all to why you are seeing the symptom of failed workers.
Keith Lindsay (Jan 26 2020 at 05:24):
With the larger chunks, I did find it necessary to increase memory per worker in ~/.dask/jobqueue.yaml
.
Last updated: Jan 30 2022 at 12:01 UTC