Stream: python-questions

Topic: dask workers on casper timing out during start up


view this post on Zulip Keith Lindsay (Jan 25 2020 at 14:24):

I'm trying to do some dask-enabled computations on casper/dav, which I'm using because the input files are on campaign storage. I recently realized that my scripts have been consistently slower than they used to be. After digging in, I realized that the computations are essentially serial, because the dask workers are being canceled shortly after they start up. The slurm job standard error files for the workers all contain

distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker

I'm trying to figure out if this is being caused by a change in the system or from change in my environment and/or scripts. I'm not sure how long this has been going on. Have others seen this behavior on casper?

view this post on Zulip Anderson Banihirwe (Jan 25 2020 at 18:18):

Have others seen this behavior on casper?

I haven't encountered this behavior myself, but I've seen similar reports online before. Can you point me to the location of the log file?

view this post on Zulip Anderson Banihirwe (Jan 25 2020 at 19:39):

I haven't encountered this behavior myself

Scratch that :slight_smile: I just started seeing this behavior myself. I don't know what's going on yet:

distributed.worker - INFO -       Start worker at:   tcp://10.12.205.25:36228
distributed.worker - INFO -          Listening to:   tcp://10.12.205.25:36228
distributed.worker - INFO -          dashboard at:         10.12.205.25:40625
distributed.worker - INFO - Waiting to connect to: tcp://128.117.181.206:36764
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                   50.00 GB
distributed.worker - INFO -       Local Directory: /glade/scratch/abanihi/worker-nww69z_s
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Closing Nanny at 'tcp://10.12.205.25:44881'
distributed.worker - INFO - Stopping worker at tcp://10.12.205.25:36228
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker

view this post on Zulip Anderson Banihirwe (Jan 25 2020 at 23:46):

@Keith Lindsay

view this post on Zulip Anderson Banihirwe (Jan 25 2020 at 23:46):

Were you using cluster.adapt() or cluster.scale() ??

view this post on Zulip Keith Lindsay (Jan 25 2020 at 23:49):

cluster.scale()
The client is being instantiated in function _tseries_gen of /glade/work/klindsay/analysis/CESM2_coup_carb_cycle_JAMES/src/tseries_mod.py

view this post on Zulip Precious Mongwe (Jan 25 2020 at 23:49):

I used cluster.scale()

view this post on Zulip Precious Mongwe (Jan 25 2020 at 23:50):

Sorry, "this is a different conversation"

view this post on Zulip Anderson Banihirwe (Jan 26 2020 at 00:25):

cluster.scale()
The client is being instantiated in function _tseries_gen of /glade/work/klindsay/analysis/CESM2_coup_carb_cycle_JAMES/src/tseries_mod.py

After some digging and looking at tseries_mod.py, my speculation so far is that this happens just when a worker comes online. The error is logged by this line. Throughout tseries_mod.py, the cluster is scaled up and down in a few places, and it's likely that in some cases the following is happening:

distributed.nanny - INFO - Closing Nanny at 'tcp://10.12.205.25:44881'
distributed.worker - INFO - Stopping worker at tcp://10.12.205.25:36228
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker

scaling the cluster once (i.e. calling cluster.scale() once) throughout the computation may provide more insights.

I will do more digging, and will let you know if I find anything

view this post on Zulip Keith Lindsay (Jan 26 2020 at 01:24):

Thanks for taking a look @Anderson Banihirwe

While tseries_mod.py can end up calling cluster.scale() multiple times with different arguments, the problems that I'm currently encountering are occurring with cluster.scale() having been called just once. I just added a print statement to tseries_mod.py to confirm this and reran one of my notebooks. Now I'm getting an error from distributed that I don't recall encountering previously:

distributed.utils - ERROR - addresses should be strings or tuples, got None
Traceback (most recent call last):
  File "/glade/work/klindsay/miniconda3/envs/CESM2_coup_carb_cycle_JAMES/lib/python3.7/site-packages/distributed/utils.py", line 662, in log_errors
    yield
  File "/glade/work/klindsay/miniconda3/envs/CESM2_coup_carb_cycle_JAMES/lib/python3.7/site-packages/distributed/scheduler.py", line 2122, in remove_worker
    address = self.coerce_address(address)
  File "/glade/work/klindsay/miniconda3/envs/CESM2_coup_carb_cycle_JAMES/lib/python3.7/site-packages/distributed/scheduler.py", line 4831, in coerce_address
    raise TypeError("addresses should be strings or tuples, got %r" % (addr,))
TypeError: addresses should be strings or tuples, got None

I see some github issues on this GH #3374 and GH #3386 that I'll take a look at.
Perhaps I got some incompatibilities in a recent environment update.

view this post on Zulip Keith Lindsay (Jan 26 2020 at 05:08):

@Anderson Banihirwe , I tried using various earlier versions of distributed, which also led to me using earlier versions of dask. These attempts didn't pan out.

However, I realized that my script was working on some datasets, the workers didn't fail for all datasets. The datasets that are leading to failed workers are 4D (3D space+time), while the ones that work are 3D (2D space+time). For 4D datasets, I'm using smaller chunk sizes in time, in an attempt to keep memory per chunk from getting too big. I tried using small time chunks for 3D datasets and this led to failed workers, telling me that I was on to something. So I tried increasing the time chunksize for 4D datasets. This worked, and I'm now up and running.

I have to admit that I don't understand why the small time chunk sizes are a problem now, while they previously wasn't.

I don't know if this is related at all to why you are seeing the symptom of failed workers.

view this post on Zulip Keith Lindsay (Jan 26 2020 at 05:24):

With the larger chunks, I did find it necessary to increase memory per worker in ~/.dask/jobqueue.yaml.


Last updated: Jan 30 2022 at 12:01 UTC