Stream: dask

Topic: terminate dask cluster?

view this post on Zulip Stephen Yeager (Nov 19 2020 at 17:46):

I started a dask cluster that I no longer need. How do I terminate all my dask-worker jobs?

view this post on Zulip Deepak Cherian (Nov 19 2020 at 18:00):

in the notebook you can do cluster.close() assuming you started with cluster = NCARCluster(...) or something similar

view this post on Zulip Deepak Cherian (Nov 19 2020 at 18:01):

I think you'll want to do client.close() also to avoid a bunch of error messages (assuming client = distributed.Client(cluster)

view this post on Zulip Keith Lindsay (Nov 19 2020 at 18:05):

If you want to kill the jobs from the command line the following will kill ALL of your dask-workers:

squeue -u $USER | grep dask-wor | awk '{print $1}' | xargs scancel

Don't run this if you have other dask-workers that you want to keep around.

This is assuming that you're workers are on casper/slurm.
Replace squeue and scancel with corresponding PBS commands if your workers are on cheyenne/PBS.

view this post on Zulip Stephen Yeager (Nov 19 2020 at 18:06):

Thanks, that worked to kill the second batch of workers, but I invoked cluster = NCARCluster(...) and client = Client(cluster) twice in the same notebook. The first batch of workers seem to be in limbo and no longer connected to my notebook?

view this post on Zulip Deepak Cherian (Nov 19 2020 at 18:07):

ah yes i don't know a good notebook-only solution for the two cluster problem. @Anderson Banihirwe any ideas?

view this post on Zulip Anderson Banihirwe (Nov 19 2020 at 18:32):

There's a hidden _global_clients object which holds references to clients instantiated in a Python session... You can loop over this object to see which clients have been registered..

In [13]: from distributed.client import _global_clients

In [31]: for k, v in _global_clients.items():
    ...:     print(k, v)
0 <Client: 'tcp://' processes=4 threads=8, memory=17.18 GB>
1 <Client: 'tcp://' processes=4 threads=8, memory=17.18 GB>
2 <Client: 'tcp://' processes=4 threads=8, memory=17.18 GB>
3 <Client: 'tcp://' processes=4 threads=8, memory=17.18 GB>

Once you know the key corresponding to the client/cluster you want to terminate, you can close the cluster associated with it by retrieving the entry and deleting the corresponding cluster:

In [38]: _global_clients[1].cluster.close()

view this post on Zulip Stephen Yeager (Jan 11 2022 at 17:49):

For reference, note that the wide-format option of qstat is needed to terminate dask workers on cheyenne/PBS:

qstat -w -u $USER | grep dask-wor | awk '{print $1}' | xargs qdel

view this post on Zulip Jared Baker (Jan 11 2022 at 22:39):

golf? :grinning:
qselect -N dask-worker -u $USER | xargs qdel

view this post on Zulip Deepak Cherian (Jan 12 2022 at 17:29):

This should be on the ESDS. FAQ!

view this post on Zulip Brian Bonnlander (Jan 13 2022 at 19:18):

The command doesn't seem to work for me:

qselect -N dask-worker -u bonnland | xargs qdel
    qdel [-W force|suppress_email=X] [-x] job_identifier...
    qdel --version
: 123

qdel --version
pbs_version = 2021.1.1.20210518090504

view this post on Zulip Jared Baker (Jan 13 2022 at 21:39):

are you getting output from qselect? If not, which host are you on?

view this post on Zulip Brian Bonnlander (Jan 13 2022 at 21:41):

Thanks Jared, hostname is crhtc53 (via execcasper) and output from qselect looks like this:


view this post on Zulip Jared Baker (Jan 13 2022 at 22:01):

odd, if you're getting newline delimited output, running into xargs qdel should be fine. What is interesting is that ': 123' ouptut. I'm trying to even generate that output.

view this post on Zulip Brian Bonnlander (Jan 19 2022 at 21:34):

Hi Jared,

the dask terminate invocation is working for me now. It's possible it wasn't working because I invoked from a Bash shell inside of JupyterHub. If I login directly to Casper instead using Terminal, the command seems to work.

view this post on Zulip Jared Baker (Jan 20 2022 at 18:56):

Still should have been fine regardless. I mean, as long as you were within the casper domain, should have worked.

Last updated: Jan 30 2022 at 12:01 UTC