Stream: dask
Topic: terminate dask cluster?
Stephen Yeager (Nov 19 2020 at 17:46):
I started a dask cluster that I no longer need. How do I terminate all my dask-worker jobs?
Deepak Cherian (Nov 19 2020 at 18:00):
in the notebook you can do cluster.close()
assuming you started with cluster = NCARCluster(...)
or something similar
Deepak Cherian (Nov 19 2020 at 18:01):
I think you'll want to do client.close()
also to avoid a bunch of error messages (assuming client = distributed.Client(cluster)
Keith Lindsay (Nov 19 2020 at 18:05):
If you want to kill the jobs from the command line the following will kill ALL of your dask-workers:
squeue -u $USER | grep dask-wor | awk '{print $1}' | xargs scancel
Don't run this if you have other dask-workers that you want to keep around.
This is assuming that you're workers are on casper/slurm.
Replace squeue
and scancel
with corresponding PBS commands if your workers are on cheyenne/PBS.
Stephen Yeager (Nov 19 2020 at 18:06):
Thanks, that worked to kill the second batch of workers, but I invoked cluster = NCARCluster(...)
and client = Client(cluster)
twice in the same notebook. The first batch of workers seem to be in limbo and no longer connected to my notebook?
Deepak Cherian (Nov 19 2020 at 18:07):
ah yes i don't know a good notebook-only solution for the two cluster problem. @Anderson Banihirwe any ideas?
Anderson Banihirwe (Nov 19 2020 at 18:32):
There's a hidden _global_clients
object which holds references to clients instantiated in a Python session... You can loop over this object to see which clients have been registered..
In [13]: from distributed.client import _global_clients In [31]: for k, v in _global_clients.items(): ...: print(k, v) ...: 0 <Client: 'tcp://127.0.0.1:59227' processes=4 threads=8, memory=17.18 GB> 1 <Client: 'tcp://127.0.0.1:59227' processes=4 threads=8, memory=17.18 GB> 2 <Client: 'tcp://127.0.0.1:59374' processes=4 threads=8, memory=17.18 GB> 3 <Client: 'tcp://127.0.0.1:59374' processes=4 threads=8, memory=17.18 GB>
Once you know the key corresponding to the client/cluster you want to terminate, you can close the cluster associated with it by retrieving the entry and deleting the corresponding cluster:
In [38]: _global_clients[1].cluster.close()
Stephen Yeager (Jan 11 2022 at 17:49):
For reference, note that the wide-format option of qstat is needed to terminate dask workers on cheyenne/PBS:
qstat -w -u $USER | grep dask-wor | awk '{print $1}' | xargs qdel
Jared Baker (Jan 11 2022 at 22:39):
golf? :grinning:
qselect -N dask-worker -u $USER | xargs qdel
Deepak Cherian (Jan 12 2022 at 17:29):
This should be on the ESDS. FAQ!
Brian Bonnlander (Jan 13 2022 at 19:18):
The command doesn't seem to work for me:
qselect -N dask-worker -u bonnland | xargs qdel
usage:
qdel [-W force|suppress_email=X] [-x] job_identifier...
qdel --version
: 123
qdel --version
pbs_version = 2021.1.1.20210518090504
Jared Baker (Jan 13 2022 at 21:39):
are you getting output from qselect
? If not, which host are you on?
Brian Bonnlander (Jan 13 2022 at 21:41):
Thanks Jared, hostname is crhtc53 (via execcasper) and output from qselect looks like this:
1921640.casper-pbs
1921641.casper-pbs
1921642.casper-pbs
1921643.casper-pbs
1921644.casper-pbs
1921645.casper-pbs
1921646.casper-pbs
1921647.casper-pbs
1921648.casper-pbs
1921649.casper-pbs
1921650.casper-pbs
Jared Baker (Jan 13 2022 at 22:01):
odd, if you're getting newline delimited output, running into xargs qdel
should be fine. What is interesting is that ': 123' ouptut. I'm trying to even generate that output.
Brian Bonnlander (Jan 19 2022 at 21:34):
Hi Jared,
the dask terminate invocation is working for me now. It's possible it wasn't working because I invoked from a Bash shell inside of JupyterHub. If I login directly to Casper instead using Terminal, the command seems to work.
Jared Baker (Jan 20 2022 at 18:56):
Still should have been fine regardless. I mean, as long as you were within the casper domain, should have worked.
Last updated: Jan 30 2022 at 12:01 UTC