I am trying to spin up a dask cluster with 72 workers, and when the machine gets busy it can take a lot of time for all the workers to be ready at once... so several workers may sit idle for a while (and, in extreme cases, the first workers to get launched may run out of walltime before the rest of the cluster is available).
I'm currently running
cluster = PBSCluster(
memory='20 GB',
processes=1,
cores=1,
queue='casper',
walltime='1:30:00',
resource_spec='select=1:ncpus=1:mem=20GB',
log_directory='./dask-logs',
)
cluster.scale(72)
client = Client(cluster)
client
so I get 72 jobs, each requesting a single core and being treated as a single worker. Is there a way to have cluster.scale()
submit a single job request? If I were to blindly change things, I would try
cluster = PBSCluster(
memory='1440 GB',
processes=72,
cores=72,
queue='casper',
walltime='1:30:00',
resource_spec='select=72:ncpus=1:mem=20GB',
log_directory='./dask-logs',
)
cluster.scale(1)
client = Client(cluster)
client
but I'm pretty confused about the interaction between dask and PB. I want to avoid a situation where I request 72 cores and dask treats that as a single worker.
@Michael Levy Have you seen this notebook from the dask tutorial led by @Negin Sobhani and @Brian Vanderwende this past spring? I found it to be a very helpful resource for understanding PBS and dask. I'm not sure if it answers your exact question though. Maybe tweaking the resource_spec
relative to the cores
parameter? Could adaptive scaling be useful here?
Thanks @Katie Dagon ! I hadn't seen that notebook, but I'll read through it and follow some of the links. I think my use-case is different from most, though, so I'd like to stay away from adaptive scaling. While typical scripts can start running with a few workers and then scale up, I'd really like to make sure we have all the memory / cores we requested before doing any computation - and if we're going to wait for all of the workers to be available, it makes sense to let PBS treat it as a single job request instead of running one worker at a time as the resources become available.
Last updated: May 16 2025 at 17:14 UTC