Stream: dask

Topic: Problem selecting proper resource


view this post on Zulip Michael Levy (Apr 01 2022 at 14:57):

I'm not sure if this is a dask problem, a JupyterHub problem, or a PBS issue... @Jared Baker, I'm tagging you in case it's a Hub or PBS problem, since I don't know if you check out this channel regularly. I was chatting with @Holly Olivarez, who ran into this issue first, but I was able to reproduce it on my own:

from dask_jobqueue import PBSCluster
from dask.distributed import Client

cluster = PBSCluster(
    cores=36,
    memory='300 GB',
    processes=9,
    resource_spec='select=1:ncpus=36:mem=300GB',
)

cluster.scale(1)

Works fine

client = Client(cluster)

fails with

RuntimeError: Command exited with non-zero exit code.
Exit code: 1
Command:
qsub /glade/scratch/mlevy/tmpdir/tmpaxepzplf.sh
stdout:
There was a problem selecting the proper resource. Please open a research computing ticket.

stderr:

this is dask 2022.01.0, not sure if it would be useful to have version numbers from anything else. I was using the Hub to run from a Casper PBS node, and Holly was on the Casper login node.

Has anyone seen this before? As mentioned early in this stream (but a different topic), this PBSCluster() command was working fine just a few days ago... I believe out of the same conda environment I'm using to reproduce Holly's error.

view this post on Zulip Jared Baker (Apr 01 2022 at 14:58):

where are you running on? JupyterHub Stable?

view this post on Zulip Michael Levy (Apr 01 2022 at 14:59):

yup, JupyterHub Stable

view this post on Zulip Jared Baker (Apr 01 2022 at 15:00):

okay. I know where that message comes from. Give me just a few seconds.

view this post on Zulip Jared Baker (Apr 01 2022 at 15:05):

What about now?

view this post on Zulip Michael Levy (Apr 01 2022 at 15:09):

looks like it's working, thanks! (I needed to add the queue argument, but that's probably an issue in my configuration :)

view this post on Zulip Matt Long (Apr 01 2022 at 15:58):

Yeah, it looks to me like the queuing system was not getting sufficient information (i.e., the queue!).


Last updated: May 16 2025 at 17:14 UTC