Hi,
Up until a few days ago, when I would request a cluster of Dask workers using NCARCluster on Casper, I would see pending requests for those workers using the squeue --me
command. Lately, however, I am not seeing the pending request for workers, and I am never granted those workers.
Is anyone aware of the issue? Does SLURM not add worker requests to the queue if it's too full, maybe?
Are you having trouble accessing the actual JupyterHub? Accessing an environment with ncar_jobqueue installed?
Is it moreso an issue of being able to actually access compute nodes?
I am using my own conda environment with JupyterLab installed. It has been working great for months until recently.
So are you able to get into jupyterlab/notebook? Or is this issue of actually running the notebook? When you login, are you able to request the resources you need?
My question is about whether others have ever seen Dask worker requests not show up as pending in the SLURM queue. I am running JupyterLab without issues using execdav
, as recommended.
have you run cluster.scale(...)
?
Could it be also the migration to PBS from SLURM on Casper? https://www2.cisl.ucar.edu/resources/computational-systems/casper/migrating-casper-jobs-slurm-pbs
Ah, I commented out the cluster.scale() line! I didn't realize its importance. I take it that this is what actually creates the SLURM reqeusts.
If we use scale(), do we still need to use wait_for_workers?
If we use scale(), do we still need to use wait_for_workers?
It depends... By default, dask doesn't mind proceeding with work if it has access to one worker at least. So, if your workload requires a minimum amount of resources/workers, you may still need the wait_for_workers()
call because there's no guarantee that the requested resources (via .scale()
) will be available at the same time.
But recently, I scale to the specific total number of workers, but my codes still stuck at wait_for_workers, then one of the node retired after some timeout time, thus my codes stuck forever.
But recently, I scale to the specific total number of workers, but my codes still stuck at wait_for_workers, then one of the node retired after some timeout time, thus my codes stuck forever.
Could you share a minimal script to reproduce or point me to the code you are currently using? It's hard to know what's going on unless I see your code
I feel really daft asking this question, but I'm requesting 10 workers and 400 Gb of memory with the following
I'd assumed the cluster.scale(2) would get me 20 workers and 800 GB of memory, but this isn't happening.
What am I doing wrong?
from dask_jobqueue import PBSCluster
from dask.distributed import Client
cluster = PBSCluster(
cores=ncores, # The number of cores you want
memory=nmem, # Amount of memory
processes=ncores, # How many processes
queue='casper', # The type of queue to utilize (/glade/u/apps/dav/opt/usr/bin/execcasper)
resource_spec='select=1:ncpus='+str(ncores)+':mem='+nmem, # Specify resources
project='P93300641', # Input your project ID here
walltime='1:00:00', # Amount of wall time
interface='ib0', # Interface to use
extra=["--lifetime", "55m", "--lifetime-stagger", "4m"],
)
# Scale up
cluster.scale(2)
are you on Casper PBS Batch?
yes
What is happening?
I have a little wrapper for convenience (since ncar-jobqueue
is currently broken):
def get_ClusterClient(): import dask from dask_jobqueue import PBSCluster from dask.distributed import Client cluster = PBSCluster( cores=1, memory='25GB', processes=1, queue='casper', local_directory='$TMPDIR', log_directory='$TMPDIR', resource_spec='select=1:ncpus=1:mem=25GB', project='NCGD0011', walltime='01:00:00', interface='ib0',) dask.config.set({ 'distributed.dashboard.link': 'https://jupyterhub.hpc.ucar.edu/stable/user/{USER}/proxy/{port}/status' }) client = Client(cluster) return cluster, client
Which I call like this
cluster, client = utils.get_ClusterClient() cluster.scale(12) #adapt(minimum_jobs=0, maximum_jobs=24)
Works!
qstat -u $USER
tells me my worker jobs are running. Dashboard is active, etc.
So are you requesting 40 GB of memory per worker then? Or are you looking for 400 GB per worker?
The following
from dask_jobqueue import PBSCluster from dask.distributed import Client ncores=1 nmem = '40GB' cluster = PBSCluster( cores=ncores, # The number of cores you want memory=nmem, # Amount of memory processes=ncores, # How many processes queue='casper', # The type of queue to utilize (/glade/u/apps/dav/opt/usr/bin/execcasper) resource_spec='select=1:ncpus='+str(ncores)+':mem='+nmem, # Specify resources project='NCGD0011', # Input your project ID here walltime='1:00:00', # Amount of wall time interface='ib0', # Interface to use ) cluster.scale(10)
results in 10 workers w/ 10 cores and 400 GB of memory. If you increase cluster.scale(n)
to n=20
, the result is 20 workers with 800 GB of memory.
no, 40 GB / worker. Matt's example feems to work fine, if I use .scale(20)
I get the 800 GB of memory needed
I guess the key would be take however much memory you are looking for in total, divided by the number of workers you would like to get the suggested nmem
. Thanks for asking this question!
THanks @Max Grover and @Matt Long
FYI I'm using NCARCluster for the first time, launching a "casper batch" session via https://jupyterhub.hpc.ucar.edu/, and finding that 'identify_host()' cannot identify casper as "casper".
This is a known issue, related to the switch to PBS. See here: https://github.com/NCAR/ncar-jobqueue/issues/40
This is a known issue, related to the switch to PBS. See here: https://github.com/NCAR/ncar-jobqueue/issues/40
This issue has been resolved in the latest release of ncar-jobqueue
. You can upgrade to the latest version via
conda install -c conda-forge ncar-jobqueue==2021.4.14
or
python -m pip install ncar-jobqueue --upgrade
NOTE
You're awesome, @Anderson Banihirwe! Thank you!
I second that assessment!!
Last updated: May 16 2025 at 17:14 UTC