Not able to get NCARCluster resources allocated on Caspe · dask

Stream: dask

Topic: Not able to get NCARCluster resources allocated on Caspe

Orhan Eroglu (Oct 07 2021 at 16:45):

Hi,

I am trying to run a geocat-comp function via Dask on Casper but can't get the requested cluster/client resources allocated for the last two days, i.e. always getting zero workers, zero memory. I am using the same ncar-jobqueue configs with a colleague, who is able to run Dask clusters successfully. My config file is located at /glade/u/home/oero/.config/dask for reference. Here are what I do in order:

Start my server on Casper login nodes via UCAR Jupyter Hub
Run my notebook located at /glade/u/home/oero/src/brian_issue_interp_hybrid_to_sigma/Orhan.ipynb
In the notebook, setup my cluster with

import dask.distributed as dd
from ncar_jobqueue import NCARCluster

cluster = NCARCluster()
cluster.scale(jobs=5)
client = dd.Client(cluster)

And this is what I always get for the client in the last two days:
image.png

I couldn't get it fixed. Any thoughts?

Anderson Banihirwe (Oct 07 2021 at 16:54):

cluster.scale(jobs=5) call is asynchronous. As a result, you may get a report with zero workers when you print the client right after calling cluster.scale() and jobs submitted are still in a pending state...

Do you see any jobs via qstat -u $USER from the command line?

Anderson Banihirwe (Oct 07 2021 at 16:54):

When you get a moment, can you post here the output of print(cluster.job_script())?

Orhan Eroglu (Oct 07 2021 at 16:58):

Yeah, I thought about it being asynchronous and kept trying to run the remaining cells, while having the Dashboard open. Dashboard never showed me some realtime resources either.

qstat -u $USER returned me nothing in this morning, but it gives me now:

image.png

Brian Bonnlander (Oct 07 2021 at 16:59):

Hi @Orhan Eroglu

My own experience is that NCARCluster acted differently on Casper after the switch from SLURM to PBS. I could be wrong, but the config settings for PBS were originally designed for Cheyenne. You want to start by specifying everything (#workers, #cores, etc) in your NCARCluster call to make sure of your settings. Keep in mind that your cluster.scale() command is how to select multiple copies of your original settings.

Your default settings are found in ~/.dask/jobqueue.yaml I believe.

Orhan Eroglu (Oct 07 2021 at 16:59):

And, this is what I get from print(cluster.job_script()):

#PBS -N dask-worker-casper-dav
#PBS -q casper
#PBS -A NVST0001
#PBS -l select=1:ncpus=6:mem=128GB
#PBS -l walltime=01:00:00
#PBS -e /glade/scratch/oero/dask/casper-dav/logs/
#PBS -o /glade/scratch/oero/dask/casper-dav/logs/

/glade/work/oero/miniconda3/envs/s_at_s/bin/python -m distributed.cli.dask_worker tcp://10.12.206.18:36731 --nthreads 0 --nprocs 12 --memory-limit 9.93GiB --name dummy-name --nanny --death-timeout 60 --local-directory $TMPDIR --interface ib0 --protocol tcp://

Anderson Banihirwe (Oct 07 2021 at 17:03):

My own experience is that NCARCluster acted differently on Casper after the switch from SLURM to PBS. I could be wrong, but the config settings for PBS were originally designed for Cheyenne.

This issue got fixed a while ago...

Brian Bonnlander (Oct 07 2021 at 17:04):

OK thanks, that was a total guess on my part.

Anderson Banihirwe (Oct 07 2021 at 17:10):

@Orhan Eroglu,

/glade/work/oero/miniconda3/envs/s_at_s/bin/python -m distributed.cli.dask_worker tcp://10.12.206.18:36731 --nthreads 0 --nprocs 12 --memory-limit 9.93GiB

The dask-worker launch script has --nthreads set to 0 and this is likely what's causing you issues...

What is the output of

In [7]: import dask

In [8]: dask.config.get('jobqueue.pbs')

Orhan Eroglu (Oct 07 2021 at 17:16):

Here it is

{'name': 'dask-worker',
'cores': 36,
'memory': '109GB',
'processes': 1,
'interface': 'ib0',
'queue': 'regular',
'walltime': '01:00:00',
'resource-spec': 'select=1:ncpus=36:mem=109GB',
'local-directory': '/glade/scratch/oero',
'project': '<our VAST project ID>',
'job-extra': [],
'log-directory': '/glade/scratch/oero'}

Brian Bonnlander (Oct 07 2021 at 17:18):

Yes, like me you might still have your own personal NCARCluster config settings that were originally designed for Cheyenne. I see that in the resource-spec line.

Orhan Eroglu (Oct 07 2021 at 17:19):

Brian Bonnlander said:

Yes, like me you might still have your own personal NCARCluster config settings that were originally designed for Cheyenne. I see that in the resource-spec line.

Oh, yes, it is interesting, it is reading from Cheyenne configs instead of Casper.

Brian Bonnlander (Oct 07 2021 at 17:20):

My way around this is to be explicit in the NCARCluster call; don't rely on defaults at first. Then modify the defaults when you find a good configuration.

Brian Bonnlander (Oct 07 2021 at 17:21):

Actually I switched to PBSCluster because for the moment, all HPC resources are managed with PBS.

Orhan Eroglu (Oct 07 2021 at 17:26):

But keep in mind, the above outputs are when I run dask.config.get('jobqueue.pbs') right after import dask. Instead, when I run it after NCARCluster initialization, the output is (which seems good think):

#!/usr/bin/env bash

#PBS -N dask-worker-casper-dav
#PBS -q casper
#PBS -A NVST0001
#PBS -l select=1:ncpus=1:mem=50GB
#PBS -l walltime=01:30:00
#PBS -e /glade/scratch/oero/dask/casper-dav/logs/
#PBS -o /glade/scratch/oero/dask/casper-dav/logs/

/glade/work/oero/miniconda3/envs/s_at_s/bin/python -m distributed.cli.dask_worker tcp://10.12.206.18:35301 --nthreads 2 --memory-limit 46.57GiB --name dummy-name --nanny --death-timeout 60 --local-directory /glade/scratch/oero/dask/casper-dav/local-dir --interface ib0 --protocol tcp://

Brian Bonnlander (Oct 07 2021 at 17:27):

OK, my theory is easily falsified, making it a great theory in one sense :smile:

Anderson Banihirwe (Oct 07 2021 at 17:33):

But keep in mind, the above outputs are when I run dask.config.get('jobqueue.pbs') right after import dask. Instead, when I run it after NCARCluster initialization, the output is (which seems good think):

This looks great and I expect it to work....

Because Casper and Cheyenne are using the same scheduler (PBS) and dask doesn't know about this, NCARJobqueue introduces a hack that

(1) loads the right configurations
(2) modifies dask's default configurations and then passes modified configurations to dask.
(3) prevents dask from re-loading the default configurations

So, in some cases (e.g. when you import dask after ncar-jobqueue), changes made by ncar-jobqueue are being overridden by dask (because dask is reloading its configuration but isn't aware of ncar-jobqueue's custom configurations)

Anderson Banihirwe (Oct 07 2021 at 17:36):

Orhan Eroglu said:

Yeah, I thought about it being asynchronous and kept trying to run the remaining cells, while having the Dashboard open. Dashboard never showed me some realtime resources either.

qstat -u $USER returned me nothing in this morning, but it gives me now:

image.png

I missed this message :frown:

Anderson Banihirwe (Oct 07 2021 at 17:38):

Since you don't have any jobs in the queue, my hunch is that the batch script submission is not working properly due to some errors... I am going to take a look at logs in /glade/scratch/oero/dask/casper-dav/logs to see if I can figure out what's going on

Anderson Banihirwe (Oct 07 2021 at 17:41):

I found this in one of your logs

$ cat /glade/scratch/oero/dask/casper-dav/logs/1242944.casper-pbs.ER
Traceback (most recent call last):
  File "/glade/work/oero/miniconda3/envs/s_at_s/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/glade/work/oero/miniconda3/envs/s_at_s/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/glade/work/oero/miniconda3/envs/s_at_s/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 466, in <module>
    go()
  File "/glade/work/oero/miniconda3/envs/s_at_s/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 461, in go
    check_python_3()
  File "/glade/work/oero/miniconda3/envs/s_at_s/lib/python3.7/site-packages/distributed/cli/utils.py", line 32, in check_python_3
    _unicodefun._verify_python3_env()

Brian Bonnlander (Oct 07 2021 at 17:46):

Anderson Banihirwe said:

Because Casper and Cheyenne are using the same scheduler (PBS) and dask doesn't know about this, NCARJobqueue introduces a hack that

(1) loads the right configurations

Just so we're clear: which configurations are the "right" ones? The ones originally intended for Slurm?

Anderson Banihirwe (Oct 07 2021 at 17:48):

Anderson Banihirwe said:

I found this in one of your logs

$ cat /glade/scratch/oero/dask/casper-dav/logs/1242944.casper-pbs.ER
Traceback (most recent call last):
  File "/glade/work/oero/miniconda3/envs/s_at_s/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/glade/work/oero/miniconda3/envs/s_at_s/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/glade/work/oero/miniconda3/envs/s_at_s/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 466, in <module>
    go()
  File "/glade/work/oero/miniconda3/envs/s_at_s/lib/python3.7/site-packages/distributed/cli/dask_worker.py", line 461, in go
    check_python_3()
  File "/glade/work/oero/miniconda3/envs/s_at_s/lib/python3.7/site-packages/distributed/cli/utils.py", line 32, in check_python_3
    _unicodefun._verify_python3_env()

The version of click (v8.0.1) in your environment appears to have introduced a bug in distributed. Can you try downgrade to an earlier version of click (e.g. v7.1.2) to see if this error goes away

Anderson Banihirwe (Oct 07 2021 at 17:52):

Another option is to upgrade your distributed version to a more recent version (v2021.09.1)

Orhan Eroglu (Oct 07 2021 at 17:54):

Thanks @Anderson Banihirwe ! Let me try these. Will keep you posted

Anderson Banihirwe (Oct 07 2021 at 18:10):

Just so we're clear: which configurations are the "right" ones? The ones originally intended for Slurm?

@Brian Bonnlander, https://github.com/NCAR/ncar-jobqueue/blob/main/ncar_jobqueue/ncar-jobqueue.yaml provides a starting point... and if you are using ncar-jobqueue there is a copy in your home at ~/.config/dask/ncar-jobqueue.yaml. You will notice that the jobqueue.yaml and ncar-jobqueue.yaml files use different structures.

Orhan Eroglu (Oct 07 2021 at 22:44):

Thanks a lot @Anderson Banihirwe for figuring this out! Installing a fresh conda environment did the fix (I chose this way since I noticed I had a several month-old one). The new environment has distributed (v2021.09.1) and click (v8.0.1) FYI.

Last updated: Jan 30 2022 at 12:01 UTC