Stream: dask

Topic: missing workers


view this post on Zulip Else Schlerman (Mar 28 2022 at 18:57):

Hi everyone -- I'm trying to run some code to use CESM2, but am having issues getting workers... I'm working with @Will Wieder and I'm trying to run through a notebook he wrote using the same conda environment he uses, but with different results.

I tried running a cell with the following:

cluster, client = get_ClusterClient(nmem='20GB')
cluster.scale(10)
cluster

On Will's machine, a window pops up after the code and workers start appearing in the dask worker window. I am getting neither of these outputs. No error messages right now, just no workers.

Any thoughts?

view this post on Zulip Anderson Banihirwe (Mar 28 2022 at 19:21):

you may want to make sure you have both ipywidgets and dask-labextension installed in the environment you are using within the notebook

mamba install -c conda-forge ipywidgets dask-labextension

or

conda install -c conda-forge ipywidgets dask-labextension

view this post on Zulip Anderson Banihirwe (Mar 28 2022 at 19:22):

restart the notebook/kernel after the installation

view this post on Zulip Else Schlerman (Mar 28 2022 at 19:30):

Still running into the same issues... Is there a way to double check that I have both of them installed?

view this post on Zulip Anderson Banihirwe (Mar 28 2022 at 19:31):

from the notebook, do you get any output when you run

import ipywidgets
print(ipywidgets.__version__)

view this post on Zulip Else Schlerman (Mar 28 2022 at 19:35):

I get
7.6.5

view this post on Zulip Anderson Banihirwe (Mar 28 2022 at 19:35):

okay... looks good.

view this post on Zulip Anderson Banihirwe (Mar 28 2022 at 19:36):

regarding the missing workers, you may want to check if you already have some pending jobs in the queue

view this post on Zulip Anderson Banihirwe (Mar 28 2022 at 19:37):

from the command line

qstat -u $USER

or within a notebook cell

!qstat -u $USER

view this post on Zulip Else Schlerman (Mar 28 2022 at 19:39):

                                                           Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
2259179.casper* eschlerm jhublog* cr-login-*  97095   1   1    4gb 720:0 R 454:4

view this post on Zulip Else Schlerman (Mar 28 2022 at 19:39):

^ This is the output from the command line

view this post on Zulip Anderson Banihirwe (Mar 28 2022 at 19:40):

it appears you don't have any pending dask-worker jobs

view this post on Zulip Anderson Banihirwe (Mar 28 2022 at 19:40):

what's the output of

print(cluster.job_script())

view this post on Zulip Else Schlerman (Mar 28 2022 at 19:42):

#!/usr/bin/env bash

#PBS -N dask-worker
#PBS -q casper
#PBS -A P93300041
#PBS -l select=1:ncpus=1:mem=20GB
#PBS -l walltime=2:00:00

/glade/work/eschlerm/opt/miniconda/envs/lens-py/bin/python -m distributed.cli.dask_worker tcp://10.12.1.3:43534 --nthreads 1 --memory-limit 18.63GiB --name dummy-name --nanny --death-timeout 60 --interface ib0 --protocol tcp://

Also I'm now getting the missing qsub error again

[Errno 2] No such file or directory: 'qsub': 'qsub'

view this post on Zulip Anderson Banihirwe (Mar 28 2022 at 19:46):

this issue appears to be related to https://zulip.ucar.edu/#narrow/stream/16-jupyterlab-hub/topic/qsub.20missing.20from.20.24PATH.20when.20using.20JupyterHub. @Jared Baker, do you happen to have a hint about why @Else Schlerman doesn't have qsub on their PATH???

view this post on Zulip Jared Baker (Mar 28 2022 at 19:58):

I see it on the path for the base jupyter server on crhtc45. Very much in the PATH variable at the end.

view this post on Zulip Jared Baker (Mar 28 2022 at 20:06):

when launching the submit job, is that where the error with qsub is coming with? If submitting, I wouldn't guarantee variables in in the environment w/o -V on qsub on in the script.

view this post on Zulip Anderson Banihirwe (Mar 28 2022 at 20:11):

I see it on the path for the base jupyter server on crhtc45. Very much in the PATH variable at the end.

@Else Schlerman, are you using the jupyterhub (https://jupyterhub.hpc.ucar.edu/) or launching the jupyter server yourself (via jupyter-forward)?

view this post on Zulip Else Schlerman (Mar 28 2022 at 20:17):

I'm launching the jupyter server via jupyter-forward

view this post on Zulip Else Schlerman (Mar 28 2022 at 20:19):

@Jared Baker I'm not quite sure what you're asking, but here is the git repository of the code with the error message, if that is helpful
https://github.com/eschlerm/permafrost/blob/master/.ipynb_checkpoints/LocalChange-ARC-checkpoint.ipynb

view this post on Zulip Else Schlerman (Mar 28 2022 at 20:23):

I'm noticing that the qsub error occurs when I add print(cluster.job_script()) to the cell and run it, but I'm not currently getting the error otherwise. However, I am still not getting any workers

view this post on Zulip Else Schlerman (Mar 28 2022 at 20:24):

The only output I get is

Tab(children=(HTML(value='\n            <div class="jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-Ou…

view this post on Zulip Jared Baker (Mar 28 2022 at 20:27):

can you just add #PBS -V to the dask script?

view this post on Zulip Else Schlerman (Mar 28 2022 at 20:28):

in jupyter notebooks?

view this post on Zulip Anderson Banihirwe (Mar 28 2022 at 20:31):

@Else Schlerman, you will need to modify the code in get_ClusterClient() function which I assume contains the code responsible for instantiating the dask cluster

and pass job_extra=["-V"]

cluster = PBSCluster(..., job_extra=["-V"])

view this post on Zulip Else Schlerman (Mar 28 2022 at 20:36):

Thank you @Anderson Banihirwe
I now have:

def get_ClusterClient(ncores=1, nmem='25GB'):
    import dask
    from dask_jobqueue import PBSCluster
    from dask.distributed import Client
    ncores=ncores
    nmem = nmem

    cluster = PBSCluster(
        cores=ncores, # The number of cores you want
        memory=nmem, # Amount of memory
        processes=ncores, # How many processes
        queue='casper', # The type of queue to utilize (/glade/u/apps/dav/opt/usr/bin/execcasper)
        resource_spec='select=1:ncpus='+str(ncores)+':mem='+nmem, # Specify resources
        project='P93300041', # Input your project ID here
        walltime='2:00:00', # Amount of wall time
        interface='ib0', # Interface to use
        job_extra=["-V"]
    )

    dask.config.set({
        'distributed.dashboard.link':
        'https://jupyterhub.hpc.ucar.edu/stable/user/{USER}/proxy/{port}/status'
    })

    client = Client(cluster)
    return cluster, client

This did seem to fix the qsub error when I use the print(cluster.job_script()) command, but still not getting any workers

view this post on Zulip Katie Dagon (Mar 28 2022 at 22:01):

@Else Schlerman It's possible you're missing a cluster.scale(x) command (where x = number of workers) after the cluster = PBSCluster() call. I think that is the call that actually requests the workers.

view this post on Zulip Else Schlerman (Mar 29 2022 at 13:50):

Thanks @Katie Dagon I do have that command in the next cell, not copied above. However, I went to the xdev office hours last night -- it seems like the issue was coming from my jupyter forwarding configuration and things are now working as expected!

view this post on Zulip Will Wieder (Mar 29 2022 at 15:43):

to add to this, Else's cloned my conda environment with a .yml file created from the environment that's working for me. She's running the identical notebook, but unable to get any workers to show up. Is there something else we're potentially missing here?

view this post on Zulip Will Wieder (Mar 29 2022 at 15:44):

Ah, I posted before reading this last note, I wondered if what was a jupyter forward issue. Thanks for digging in @Else Schlerman !

view this post on Zulip Michael Levy (Mar 29 2022 at 16:09):

Yeah, it turned out that juypter-forward was having trouble with the TMPDIR environment variable (printenv TMPDIR was showing the variable pointing to her scratch space, but then when it checked to see if $TMPDIR was writable, it was reverting to an empty string and therefore trying to create files in /). Once we explicitly defined TMPDIR in her .bashrc file, everything worked as expected... though it's still unclear to me why we needed to do that. (I should let @Anderson Banihirwe know about this :smile: )

view this post on Zulip Katie Dagon (Mar 29 2022 at 17:39):

Can I ask a general question about when we would want to use jupyter-forward over the jupyterhub? Is it to avoid hub stability issues? I used to do a lot of port forwarding to launch jupyter lab, but since the hub stability has improved I find myself just logging on to the hub. Curious about when jupyter-forward might be preferable though.

view this post on Zulip Michael Levy (Mar 29 2022 at 18:43):

Katie Dagon said:

Can I ask a general question about when we would want to use jupyter-forward over the jupyterhub? Is it to avoid hub stability issues? I used to do a lot of port forwarding to launch jupyter lab, but since the hub stability has improved I find myself just logging on to the hub. Curious about when jupyter-forward might be preferable though.

At this point, I'm really only using jupyter-forward when the Hub is down. It's proving to be a useful tool for systems that don't have JupyterHub installed - I haven't really done any analysis on andre but suspect jupyter-forward would be the best tool for launching a notebook on that machine

view this post on Zulip Deepak Cherian (Mar 31 2022 at 16:09):

It's proving to be a useful tool for systems that don't have JupyterHub installed - I haven't really done any analysis on andre but suspect jupyter-forward would be the best tool for launching a notebook on that machine

I agree.


Last updated: May 16 2025 at 17:14 UTC