missing workers · dask · Zulip Chat Archive

Hi everyone -- I'm trying to run some code to use CESM2, but am having issues getting workers... I'm working with @Will Wieder and I'm trying to run through a notebook he wrote using the same conda environment he uses, but with different results.

cluster, client = get_ClusterClient(nmem='20GB')
cluster.scale(10)
cluster

On Will's machine, a window pops up after the code and workers start appearing in the dask worker window. I am getting neither of these outputs. No error messages right now, just no workers.

Anderson Banihirwe (Mar 28 2022 at 19:21):

you may want to make sure you have both ipywidgets and dask-labextension installed in the environment you are using within the notebook

mamba install -c conda-forge ipywidgets dask-labextension

conda install -c conda-forge ipywidgets dask-labextension

Anderson Banihirwe (Mar 28 2022 at 19:22):

Else Schlerman (Mar 28 2022 at 19:30):

Still running into the same issues... Is there a way to double check that I have both of them installed?

Anderson Banihirwe (Mar 28 2022 at 19:31):

import ipywidgets
print(ipywidgets.__version__)

Else Schlerman (Mar 28 2022 at 19:35):

Anderson Banihirwe (Mar 28 2022 at 19:35):

Anderson Banihirwe (Mar 28 2022 at 19:36):

regarding the missing workers, you may want to check if you already have some pending jobs in the queue

Anderson Banihirwe (Mar 28 2022 at 19:37):

qstat -u $USER

!qstat -u $USER

Else Schlerman (Mar 28 2022 at 19:39):

                                                           Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
2259179.casper* eschlerm jhublog* cr-login-*  97095   1   1    4gb 720:0 R 454:4

Else Schlerman (Mar 28 2022 at 19:39):

Anderson Banihirwe (Mar 28 2022 at 19:40):

print(cluster.job_script())

Else Schlerman (Mar 28 2022 at 19:42):

#!/usr/bin/env bash

#PBS -N dask-worker
#PBS -q casper
#PBS -A P93300041
#PBS -l select=1:ncpus=1:mem=20GB
#PBS -l walltime=2:00:00

/glade/work/eschlerm/opt/miniconda/envs/lens-py/bin/python -m distributed.cli.dask_worker tcp://10.12.1.3:43534 --nthreads 1 --memory-limit 18.63GiB --name dummy-name --nanny --death-timeout 60 --interface ib0 --protocol tcp://

[Errno 2] No such file or directory: 'qsub': 'qsub'

Anderson Banihirwe (Mar 28 2022 at 19:46):

Jared Baker (Mar 28 2022 at 19:58):

I see it on the path for the base jupyter server on crhtc45. Very much in the PATH variable at the end.

Jared Baker (Mar 28 2022 at 20:06):

when launching the submit job, is that where the error with qsub is coming with? If submitting, I wouldn't guarantee variables in in the environment w/o -V on qsub on in the script.

Anderson Banihirwe (Mar 28 2022 at 20:11):

@Else Schlerman, are you using the jupyterhub (https://jupyterhub.hpc.ucar.edu/) or launching the jupyter server yourself (via jupyter-forward)?

Else Schlerman (Mar 28 2022 at 20:17):

Else Schlerman (Mar 28 2022 at 20:19):

Else Schlerman (Mar 28 2022 at 20:23):

I'm noticing that the qsub error occurs when I add print(cluster.job_script()) to the cell and run it, but I'm not currently getting the error otherwise. However, I am still not getting any workers

Else Schlerman (Mar 28 2022 at 20:24):

Tab(children=(HTML(value='\n            <div class="jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-Ou…

Jared Baker (Mar 28 2022 at 20:27):

Else Schlerman (Mar 28 2022 at 20:28):

Anderson Banihirwe (Mar 28 2022 at 20:31):

@Else Schlerman, you will need to modify the code in get_ClusterClient() function which I assume contains the code responsible for instantiating the dask cluster

cluster = PBSCluster(..., job_extra=["-V"])

Else Schlerman (Mar 28 2022 at 20:36):

def get_ClusterClient(ncores=1, nmem='25GB'):
    import dask
    from dask_jobqueue import PBSCluster
    from dask.distributed import Client
    ncores=ncores
    nmem = nmem

    cluster = PBSCluster(
        cores=ncores, # The number of cores you want
        memory=nmem, # Amount of memory
        processes=ncores, # How many processes
        queue='casper', # The type of queue to utilize (/glade/u/apps/dav/opt/usr/bin/execcasper)
        resource_spec='select=1:ncpus='+str(ncores)+':mem='+nmem, # Specify resources
        project='P93300041', # Input your project ID here
        walltime='2:00:00', # Amount of wall time
        interface='ib0', # Interface to use
        job_extra=["-V"]
    )

    dask.config.set({
        'distributed.dashboard.link':
        'https://jupyterhub.hpc.ucar.edu/stable/user/{USER}/proxy/{port}/status'
    })

    client = Client(cluster)
    return cluster, client

This did seem to fix the qsub error when I use the print(cluster.job_script()) command, but still not getting any workers

Katie Dagon (Mar 28 2022 at 22:01):

@Else Schlerman It's possible you're missing a cluster.scale(x) command (where x = number of workers) after the cluster = PBSCluster() call. I think that is the call that actually requests the workers.

Else Schlerman (Mar 29 2022 at 13:50):

Thanks @Katie Dagon I do have that command in the next cell, not copied above. However, I went to the xdev office hours last night -- it seems like the issue was coming from my jupyter forwarding configuration and things are now working as expected!

Will Wieder (Mar 29 2022 at 15:43):

to add to this, Else's cloned my conda environment with a .yml file created from the environment that's working for me. She's running the identical notebook, but unable to get any workers to show up. Is there something else we're potentially missing here?

Will Wieder (Mar 29 2022 at 15:44):

Ah, I posted before reading this last note, I wondered if what was a jupyter forward issue. Thanks for digging in @Else Schlerman !

Michael Levy (Mar 29 2022 at 16:09):

Yeah, it turned out that juypter-forward was having trouble with the TMPDIR environment variable (printenv TMPDIR was showing the variable pointing to her scratch space, but then when it checked to see if $TMPDIR was writable, it was reverting to an empty string and therefore trying to create files in /). Once we explicitly defined TMPDIR in her .bashrc file, everything worked as expected... though it's still unclear to me why we needed to do that. (I should let @Anderson Banihirwe know about this :smile: )

Katie Dagon (Mar 29 2022 at 17:39):

Can I ask a general question about when we would want to use jupyter-forward over the jupyterhub? Is it to avoid hub stability issues? I used to do a lot of port forwarding to launch jupyter lab, but since the hub stability has improved I find myself just logging on to the hub. Curious about when jupyter-forward might be preferable though.

Michael Levy (Mar 29 2022 at 18:43):

At this point, I'm really only using jupyter-forward when the Hub is down. It's proving to be a useful tool for systems that don't have JupyterHub installed - I haven't really done any analysis on andre but suspect jupyter-forward would be the best tool for launching a notebook on that machine

Stream: dask

Topic: missing workers

Else Schlerman (Mar 28 2022 at 18:57):

Anderson Banihirwe (Mar 28 2022 at 19:21):

Anderson Banihirwe (Mar 28 2022 at 19:22):

Else Schlerman (Mar 28 2022 at 19:30):

Anderson Banihirwe (Mar 28 2022 at 19:31):

Else Schlerman (Mar 28 2022 at 19:35):

Anderson Banihirwe (Mar 28 2022 at 19:35):

Anderson Banihirwe (Mar 28 2022 at 19:36):

Anderson Banihirwe (Mar 28 2022 at 19:37):

Else Schlerman (Mar 28 2022 at 19:39):

Else Schlerman (Mar 28 2022 at 19:39):

Anderson Banihirwe (Mar 28 2022 at 19:40):

Anderson Banihirwe (Mar 28 2022 at 19:40):

Else Schlerman (Mar 28 2022 at 19:42):

Anderson Banihirwe (Mar 28 2022 at 19:46):

Jared Baker (Mar 28 2022 at 19:58):

Jared Baker (Mar 28 2022 at 20:06):

Anderson Banihirwe (Mar 28 2022 at 20:11):

Else Schlerman (Mar 28 2022 at 20:17):

Else Schlerman (Mar 28 2022 at 20:19):

Else Schlerman (Mar 28 2022 at 20:23):

Else Schlerman (Mar 28 2022 at 20:24):

Jared Baker (Mar 28 2022 at 20:27):

Else Schlerman (Mar 28 2022 at 20:28):

Anderson Banihirwe (Mar 28 2022 at 20:31):

Else Schlerman (Mar 28 2022 at 20:36):

Katie Dagon (Mar 28 2022 at 22:01):

Else Schlerman (Mar 29 2022 at 13:50):

Will Wieder (Mar 29 2022 at 15:43):

Will Wieder (Mar 29 2022 at 15:44):

Michael Levy (Mar 29 2022 at 16:09):

Katie Dagon (Mar 29 2022 at 17:39):

Michael Levy (Mar 29 2022 at 18:43):

Deepak Cherian (Mar 31 2022 at 16:09):