Dask on cheyenne Example? · jupyter

Stream: jupyter

Topic: Dask on cheyenne Example?

Brian Bonnlander (Sep 09 2020 at 21:57):

I'm unable to get the Dask dashboard working on cheyenne.

Is anyone able to post an example of importing and running Dask with the dashboard using JupyterHub on Cheyenne? I would be very grateful, as Casper resources are very hard to get right now. I am using Dask through a conda environment that I built on Casper, so I'm a little worried that it might not be configured properly to work on cheyenne.

Anderson Banihirwe (Sep 09 2020 at 22:04):

I'm unable to get the Dask dashboard working on cheyenne.

Are you getting an 404 error or some other error?

Brian Bonnlander (Sep 09 2020 at 22:05):

I don't know what the dashboard URL is supposed to be. I'm not seeing an HTTP error (yet).

Anderson Banihirwe (Sep 09 2020 at 22:10):

I don't know what the dashboard URL is supposed to be.

The template of the dashboard link should be "https://jupyterhub.ucar.edu/ch/user/{USER}/proxy/{port}/status"

Anderson Banihirwe (Sep 09 2020 at 22:10):

Are you using ncar-jobqueue or dask-jobqueue to instantiate the cluster?

Brian Bonnlander (Sep 09 2020 at 22:12):

Thank you, I will try that. Do you also happen to know what parameters to pass when creating the cluster, so that CPU and memory resources are reasonable for the cheyenne architecture?

Anderson Banihirwe (Sep 09 2020 at 22:16):

Try

cores = 20
processes = 10
memory = '109GB'

Anderson Banihirwe (Sep 09 2020 at 22:18):

if you find yourself getting KilledWorker errors, try reducing the number of processes

Brian Bonnlander (Sep 09 2020 at 22:29):

Thank you, I am assuming I have to use ncar-jobqueue on cheyenne. I will post shortly the code I am trying.

Anderson Banihirwe (Sep 09 2020 at 22:32):

I am assuming I have to use ncar-jobqueue on cheyenne

When using ncar-jobqueue in conjunction with the hub, the dashboard link is set for you i.e. should work out of the box without you needing to set it manually.

Brian Bonnlander (Sep 09 2020 at 22:35):

I was worried that by using my own Dask configuration, that the dashboard link would not be set by default, but I will try it as you suggest.

Brian Bonnlander (Sep 09 2020 at 22:39):

Does this code look reasonable?

import dask

machine = 'cheyenne'  # 'casper'

if machine == 'cheyenne':
    # The following is supposedly set when using NCARCluster
    #dask.config.set({'distributed.dashboard.link': "https://jupyterhub.ucar.edu/ch/user/{USER}/proxy/{port}/status"})
    from ncar_jobqueue import NCARCluster
    cluster = NCARCluster(cores=10, processes=20, memory='109GB', project='STDD0003')
    cluster.scale(jobs=20)
else:
    # Assume machine is Casper.
    dask.config.set({'distributed.dashboard.link': '/proxy/{port}/status'})
    from dask_jobqueue import SLURMCluster
    cluster = SLURMCluster(cores=8, memory='200GB', project='STDD0003')
    cluster.scale(jobs=8)

from distributed import Client
client = Client(cluster)
cluster

Anderson Banihirwe (Sep 09 2020 at 22:45):

The primary objective of ncar-jobqueue is to abstract the if statements...

My recommendation is to just use

from ncar_jobqueue import NCARCluster
cluster = NCARCluster(cores=10, processes=20, memory='109GB', project='STDD0003')

If you want to have separate configurations (such as memory, cores, etcc) for casper and/or cheyenne, You could just put them in the ~/.config/dask/jobqueue.yaml file

Brian Bonnlander (Sep 09 2020 at 22:55):

OK, that is good information, I will give it a try on both systems and report back if there are problems. Thank you again!

Brian Bonnlander (Sep 09 2020 at 23:25):

OK, the dashboard works on cheyenne now, but Dask freezes with errors. I might be related to the operations I am trying, but still I wonder if I have the right package versions for Dask.

I am using these package versions:

bokeh                     1.4.0            py38h32f6830_1    conda-forge
dask                      2.14.0                     py_0    conda-forge
dask-core                 2.14.0                     py_0    conda-forge
dask-jobqueue             0.7.1                      py_0    conda-forge
distributed               2.14.0           py38h32f6830_0    conda-forge
ncar-jobqueue             2020.3.4                 pypi_0    pypi

tornado.application - ERROR - Exception in callback <bound method BokehTornado._keep_alive of <bokeh.server.tornado.BokehTornado object at 0x2b3a03b7a790>>
Traceback (most recent call last):
  File "/glade/u/home/bonnland/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/glade/u/home/bonnland/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/bokeh/server/tornado.py", line 579, in _keep_alive
    c.send_ping()
  File "/glade/u/home/bonnland/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/bokeh/server/connection.py", line 80, in send_ping
    self._socket.ping(codecs.encode(str(self._ping_count), "utf-8"))
  File "/glade/u/home/bonnland/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/tornado/websocket.py", line 447, in ping
    raise WebSocketClosedError()
tornado.websocket.WebSocketClosedError

I am also getting back warnings fromintake-esm.to_dataset_dict() like this:

/glade/u/home/bonnland/miniconda3/envs/lens-conversion/lib/python3.8/site-packages/dask/array/core.py:3911: PerformanceWarning: Increasing number of chunks by factor of 65
  result = blockwise(

I don't think I am specifying too few chunks...

Brian Bonnlander (Sep 10 2020 at 20:31):

UPDATE: I was specifying too few chunks after all, when trying to create a Zarr store. After increasing the number of chunks and being patient, I was able to get past the original errors. The error messages still appear, but do not cause the overall computation to halt.

Users should note that the Dask dashboard can produce 500: Internal Server Error messages and go "dead" (all buttons turn grey) and for minutes at a time, before "coming back alive" (all buttons turn orange). It is possible that the computation is still active while these errors are occurring.

Brian Bonnlander (Sep 11 2020 at 16:41):

I believe I understand better now why the Dask dashboard is not working for me on Cheyenne, while it is working for me on Casper.

On Casper, I was ssh-tunneling to my own installed version of Jupyter Lab, which is more recent. Whereas on Cheyenne, I am logging in via JupyterHub, and using an older version of Jupyter. Apparently the older version of Jupyter on JupyterHub does not have a properly configured Jupyter instance, or it is out of date, because the Dask dashboard there is constantly "going dead".

If anyone knows how to run their own JupyterLab instance on Cheyenne, I would really appreciate some tips. Thanks!

Deepak Cherian (Sep 11 2020 at 17:28):

you can use the same tunnelling approach on cheyenne...

Brian Bonnlander (Sep 11 2020 at 18:26):

Ah, thanks I didn't know that!

Last updated: Jan 30 2022 at 12:01 UTC