/glade/scratch issue with dask cluster · dask

Hello, I'm trying to get a dask cluster going but it's a command that worked earlier today is now giving an error related to /glade/scratch saying permission denied. I assume this is because /glade/scratch is no more. I'm using the PBSCluster command from the dask_jobqueue. Is it possible that something is hardcoded in here to use /glade/scratch and that this needs to be updated?

Kristen Krumhardt (Mar 01 2024 at 14:16):

Gustavo M Marques (Mar 01 2024 at 17:41):

I was able to solve this by modifying the following file
~/.config/dask/ncar-jobqueue.yaml

Update the following lines under casper-dav


    log-directory: '/glade/derecho/scratch/${USER}/dask/casper-dav/logs'
    local-directory: '/glade/derecho/scratch/${USER}/dask/casper-dav/local-dir'

Kristen Krumhardt (Mar 01 2024 at 17:52):

Isla Simpson (Mar 01 2024 at 18:13):

Hmm, this is not resolving the issue for me. I didn't have an ncar-jobqueue.yml file in ~/.config/dasks. I had a jobqueue.yml file and I changed all the occurrences of /glade/scratch in there to /glade/derecho/scratch. That didn't work. I then copied over Gustavo's ncar-jobqueue.yml file into ~/.config/dask and that still didn't work. I restarted jupyterhub each time. Any other thoughts?

Negin Sobhani (Mar 01 2024 at 19:52):

@Isla Simpson Yes /glade/scratch is not longer available. Your dask default settings are either in ~/.config/dask/ncar-jobqueue.yaml (as @Gustavo M Marques suggested ) or in ~/.dask/jobqueueu.yml.

For you specifically, I checked, and you need to update the local directory or log directory and job extra arguments in the following file:
cat ~/.dask/jobqueue.yaml

distributed:
  comm:
    compression: null
  scheduler:
    bandwidth: 1000000000
  worker:
    memory:
      pause: 0.8
      spill: false
      target: 0.9
      terminate: 0.95
jobqueue:
  pbs:
    cores: 36
    interface: ib0
    job-extra: []
    local-directory: /glade/scratch/islas
    log-directory: /glade/scratch/islas
    memory: 109GB
    name: dask-worker
    processes: 1
    queue: regular
    resource-spec: select=1:ncpus=36:mem=109GB
    walltime: 01:00:00
  slurm:
    cores: 1
    interface: ib0
    job-extra:
    - -C casper
    - -o /glade/scratch/islas/dask-worker.o%J
    - -e /glade/scratch/islas/dask-worker.e%J
    local-directory: /glade/scratch/islas
    log-directory: /glade/scratch/islas
    memory: 25GB
    name: dask-worker
    processes: 1
    walltime: 06:00:00

Negin Sobhani (Mar 01 2024 at 19:57):

Also, I noticed the default values in this file are not very optimal. I would argue the following values instead:

  pbs:
    cores: 1
    interface: ext
    job-extra: []
    local-directory: /glade/derecho/scratch/islas
    log-directory: /glade/derecho/scratch/islas
    memory: 4GiB
    name: dask-worker
    processes: 1
    queue: casper
    resource-spec: select=1:ncpus=1:mem=4GB
    walltime: 01:00:00

I would also remove slurm arguments too as we are not using it. Please let m know if you have any questions or concerns on this. :-)

Isla Simpson (Mar 01 2024 at 21:26):

Oh, I see. I have too jobqueue.yaml files. On in ~/.dask and one in ~/config/dask and I only changed the one in ~/.config/dask. I've changed the correct one now and I'm not getting the error any more. Thanks a lot!

Stream: dask

Topic: /glade/scratch issue with dask cluster