Stream: xarray

Topic: Problem Reading netCDF WRF file with Xarray / Dask Cluster

Julien Chastang (Apr 04 2022 at 16:09):

Problem Reading RDA netCDF WRF file with Xarray / Dask Cluster / Kubernetes / JupyterHub

Hello. Here is my first thread on the Zulip @ UCAR so here goes:

I am trying to perform some analysis and visualization of WRF output from the RDA on a Dask cluster (my own, not NCAR's).

I fetched this data file:

https://rda.ucar.edu/data/ds612.0/PGW3D/2004/wrf3d_d01_PGW_U_20040602.nc

which you can find here:

https://rda.ucar.edu/datasets/ds612.0/index.html#sfol-wl-/data/ds612.0?g=33200406

I attempt to open the file with the following two lines of code,

ds = xr.open_dataset('./data/wrf3d_d01_PGW_U_20040601.nc',
                     chunks={'bottom_top': 10})
ds.load()

When I try to load this file, I obtain a:

FileNotFoundError: [Errno 2] No such file or directory: b'/home/jovyan/data/wrf3d_d01_PGW_U_20040601.nc'

A few observations:

The file exists as is evident from the following line of code so that error message is a red herring:

!ls -arlt ./data/wrf3d_d01_PGW_U_20040601.nc
-rw-rw-r-- 1 jovyan users 1578843964 Jul  4  2017 /home/jovyan/data/wrf3d_d01_PGW_U_20040601.nc

The worker node log reveals a more accurate source of the problem:

distributed.worker - WARNING - Compute Failed Function: getter args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyIndexedArray(array=<xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x7f3cc159d0c0>, key=BasicIndexer((slice(None, None, None), slice(None, None, None)))))), (slice(0, 1015, None), slice(0, 1359, None))) kwargs: {} Exception: FileNotFoundError(2, 'No such file or directory')

When I change the engine to h5netcdf, I get a different error message: TypeError: decode_vlen_strings is an invalid keyword argument for this function. The worker node yields the same error.
When I remove the chunking option, I have no problem reading in the data, e.g.,

ds = xr.open_dataset('./data/wrf3d_d01_PGW_U_20040601.nc')
ds.load()

though the lack of chunking will create additional problems later on. I would rather have chunking work.

Oddly, when I load the data via a THREDDS server, I have no problems reading in the file:

ds = xr.open_dataset('https://rda.ucar.edu/thredds/dodsC/files/g/ds612.0/PGW3D/2004/wrf3d_d01_PGW_U_20040601.nc',
                      chunks={'bottom_top': 10})
ds.load()

though I am sort of assuming this is not ideal due to the voluminous nature of the data and subsequent calculations (wind speed) on other files (total 60).

Has anyone seen such an error or know what to do about it?

Here are my conda environments from the client , scheduler and worker. I include them separately since they are long.

Kevin Paul (Apr 04 2022 at 16:21):

@Julien Chastang: Can you provide more detail about your Dask cluster? Is it distributed? (Looks like you are creating it using Dask Gateway. How is your Dask Gateway configured?)

I ask because it is almost like you don't have a shared filesystem. That is, it's like you downloaded the file to the container where the JupyterLab session is running, but it isn't accessible to the worker containers. Could that be the case?

Kevin Paul (Apr 04 2022 at 16:39):

Incidentally, a different filesystem for each worker would also explain why the open_dataset option without the chunks parameter works (i.e., you read and load data from the JupyterLab container only), and why the THREDDS use works (i.e., the dask client passes the URL/Filepath to the workers and the workers read from the URL/Filepath).

Julien Chastang (Apr 04 2022 at 16:48):

Thank you @Kevin Paul for the super-quick response :-) I will try to answer your questions. I have also added you to the allow list on my JupyterHub which you can login with your GitHub credentials.. Run the wrf.ipynb notebook which should be ready to go including data.

Can you provide more detail about your Dask cluster? Is it distributed?

Yes, I believe so. I can have the following bit of code run and I see the dashboard update as expected:

import dask.array as da
a = da.random.normal(size=(40000, 40000), chunks=(500, 500))
a.mean().compute()

I ask because it is almost like you don't have a shared filesystem. That is, it's like you downloaded the file to the container where the JupyterLab session is running, but it isn't accessible to the worker containers. Could that be the case?

That may be the case and you may be onto something. I will dig into this idea further. Does your running of wrf.ipynb confirm this idea?

Again, thank you very much! :-) I've really been stuck here.

Kevin Paul (Apr 04 2022 at 16:50):

I’ll take a look. Thanks for the access!

Kevin Paul (Apr 04 2022 at 20:55):

Hey, @Julien Chastang: I'm stuck starting up a server. Maybe there aren't enough resources?

Kevin Paul (Apr 04 2022 at 20:55):

Screen-Shot-2022-04-04-at-2.54.50-PM.png

Julien Chastang (Apr 04 2022 at 21:11):

Thanks @Kevin Paul yeah, I think I am hogging up all the resources, but basically, you hit the nail on the head. I was able to confirm with my colleague (Andrea Zonca, SDSC) that indeed these data volumes will not support multi-attach. In retrospect, I think my line of reasoning was a bit off with the local data solution. I am finding that a THREDDS server may be sufficient for my purposes moreover, it seems like zarr may also offer another avenue of exploration (though I admittedly do not know much about this area). By the way, I did not mention that I am running all of this on the NSF Jetstream Cloud which is about to be decommissioned and replaced with Jetstream2.

As an aside, I had a lot of trouble setting up these dask clusters. Basically, it was a huge challenge to sync up client / worker / scheduler environments, but I believe I am past that now and I am seeing some pretty nice performance when accessing the data via a THREDDS server (definitely way better than single threaded).

Thanks again for your insight. :-) I'd still be stuck otherwise.

Kevin Paul (Apr 04 2022 at 21:16):

@Julien Chastang: I'm glad that helped! Yeah. The general cloud model does not assume that you have a shared filesystem. This is generally true because you typically have access to data that is available via object store, which is accessed via a REST API. Hence, the workers only need internet access to the object store! And local storage on the workers is used only for Dask's "spill" capability.

It takes a bit to wrap your head around not having a shared filesystem, if you are used to HPC systems.

Last updated: May 16 2025 at 17:14 UTC