Stream: xarray

Topic: Multiprocessing with Xarray Dataset objects


view this post on Zulip Daniel Adriaansen (Sep 29 2022 at 20:01):

I am working on a Linux workstation (not HPC) so tend to use Python's multiprocessing module frequently. Recently I encountered an error I do not understand, when passing a Dataset object to a function being called via multiprocessing.Pool. What's more interesting, is that if I subset the Dataset (via isel or sel, or by subsetting by passing a single variable like ds[['variable']]) prior to passing to the function, I do not get the error.

Here is some pseudocode:

def calc_nearest_stat(ds,stat_name):
  return(ds.where(ds.mask>0).max(dim=['x0','y0','z0']).to_dataframe())

def preprocess(ds):
  ds = ds.drop_vars('z1')
  return(ds)

ncfiles = glob.glob('/path/to/files/*.nc')
datasets = [xr.open_mfdataset(f,preprocess=preprocess) for f in ncfiles]
mp = multiprocessing.Pool(20)
results = mp.starmap(calc_nearest_stat,[(ds,'max') for ds in datasets])

The error:

  File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
  File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
    put(task)
  File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'open_mfdataset.<locals>.multi_file_closer'

If I alter the Dataset object in any manner it works fine. For example, the line using mp.starmap works if I do this:

results = mp.starmap(calc_nearest_stat,[(ds.isel(x0=10,y0=20),'max') for ds in datasets])

or

results = mp.starmap(calc_nearest_stat,[(ds[['var_name']],'max') for ds in datasets])

My main question, is whether there is a way to pass the Dataset as is, but without the multi_file_closer object attached to it, without modifying the Dataset via subsetting? I could get around this by using open_dataset instead of open_mfdataset I suppose, but I like the functionality of preprocess so that's why I chose open_mfdataset.

Thank you!

view this post on Zulip Deepak Cherian (Sep 29 2022 at 20:08):

Can you post this on the Xarray issue tracker please? This seems like a bug.

view this post on Zulip Daniel Adriaansen (Sep 29 2022 at 20:45):

Just to confirm you mean here: https://github.com/pydata/xarray/issues? Sure thing, thank you @Deepak Cherian !

view this post on Zulip Deepak Cherian (Sep 29 2022 at 20:49):

Yes. thanks

view this post on Zulip Daniel Adriaansen (Sep 30 2022 at 02:44):

For reference: https://github.com/pydata/xarray/issues/7109


Last updated: May 16 2025 at 17:14 UTC