Multiprocessing with Xarray Dataset objects · xarray

I am working on a Linux workstation (not HPC) so tend to use Python's multiprocessing module frequently. Recently I encountered an error I do not understand, when passing a Dataset object to a function being called via multiprocessing.Pool. What's more interesting, is that if I subset the Dataset (via isel or sel, or by subsetting by passing a single variable like ds[['variable']]) prior to passing to the function, I do not get the error.

def calc_nearest_stat(ds,stat_name):
  return(ds.where(ds.mask>0).max(dim=['x0','y0','z0']).to_dataframe())

def preprocess(ds):
  ds = ds.drop_vars('z1')
  return(ds)

ncfiles = glob.glob('/path/to/files/*.nc')
datasets = [xr.open_mfdataset(f,preprocess=preprocess) for f in ncfiles]
mp = multiprocessing.Pool(20)
results = mp.starmap(calc_nearest_stat,[(ds,'max') for ds in datasets])

  File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
  File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
    put(task)
  File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'open_mfdataset.<locals>.multi_file_closer'

If I alter the Dataset object in any manner it works fine. For example, the line using mp.starmap works if I do this:

results = mp.starmap(calc_nearest_stat,[(ds.isel(x0=10,y0=20),'max') for ds in datasets])

results = mp.starmap(calc_nearest_stat,[(ds[['var_name']],'max') for ds in datasets])

My main question, is whether there is a way to pass the Dataset as is, but without the multi_file_closer object attached to it, without modifying the Dataset via subsetting? I could get around this by using open_dataset instead of open_mfdataset I suppose, but I like the functionality of preprocess so that's why I chose open_mfdataset.

Stream: xarray

Topic: Multiprocessing with Xarray Dataset objects

Daniel Adriaansen (Sep 29 2022 at 20:01):

Deepak Cherian (Sep 29 2022 at 20:08):

Daniel Adriaansen (Sep 29 2022 at 20:45):

Deepak Cherian (Sep 29 2022 at 20:49):

Daniel Adriaansen (Sep 30 2022 at 02:44):