I am working on a Linux workstation (not HPC) so tend to use Python's multiprocessing module frequently. Recently I encountered an error I do not understand, when passing a Dataset object to a function being called via multiprocessing.Pool
. What's more interesting, is that if I subset the Dataset (via isel
or sel
, or by subsetting by passing a single variable like ds[['variable']]
) prior to passing to the function, I do not get the error.
Here is some pseudocode:
def calc_nearest_stat(ds,stat_name):
return(ds.where(ds.mask>0).max(dim=['x0','y0','z0']).to_dataframe())
def preprocess(ds):
ds = ds.drop_vars('z1')
return(ds)
ncfiles = glob.glob('/path/to/files/*.nc')
datasets = [xr.open_mfdataset(f,preprocess=preprocess) for f in ncfiles]
mp = multiprocessing.Pool(20)
results = mp.starmap(calc_nearest_stat,[(ds,'max') for ds in datasets])
The error:
File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
put(task)
File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/dadriaan/.conda/envs/icing/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'open_mfdataset.<locals>.multi_file_closer'
If I alter the Dataset object in any manner it works fine. For example, the line using mp.starmap
works if I do this:
results = mp.starmap(calc_nearest_stat,[(ds.isel(x0=10,y0=20),'max') for ds in datasets])
or
results = mp.starmap(calc_nearest_stat,[(ds[['var_name']],'max') for ds in datasets])
My main question, is whether there is a way to pass the Dataset as is, but without the multi_file_closer
object attached to it, without modifying the Dataset via subsetting? I could get around this by using open_dataset
instead of open_mfdataset
I suppose, but I like the functionality of preprocess
so that's why I chose open_mfdataset
.
Thank you!
Can you post this on the Xarray issue tracker please? This seems like a bug.
Just to confirm you mean here: https://github.com/pydata/xarray/issues? Sure thing, thank you @Deepak Cherian !
Yes. thanks
For reference: https://github.com/pydata/xarray/issues/7109
Last updated: May 16 2025 at 17:14 UTC