Xarray unify_chunks() · python-questions

I have two Zarr stores whose data I want to combine along the time dimension, and then save as a new Zarr store. The second Zarr store has time values that start just after when the first Zarr store's time values end.

Both datasets have the same data variables and chunk sizes. But I know that the combined chunks from both will likely need re-alignment, because the final chunks from the first Zarr store are not completely filled.

Is calling unify_chunks() after concatenation going to de-fragment and rechunk the data using the existing chunk information, as I hope it will?

Brian Bonnlander (Oct 28 2020 at 19:12):

I guess I should have tried this before asking, but unify_chunks() does not do what I need. I need to call chunk() with the coordinate dimensions and chunk sizes I am interested in.

All I need to figure out is how to produce the dictionary describing my chunk sizes.
I can get the data variable dimension names with this expression:

But I can't seem to get the dictionary that puts these two things together, i.e.

For now, I will assume that the ordering of values matches and I can construct the dictionary myself, but I don't know if this is a safe assumption.

Anderson Banihirwe (Oct 28 2020 at 19:16):

In [1]: dims = ('member_id', 'time', 'lat', 'lon')

In [2]: chunksizes = (4, 1000, 65, 120)

In [3]: dict(zip(dims, chunksizes))
Out[3]: {'member_id': 4, 'time': 1000, 'lat': 65, 'lon': 120}

Brian Bonnlander (Oct 28 2020 at 19:18):

That works as long as the values for dims and chunksizes are always produced in the same order. Is this a fair assumption?

Anderson Banihirwe (Oct 28 2020 at 19:18):

Brian Bonnlander (Oct 28 2020 at 19:21):

Do you mean that I can assume the above code will always work? I don't want to associate a key with the wrong value...

Anderson Banihirwe (Oct 28 2020 at 19:22):

dict(zip(ds['my_var'].dims, ds['my_var'].chunks))

Brian Bonnlander (Oct 28 2020 at 19:26):

dict(zip(ds['uas'].dims, ds['uas'].chunks))
{'member_id': (4, 4, 4),
 'time': (1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  698),
 'lat': (65, 65, 65, 63),
 'lon': (120, 120, 120, 120, 120)}

I don't think that is what I want...I just want to know if I can trust that dims and chunksize will order the values correctly for me.

Anderson Banihirwe (Oct 28 2020 at 19:29):

I've always used both, I haven't run into any issues... It's my understanding that dims, chunksize are returned in the same ordering

Brian Bonnlander (Oct 28 2020 at 19:30):

Thanks, I appreciate knowing your experience so far. I will print some diagnostics to help make sure that things look correct.

Anna-Lena Deppenmeier (Mar 22 2022 at 18:34):

Hi @Brian Bonnlander and @Anderson Banihirwe , I'm adding to this threat since I'm currently having issues saving something to zarr because of chunk size. Happy to open a new issue if needed. I am trying to interpolate some 4D data on isotherms, I made sure to have the vertical dimension intact, and have the time, X and Y dimensions chunked in the same manner between the two datasets (one is the data I want to interpolate, the other contains the isotherms I want to interpolate to).

# then interpolate to those isotherms
ds_wdt = xr.open_zarr('/project/oce/deppenme/process-dat/tpose/WDT/' +str(year)+ '/wdt_z.zarr') # data to interpolate
ds_iso_set = xr.open_zarr('/project/oce/deppenme/process-dat/tpose/WDT/' +str(year)+ '/z_isotherms_05.zarr') # isotherms to interpolate to
wdt_isos = dc.interpolate.pchip(ds_wdt.wdt.chunk({'Z':-1, 'time':5, 'YC':-1, 'XC':-1}), 'Z',
                                ds_iso_set.z_iso.chunk({'target':-1, 'time':5, 'YC':-1, 'XC':-1}),
                                core_dim='target')

(wdt_isos.to_dataset(name='wdt_iso').unify_chunks()
 .to_zarr('/project/oce/deppenme/process-dat/tpose/WDT/' +str(year)+ '/wdt_iso.zarr'))

NotImplementedError: Specified zarr chunks encoding['chunks']=(10,) for variable named 'iter' would overlap multiple dask chunks ((5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5),). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

del ds_wdt.wdt.encoding['chunks'] # data to interpolate
del ds_iso_set.z_iso.encoding['chunks'] # isotherms to interpolate to

(wdt_isos.to_dataset(name='wdt_iso').unify_chunks()
 .to_zarr('/project/oce/deppenme/process-dat/tpose/WDT/' +str(year)+ '/wdt_iso.zarr'))

Anna-Lena Deppenmeier (Mar 22 2022 at 20:08):

Deepak Cherian (Mar 23 2022 at 15:05):

Specified zarr chunks encoding['chunks']=(10,) for variable named 'iter'

Anna-Lena Deppenmeier (Mar 23 2022 at 15:25):

great, thanks will try. What is 'iter'? I assumed it was a stand-in for all the variables because I don't know of a variable 'iter' in the dataset

Stream: python-questions

Topic: Xarray unify_chunks()

Brian Bonnlander (Oct 28 2020 at 00:12):