I have two Zarr stores whose data I want to combine along the time dimension, and then save as a new Zarr store. The second Zarr store has time values that start just after when the first Zarr store's time values end.
Both datasets have the same data variables and chunk sizes. But I know that the combined chunks from both will likely need re-alignment, because the final chunks from the first Zarr store are not completely filled.
Is calling unify_chunks() after concatenation going to de-fragment and rechunk the data using the existing chunk information, as I hope it will?
Thanks for any insights.
I guess I should have tried this before asking, but unify_chunks()
does not do what I need. I need to call chunk()
with the coordinate dimensions and chunk sizes I am interested in.
All I need to figure out is how to produce the dictionary describing my chunk sizes.
I can get the data variable dimension names with this expression:
ds['my_var'].dims
('member_id', 'time', 'lat', 'lon')
And I can get the data variable chunk sizes with this expression:
ds['my_var'].data.chunksize
(4, 1000, 65, 120)
But I can't seem to get the dictionary that puts these two things together, i.e.
{'member_id': 4, 'time': 1000, 'lat': 65, 'lon': 120}
For now, I will assume that the ordering of values matches and I can construct the dictionary myself, but I don't know if this is a safe assumption.
But I can't seem to get the dictionary that puts these two things together, i.e.
Are you looking for this:
In [1]: dims = ('member_id', 'time', 'lat', 'lon') In [2]: chunksizes = (4, 1000, 65, 120) In [3]: dict(zip(dims, chunksizes)) Out[3]: {'member_id': 4, 'time': 1000, 'lat': 65, 'lon': 120}
That works as long as the values for dims
and chunksizes
are always produced in the same order. Is this a fair assumption?
Yes, the ordering matters
Do you mean that I can assume the above code will always work? I don't want to associate a key with the wrong value...
Instead of ds['my_var'].data.chunksize
, Try the following
dict(zip(ds['my_var'].dims, ds['my_var'].chunks))
This is what I get:
dict(zip(ds['uas'].dims, ds['uas'].chunks)) {'member_id': (4, 4, 4), 'time': (1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 698), 'lat': (65, 65, 65, 63), 'lon': (120, 120, 120, 120, 120)}
I don't think that is what I want...I just want to know if I can trust that dims
and chunksize
will order the values correctly for me.
I can trust that dims and chunksize will order the values correctly for me.
I've always used both, I haven't run into any issues... It's my understanding that dims, chunksize are returned in the same ordering
Thanks, I appreciate knowing your experience so far. I will print some diagnostics to help make sure that things look correct.
Hi @Brian Bonnlander and @Anderson Banihirwe , I'm adding to this threat since I'm currently having issues saving something to zarr because of chunk size. Happy to open a new issue if needed. I am trying to interpolate some 4D data on isotherms, I made sure to have the vertical dimension intact, and have the time, X and Y dimensions chunked in the same manner between the two datasets (one is the data I want to interpolate, the other contains the isotherms I want to interpolate to).
# then interpolate to those isotherms
ds_wdt = xr.open_zarr('/project/oce/deppenme/process-dat/tpose/WDT/' +str(year)+ '/wdt_z.zarr') # data to interpolate
ds_iso_set = xr.open_zarr('/project/oce/deppenme/process-dat/tpose/WDT/' +str(year)+ '/z_isotherms_05.zarr') # isotherms to interpolate to
wdt_isos = dc.interpolate.pchip(ds_wdt.wdt.chunk({'Z':-1, 'time':5, 'YC':-1, 'XC':-1}), 'Z',
ds_iso_set.z_iso.chunk({'target':-1, 'time':5, 'YC':-1, 'XC':-1}),
core_dim='target')
When I try to save my data
(wdt_isos.to_dataset(name='wdt_iso').unify_chunks()
.to_zarr('/project/oce/deppenme/process-dat/tpose/WDT/' +str(year)+ '/wdt_iso.zarr'))
I get this error:
NotImplementedError: Specified zarr chunks encoding['chunks']=(10,) for variable named 'iter' would overlap multiple dask chunks ((5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5),). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.
I have tried
del ds_wdt.wdt.encoding['chunks'] # data to interpolate
del ds_iso_set.z_iso.encoding['chunks'] # isotherms to interpolate to
I have also tried
(wdt_isos.to_dataset(name='wdt_iso').unify_chunks()
.to_zarr('/project/oce/deppenme/process-dat/tpose/WDT/' +str(year)+ '/wdt_iso.zarr'))
with no result. Not really sure how to proceed, any ideas?
pinging @Deepak Cherian too :upside_down:
Specified zarr chunks encoding['chunks']=(10,) for variable named 'iter'
I think you need to delete encoding for iter
great, thanks will try. What is 'iter'? I assumed it was a stand-in for all the variables because I don't know of a variable 'iter' in the dataset
Last updated: May 16 2025 at 17:14 UTC