Xarray unify_chunks() · python-questions

I have two Zarr stores whose data I want to combine along the time dimension, and then save as a new Zarr store. The second Zarr store has time values that start just after when the first Zarr store's time values end.

Both datasets have the same data variables and chunk sizes. But I know that the combined chunks from both will likely need re-alignment, because the final chunks from the first Zarr store are not completely filled.

Is calling unify_chunks() after concatenation going to de-fragment and rechunk the data using the existing chunk information, as I hope it will?

Thanks for any insights.

Brian Bonnlander (Oct 28 2020 at 19:12):

I guess I should have tried this before asking, but unify_chunks() does not do what I need. I need to call chunk() with the coordinate dimensions and chunk sizes I am interested in.

All I need to figure out is how to produce the dictionary describing my chunk sizes.
I can get the data variable dimension names with this expression:

ds['my_var'].dims ('member_id', 'time', 'lat', 'lon')

And I can get the data variable chunk sizes with this expression:

ds['my_var'].data.chunksize (4, 1000, 65, 120)

But I can't seem to get the dictionary that puts these two things together, i.e.

{'member_id': 4, 'time': 1000, 'lat': 65, 'lon': 120}

For now, I will assume that the ordering of values matches and I can construct the dictionary myself, but I don't know if this is a safe assumption.

Anderson Banihirwe (Oct 28 2020 at 19:16):

But I can't seem to get the dictionary that puts these two things together, i.e.

Are you looking for this:

In [1]: dims = ('member_id', 'time', 'lat', 'lon')

In [2]: chunksizes = (4, 1000, 65, 120)

In [3]: dict(zip(dims, chunksizes))
Out[3]: {'member_id': 4, 'time': 1000, 'lat': 65, 'lon': 120}

Brian Bonnlander (Oct 28 2020 at 19:18):

That works as long as the values for dims and chunksizes are always produced in the same order. Is this a fair assumption?

Anderson Banihirwe (Oct 28 2020 at 19:18):

Yes, the ordering matters

Brian Bonnlander (Oct 28 2020 at 19:21):

Do you mean that I can assume the above code will always work? I don't want to associate a key with the wrong value...

Anderson Banihirwe (Oct 28 2020 at 19:22):

Instead of ds['my_var'].data.chunksize, Try the following

dict(zip(ds['my_var'].dims, ds['my_var'].chunks))

Brian Bonnlander (Oct 28 2020 at 19:26):

This is what I get:

dict(zip(ds['uas'].dims, ds['uas'].chunks))
{'member_id': (4, 4, 4),
 'time': (1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  1000,
  698),
 'lat': (65, 65, 65, 63),
 'lon': (120, 120, 120, 120, 120)}

I don't think that is what I want...I just want to know if I can trust that dims and chunksize will order the values correctly for me.

Anderson Banihirwe (Oct 28 2020 at 19:29):

I can trust that dims and chunksize will order the values correctly for me.

I've always used both, I haven't run into any issues... It's my understanding that dims, chunksize are returned in the same ordering

Brian Bonnlander (Oct 28 2020 at 19:30):

Thanks, I appreciate knowing your experience so far. I will print some diagnostics to help make sure that things look correct.

Last updated: Jan 30 2022 at 12:01 UTC

Stream: python-questions

Topic: Xarray unify_chunks()

Brian Bonnlander (Oct 28 2020 at 00:12):