Stream: python-questions
Topic: peculiar behavior when saving xarray dataset to zarr
Anderson Banihirwe (Mar 21 2020 at 00:15):
I am trying to write an xarray dataset after manipulating the coordinates, and for reason unbeknownst to me xarray seems to be doing what I didn't tell it to do. Here's a minimal example:
- Read the data
In [14]: import xarray as xr In [15]: ds = xr.tutorial.open_dataset('rasm').chunk() In [16]: ds Out[16]: <xarray.Dataset> Dimensions: (time: 36, x: 275, y: 205) Coordinates: * time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00 xc (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray> yc (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray> Dimensions without coordinates: x, y Data variables: Tair (time, y, x) float64 dask.array<chunksize=(36, 205, 275), meta=np.ndarray> Attributes: title: /workspace/jhamman/processed/R1002RBRxaaa01a/l... institution: U.W. source: RACM R1002RBRxaaa01a output_frequency: daily output_mode: averaged convention: CF-1.4 references: Based on the initial model of Liang et al., 19... comment: Output from the Variable Infiltration Capacity... nco_openmp_thread_number: 1 NCO: "4.6.0" history: Tue Dec 27 14:15:22 2016: ncatted -a dimension...
- Define a function for restoring non-dim coordinates:
In [17]: def _restore_non_dim_coords(ds): ...: """restore non_dim_coords to variables""" ...: non_dim_coords_reset = set(ds.coords) - set(ds.dims) ...: ds = ds.reset_coords(non_dim_coords_reset) ...: return ds In [18]: ds2 = _restore_non_dim_coords(ds)
- As you can see, at this point (xc, yc) have been promoted to be
data_variables
:
In [19]: ds2 Out[19]: <xarray.Dataset> Dimensions: (time: 36, x: 275, y: 205) Coordinates: * time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00 Dimensions without coordinates: x, y Data variables: Tair (time, y, x) float64 dask.array<chunksize=(36, 205, 275), meta=np.ndarray> xc (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray> yc (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray> Attributes: title: /workspace/jhamman/processed/R1002RBRxaaa01a/l... institution: U.W. source: RACM R1002RBRxaaa01a output_frequency: daily output_mode: averaged convention: CF-1.4 references: Based on the initial model of Liang et al., 19... comment: Output from the Variable Infiltration Capacity... nco_openmp_thread_number: 1 NCO: "4.6.0" history: Tue Dec 27 14:15:22 2016: ncatted -a dimension...
When I write the data, and read it back in, I get unexpected results i.e. (xc, yc) have been moved back to coordinates:
In [20]: ds2.to_zarr("test.zarr", consolidated=True) Out[20]: <xarray.backends.zarr.ZarrStore at 0x2b6dd7a08e90> In [21]: xr.open_zarr("test.zarr", consolidated=True) Out[21]: <xarray.Dataset> Dimensions: (time: 36, x: 275, y: 205) Coordinates: * time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00 xc (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray> yc (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray> Dimensions without coordinates: x, y Data variables: Tair (time, y, x) float64 dask.array<chunksize=(36, 205, 275), meta=np.ndarray> Attributes: NCO: "4.6.0" comment: Output from the Variable Infiltration Capacity... convention: CF-1.4 history: Tue Dec 27 14:15:22 2016: ncatted -a dimension... institution: U.W. nco_openmp_thread_number: 1 output_frequency: daily output_mode: averaged references: Based on the initial model of Liang et al., 19... source: RACM R1002RBRxaaa01a title: /workspace/jhamman/processed/R1002RBRxaaa01a/l...
Am I missing something? I am trying to determine whether this is a bug that needs to be reported upstream (in xarray) or whether this is the expected behavior.
Cc @Deepak Cherian, @Joe Hamman in case you have suggestions on addressing this issue.
Deepak Cherian (Mar 21 2020 at 14:27):
The coordinates
attribute is set on either attrs
or encoding
those variables (most likely). [I think] This behaviour exists so you can get perfect roundtripping if decode_coords=False
(i.e. a coordinates attribute was set but variables were not converted to coordinates). We could raise a SerializationWarning
when this is the case. See https://github.com/pydata/xarray/pull/3487.
Anderson Banihirwe (Mar 21 2020 at 19:50):
Thank you for the clarification, @Deepak Cherian! I wasn't aware of the changes introduced by https://github.com/pydata/xarray/pull/3487 (and my confusion was mostly due to the fact in the past (prior v0.14.1) I had done this manipulation successfully).
The coordinates attribute is set on either attrs or encoding those variables (most likely).
Indeed:
In [43]: ds.Tair.encoding Out[43]: {'source': '/Users/abanihi/.xarray_tutorial_data/rasm.nc', 'original_shape': (36, 205, 275), 'dtype': dtype('float64'), '_FillValue': 9.969209968386869e+36, 'coordinates': 'yc xc'}
Deleting the coordinates key from the encoding gave me the outcome I was looking for
In [45]: del ds2.Tair.encoding['coordinates'] In [48]: ds2.to_zarr('test.zarr', consolidated=True) Out[48]: <xarray.backends.zarr.ZarrStore at 0x12a960a10> In [50]: xr.open_zarr('test.zarr', consolidated=True) Out[50]: <xarray.Dataset> Dimensions: (time: 36, x: 275, y: 205) Coordinates: * time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00 Dimensions without coordinates: x, y Data variables: Tair (time, y, x) float64 dask.array<chunksize=(36, 205, 275), meta=np.ndarray> xc (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray> yc (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
Last updated: Jan 30 2022 at 12:01 UTC