Stream: python-questions

Topic: peculiar behavior when saving xarray dataset to zarr


view this post on Zulip Anderson Banihirwe (Mar 21 2020 at 00:15):

I am trying to write an xarray dataset after manipulating the coordinates, and for reason unbeknownst to me xarray seems to be doing what I didn't tell it to do. Here's a minimal example:

In [14]: import xarray as xr

In [15]: ds = xr.tutorial.open_dataset('rasm').chunk()

In [16]: ds
Out[16]:
<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
    xc       (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
    yc       (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 dask.array<chunksize=(36, 205, 275), meta=np.ndarray>
Attributes:
    title:                     /workspace/jhamman/processed/R1002RBRxaaa01a/l...
    institution:               U.W.
    source:                    RACM R1002RBRxaaa01a
    output_frequency:          daily
    output_mode:               averaged
    convention:                CF-1.4
    references:                Based on the initial model of Liang et al., 19...
    comment:                   Output from the Variable Infiltration Capacity...
    nco_openmp_thread_number:  1
    NCO:                       "4.6.0"
    history:                   Tue Dec 27 14:15:22 2016: ncatted -a dimension...
In [17]: def _restore_non_dim_coords(ds):
    ...:     """restore non_dim_coords to variables"""
    ...:     non_dim_coords_reset = set(ds.coords) - set(ds.dims)
    ...:     ds = ds.reset_coords(non_dim_coords_reset)
    ...:     return ds
In [18]: ds2 = _restore_non_dim_coords(ds)
In [19]: ds2
Out[19]:
<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 dask.array<chunksize=(36, 205, 275), meta=np.ndarray>
    xc       (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
    yc       (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
Attributes:
    title:                     /workspace/jhamman/processed/R1002RBRxaaa01a/l...
    institution:               U.W.
    source:                    RACM R1002RBRxaaa01a
    output_frequency:          daily
    output_mode:               averaged
    convention:                CF-1.4
    references:                Based on the initial model of Liang et al., 19...
    comment:                   Output from the Variable Infiltration Capacity...
    nco_openmp_thread_number:  1
    NCO:                       "4.6.0"
    history:                   Tue Dec 27 14:15:22 2016: ncatted -a dimension...

When I write the data, and read it back in, I get unexpected results i.e. (xc, yc) have been moved back to coordinates:

In [20]: ds2.to_zarr("test.zarr", consolidated=True)
Out[20]: <xarray.backends.zarr.ZarrStore at 0x2b6dd7a08e90>

In [21]: xr.open_zarr("test.zarr", consolidated=True)
Out[21]:
<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
    xc       (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
    yc       (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 dask.array<chunksize=(36, 205, 275), meta=np.ndarray>
Attributes:
    NCO:                       "4.6.0"
    comment:                   Output from the Variable Infiltration Capacity...
    convention:                CF-1.4
    history:                   Tue Dec 27 14:15:22 2016: ncatted -a dimension...
    institution:               U.W.
    nco_openmp_thread_number:  1
    output_frequency:          daily
    output_mode:               averaged
    references:                Based on the initial model of Liang et al., 19...
    source:                    RACM R1002RBRxaaa01a
    title:                     /workspace/jhamman/processed/R1002RBRxaaa01a/l...

Am I missing something? I am trying to determine whether this is a bug that needs to be reported upstream (in xarray) or whether this is the expected behavior.

Cc @Deepak Cherian, @Joe Hamman in case you have suggestions on addressing this issue.

view this post on Zulip Deepak Cherian (Mar 21 2020 at 14:27):

The coordinates attribute is set on either attrs or encoding those variables (most likely). [I think] This behaviour exists so you can get perfect roundtripping if decode_coords=False (i.e. a coordinates attribute was set but variables were not converted to coordinates). We could raise a SerializationWarning when this is the case. See https://github.com/pydata/xarray/pull/3487.

view this post on Zulip Anderson Banihirwe (Mar 21 2020 at 19:50):

Thank you for the clarification, @Deepak Cherian! I wasn't aware of the changes introduced by https://github.com/pydata/xarray/pull/3487 (and my confusion was mostly due to the fact in the past (prior v0.14.1) I had done this manipulation successfully).

The coordinates attribute is set on either attrs or encoding those variables (most likely).

Indeed:

In [43]: ds.Tair.encoding
Out[43]:
{'source': '/Users/abanihi/.xarray_tutorial_data/rasm.nc',
 'original_shape': (36, 205, 275),
 'dtype': dtype('float64'),
 '_FillValue': 9.969209968386869e+36,
 'coordinates': 'yc xc'}

Deleting the coordinates key from the encoding gave me the outcome I was looking for

In [45]: del ds2.Tair.encoding['coordinates']
In [48]: ds2.to_zarr('test.zarr', consolidated=True)
Out[48]: <xarray.backends.zarr.ZarrStore at 0x12a960a10>
In [50]: xr.open_zarr('test.zarr', consolidated=True)
Out[50]:
<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 dask.array<chunksize=(36, 205, 275), meta=np.ndarray>
    xc       (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>
    yc       (y, x) float64 dask.array<chunksize=(205, 275), meta=np.ndarray>

Last updated: Jan 30 2022 at 12:01 UTC