Stream: python-questions

Topic: xr.to_netcdf adding fillvalues


view this post on Zulip Allison Baker (Mar 03 2020 at 21:45):

I am trying to use xarray to read in a netcdf file (a CAM timeseries file), modify a
few fields, and then save as a new netcdf file. Here's a simplified
version:

import xarray as xr
myfile_orig = '/Users/abaker/alli/compression/statistical/prect/orig-data/PRECT.orig.ts10000.nc'
ds = xr.open_dataset(myfile_orig)
new_file = 'Copy.nc'
ds.to_netcdf(new_file, engine = 'netcdf4', format='NETCDF4')

It appears that the xr.to_netcdf() function adds a _FillValue: nan to every variable.
(This causes another code I have to fail as the coordinate variables
are not supposed to have this). For example:

(py372) cisl-duluth:prect abaker$ ncinfo -v lon Copy.nc
<class 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
_FillValue: nan
long_name: longitude
units: degrees_east
unlimited dimensions:
current shape = (288,)
filling on

(py372) cisl-duluth:prect abaker$ ncinfo -v lon orig-data/PRECT.orig.ts10000.nc
<class 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
long_name: longitude
units: degrees_east
unlimited dimensions:
current shape = (288,)
filling on, default _FillValue of 9.969209968386869e+36 used

In this stackflow thread:
https://stackoverflow.com/questions/45693688/xarray-automatically-applying-fillvalue-to-coordinates-on-netcdf-output
It suggests to set the encoding for each variable to False before call
to_netcdf(), e.g.: ds.lon.encoding['_FillValue'] = False

While this works, it seems that there might be a more global way to
do this without having to modify the encoding for each variable.
(I tried setting ds.encoding['_FillValue'] = False, but that did not affect anything )

There is another related post here:
https://github.com/pydata/xarray/issues/1598

But it still seems that this must be done on a variable by variable basis.
Is that true or does anyone know a different way? Thanks.

view this post on Zulip Anderson Banihirwe (Mar 03 2020 at 22:50):

@Allison Baker,

But it still seems that this must be done on a variable by variable basis.
Is that true or does anyone know a different way?

Internally, xarray checks the data type of each variable in order to assign it a default _FillValue. Because of this, setting ds.encoding['_FillValue'] = False globally does not work. One solution is to loop through all variables:

for v in ds.variables:
    if '_FillValue' not in ds[v].encoding:
        ds[v].encoding['_FillValue'] = None

view this post on Zulip Allison Baker (Mar 04 2020 at 17:07):

@Anderson Banihirwe - Thanks for clarifying and for the suggestion.

view this post on Zulip Allison Baker (Mar 04 2020 at 17:27):

Following up on this same example, the other thing happening is that
my dimension "chars=8" gets changed to "string8 = 8". For example,

(py372) cisl-duluth:prect abaker$ ncdump -h Copy.nc
netcdf Copy {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
string8 = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;

(py372) cisl-duluth:prect abaker$ ncdump -h orig-data/PRECT.orig.ts10000.nc
netcdf PRECT.orig.ts10000 {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
chars = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;

I see that when I read in the dataset, chars is not in the dims:
In [12]: ds.dims
Out[12]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))

I have been post-processing the file with a nco tool to change it back
(ncrename -d string8,chars Copy.nc), but would rather fix it in my
python script if anyone knows how to do that....

Thanks!

view this post on Zulip Kevin Paul (Mar 04 2020 at 17:38):

Yeah, I think this is clearly a bug in Xarray. The way that Xarray documentation says to handle this is to add the 'char_dim_name to the variable's encoding (e.g., var.encoding['char_dim_name'] = 'chars'`), but this does not work.

And, as a separate issue, the char_dim_name encoding value should be stored in the variable's encoding at read-time, when the Xarray Dataset is created. But it is not.

view this post on Zulip Anderson Banihirwe (Mar 05 2020 at 01:22):

@Allison Baker, when you get a chance, can you point me to the file or send me the file you are using? I'd like to use it to open an issue in the xarray repo. Thanks!

view this post on Zulip Kevin Paul (Mar 05 2020 at 03:46):

@Allison Baker’s file might be too big to upload. However, I’m sure we can extract a smaller piece of the file (e.g., 1 timestep) that can reproduce the problem.

view this post on Zulip Allison Baker (Mar 05 2020 at 16:50):

@Anderson Banihirwe Yes - I attached PRECT.orig.ts10000.nc my single timeslice file.

view this post on Zulip Anderson Banihirwe (Mar 05 2020 at 18:55):

I see that when I read in the dataset, chars is not in the dims:
In [12]: ds.dims
Out[12]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))

I have been post-processing the file with a nco tool to change it back
(ncrename -d string8,chars Copy.nc), but would rather fix it in my
python script if anyone knows how to do that....

@Allison Baker

I can confirm that the chars is not in the dims, however, I am not getting the string8 = 8 issue. According to the documentation, xarray is keeping tracking of chars dim in the encoding of every variable that has this dimension. When I write the dataset back, the chars is present in netCDF:

In [4]: ds = xr.open_dataset("PRECT.orig.ts10000.nc")

In [5]: ds.dims
Out[5]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))

In [6]: ds.date_written
Out[6]:
<xarray.DataArray 'date_written' (time: 1)>
array([b'03/02/20'], dtype='|S8')
Coordinates:
  * time     (time) object 1947-05-26 00:00:00

In [7]: ds.date_written.encoding
Out[7]:
{'zlib': True,
 'shuffle': True,
 'complevel': 1,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (1, 8),
 'source': '/Users/abanihi/devel/tmp_notebooks/xarray/PRECT.orig.ts10000.nc',
 'original_shape': (1, 8),
 'dtype': dtype('S1'),
 'char_dim_name': 'chars'}
In [8]: ds.to_netcdf("test.nc")

In [9]: %%bash
   ...: ncdump -h test.nc
   ...:
   ...:
   ...:
   ...:
netcdf test {
dimensions:
    time = UNLIMITED ; // (1 currently)
    lat = 192 ;
    lon = 288 ;
    chars = 8 ;
    ilev = 31 ;
    lev = 30 ;
    slat = 191 ;
    slon = 288 ;
    nbnd = 2 ;
variables:
    double P0 ;
        P0:_FillValue = NaN ;
        P0:long_name = "reference pressure" ;
        P0:units = "Pa" ;
    float PRECT(time, lat, lon) ;
...
    char date_written(time, chars) ;
    int datesec(time) ;
        datesec:long_name = "current seconds of current date" ;
    double f11vmr(time) ;
        f11vmr:_FillValue = NaN ;
        f11vmr:long_name = "f11 volume mixing ratio" ;

view this post on Zulip Anderson Banihirwe (Mar 05 2020 at 18:57):

Can you elaborate on what operations you are applying on the read dataset that result in you getting:

(py372) cisl-duluth:prect abaker$ ncdump -h Copy.nc
netcdf Copy {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
string8 = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;

view this post on Zulip Kevin Paul (Mar 05 2020 at 18:57):

We were not seeing that in our testing. In fact, the chars dimension was not being saved in the variable encoding.

@Allison Baker What version of Xarray are you using?

view this post on Zulip Anderson Banihirwe (Mar 05 2020 at 18:58):

I should add that I am using

In [10]: xr.__version__
Out[10]: '0.15.0'

view this post on Zulip Allison Baker (Mar 05 2020 at 20:39):

I just did the same thing as you and mine turned to string8. Maybe I have a different version of some package?

In [1]: import xarray as xr
...: import numpy as np
...: ds = xr.open_dataset("orig-data/PRECT.orig.ts10000.nc")

In [2]: ds.dims
Out[2]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))

In [3]: ds.date_written
Out[3]:
<xarray.DataArray 'date_written' (time: 1)>
array([b'03/02/20'], dtype='|S8')
Coordinates:

* time (time) object 1947-05-26 00:00:00

In [4]: ds.date_written.encoding
Out[4]:
{'zlib': True,
'shuffle': True,
'complevel': 1,
'fletcher32': False,
'contiguous': False,
'chunksizes': (1, 8),
'source': '/Users/abaker/alli/compression/statistical/prect/orig-data/PRECT.orig.ts10000.nc',
'original_shape': (1, 8),
'dtype': dtype('S1')}

In [5]: ds.to_netcdf("test.nc")

In [6]: ds.to_netcdf("test.nc")

In [7]: %%bash
...: ncdump -h test.nc
...:
...:
netcdf test {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
string8 = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;

view this post on Zulip Allison Baker (Mar 05 2020 at 20:41):

@Kevin Paul @Anderson Banihirwe
OK, I have an older version - grrrr
xr.__version__
Out[8]: '0.12.1'

I will update it and re-try.....

view this post on Zulip Kevin Paul (Mar 05 2020 at 20:42):

@Allison Baker Yes. Update regularly!

view this post on Zulip Allison Baker (Mar 05 2020 at 21:21):

@Anderson Banihirwe @Kevin Paul
OK, that fixed it - thank you!


Last updated: Jan 30 2022 at 12:01 UTC