xr.to_netcdf adding fillvalues · python-questions

I am trying to use xarray to read in a netcdf file (a CAM timeseries file), modify a
few fields, and then save as a new netcdf file. Here's a simplified
version:

import xarray as xr
myfile_orig = '/Users/abaker/alli/compression/statistical/prect/orig-data/PRECT.orig.ts10000.nc'
ds = xr.open_dataset(myfile_orig)
new_file = 'Copy.nc'
ds.to_netcdf(new_file, engine = 'netcdf4', format='NETCDF4')

It appears that the xr.to_netcdf() function adds a _FillValue: nan to every variable.
(This causes another code I have to fail as the coordinate variables
are not supposed to have this). For example:

(py372) cisl-duluth:prect abaker$ ncinfo -v lon Copy.nc
<class 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
_FillValue: nan
long_name: longitude
units: degrees_east
unlimited dimensions:
current shape = (288,)
filling on

(py372) cisl-duluth:prect abaker$ ncinfo -v lon orig-data/PRECT.orig.ts10000.nc
<class 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
long_name: longitude
units: degrees_east
unlimited dimensions:
current shape = (288,)
filling on, default _FillValue of 9.969209968386869e+36 used

While this works, it seems that there might be a more global way to
do this without having to modify the encoding for each variable.
(I tried setting ds.encoding['_FillValue'] = False, but that did not affect anything )

But it still seems that this must be done on a variable by variable basis.
Is that true or does anyone know a different way? Thanks.

Anderson Banihirwe (Mar 03 2020 at 22:50):

Internally, xarray checks the data type of each variable in order to assign it a default _FillValue. Because of this, setting ds.encoding['_FillValue'] = False globally does not work. One solution is to loop through all variables:

for v in ds.variables:
    if '_FillValue' not in ds[v].encoding:
        ds[v].encoding['_FillValue'] = None

Allison Baker (Mar 04 2020 at 17:07):

Allison Baker (Mar 04 2020 at 17:27):

Following up on this same example, the other thing happening is that
my dimension "chars=8" gets changed to "string8 = 8". For example,

(py372) cisl-duluth:prect abaker$ ncdump -h Copy.nc
netcdf Copy {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
string8 = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;

(py372) cisl-duluth:prect abaker$ ncdump -h orig-data/PRECT.orig.ts10000.nc
netcdf PRECT.orig.ts10000 {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
chars = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;

I see that when I read in the dataset, chars is not in the dims:
In [12]: ds.dims
Out[12]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))

I have been post-processing the file with a nco tool to change it back
(ncrename -d string8,chars Copy.nc), but would rather fix it in my
python script if anyone knows how to do that....

Kevin Paul (Mar 04 2020 at 17:38):

Yeah, I think this is clearly a bug in Xarray. The way that Xarray documentation says to handle this is to add the 'char_dim_name to the variable's encoding (e.g., var.encoding['char_dim_name'] = 'chars'`), but this does not work.

And, as a separate issue, the char_dim_name encoding value should be stored in the variable's encoding at read-time, when the Xarray Dataset is created. But it is not.

Anderson Banihirwe (Mar 05 2020 at 01:22):

@Allison Baker, when you get a chance, can you point me to the file or send me the file you are using? I'd like to use it to open an issue in the xarray repo. Thanks!

Kevin Paul (Mar 05 2020 at 03:46):

@Allison Baker’s file might be too big to upload. However, I’m sure we can extract a smaller piece of the file (e.g., 1 timestep) that can reproduce the problem.

Allison Baker (Mar 05 2020 at 16:50):

Anderson Banihirwe (Mar 05 2020 at 18:55):

I can confirm that the chars is not in the dims, however, I am not getting the string8 = 8 issue. According to the documentation, xarray is keeping tracking of chars dim in the encoding of every variable that has this dimension. When I write the dataset back, the chars is present in netCDF:

In [4]: ds = xr.open_dataset("PRECT.orig.ts10000.nc")

In [5]: ds.dims
Out[5]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))

In [6]: ds.date_written
Out[6]:
<xarray.DataArray 'date_written' (time: 1)>
array([b'03/02/20'], dtype='|S8')
Coordinates:
  * time     (time) object 1947-05-26 00:00:00

In [7]: ds.date_written.encoding
Out[7]:
{'zlib': True,
 'shuffle': True,
 'complevel': 1,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (1, 8),
 'source': '/Users/abanihi/devel/tmp_notebooks/xarray/PRECT.orig.ts10000.nc',
 'original_shape': (1, 8),
 'dtype': dtype('S1'),
 'char_dim_name': 'chars'}

In [8]: ds.to_netcdf("test.nc")

In [9]: %%bash
   ...: ncdump -h test.nc
   ...:
   ...:
   ...:
   ...:
netcdf test {
dimensions:
    time = UNLIMITED ; // (1 currently)
    lat = 192 ;
    lon = 288 ;
    chars = 8 ;
    ilev = 31 ;
    lev = 30 ;
    slat = 191 ;
    slon = 288 ;
    nbnd = 2 ;
variables:
    double P0 ;
        P0:_FillValue = NaN ;
        P0:long_name = "reference pressure" ;
        P0:units = "Pa" ;
    float PRECT(time, lat, lon) ;
...
    char date_written(time, chars) ;
    int datesec(time) ;
        datesec:long_name = "current seconds of current date" ;
    double f11vmr(time) ;
        f11vmr:_FillValue = NaN ;
        f11vmr:long_name = "f11 volume mixing ratio" ;

Anderson Banihirwe (Mar 05 2020 at 18:57):

Can you elaborate on what operations you are applying on the read dataset that result in you getting:

(py372) cisl-duluth:prect abaker$ ncdump -h Copy.nc
netcdf Copy {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
string8 = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;

Kevin Paul (Mar 05 2020 at 18:57):

We were not seeing that in our testing. In fact, the chars dimension was not being saved in the variable encoding.

Anderson Banihirwe (Mar 05 2020 at 18:58):

In [10]: xr.__version__
Out[10]: '0.15.0'

Allison Baker (Mar 05 2020 at 20:39):

I just did the same thing as you and mine turned to string8. Maybe I have a different version of some package?

In [1]: import xarray as xr
...: import numpy as np
...: ds = xr.open_dataset("orig-data/PRECT.orig.ts10000.nc")

In [2]: ds.dims
Out[2]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))

In [3]: ds.date_written
Out[3]:
<xarray.DataArray 'date_written' (time: 1)>
array([b'03/02/20'], dtype='|S8')
Coordinates:

In [4]: ds.date_written.encoding
Out[4]:
{'zlib': True,
'shuffle': True,
'complevel': 1,
'fletcher32': False,
'contiguous': False,
'chunksizes': (1, 8),
'source': '/Users/abaker/alli/compression/statistical/prect/orig-data/PRECT.orig.ts10000.nc',
'original_shape': (1, 8),
'dtype': dtype('S1')}

In [7]: %%bash
...: ncdump -h test.nc
...:
...:
netcdf test {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
string8 = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;

Allison Baker (Mar 05 2020 at 20:41):

@Kevin Paul @Anderson Banihirwe
OK, I have an older version - grrrr
xr.__version__
Out[8]: '0.12.1'

Stream: python-questions

Topic: xr.to_netcdf adding fillvalues

Allison Baker (Mar 03 2020 at 21:45):