Stream: python-questions
Topic: xr.to_netcdf adding fillvalues
Allison Baker (Mar 03 2020 at 21:45):
I am trying to use xarray to read in a netcdf file (a CAM timeseries file), modify a
few fields, and then save as a new netcdf file. Here's a simplified
version:
import xarray as xr
myfile_orig = '/Users/abaker/alli/compression/statistical/prect/orig-data/PRECT.orig.ts10000.nc'
ds = xr.open_dataset(myfile_orig)
new_file = 'Copy.nc'
ds.to_netcdf(new_file, engine = 'netcdf4', format='NETCDF4')
It appears that the xr.to_netcdf() function adds a _FillValue: nan to every variable.
(This causes another code I have to fail as the coordinate variables
are not supposed to have this). For example:
(py372) cisl-duluth:prect abaker$ ncinfo -v lon Copy.nc
<class 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
_FillValue: nan
long_name: longitude
units: degrees_east
unlimited dimensions:
current shape = (288,)
filling on
(py372) cisl-duluth:prect abaker$ ncinfo -v lon orig-data/PRECT.orig.ts10000.nc
<class 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
long_name: longitude
units: degrees_east
unlimited dimensions:
current shape = (288,)
filling on, default _FillValue of 9.969209968386869e+36 used
In this stackflow thread:
https://stackoverflow.com/questions/45693688/xarray-automatically-applying-fillvalue-to-coordinates-on-netcdf-output
It suggests to set the encoding for each variable to False before call
to_netcdf(), e.g.: ds.lon.encoding['_FillValue'] = False
While this works, it seems that there might be a more global way to
do this without having to modify the encoding for each variable.
(I tried setting ds.encoding['_FillValue'] = False, but that did not affect anything )
There is another related post here:
https://github.com/pydata/xarray/issues/1598
But it still seems that this must be done on a variable by variable basis.
Is that true or does anyone know a different way? Thanks.
Anderson Banihirwe (Mar 03 2020 at 22:50):
@Allison Baker,
But it still seems that this must be done on a variable by variable basis.
Is that true or does anyone know a different way?
Internally, xarray checks the data type of each variable in order to assign it a default _FillValue
. Because of this, setting ds.encoding['_FillValue'] = False
globally does not work. One solution is to loop through all variables:
for v in ds.variables: if '_FillValue' not in ds[v].encoding: ds[v].encoding['_FillValue'] = None
Allison Baker (Mar 04 2020 at 17:07):
@Anderson Banihirwe - Thanks for clarifying and for the suggestion.
Allison Baker (Mar 04 2020 at 17:27):
Following up on this same example, the other thing happening is that
my dimension "chars=8" gets changed to "string8 = 8". For example,
(py372) cisl-duluth:prect abaker$ ncdump -h Copy.nc
netcdf Copy {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
string8 = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;
(py372) cisl-duluth:prect abaker$ ncdump -h orig-data/PRECT.orig.ts10000.nc
netcdf PRECT.orig.ts10000 {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
chars = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;
I see that when I read in the dataset, chars is not in the dims:
In [12]: ds.dims
Out[12]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))
I have been post-processing the file with a nco tool to change it back
(ncrename -d string8,chars Copy.nc), but would rather fix it in my
python script if anyone knows how to do that....
Thanks!
Kevin Paul (Mar 04 2020 at 17:38):
Yeah, I think this is clearly a bug in Xarray. The way that Xarray documentation says to handle this is to add the 'char_dim_name to the variable's
encoding (e.g.,
var.encoding['char_dim_name'] = 'chars'`), but this does not work.
And, as a separate issue, the char_dim_name
encoding value should be stored in the variable's encoding at read-time, when the Xarray Dataset is created. But it is not.
Anderson Banihirwe (Mar 05 2020 at 01:22):
@Allison Baker, when you get a chance, can you point me to the file or send me the file you are using? I'd like to use it to open an issue in the xarray repo. Thanks!
Kevin Paul (Mar 05 2020 at 03:46):
@Allison Baker’s file might be too big to upload. However, I’m sure we can extract a smaller piece of the file (e.g., 1 timestep) that can reproduce the problem.
Allison Baker (Mar 05 2020 at 16:50):
@Anderson Banihirwe Yes - I attached PRECT.orig.ts10000.nc my single timeslice file.
Anderson Banihirwe (Mar 05 2020 at 18:55):
I see that when I read in the dataset, chars is not in the dims:
In [12]: ds.dims
Out[12]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))I have been post-processing the file with a nco tool to change it back
(ncrename -d string8,chars Copy.nc), but would rather fix it in my
python script if anyone knows how to do that....
@Allison Baker
I can confirm that the chars
is not in the dims, however, I am not getting the string8 = 8
issue. According to the documentation, xarray is keeping tracking of chars
dim in the encoding of every variable that has this dimension. When I write the dataset back, the chars
is present in netCDF:
In [4]: ds = xr.open_dataset("PRECT.orig.ts10000.nc") In [5]: ds.dims Out[5]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2})) In [6]: ds.date_written Out[6]: <xarray.DataArray 'date_written' (time: 1)> array([b'03/02/20'], dtype='|S8') Coordinates: * time (time) object 1947-05-26 00:00:00 In [7]: ds.date_written.encoding Out[7]: {'zlib': True, 'shuffle': True, 'complevel': 1, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 8), 'source': '/Users/abanihi/devel/tmp_notebooks/xarray/PRECT.orig.ts10000.nc', 'original_shape': (1, 8), 'dtype': dtype('S1'), 'char_dim_name': 'chars'}
In [8]: ds.to_netcdf("test.nc") In [9]: %%bash ...: ncdump -h test.nc ...: ...: ...: ...: netcdf test { dimensions: time = UNLIMITED ; // (1 currently) lat = 192 ; lon = 288 ; chars = 8 ; ilev = 31 ; lev = 30 ; slat = 191 ; slon = 288 ; nbnd = 2 ; variables: double P0 ; P0:_FillValue = NaN ; P0:long_name = "reference pressure" ; P0:units = "Pa" ; float PRECT(time, lat, lon) ; ... char date_written(time, chars) ; int datesec(time) ; datesec:long_name = "current seconds of current date" ; double f11vmr(time) ; f11vmr:_FillValue = NaN ; f11vmr:long_name = "f11 volume mixing ratio" ;
Anderson Banihirwe (Mar 05 2020 at 18:57):
Can you elaborate on what operations you are applying on the read dataset that result in you getting:
(py372) cisl-duluth:prect abaker$ ncdump -h Copy.nc netcdf Copy { dimensions: time = UNLIMITED ; // (1 currently) lat = 192 ; lon = 288 ; string8 = 8 ; ilev = 31 ; lev = 30 ; slat = 191 ; slon = 288 ; nbnd = 2 ;
Kevin Paul (Mar 05 2020 at 18:57):
We were not seeing that in our testing. In fact, the chars
dimension was not being saved in the variable encoding
.
@Allison Baker What version of Xarray are you using?
Anderson Banihirwe (Mar 05 2020 at 18:58):
I should add that I am using
In [10]: xr.__version__ Out[10]: '0.15.0'
Allison Baker (Mar 05 2020 at 20:39):
I just did the same thing as you and mine turned to string8. Maybe I have a different version of some package?
In [1]: import xarray as xr
...: import numpy as np
...: ds = xr.open_dataset("orig-data/PRECT.orig.ts10000.nc")
In [2]: ds.dims
Out[2]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))
In [3]: ds.date_written
Out[3]:
<xarray.DataArray 'date_written' (time: 1)>
array([b'03/02/20'], dtype='|S8')
Coordinates:
* time (time) object 1947-05-26 00:00:00
In [4]: ds.date_written.encoding
Out[4]:
{'zlib': True,
'shuffle': True,
'complevel': 1,
'fletcher32': False,
'contiguous': False,
'chunksizes': (1, 8),
'source': '/Users/abaker/alli/compression/statistical/prect/orig-data/PRECT.orig.ts10000.nc',
'original_shape': (1, 8),
'dtype': dtype('S1')}
In [5]: ds.to_netcdf("test.nc")
In [6]: ds.to_netcdf("test.nc")
In [7]: %%bash
...: ncdump -h test.nc
...:
...:
netcdf test {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
string8 = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;
Allison Baker (Mar 05 2020 at 20:41):
@Kevin Paul @Anderson Banihirwe
OK, I have an older version - grrrr
xr.__version__
Out[8]: '0.12.1'
I will update it and re-try.....
Kevin Paul (Mar 05 2020 at 20:42):
@Allison Baker Yes. Update regularly!
Allison Baker (Mar 05 2020 at 21:21):
@Anderson Banihirwe @Kevin Paul
OK, that fixed it - thank you!
Last updated: Jan 30 2022 at 12:01 UTC