I am trying to use xarray to read in a netcdf file (a CAM timeseries file), modify a
few fields, and then save as a new netcdf file. Here's a simplified
version:
import xarray as xr
myfile_orig = '/Users/abaker/alli/compression/statistical/prect/orig-data/PRECT.orig.ts10000.nc'
ds = xr.open_dataset(myfile_orig)
new_file = 'Copy.nc'
ds.to_netcdf(new_file, engine = 'netcdf4', format='NETCDF4')
It appears that the xr.to_netcdf() function adds a _FillValue: nan to every variable.
(This causes another code I have to fail as the coordinate variables
are not supposed to have this). For example:
(py372) cisl-duluth:prect abaker$ ncinfo -v lon Copy.nc
<class 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
_FillValue: nan
long_name: longitude
units: degrees_east
unlimited dimensions:
current shape = (288,)
filling on
(py372) cisl-duluth:prect abaker$ ncinfo -v lon orig-data/PRECT.orig.ts10000.nc
<class 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
long_name: longitude
units: degrees_east
unlimited dimensions:
current shape = (288,)
filling on, default _FillValue of 9.969209968386869e+36 used
In this stackflow thread:
https://stackoverflow.com/questions/45693688/xarray-automatically-applying-fillvalue-to-coordinates-on-netcdf-output
It suggests to set the encoding for each variable to False before call
to_netcdf(), e.g.: ds.lon.encoding['_FillValue'] = False
While this works, it seems that there might be a more global way to
do this without having to modify the encoding for each variable.
(I tried setting ds.encoding['_FillValue'] = False, but that did not affect anything )
There is another related post here:
https://github.com/pydata/xarray/issues/1598
But it still seems that this must be done on a variable by variable basis.
Is that true or does anyone know a different way? Thanks.
@Allison Baker,
But it still seems that this must be done on a variable by variable basis.
Is that true or does anyone know a different way?
Internally, xarray checks the data type of each variable in order to assign it a default _FillValue
. Because of this, setting ds.encoding['_FillValue'] = False
globally does not work. One solution is to loop through all variables:
for v in ds.variables: if '_FillValue' not in ds[v].encoding: ds[v].encoding['_FillValue'] = None
@Anderson Banihirwe - Thanks for clarifying and for the suggestion.
Following up on this same example, the other thing happening is that
my dimension "chars=8" gets changed to "string8 = 8". For example,
(py372) cisl-duluth:prect abaker$ ncdump -h Copy.nc
netcdf Copy {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
string8 = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;
(py372) cisl-duluth:prect abaker$ ncdump -h orig-data/PRECT.orig.ts10000.nc
netcdf PRECT.orig.ts10000 {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
chars = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;
I see that when I read in the dataset, chars is not in the dims:
In [12]: ds.dims
Out[12]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))
I have been post-processing the file with a nco tool to change it back
(ncrename -d string8,chars Copy.nc), but would rather fix it in my
python script if anyone knows how to do that....
Thanks!
Yeah, I think this is clearly a bug in Xarray. The way that Xarray documentation says to handle this is to add the 'char_dim_name to the variable's
encoding (e.g.,
var.encoding['char_dim_name'] = 'chars'`), but this does not work.
And, as a separate issue, the char_dim_name
encoding value should be stored in the variable's encoding at read-time, when the Xarray Dataset is created. But it is not.
@Allison Baker, when you get a chance, can you point me to the file or send me the file you are using? I'd like to use it to open an issue in the xarray repo. Thanks!
@Allison Baker’s file might be too big to upload. However, I’m sure we can extract a smaller piece of the file (e.g., 1 timestep) that can reproduce the problem.
@Anderson Banihirwe Yes - I attached PRECT.orig.ts10000.nc my single timeslice file.
I see that when I read in the dataset, chars is not in the dims:
In [12]: ds.dims
Out[12]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))I have been post-processing the file with a nco tool to change it back
(ncrename -d string8,chars Copy.nc), but would rather fix it in my
python script if anyone knows how to do that....
@Allison Baker
I can confirm that the chars
is not in the dims, however, I am not getting the string8 = 8
issue. According to the documentation, xarray is keeping tracking of chars
dim in the encoding of every variable that has this dimension. When I write the dataset back, the chars
is present in netCDF:
In [4]: ds = xr.open_dataset("PRECT.orig.ts10000.nc") In [5]: ds.dims Out[5]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2})) In [6]: ds.date_written Out[6]: <xarray.DataArray 'date_written' (time: 1)> array([b'03/02/20'], dtype='|S8') Coordinates: * time (time) object 1947-05-26 00:00:00 In [7]: ds.date_written.encoding Out[7]: {'zlib': True, 'shuffle': True, 'complevel': 1, 'fletcher32': False, 'contiguous': False, 'chunksizes': (1, 8), 'source': '/Users/abanihi/devel/tmp_notebooks/xarray/PRECT.orig.ts10000.nc', 'original_shape': (1, 8), 'dtype': dtype('S1'), 'char_dim_name': 'chars'}
In [8]: ds.to_netcdf("test.nc") In [9]: %%bash ...: ncdump -h test.nc ...: ...: ...: ...: netcdf test { dimensions: time = UNLIMITED ; // (1 currently) lat = 192 ; lon = 288 ; chars = 8 ; ilev = 31 ; lev = 30 ; slat = 191 ; slon = 288 ; nbnd = 2 ; variables: double P0 ; P0:_FillValue = NaN ; P0:long_name = "reference pressure" ; P0:units = "Pa" ; float PRECT(time, lat, lon) ; ... char date_written(time, chars) ; int datesec(time) ; datesec:long_name = "current seconds of current date" ; double f11vmr(time) ; f11vmr:_FillValue = NaN ; f11vmr:long_name = "f11 volume mixing ratio" ;
Can you elaborate on what operations you are applying on the read dataset that result in you getting:
(py372) cisl-duluth:prect abaker$ ncdump -h Copy.nc netcdf Copy { dimensions: time = UNLIMITED ; // (1 currently) lat = 192 ; lon = 288 ; string8 = 8 ; ilev = 31 ; lev = 30 ; slat = 191 ; slon = 288 ; nbnd = 2 ;
We were not seeing that in our testing. In fact, the chars
dimension was not being saved in the variable encoding
.
@Allison Baker What version of Xarray are you using?
I should add that I am using
In [10]: xr.__version__ Out[10]: '0.15.0'
I just did the same thing as you and mine turned to string8. Maybe I have a different version of some package?
In [1]: import xarray as xr
...: import numpy as np
...: ds = xr.open_dataset("orig-data/PRECT.orig.ts10000.nc")
In [2]: ds.dims
Out[2]: Frozen(SortedKeysDict({'time': 1, 'lat': 192, 'lon': 288, 'ilev': 31, 'lev': 30, 'slat': 191, 'slon': 288, 'nbnd': 2}))
In [3]: ds.date_written
Out[3]:
<xarray.DataArray 'date_written' (time: 1)>
array([b'03/02/20'], dtype='|S8')
Coordinates:
* time (time) object 1947-05-26 00:00:00
In [4]: ds.date_written.encoding
Out[4]:
{'zlib': True,
'shuffle': True,
'complevel': 1,
'fletcher32': False,
'contiguous': False,
'chunksizes': (1, 8),
'source': '/Users/abaker/alli/compression/statistical/prect/orig-data/PRECT.orig.ts10000.nc',
'original_shape': (1, 8),
'dtype': dtype('S1')}
In [5]: ds.to_netcdf("test.nc")
In [6]: ds.to_netcdf("test.nc")
In [7]: %%bash
...: ncdump -h test.nc
...:
...:
netcdf test {
dimensions:
time = UNLIMITED ; // (1 currently)
lat = 192 ;
lon = 288 ;
string8 = 8 ;
ilev = 31 ;
lev = 30 ;
slat = 191 ;
slon = 288 ;
nbnd = 2 ;
@Kevin Paul @Anderson Banihirwe
OK, I have an older version - grrrr
xr.__version__
Out[8]: '0.12.1'
I will update it and re-try.....
@Allison Baker Yes. Update regularly!
@Anderson Banihirwe @Kevin Paul
OK, that fixed it - thank you!
Last updated: May 16 2025 at 17:14 UTC