I am trying to analyze and plot some high res (0.25 deg) CAM output from the iHESP simulations. It looks like the data is indexed by ncol, rather than being on a lat/lon grid. I've never dealt with such data before.
I found this really helpful post on dealing with CAM-SE output which seems like it has the same format.
https://ncar.github.io/esds/posts/2023/cam-se-analysis/#section3
I'm trying to follow the third example on that blog, and few questions/issues pop up:
1) Is it fair to use the weights file provided in the example on this blog post? The ncol dimensions I have are the same as theirs.
2) I receive an error when trying to regrid (method 3 in the blog post) which says :
import xesmf
regridded = regrid_cam_se(mfds, f"{map_path}/{map_file}")
MissingDimensionsError: 'lat' has more than 1-dimension and the same name as one of its dimensions ('lat', 'lon'). xarray disallows such variables because they conflict with the coordinates used to label dimensions
Anyone have thoughts or pointers?
Hi @Mira Berdahl. I can't answer your first question, but you may be able to solve your second question by using a version of Xarray >= 2023.08.0. It appears versions after that allow handling data like you have. See this Xarray GitHub issue: https://github.com/pydata/xarray/issues/2233 for more details, but the error there looks exactly like the one you have.
Thanks @Daniel Adriaansen . I just updated my xarray package in the hopes it would help, but now received the following error when importing packages (the PBSCluster specifically):
from dask_jobqueue import PBSCluster
ImportError: cannot import name 'tmpfile' from 'distributed.utils' (/glade/work/mberdahl/miniconda/envs/pangeo/lib/python3.9/site-packages/distributed/utils.py)
Anyone know how to resolve this?
Hi @Mira Berdahl, when you updated Xarray, did you happen to notice if it updated either Dask or Distributed or both? When you do:
conda list dask
and
conda list distributed
what is the output?
Hi @Daniel Adriaansen , yeah it seems both dask and distributed were both upgraded:
- dask-core 2022.1.0 pyhd8ed1ab_0 conda-forge Cached
+ dask-core 2023.10.1 pyhd8ed1ab_0 conda-forge 863kB
- distributed 2022.1.0 py39hf3d152e_0 conda-forge Cached
+ distributed 2023.10.1 pyhd8ed1ab_0 conda-forge 791kB
- xarray 2022.9.0 pyhd8ed1ab_0 conda-forge Cached
+ xarray 2023.12.0 pyhd8ed1ab_0 conda-forge 725kB
- dask 2022.1.0 pyhd8ed1ab_0 conda-forge Cached
+ dask 2023.10.1 pyhd8ed1ab_0 conda-forge 7kB
Also,
(pangeo) mberdahl@casper-login2:/glade/work/mberdahl/miniconda/envs/pangeo> conda list dask
# packages in environment at /glade/work/mberdahl/miniconda/envs/pangeo:
#
# Name Version Build Channel
dask 2023.10.1 pyhd8ed1ab_0 conda-forge
dask-core 2023.10.1 pyhd8ed1ab_0 conda-forge
dask-jobqueue 0.7.3 pyhd8ed1ab_0 conda-forge
dask-labextension 5.2.0 pyhd8ed1ab_0 conda-forge
(pangeo) mberdahl@casper-login2:/glade/work/mberdahl/miniconda/envs/pangeo> conda list distributed
# packages in environment at /glade/work/mberdahl/miniconda/envs/pangeo:
#
# Name Version Build Channel
distributed 2023.10.1 pyhd8ed1ab_0 conda-forge
maybe the version of dask_jobqueue is the problem? this quickly gets tricky though, as you are seeing. Maybe see what the latest dask_jobqueue is and consider updating it also?
@Daniel Adriaansen , Alright, so after updating dask_jobqueue separately I can load the packages again. However, when trying to regrid, I get a new error, that is triggered at the same line in my function as the former error
---> regridded = regridder(updated.rename({"dummy": "lat", "ncol": "lon"}))
ValueError: Dimension 1 has 2 blocks, adjust_chunks specified with 1 blocks
I assume this might have something to do with how I specified chunking when reading the file (?) in which was:
mfds = xr.open_mfdataset(dfiles, combine='by_coords', parallel=True , chunks={"lat": 180, "lon": 240}, data_vars=['U','time_bnds'], drop_variables=['lon'], decode_times=True, compat='override', coords = 'minimal', preprocess = preprocess)
Actually maybe its not to do with the chunking on read-in, as I've played around with that option but the error remains.
Where is the ValueError originating, is it from xesmf?
Hah, I just answered similarly to the question about the import error over on the dask stream.
@Mira Berdahl for your question about using the same weights file, I think it would be fair to use the same weights file from the blog post on the iHESP output. The file in the blog post is from a very similar CESM1.3 0.25deg simulation setup that John Truesdale made for another project. FWIW it uses bilinear interpolation so if you want a different method, there is some information in the blog post about how to create your own map file.
@Katie Dagon I'm fine with bilinear interp so using the weights file from the blog post is good with me - thanks for letting me know. @Daniel Adriaansen I'm trying a new method now (#2 in the blog post https://ncar.github.io/esds/posts/2023/cam-se-analysis/#section3). I can remap now, but running into other errors now though. Will post again when I have a better handle on what is going on.
Thanks for testing out the methods in the blog post! It's good to know if they don't work on other datasets.
Oh, by the way, the package versions I was using to test out those methods are listed at the top - if that helps with debugging.
It looks like I can remap CAM-SE output with the following function
import scipy.sparse as sps
def remap_camse(ds, dsw, varlst=[]):
dso = xr.full_like(ds.drop_dims('ncol'), np.nan)
lonb = dsw.xc_b.values.reshape([dsw.dst_grid_dims[1].values, dsw.dst_grid_dims[0].values])
latb = dsw.yc_b.values.reshape([dsw.dst_grid_dims[1].values, dsw.dst_grid_dims[0].values])
weights = sps.coo_matrix((dsw.S, (dsw.row-1, dsw.col-1)), shape=[dsw.dims['n_b'], dsw.dims['n_a']])
if not varlst:
for varname in list(ds):
if 'ncol' in(ds[varname].dims):
varlst.append(varname)
if 'lon' in varlst: varlst.remove('lon')
if 'lat' in varlst: varlst.remove('lat')
if 'area' in varlst: varlst.remove('area')
for varname in varlst:
shape = ds[varname].shape
invar_flat = ds[varname].values.reshape(-1, shape[-1])
remapped_flat = weights.dot(invar_flat.T).T
remapped = remapped_flat.reshape([*shape[0:-1], dsw.dst_grid_dims[1].values, dsw.dst_grid_dims[0].values])
dimlst = list(ds[varname].dims[0:-1])
dims={}
coords={}
for it in dimlst:
dims[it] = dso.dims[it]
coords[it] = dso.coords[it]
dims['lat'] = int(dsw.dst_grid_dims[1])
dims['lon'] = int(dsw.dst_grid_dims[0])
coords['lat'] = latb[:,0]
coords['lon'] = lonb[0,:]
remapped = xr.DataArray(remapped, coords=coords, dims=dims, attrs=ds[varname].attrs)
dso = xr.merge([dso, remapped.to_dataset(name=varname)])
return dso
However, say I read in two files, a U-wind file and V-wind file, and run the remap function on the first file (U-wind) it works fine. But then if I call the function again afterwards on the V-wind file, I get the following error:
/glade/work/mberdahl/miniconda/envs/pangeo/lib/python3.9/site-packages/numpy/core/numeric.py:407: RuntimeWarning: invalid value encountered in cast
multiarray.copyto(res, fill_value, casting='unsafe')
/glade/work/mberdahl/miniconda/envs/pangeo/lib/python3.9/site-packages/xarray/core/utils.py:494: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
warnings.warn(
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /glade/work/mberdahl/miniconda/envs/pangeo/lib/python3.9/site-packages/xarray/core/dataset.py:1452, in Dataset._construct_dataarray(self, name)
1451 try:
-> 1452 variable = self._variables[name]
1453 except KeyError:
KeyError: 'U'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
File /glade/work/mberdahl/miniconda/envs/pangeo/lib/python3.9/site-packages/xarray/core/dataset.py:1553, in Dataset.__getitem__(self, key)
1552 try:
-> 1553 return self._construct_dataarray(key)
1554 except KeyError as e:
File /glade/work/mberdahl/miniconda/envs/pangeo/lib/python3.9/site-packages/xarray/core/dataset.py:1454, in Dataset._construct_dataarray(self, name)
1453 except KeyError:
-> 1454 _, name, variable = _get_virtual_variable(self._variables, name, self.sizes)
1456 needed_dims = set(variable.dims)
File /glade/work/mberdahl/miniconda/envs/pangeo/lib/python3.9/site-packages/xarray/core/dataset.py:218, in _get_virtual_variable(variables, key, dim_sizes)
217 if len(split_key) != 2:
--> 218 raise KeyError(key)
220 ref_name, var_name = split_key
KeyError: 'U'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Input In [26], in <module>
1 import scipy as sp
----> 2 out2 = remap_camse(mfds2, weights)
Input In [14], in remap_camse(ds, dsw, varlst)
15 if 'area' in varlst: varlst.remove('area')
16 for varname in varlst:
---> 17 shape = ds[varname].shape
18 invar_flat = ds[varname].values.reshape(-1, shape[-1])
19 remapped_flat = weights.dot(invar_flat.T).T
File /glade/work/mberdahl/miniconda/envs/pangeo/lib/python3.9/site-packages/xarray/core/dataset.py:1555, in Dataset.__getitem__(self, key)
1553 return self._construct_dataarray(key)
1554 except KeyError as e:
-> 1555 raise KeyError(
1556 f"No variable named {key!r}. Variables on the dataset include {shorten_list_repr(list(self.variables.keys()), max_items=10)}"
1557 ) from e
1559 if utils.iterable_of_hashable(key):
1560 return self._copy_listed(key)
KeyError: "No variable named 'U'. Variables on the dataset include ['lev', 'ilev', 'hyam', 'hybm', 'P0', ..., 'f12vmr', 'sol_tsi', 'nsteph', 'V', 'time']"
If I run the function again and then try on the V-wind file again it works. Is there something I need to reset, or change in that script so that it doesn't keep looking for variable "U" on all subsequent files?
I think @Stephen Yeager wrote this function originally, I'd be curious to hear if he has any thoughts about using it repeatedly for multiple variables.
For others who may encounter this in the future, it appears that adding the line varlst.clear() at the end of the function does the trick. Not clear exactly why, it shouldn't be necessary. Thanks to Fred Castruccio for the help.
Hi @Mira Berdahl . I just wanted to call to your attention NCAR's Project Raijin effort, which is working with the DOE to develop UXarray: an Xarray-compatible Python package for analyzing and visualize unstructured grid data without resampling. We just published a new Pythia cookbook that demonstrates plotting with UXarray.
Thanks @John Clyne ! I think an example of UXarray applied to CAM-SE unstructured output could be really helpful. I think the cookbook only shows MPAS?
Thanks @John Clyne , good to know about this! Agreed with @Katie Dagon , an example with CAM-SE would be super helpful.
Thanks for the input! @Philip Chmielowiec @Orhan Eroglu (in case you didn't already see this)
UXarray currently supports CAM-SE just as MPAS. Please see this
We will definitely consider putting together CAM-SE examples as well, but you should be able to run any functions on CAM-SE in the way they are documented for MPAS.
Hi again, I'm having a strange issue crop up from time to time when using this function (remap_camse). Sometimes it works fine, and other times I end up with the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/glade/u/home/mberdahl/FirstLook_LIGruns/DASK_scripts/HistoricalAnalysis/dask-worker-space/worker-5wchvnxg/storage/%28%27open_dataset-concatenate-b1b2ed4bd191a85fa8152a06f16ef99d%27%2C%206%2C%200%2C%200%29'
The particular line in the function that gets highlighted is this one:
---> 16 invar_flat = ds[varname].values.reshape(-1, shape[-1])
Does anyone know what might be causing this?
@Mira Berdahl and I chatted about this a little bit. We fixed the issue by removing /glade/u/home/mberdahl/FirstLook_LIGruns/DASK_scripts/HistoricalAnalysis/dask-worker-space
and replacing it with a softlink to a directory on scratch, but it raises a couple of questions:
dask.config.set(temporary_directory='/path/to/dir')
didn't work -- it changed temporary_directory
(verified via dask.config.get()
), but that isn't where these worker files went./glade/u/home
would fail, but using /glade/derecho/scratch
would be okay? Or did our directory switch coincide with some file system issues working themselves out?For #1, are there other folder settings you can change in the dask config files (~/.config/dask
or ~/.dask
)? Maybe in jobqueue.yaml
? I think there are log-directory
and local_directory
settings.
For #2, is home full or close to full?
Thanks for the suggestions, @Katie Dagon! I'll look in my config files and see if I can find where I set my temp location. I think I also set $TMPDIR
in my .bashrc
file, but that might be for VSCode instead of dask. Disk space was my first thought as well, but we checked and @Mira Berdahl was only using 2% of her quota; I should have mentioned that in my first reply.
Agreed with @Katie Dagon 's response.
@Mira Berdahl and @Michael Levy This indicates spilling to the disk: when worker memory hits the threshold (default 70%) and data spills to disk. That explains why you only see this occasionally when you hit that threshold.
You can set this by using jobqueue.yaml
(as Katie suggested) or using local_directory
argument in PBSCluster
to automatically set the spill directory:
cluster = PBSCluster(
job_name = 'dask-wrkr',
cores = 1,
memory = '4GiB',
processes = 1,
local_directory = '/local_scratch/pbs.$PBS_JOBID/dask/spill',
)
Thanks @Negin Sobhani , @Katie Dagon and @Michael Levy . The new line setting the local_directory, in conjunction with limiting the size of the files I'm processing, seems to do the trick.
Last updated: May 16 2025 at 17:14 UTC