Hi all,
I've been using the excellent solution documented here to work with 1D output from CLM. This works great when converting from 1D to 4D output, where the dimensions are (time, pft, lat, lon). My question is how to expand this to deal with PFT-level output with more than 4 dimensions. I'm looking at VEGWP, a field with 5 (time, nvegwcs, pft, lat, lon) and so far I've not been able to crowbar the 1D output into an xarray.
I've tried the following:
# extract coordinate index locations
ixy = dataset.pfts1d_ixy.astype(int)
jxy = dataset.pfts1d_jxy.astype(int)
nvegwcs = dataset.VEGWP.nvegwcs.astype(int)
vegtype = dataset.pfts1d_itype_veg.astype(int)
result = xr.Dataset()
# we loop over variables so we can specify the appropriate dtype
for var in pfts:
result[var] = xr.apply_ufunc(
to_sparse,
pfts[var],
nvegwcs,
vegtype,
jxy,
ixy,
kwargs={"shape": (4, 79, 109, 116)},
input_core_dims=[["pft"]] * 5,
output_core_dims=[["nvegwcs", "vegtype", "lat", "lon"]],
dask="parallelized",
dask_gufunc_kwargs=dict(
meta=sparse.COO(np.array([], dtype=pfts[var].dtype)),
output_sizes={"nvegwcs": 4, "vegtype": 79, "lat": 109, "lon": 116},
),
keep_attrs=True,
)
but this gives me
ValueError: operand to apply_ufunc encountered unexpected dimensions ['nvegwcs'] on an input variable: these are core dimensions on other input or output variables
I wondered if anyone else has tried to do this?
I bet nvegwcs
needs to be a core dimension in input_core_dims
for the nvegcws
variable
Thanks Deepak! I've tried
for var in pfts:
result[var] = xr.apply_ufunc(
to_sparse,
pfts[var],
nvegwcs,
vegtype,
jxy,
ixy,
kwargs={"shape": (4, 79, 109, 116)},
input_core_dims=[["nvegwcs"]] * 5,
output_core_dims=[["nvegwcs", "vegtype", "lat", "lon"]],
dask="parallelized",
dask_gufunc_kwargs=dict(
meta=sparse.COO(np.array([], dtype=pfts[var].dtype)),
output_sizes={"nvegwcs": 4, "vegtype": 79, "lat": 109, "lon": 116},
),
keep_attrs=True,
)
which gets me:
ValueError: operand to apply_ufunc has required core dimensions ['nvegwcs'], but some of these dimensions are absent on an input variable: ['nvegwcs']
I'm wondering if this is happening because nvegwcs doesn't seem to be indexed in the same way as the other dimensions on my input file? Here's the result of printing inputfile.VEGWP:
<xarray.DataArray 'VEGWP' (time: 1020, nvegwcs: 4, pft: 67373)>
dask.array<open_dataset-29b908a318577f8a2e4da34f01186702VEGWP, shape=(1020, 4, 67373), dtype=float32, chunksize=(100, 4, 67373), chunktype=numpy.ndarray>
Coordinates:
* time (time) object 2015-02-01 00:00:00 ... 2100-01-01 00:00:00
Dimensions without coordinates: nvegwcs, pft
Hi @James King this is an interesting use case. I have not looked at VEGWP
output in depth myself. Do you think both nvegwcs
and pft
need to be listed in input_core_dims
?
Tagging @Daniel Kennedy since he may have some insight. This is also a great question to bring up at the CLM science or software engineering meeting, I wonder if anyone else there has encountered this before.
Is that error popping up when you are regridding VEGWP or when you are regridding another variable that doesn't have nvegwcs as a dimension?
On a side note, since it looks like you are doing the same regridding task repeatedly, it might be faster to just explicitly compute the map from 1d->3d, save it somewhere, and reuse it every time you want to regrid.
input_core_dims=[["nvegwcs"]] * 5,
This isn't right. input_core_dims
is a list of lists where the inner list is a list of core dimensions for the input variables in order. I tjink you need
input_core_dims=[
["nvegwcs", "pft"], # for pfts[var] which I assume is VEGWP
["nvegwcs"], # for nvegwcs
["pft"], # for vegtype
["pft"], # for jxy
["pft"], # for ixy
]
For more on apply_ufunc
and core dimensions start here
The [["pft"]] * 5
is a shortcut for [ ["pft"], ["pft"], ["pft"], ["pft"], ["pft"] ]
Thanks a lot for this! Re Daniel's point, the error occurs when attempting to regrid an xarray dataset that contains VEGWP (which has the nvegwcs dimension) as well as other variables which don't have it (e.g. the various pft weights and grid indices). This is supported by the error message I get when I apply Deepak's correction to input_core_dims
above:
ValueError: operand to apply_ufunc has required core dimensions ['nvegwcs', 'pft'], but some of these dimensions are absent on an input variable: ['nvegwcs']
I'm attending the CLM meeting this week with some science questions related to this work, so I can ask people there if they've come across this problem before - thanks for the suggestion Katie. I vaguely remember that there might be a solution knocking around which uses IDL but that's not something I know much about
sounds like you need to dynamically determine input_core_dims
and output_core_dims
for each variable inside the loop. This should be straightforward.
I'm looking at VEGWP, a field with 5 (time, nvegwcs, pft, lat, lon)
Ah the core to_sparse
function will need some generalization too. It can only handle 3D and 4D output.
@Deepak Cherian, I'm sure you have ideas for how to do the generalization, but just in case, here is how I did it for the (non-sparse) gridding function in utils.py. It loops through the dimensions of the target (gridded) array to build a list of ndarrays, fill_indices, that will be used to fill the target array with the ungridded array. The key bit is appending ellipses for any dimension other than latitude and longitude. (I also have special handling for vegetation type ivt_str, which might not be necessary.)
Nice. Is preserving the sparsity not that important?
The non-sparse method was acceptable for me because I have a beefy laptop, but even so it makes things take a long time. I had never used sparse arrays before, as I was more experienced in MATLAB where (at least when I was first learning) they had too many tradeoffs. However, I think it'd be great to have a generalized sparse function to handle this.
as I was more experienced in MATLAB where (at least when I was first learning) they had too many tradeoffs.
Reminds me of my thesis! :grinning:
These generalized sparse arrays also have tradeoffs (not every operation is supported or efficient) but you can always convert to dense when needed with DataArray.as_numpy
and control memory usage that way.
However, I think it'd be great to have a generalized sparse function to handle this.
If you're interested, I'm happy to help a little with this. The core idea is exactly the same: "figure out the indices where data exists", except here they get passed to the sparse.COO
constructor instead of using them for assigning values in a dense array.
I don't have the bandwidth to really work on this at the moment, sorry! But I think the solution could be like how you handle time: Just assume that any axis other that lon/lat is totally filled.
Thanks all for the suggestions - I'd already had a go at generalising the core to_sparse
function but now have a better idea of how to approach it
As an update, I think this works. In the to_sparse
function, we add an extra loop to handle data with 3 non-time dimensions:
if data.ndim == 1:
coords = np.stack([vegtype, jxy - 1, ixy - 1], axis=0)
elif data.ndim == 2:
itime = np.repeat(np.arange(data.shape[0]), data.shape[1])
# expand vegtype and friends for all time instants
# by sequentially concatenating each array for each time instants
tostack = [np.concatenate([array] * data.shape[0]) for array in [vegtype, jxy - 1, ixy - 1]]
coords = np.stack([itime] + tostack, axis=0)
elif data.ndim == 3:
itime = np.repeat(np.arange(data.shape[0]), data.shape[1])
# expand vegtype and friends for all time instants
# by sequentially concatenating each array for each time instants
tostack = [np.concatenate([array] * data.shape[0]) for array in [nvegwcs, vegtype, jxy - 1, ixy - 1]]
coords = np.stack([itime] + tostack, axis=0)
else:
raise NotImplementedError
Then, in the convert_pft_variables_to_sparse
function, we add a loop to add the additional variables when the extra nvegwcs dimension is present:
for var in pfts:
if "nvegwcs" not in pfts[var].dims:
result[var] = xr.apply_ufunc(
to_sparse,
pfts[var],
vegtype,
jxy,
ixy,
kwargs={"shape": (79, 109, 116)},
input_core_dims=[["pft"]]*4,
output_core_dims=[["vegtype", "lat", "lon"]],
dask="parallelized",
dask_gufunc_kwargs=dict(
meta=sparse.COO(np.array([], dtype=pfts[var].dtype)),
output_sizes={"vegtype": 79, "lat": 109, "lon": 116},
),
keep_attrs=True,
)
if "nvegwcs" in pfts[var].dims:
result[var] = xr.apply_ufunc(
to_sparse,
pfts[var],
nvegwcs,
vegtype,
jxy,
ixy,
kwargs={"shape": (4, 79, 109, 116)},
input_core_dims=[
["nvegwcs", "pft"], # for pfts[var] which I assume is VEGWP
["nvegwcs"], # for nvegwcs
["pft"], # for vegtype
["pft"], # for jxy
["pft"]], # for ixy
output_core_dims=[["nvegwcs", "vegtype", "lat", "lon"]],
dask="parallelized",
dask_gufunc_kwargs=dict(
meta=sparse.COO(np.array([], dtype=pfts[var].dtype)),
output_sizes={"nvegwcs": 4, "vegtype": 79, "lat": 109, "lon": 116},
),
keep_attrs=True,
)
Checking the processed output:
output.VEGWP.isel(vegtype=14)
<xarray.DataArray 'VEGWP' (time: 1020, nvegwcs: 4, lat: 109, lon: 116)>
dask.array<getitem, shape=(1020, 4, 109, 116), dtype=float32, chunksize=(100, 4, 109, 116), chunktype=sparse.COO>
Coordinates:
* lat (lat) float64 -35.01 -34.54 -34.07 -33.6 ... 14.33 14.8 15.27 15.74
* lon (lon) float64 0.0 0.625 1.25 1.875 2.5 ... 357.5 358.1 358.8 359.4
* time (time) object 2015-02-01 00:00:00 ... 2100-01-01 00:00:00
vegtype |S40 b'c4_grass '
Dimensions without coordinates: nvegwcs
However, I get some weird errors when I try to convert the output to a numpy array (which I need for plotting later on in my code):
output.VEGWP.isel(vegtype=14).values
Traceback (most recent call last):
File "/home/users/j_king25/scripts/plot_1d_array_updated_vegwp.py", line 264, in <module>
print (datab_VEGWP.values)
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/xarray/core/dataarray.py", line 642, in values
return self.variable.values
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/xarray/core/variable.py", line 512, in values
return _as_array_or_item(self._data)
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/xarray/core/variable.py", line 252, in _as_array_or_item
data = np.asarray(data)
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/array/core.py", line 1689, in __array__
x = self.compute()
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/base.py", line 315, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/base.py", line 603, in compute
results = schedule(dsk, keys, **kwargs)
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/threaded.py", line 89, in get
results = get_async(
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/local.py", line 511, in get_async
raise_exception(exc, tb)
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/local.py", line 319, in reraise
raise exc
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/local.py", line 224, in execute_task
result = _execute_task(task, data)
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 119, in <genexpr>
return func(*(_execute_task(a, cache) for a in args))
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 119, in <genexpr>
return func(*(_execute_task(a, cache) for a in args))
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/optimization.py", line 990, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 149, in get
result = _execute_task(task, cache)
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/home/users/j_king25/scripts/plot_1d_array_updated_vegwp.py", line 138, in to_sparse
coords = np.stack([itime] + tostack, axis=0)
File "<__array_function__ internals>", line 180, in stack
File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/numpy/core/shape_base.py", line 426, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape
Hmm this is tricky to debug without seeing the full code/source data. @James King this might also be a good question for office hours!
I know this is a bit hacky, but you might just turn VEGWP into four 'normal' variables before doing the regridding:
for v in ds.nvegwcs.values:
ds['VEGWP'+str(v)]=ds.VEGWP.sel(nvegwcs=v)
ds=ds.drop('VEGWP')
Then VEGWP0,VEGWP1,... would behave as you are used to
can you point me to a dataset?
@Daniel Kennedy it's less hacky than what I just posted! I'll try this and see if it improves the outcome
@Deepak Cherian I've put some test data in /glade/work/jamesking/VEGWP/vegwp_merge_test
, and the raw output files are in /glade/scratch/jamesking/archive/i.clm5.AfrSSP370_climate_only.004/lnd/hist/i.clm5.AfrSSP370_climate_only.004.clm2.h0*
You should have access but let me know if you look at them and run into any probems with permissions etc.
Last updated: May 16 2025 at 17:14 UTC