Question about working with sparse PFT output · python-questions

I've been using the excellent solution documented here to work with 1D output from CLM. This works great when converting from 1D to 4D output, where the dimensions are (time, pft, lat, lon). My question is how to expand this to deal with PFT-level output with more than 4 dimensions. I'm looking at VEGWP, a field with 5 (time, nvegwcs, pft, lat, lon) and so far I've not been able to crowbar the 1D output into an xarray.
I've tried the following:

# extract coordinate index locations
    ixy = dataset.pfts1d_ixy.astype(int)
    jxy = dataset.pfts1d_jxy.astype(int)
    nvegwcs = dataset.VEGWP.nvegwcs.astype(int)
    vegtype = dataset.pfts1d_itype_veg.astype(int)

    result = xr.Dataset()
    # we loop over variables so we can specify the appropriate dtype
    for var in pfts:
        result[var] = xr.apply_ufunc(
            to_sparse,
            pfts[var],
            nvegwcs,
            vegtype,
            jxy,
            ixy,
            kwargs={"shape": (4, 79, 109, 116)},
            input_core_dims=[["pft"]] * 5,
            output_core_dims=[["nvegwcs", "vegtype", "lat", "lon"]],
            dask="parallelized",
            dask_gufunc_kwargs=dict(
                meta=sparse.COO(np.array([], dtype=pfts[var].dtype)),
                output_sizes={"nvegwcs": 4, "vegtype": 79, "lat": 109, "lon": 116},
            ),
            keep_attrs=True,
        )

ValueError: operand to apply_ufunc encountered unexpected dimensions ['nvegwcs'] on an input variable: these are core dimensions on other input or output variables

Deepak Cherian (May 12 2023 at 15:37):

I bet nvegwcs needs to be a core dimension in input_core_dims for the nvegcws variable

James King (May 15 2023 at 13:53):

for var in pfts:
        result[var] = xr.apply_ufunc(
            to_sparse,
            pfts[var],
            nvegwcs,
            vegtype,
            jxy,
            ixy,
            kwargs={"shape": (4, 79, 109, 116)},
            input_core_dims=[["nvegwcs"]] * 5,
            output_core_dims=[["nvegwcs", "vegtype", "lat", "lon"]],
            dask="parallelized",
            dask_gufunc_kwargs=dict(
                meta=sparse.COO(np.array([], dtype=pfts[var].dtype)),
                output_sizes={"nvegwcs": 4, "vegtype": 79, "lat": 109, "lon": 116},
            ),
            keep_attrs=True,
        )

ValueError: operand to apply_ufunc has required core dimensions ['nvegwcs'], but some of these dimensions are absent on an input variable: ['nvegwcs']

I'm wondering if this is happening because nvegwcs doesn't seem to be indexed in the same way as the other dimensions on my input file? Here's the result of printing inputfile.VEGWP:

<xarray.DataArray 'VEGWP' (time: 1020, nvegwcs: 4, pft: 67373)>
dask.array<open_dataset-29b908a318577f8a2e4da34f01186702VEGWP, shape=(1020, 4, 67373), dtype=float32, chunksize=(100, 4, 67373), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) object 2015-02-01 00:00:00 ... 2100-01-01 00:00:00
Dimensions without coordinates: nvegwcs, pft

Katie Dagon (May 15 2023 at 17:28):

Hi @James King this is an interesting use case. I have not looked at VEGWP output in depth myself. Do you think both nvegwcs and pft need to be listed in input_core_dims?

Tagging @Daniel Kennedy since he may have some insight. This is also a great question to bring up at the CLM science or software engineering meeting, I wonder if anyone else there has encountered this before.

Daniel Kennedy (May 15 2023 at 17:41):

Is that error popping up when you are regridding VEGWP or when you are regridding another variable that doesn't have nvegwcs as a dimension?

On a side note, since it looks like you are doing the same regridding task repeatedly, it might be faster to just explicitly compute the map from 1d->3d, save it somewhere, and reuse it every time you want to regrid.

Deepak Cherian (May 15 2023 at 19:48):

input_core_dims=[["nvegwcs"]] * 5,

This isn't right. input_core_dims is a list of lists where the inner list is a list of core dimensions for the input variables in order. I tjink you need

input_core_dims=[
    ["nvegwcs", "pft"], # for pfts[var] which I assume is VEGWP
    ["nvegwcs"], # for nvegwcs
    ["pft"], # for vegtype
    ["pft"], # for jxy
    ["pft"],  # for ixy
]

Deepak Cherian (May 15 2023 at 19:50):

Deepak Cherian (May 15 2023 at 19:51):

The [["pft"]] * 5 is a shortcut for [ ["pft"], ["pft"], ["pft"], ["pft"], ["pft"] ]

James King (May 16 2023 at 11:15):

Thanks a lot for this! Re Daniel's point, the error occurs when attempting to regrid an xarray dataset that contains VEGWP (which has the nvegwcs dimension) as well as other variables which don't have it (e.g. the various pft weights and grid indices). This is supported by the error message I get when I apply Deepak's correction to input_core_dims above:

James King (May 16 2023 at 11:15):

ValueError: operand to apply_ufunc has required core dimensions ['nvegwcs', 'pft'], but some of these dimensions are absent on an input variable: ['nvegwcs']

James King (May 16 2023 at 11:18):

I'm attending the CLM meeting this week with some science questions related to this work, so I can ask people there if they've come across this problem before - thanks for the suggestion Katie. I vaguely remember that there might be a solution knocking around which uses IDL but that's not something I know much about

Deepak Cherian (May 16 2023 at 15:22):

sounds like you need to dynamically determine input_core_dims and output_core_dims for each variable inside the loop. This should be straightforward.

Deepak Cherian (May 16 2023 at 15:27):

Ah the core to_sparse function will need some generalization too. It can only handle 3D and 4D output.

Sam Rabin (May 16 2023 at 15:38):

@Deepak Cherian, I'm sure you have ideas for how to do the generalization, but just in case, here is how I did it for the (non-sparse) gridding function in utils.py. It loops through the dimensions of the target (gridded) array to build a list of ndarrays, fill_indices, that will be used to fill the target array with the ungridded array. The key bit is appending ellipses for any dimension other than latitude and longitude. (I also have special handling for vegetation type ivt_str, which might not be necessary.)

Deepak Cherian (May 16 2023 at 15:46):

Sam Rabin (May 16 2023 at 15:52):

The non-sparse method was acceptable for me because I have a beefy laptop, but even so it makes things take a long time. I had never used sparse arrays before, as I was more experienced in MATLAB where (at least when I was first learning) they had too many tradeoffs. However, I think it'd be great to have a generalized sparse function to handle this.

Deepak Cherian (May 16 2023 at 15:56):

These generalized sparse arrays also have tradeoffs (not every operation is supported or efficient) but you can always convert to dense when needed with DataArray.as_numpy and control memory usage that way.

If you're interested, I'm happy to help a little with this. The core idea is exactly the same: "figure out the indices where data exists", except here they get passed to the sparse.COO constructor instead of using them for assigning values in a dense array.

Sam Rabin (May 16 2023 at 16:17):

I don't have the bandwidth to really work on this at the moment, sorry! But I think the solution could be like how you handle time: Just assume that any axis other that lon/lat is totally filled.

James King (May 17 2023 at 15:10):

Thanks all for the suggestions - I'd already had a go at generalising the core to_sparse function but now have a better idea of how to approach it

James King (May 18 2023 at 11:33):

As an update, I think this works. In the to_sparse function, we add an extra loop to handle data with 3 non-time dimensions:

if data.ndim == 1:
        coords = np.stack([vegtype,  jxy - 1, ixy - 1], axis=0)
    elif data.ndim == 2:
        itime = np.repeat(np.arange(data.shape[0]), data.shape[1])
        # expand vegtype and friends for all time instants
        # by sequentially concatenating each array for each time instants
        tostack = [np.concatenate([array] * data.shape[0]) for array in [vegtype,  jxy - 1, ixy - 1]]
        coords = np.stack([itime] + tostack, axis=0)
    elif data.ndim == 3:
        itime = np.repeat(np.arange(data.shape[0]), data.shape[1])
        # expand vegtype and friends for all time instants
        # by sequentially concatenating each array for each time instants
        tostack = [np.concatenate([array] * data.shape[0]) for array in [nvegwcs, vegtype,  jxy - 1, ixy - 1]]
        coords = np.stack([itime] + tostack, axis=0)
    else:
        raise NotImplementedError

Then, in the convert_pft_variables_to_sparse function, we add a loop to add the additional variables when the extra nvegwcs dimension is present:

for var in pfts:
        if "nvegwcs" not in pfts[var].dims:
            result[var] = xr.apply_ufunc(
            to_sparse,
            pfts[var],
            vegtype,
            jxy,
            ixy,
            kwargs={"shape": (79, 109, 116)},
            input_core_dims=[["pft"]]*4,
            output_core_dims=[["vegtype", "lat", "lon"]],
            dask="parallelized",
            dask_gufunc_kwargs=dict(
                meta=sparse.COO(np.array([], dtype=pfts[var].dtype)),
                output_sizes={"vegtype": 79, "lat": 109, "lon": 116},
        ),
        keep_attrs=True,
    )
        if "nvegwcs" in pfts[var].dims:
            result[var] = xr.apply_ufunc(
            to_sparse,
            pfts[var],
            nvegwcs,
            vegtype,
            jxy,
            ixy,
            kwargs={"shape": (4, 79, 109, 116)},
            input_core_dims=[
            ["nvegwcs", "pft"], # for pfts[var] which I assume is VEGWP
            ["nvegwcs"], # for nvegwcs
            ["pft"], # for vegtype
            ["pft"], # for jxy
            ["pft"]],  # for ixy
            output_core_dims=[["nvegwcs", "vegtype", "lat", "lon"]],
            dask="parallelized",
            dask_gufunc_kwargs=dict(
                meta=sparse.COO(np.array([], dtype=pfts[var].dtype)),
                output_sizes={"nvegwcs": 4, "vegtype": 79, "lat": 109, "lon": 116},
            ),
            keep_attrs=True,
        )

James King (May 18 2023 at 11:35):

output.VEGWP.isel(vegtype=14)

<xarray.DataArray 'VEGWP' (time: 1020, nvegwcs: 4, lat: 109, lon: 116)>
dask.array<getitem, shape=(1020, 4, 109, 116), dtype=float32, chunksize=(100, 4, 109, 116), chunktype=sparse.COO>
Coordinates:
  * lat      (lat) float64 -35.01 -34.54 -34.07 -33.6 ... 14.33 14.8 15.27 15.74
  * lon      (lon) float64 0.0 0.625 1.25 1.875 2.5 ... 357.5 358.1 358.8 359.4
  * time     (time) object 2015-02-01 00:00:00 ... 2100-01-01 00:00:00
    vegtype  |S40 b'c4_grass                                '
Dimensions without coordinates: nvegwcs

James King (May 18 2023 at 12:12):

However, I get some weird errors when I try to convert the output to a numpy array (which I need for plotting later on in my code):

output.VEGWP.isel(vegtype=14).values

Traceback (most recent call last):
  File "/home/users/j_king25/scripts/plot_1d_array_updated_vegwp.py", line 264, in <module>
    print (datab_VEGWP.values)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/xarray/core/dataarray.py", line 642, in values
    return self.variable.values
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/xarray/core/variable.py", line 512, in values
    return _as_array_or_item(self._data)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/xarray/core/variable.py", line 252, in _as_array_or_item
    data = np.asarray(data)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/array/core.py", line 1689, in __array__
    x = self.compute()
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/base.py", line 315, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/base.py", line 603, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/threaded.py", line 89, in get
    results = get_async(
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/local.py", line 319, in reraise
    raise exc
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/local.py", line 224, in execute_task
    result = _execute_task(task, data)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 119, in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 119, in <genexpr>
    return func(*(_execute_task(a, cache) for a in args))
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/optimization.py", line 990, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/dask/core.py", line 119, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/home/users/j_king25/scripts/plot_1d_array_updated_vegwp.py", line 138, in to_sparse
    coords = np.stack([itime] + tostack, axis=0)
  File "<__array_function__ internals>", line 180, in stack
  File "/apps/jasmin/jaspy/miniconda_envs/jaspy3.10/m3-4.9.2/envs/jaspy3.10-m3-4.9.2-r20220721/lib/python3.10/site-packages/numpy/core/shape_base.py", line 426, in stack
    raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

Katie Dagon (May 18 2023 at 16:27):

Hmm this is tricky to debug without seeing the full code/source data. @James King this might also be a good question for office hours!

Daniel Kennedy (May 18 2023 at 21:40):

I know this is a bit hacky, but you might just turn VEGWP into four 'normal' variables before doing the regridding:

for v in ds.nvegwcs.values:
    ds['VEGWP'+str(v)]=ds.VEGWP.sel(nvegwcs=v)
ds=ds.drop('VEGWP')

Deepak Cherian (May 19 2023 at 02:02):

James King (May 19 2023 at 09:04):

@Daniel Kennedy it's less hacky than what I just posted! I'll try this and see if it improves the outcome

James King (May 19 2023 at 09:12):

@Deepak Cherian I've put some test data in /glade/work/jamesking/VEGWP/vegwp_merge_test, and the raw output files are in /glade/scratch/jamesking/archive/i.clm5.AfrSSP370_climate_only.004/lnd/hist/i.clm5.AfrSSP370_climate_only.004.clm2.h0*
You should have access but let me know if you look at them and run into any probems with permissions etc.

Stream: python-questions

Topic: Question about working with sparse PFT output

James King (May 12 2023 at 14:01):

Deepak Cherian (May 12 2023 at 15:37):

James King (May 15 2023 at 13:53):

Katie Dagon (May 15 2023 at 17:28):

Daniel Kennedy (May 15 2023 at 17:41):

Deepak Cherian (May 15 2023 at 19:48):

Deepak Cherian (May 15 2023 at 19:50):

Deepak Cherian (May 15 2023 at 19:51):

James King (May 16 2023 at 11:15):

James King (May 16 2023 at 11:15):

James King (May 16 2023 at 11:18):

Deepak Cherian (May 16 2023 at 15:22):

Deepak Cherian (May 16 2023 at 15:27):

Sam Rabin (May 16 2023 at 15:38):

Deepak Cherian (May 16 2023 at 15:46):

Sam Rabin (May 16 2023 at 15:52):

Deepak Cherian (May 16 2023 at 15:56):

Sam Rabin (May 16 2023 at 16:17):

James King (May 17 2023 at 15:10):

James King (May 18 2023 at 11:33):

James King (May 18 2023 at 11:35):

James King (May 18 2023 at 12:12):

Katie Dagon (May 18 2023 at 16:27):

Daniel Kennedy (May 18 2023 at 21:40):

Deepak Cherian (May 19 2023 at 02:02):

James King (May 19 2023 at 09:04):

James King (May 19 2023 at 09:12):