Stream: python-questions

Topic: ufunc for scipy.stats


view this post on Zulip Will Wieder (Jul 01 2020 at 12:34):

Does anyone have an example of getting ufunc to work for scipy.stats? Specifically I'm trying to calculate a correlation coefficient (r) and p-values for two xarray data arrays. This example is good for for the correlation coefficient, http://xarray.pydata.org/en/stable/dask.html, but I'd like to just use scipy.stats

view this post on Zulip Will Wieder (Jul 01 2020 at 12:50):

xarray even has a .corr function now, but still not sure how to get significance levels from this?

view this post on Zulip Keith Lindsay (Jul 01 2020 at 12:59):

@Will Wieder , I use xr.apply_ufunc to compute regression slope in this code. I'm passing regression_slope_np_1d_2d, a function I wrote that works on plain numpy arrays, to xr.apply_ufunc. This might serve as a template for you, where you would similarly pass a scipy.stats function to xr.apply_ufunc.

view this post on Zulip Deepak Cherian (Jul 01 2020 at 14:49):

@Will Wieder do you have an example notebook with your attempt? One complication here is that apply_ufunc cannot deal with multiple return values with dask arrays yet. so you'll have to write a wrapper function that stacks r and p into a (2xN) array and return that. And then unpack it to two dataarrays.

This example (https://xarray.pydata.org/en/stable/examples/apply_ufunc_vectorize_1d.html) discusses a lot of the errors you may run in to while doing this.

view this post on Zulip Riley Brady (Jul 01 2020 at 15:45):

@Will Wieder --

(1) A non-apply-ufunc solution would be to use xskillscore (https://github.com/raybellwaves/xskillscore/). I've done a lot of development here, and we have it working pretty efficiently. We use apply_ufunc and pure numpy functions under the hood. And it works with dask. So you could just do r = xskillscore.pearson_r(da1, da2, 'time') to get gridded results of correlation coefficients. Then p = xskillscore.pearson_r_p_value(da1, da2, 'time') to get the significance values.

(2) Getting at what @Deepak Cherian is saying, you can get around the multiple output issue with dask/apply_ufunc with the following:

from scipy import stats
def new_linregress(x, y):
    # Wrapper around scipy linregress to use in apply_ufunc
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
    return np.array([slope, intercept, r_value, p_value, std_err])

# return a new DataArray
stats = xr.apply_ufunc(new_linregress, ds[x], ds[y],
                        input_core_dims=[['time'], ['time']],
                        output_core_dims=[["parameter"]],
                        vectorize=True,
                        dask="parallelized",
                        output_dtypes=['float64'],
                        output_sizes={"parameter": 5},
                        )

This will return a dataset with dimension parameter that has slope, intercept, etc. in it.

view this post on Zulip Will Wieder (Jul 01 2020 at 16:43):

Thanks for the suggestions. I'll give this a try. For the the purposes of my current project I just manually calculated the critical values for p<0.05 and masked out grid cells that didn't meet this criteria.

view this post on Zulip Will Wieder (Jul 01 2020 at 17:27):

@Riley Brady thanks for this example. It's really helpful! Can more examples like this be provided in the xarray documentation?!


Last updated: Jan 30 2022 at 12:01 UTC