Stream: python-questions
Topic: reshaping data in dictionary
Danica Lombardozzi (Apr 21 2020 at 16:25):
A question from a new python convert: I have a dictionary of 4 CMIP6 models and I'm trying to reshape the time dimension into year x month. I've done this successfully in the past, but using similar code I'm getting an error that I can't figure out.
for sim_name, data in dict.items(): num_years = data.values.shape[0]//12 reshaped = data.values.reshape(num_years, 12, *data.values.shape[1:])
The error message I get is related to the "num_years": "'function' object has no attribute 'shape'". Any idea why I'm getting this error message and how to fix the problem? Thanks!
Deepak Cherian (Apr 21 2020 at 16:43):
It thinks data
is a function
object which does not have shape. What does data
look like?
Also using dict
as variable name is really weird. You will overwrite pythons built-in dict
→ the consequences are probably quite bad.
Danica Lombardozzi (Apr 21 2020 at 18:26):
I'm not sure how to tell what data
looks like. It is a dictionary with 4 keys and 4 groups (each for a different model). I'm new to python and haven't used dictionaries much. How do I find out this information?
Note that I don't actually use dict
as the name of the dictionary -- I just used this in the example so that it was more obvious. I will make a note to never use dict
as a name.
I've been trying to get the script onto my GitHub repository, but still having some trouble with that. If you'd like to look at it in the meantime, you can find it on cheyenne here: /glade/work/dll/CTSM_py/notebooks/CMIP6_CO2.ipynb
. All I've done so far is read the catalog and put the data into a dictionary. Reshaping the data is the first step I need to start my analysis.
Thanks so much for your help!
Precious Mongwe (Apr 21 2020 at 18:51):
Hi Danica
I'm not sure if you were aware of the locally (Cheyenne glade/cloud also works) hosted CMIP6 catalog, and intakeesm a python-based platform we use to interact with the data.
Below is a GitHub repo example we wrote as part of the CMIP6 hackathon; it shows how to access the data and do basic operations including reshaping year x month. hope this helps
https://github.com/mara-freilich/cmip6hack-ocean-bgc/blob/master/notebooks/final_analysis.ipynb
Deepak Cherian (Apr 21 2020 at 19:46):
ah so data
is a Dataset
. Each of the DataArray
s in a Dataset
have a shape attribute but because all DataArrays in a Dataset need not have the same shape, Dataset.shape
does not exist.
Deepak Cherian (Apr 21 2020 at 19:47):
changing data to data.co2
should do what you want
Deepak Cherian (Apr 21 2020 at 20:06):
I'm not sure how to tell what data looks like.
Sticking print(data)
in the loop is what I did :D
Danica Lombardozzi (Apr 21 2020 at 21:45):
Hi Danica
I'm not sure if you were aware of the locally (Cheyenne glade/cloud also works) hosted CMIP6 catalog, and intakeesm a python-based platform we use to interact with the data.
Below is a GitHub repo example we wrote as part of the CMIP6 hackathon; it shows how to access the data and do basic operations including reshaping year x month. hope this helpshttps://github.com/mara-freilich/cmip6hack-ocean-bgc/blob/master/notebooks/final_analysis.ipynb
Thanks for sharing this example Precious! I have been using the CMIP6 catalog, although it seems quite incomplete (only a few models have CO2 data for the historical simulation, and some of these don't have the associated area and land fraction variables required for analysis. It looks like the script you pointed me to might have examples of more powerful ways of searching the catalog than I was previously aware of, so I will play around with some of those!
Danica Lombardozzi (Apr 21 2020 at 21:51):
changing data to
data.co2
should do what you want
Looks like this did the trick! It hasn't actually finished yet (it's taking a long time!), but didn't immediately exit with an error. I didn't have to do this in my other script, but I did create the dictionary in a different way. I think I still have to figure out the difference between a Dataset
and a dictionary, and when to use each. It seems that a dictionary can actually be a Dataset
, at least in this case it is. Thanks for your help!
Danica Lombardozzi (Apr 21 2020 at 21:52):
I'm not sure how to tell what data looks like.
Sticking
print(data)
in the loop is what I did :D
Thanks! This helped me realize that the time dimension is actually the last dimension, not the first. Will save me a headache later on!
Deepak Cherian (Apr 21 2020 at 22:11):
Well Dataset
mimics a dictionary mapping variable names to DataArrays
. This makes it both confusing and convenient. In your case you have a dictionary mapping a simulation name to a Dataset containing simulation output (this dataset is in turn a mapping from variable name to actual values).
Danica Lombardozzi (Apr 21 2020 at 22:19):
Well
Dataset
mimics a dictionary mapping variable names toDataArrays
. This makes it both confusing and convenient. In your case you have a dictionary mapping a simulation name to a Dataset containing simulation output (this dataset is in turn a mapping from variable name to actual values).
Thanks Deepak! I appreciate this explanation! I think it will take a bit for me to fully wrap my head around this, and this is a great start!
Matt Long (Apr 22 2020 at 15:38):
@Danica Lombardozzi, I wonder: are your intentions with reshaping the data to ultimately compute annual means? Here's gist that demonstrates how one might do this in xarray:
https://gist.github.com/matt-long/9c1efa02ad08e5f5d29539b4cab54d3c
I expect this capability will be available in GeoCAT.
cc @xdev, @geocat
Danica Lombardozzi (Apr 22 2020 at 15:50):
Danica Lombardozzi, I wonder: are your intentions with reshaping the data to ultimately compute annual means? Here's gist that demonstrates how one might do this in xarray:
https://gist.github.com/matt-long/9c1efa02ad08e5f5d29539b4cab54d3cI expect this capability will be available in GeoCAT.
cc @xdev, @geocat
Thanks Matt! I will take a look to see if it gives me some ideas. I'm actually trying to difference the max and min value for each year, so a bit different from annual means.
Matt Long (Apr 22 2020 at 15:53):
That's even easier. You should be able to apply min
and max
functions to the xarray.groupby
objects.
http://xarray.pydata.org/en/stable/groupby.html
Deepak Cherian (Apr 22 2020 at 16:59):
since you want 'each year' I'd look at resample instead (https://xarray.pydata.org/en/stable/time-series.html#resampling-and-grouped-operations) and do .. data.resample(time="Y").max() - data.resample(time="Y").min()
Danica Lombardozzi (Apr 22 2020 at 18:20):
since you want 'each year' I'd look at resample instead (https://xarray.pydata.org/en/stable/time-series.html#resampling-and-grouped-operations) and do ..
data.resample(time="Y").max() - data.resample(time="Y").min()
Thanks for all these great suggestions! Looking forward to trying them out this afternoon!
Brian Bonnlander (Apr 22 2020 at 21:06):
Sorry! I spaced the time! Be right there.
Danica Lombardozzi (Apr 24 2020 at 21:34):
@Deepak Cherian I've successfully figured out the use of resample
(thanks!), but I'm getting an error with the subtraction that I can't seem to figure out. I'm wondering if there is something basic that I'm missing here.
for sim_name, data in co2_ds.items(): datamax = data.resample(time="Y").max() datamin = data.resample(time="Y").min() amp = datamax - datamin
I printed both datamax
and datamin
to ensure they are the same size, but I still get an error pointing to amp = datamax - datamin
TypeError: unsupported operand type(s) for -: 'Array' and 'Array'
I can't figure out why I can't subtract two arrays that are the same size. Any thoughts why this might be happening? Note that I also tried
amp = data.resample(time="Y").max() - data.resample(time="Y").min()
with the same error message.
Thank you!
Deepak Cherian (Apr 24 2020 at 21:35):
what does print(datamax)
show?
Danica Lombardozzi (Apr 24 2020 at 21:36):
what does
print(datamax)
show?
Here is the output:
<xarray.Dataset> Dimensions: (bnds: 2, lat: 64, lon: 128, member_id: 1, plev: 19, time: 165) Coordinates: * time (time) object 1850-12-31 00:00:00 ... 2014-12-31 00:00:00 * plev (plev) float64 1e+05 9.25e+04 8.5e+04 7e+04 ... 1e+03 500.0 100.0 * lon (lon) float64 0.0 2.812 5.625 8.438 ... 348.8 351.6 354.4 357.2 * lat (lat) float64 -87.86 -85.1 -82.31 -79.53 ... 82.31 85.1 87.86 * member_id (member_id) <U8 'r1i1p1f1' Dimensions without coordinates: bnds Data variables: lon_bnds (time, lon, bnds) float64 dask.array<chunksize=(1, 128, 2), meta=np.ndarray> lat_bnds (time, lat, bnds) float64 dask.array<chunksize=(1, 64, 2), meta=np.ndarray> time_bnds (time, bnds) object dask.array<chunksize=(1, 2), meta=np.ndarray> co2 (time, member_id, plev, lat, lon) float32 dask.array<chunksize=(1, 1, 19, 64, 128), meta=np.ndarray>
Deepak Cherian (Apr 24 2020 at 21:38):
hmm... I think time_bnds
is the culprit. Notice it says object
instead of float64
. Can you try with amp = datamax.co2 - datamin.co2
Danica Lombardozzi (Apr 24 2020 at 21:38):
what does
print(datamax)
show?
Oh, is this because it's a Dataset? So I have to use amp = datamax.co2 - datamin.co2
.
Looks like this works. Since I only have one variable here, is there a way to convert these to DataArrays?
Deepak Cherian (Apr 24 2020 at 21:38):
You can subtract datasets... it will subtract corresponding arrays.
Danica Lombardozzi (Apr 24 2020 at 21:39):
You can subtract datasets... it will subtract corresponding arrays.
Oh, I see! I'm not sure why the time_bnds are an object, but knowing the solution works. Thanks for your help!
Brian Bonnlander (Apr 24 2020 at 21:40):
Hopefully this is right, but these days I think of Xarray objects as logically equivalent to NetCDF files. You could conceptually subtract one file from another, but rarely is that what you want.
Matt Long (Apr 25 2020 at 13:59):
time_bnds
is an object because time has been decoded. You could do something like
ds = ds.set_coords(['time_bnds', 'lat_bnds', 'lon_bnds'])
xarray will not apply the subtraction to the coordinate variables. @Brian Bonnlander, actually it's quite convenient and common to do math on datasets.
Danica Lombardozzi (Apr 27 2020 at 16:33):
Thanks Matt! It's helpful to know how to set the coordinate variables.
Last updated: Jan 30 2022 at 12:01 UTC