Introduction to Xarray

Introduction to Xarray#

ESDS 2024 Annual Event Xarray-Dask Tutorial | January 19th, 2024

Negin Sobhani and Brian Vanderwende Computational & Information Systems Lab (CISL)
negins@ucar.edu, vanderwb@ucar.edu

In this tutorial, you learn:#

What is Xarray?
The basic data structures in Xarray.
Read and write netCDF files using Xarray.
Basic computations with Xarray.
High-level computations with Xarray.
Xarray wrapping other array types.

Prerequisites#

Concepts	Importance	Notes
Basic familiarity with NumPy	Necessary
Basic familiarity with Pandas	Necessary
Understanding of NetCDF	Helpful

Time to learn: 75 minutes

Xarray

Xarray Indexing and Selecting Data#

Xarray supports different ways to index and select data. Xarray indexing is what makes it so powerful for data analysis tasks. 💪

In total, xarray supports four different kinds of indexing, as explained below and summarized in this table:

Dimension lookup	Index lookup	`DataArray` syntax	`Dataset` syntax
Positional	By integer	`da[:,0]`	not available
Positional	By label	`da.loc[:,'IA']`	not available
By name	By integer	`da.isel(space=0)` or `da[dict(space=0)]`	`ds.isel(space=0)` or `ds[dict(space=0)]`
By name	By label	`da.sel(space='IA')` or `da.loc[dict(space='IA')]`	`ds.sel(space='IA')` or `ds.loc[dict(space='IA')]`

Positional Indexing Using Dimension Names (`.isel`)#

Xarray eliminates much of the mental overhead of remembering dimension orders by allowing indexing using dimension names instead:

tref.isel(time=0).plot();

../_images/6996a77f578ea7aaa788c998c16cc7255ce571f410482dfdc23c74a863eacc80.png

Slicing with labels is also possible. For example, we can plot the first 20 time steps of the variable:

tref.isel(time=slice(0, 20), lat=20, lon=40).plot();

../_images/de92d649db9c91f94bf69d631c234649a7013d94f34508580aea57d3e76564d0.png

This is great! But it still requires you to know which index corresponds to your desired label. What if you don’t know the index of the label you want?

For example, what if you want to select the data for Lat 25 °N and Lon 210 °E, but you don’t know which index corresponds to this point.

Basic Computations with Xarray#

Xarray Data Arrays and Data Sets are compatible with arithmetic operators and numpy array functions.

For example, we can use the arithmetic operators +, -, *, /, ** to add, subtract, multiply, divide, and exponentiate two xarray objects with the same dimensions. Let’s convert the temperature values from Kelvin to Celsius by subtracting 273.15 from it:

# change the unit from Kelvin to degree Celsius 
tref_c=tref-273.15
tref_c[0,:,:].plot();

# tref[0,:,:]-273.15).plot(); # this also works

../_images/629ab3c86718240fdba2669ba549ed3bfb0c991c4b4ee05178c6f64eb2d463ac.png

Aggregation or Reduction Operations#

Similar to NumPy, Xarray provides a set of basic statistical functions that operate on arrays. For example, we can compute the mean, standard deviation, variance, min, max, etc. of an xarray object.

For example, we can compute the mean of the temperature values over the time dimension:

mean_temp = tref.mean(dim="time")
mean_temp.plot();

../_images/9e74d4201e781a3ee40c9bf1ae30bbf904643933c118f32abd587dd230891e09.png

#std of all grids at every time step
tref.std(dim=["lat", "lon"]).plot()

[<matplotlib.lines.Line2D at 0x1500fa531210>]

../_images/d133d789a93153b52ed1211c69bf96368f5cc5dc007797fe4c285243434d7619.png

Visualization (`.plot`)#

We have seen very simple plots earlier. Xarray also lets you easily visualize 3D and 4D datasets by presenting multiple facets (or panels or subplots) showing variations across rows and/or columns.

# facet the seasonal_mean
seasonal_mean.plot(col="season", col_wrap=2);

../_images/bba0780a75cb7ab7a12792253760e7f0ed30a4eb0c86d90fafec08010368a14d.png

# contours
seasonal_mean.plot.contour(col="season", levels=20, add_colorbar=True);

../_images/e255a8afdaf30556b5a5562d45e696ae752b8e7a02e175c7cfc0669ec68e215f.png

# cool line plots too? wut !
seasonal_mean.mean("lon").plot.line(hue="season", y="lat");

../_images/0f9cc95e7689e7ddb3e63039d338bb3059b3a2b856b6a10472469de306d79d10.png

Boolean indexing & masking#

Boolean masking, known as boolean indexing, is a functionality in Python that enables the filtering of values based on a specific condition.

A boolean mask refers to a binary array or a boolean-valued (True/False) array that is used as a filter to select specific elements from another array. The boolean mask acts as a criterion or condition, where each element in the mask corresponds to an element in the target array. An element in the target array is selected when the corresponding mask value is True.

Masking with `where()`#

Indexing methods on Xarray objects generally return a subset of the original data. However, it is sometimes useful to select an object with the same shape as the original data, but with some elements masked.

By applying .where(), the original data’s shape is maintained, with values masked based on a Boolean condition. Values that satisfy the condition (True) are returned unchanged, while values that do not meet the condition (False) are replaced with a predefined value.

# plot mean July temperature
t_july = tref_c.sel(time=ds.time.dt.month == 7).mean(dim="time")

## mask out below 0 C temperatures
t_masked = t_july.where(t_july > 0)
t_masked.shape, t_july.shape

((192, 288), (192, 288))

By default Xarray set the masked values to nan. But as we saw in the first example, we can set it to other values too.

## mask out below 0 C temperatures with 0
t_fillmasked = t_july.where(t_july > 0, 0)

As you can see, in the example above .where() preserved the shape of the original data by masking the values with a boolean condition.

where is performing broadcasting, which is why the shape of the original data is preserved.

from matplotlib import pyplot as plt

# -- making both plots for comparison:
fig, axes = plt.subplots(ncols=3, figsize=(17, 3))

# -- for reference (without masking):
t_july.plot(ax=axes[0],vmin = -45, vmax=45); # cmap = viridis

# -- masked DataArray
t_masked.plot(ax=axes[1], vmin= -45, vmax=45);

# -- masked DataArray with filled values
t_fillmasked.plot(ax=axes[2], vmin= -45, vmax=45);

../_images/b797226c57f6e86542afc549d0bcf46300156222cda3eaf646f985cf1b44be71.png

Xarray can wrap many NumPy-like arrays#

This notebook has focused on Numpy arrays. Xarray can wrap other array types! For example:

distributed parallel arrays & Xarray user guide on Dask

pydata/sparse : sparse arrays

GPU arrays & cupy-xarray

pint : unit-aware arrays & pint-xarray

We will learn more about Xarray wrapping Dask arrays in the next section.

Here is a quick intro:

Xarray also provides open_mfdataset, which open multiple files as a single xarray dataset using Dask Arrays (instead of NumPy arrays). Passing the argument parallel=True will speed up reading multiple datasets by executing these tasks in parallel using Dask Delayed under the hood. This is especially helpful when the data does not fit into memory.

#Let's check what is our DataArray type has been so far
type(tref.data)

numpy.ndarray

%%time
ds = xr.open_mfdataset(
    sorted(files),
    # concatenate along this dimension
    concat_dim="time",
    # concatenate files in the order provided
    combine="nested",
    # parallelize the reading of individual files using dask
    # This means the returned arrays will be dask arrays
    parallel=True,
    # these are netCDF4 files, use the h5netcdf package to read them
    engine="h5netcdf",
    # hold off on decoding time
    decode_cf=False,
    # specify that data should be automatically chunked
    chunks="auto",
)
ds = xr.decode_cf(ds)
tref_all = ds.TREFHT
tref_all

CPU times: user 1.8 s, sys: 31.5 ms, total: 1.83 s
Wall time: 1.95 s

<xarray.DataArray 'TREFHT' (time: 1032, lat: 192, lon: 288)>
dask.array<chunksize=(120, 192, 288), meta=np.ndarray>
Coordinates:
  * lat      (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 87.17 88.12 89.06 90.0
  * lon      (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8
  * time     (time) object 2015-02-01 00:00:00 ... 2101-01-01 00:00:00
Attributes: (3)

type(tref_all.data)

dask.array.core.Array

Summary#

In this notebook, we have learned about:

Xarray data structures: DataArray and Dataset
Indexing and selecting data
Reading and writing data with Xarray
Basic computations with Xarray
Broadcasting and alignment
Customized workflows using apply_ufunc

Introduction to Xarray

Contents

Introduction to Xarray#

In this tutorial, you learn:#

Prerequisites#

Introduction#

What is Xarray?#

Xarray Fundamental Data Structures#

Xarray DataArray#

Xarray DataSet#

Reading and Writing Data with Xarray#

NetCDF#

Reading NetCDF file(s) with Xarray#

Underlying data#

Xarray Indexing and Selecting Data#

Positional indexing#

Positional Indexing Using Dimension Names (`.isel`)#

Label-based Indexing Using Dimension Names (`.sel`)#

Nearest-neighbor lookups#

DateTime Indexing#

Fancy indexing based on year, month, day, or other datetime components#

Basic Computations with Xarray#

Aggregation or Reduction Operations#

High level computation#

groupby#

Resampling#

Weighted#

Rolling#

Coarsen#

Visualization (`.plot`)#

Computing with Multiple Objects#

Broadcasting: adjusting arrays to the same shape#

Alignment : Putting Data on the same grid#

Boolean indexing & masking#

Masking with `where()`#

Xarray can wrap many NumPy-like arrays#

Supplementary Material: Advanced using `apply_ufunc`#

Summary#

Additional Resources#

Introduction to Xarray

Contents

Introduction to Xarray#

In this tutorial, you learn:#

Prerequisites#

Introduction#

What is Xarray?#

Xarray Fundamental Data Structures#

Xarray DataArray#

Xarray DataSet#

Reading and Writing Data with Xarray#

NetCDF#

Reading NetCDF file(s) with Xarray#

Underlying data#

Xarray Indexing and Selecting Data#

Positional indexing#

Positional Indexing Using Dimension Names (.isel)#

Label-based Indexing Using Dimension Names (.sel)#

Nearest-neighbor lookups#

DateTime Indexing#

Fancy indexing based on year, month, day, or other datetime components#

Basic Computations with Xarray#

Aggregation or Reduction Operations#

High level computation#

groupby#

Resampling#

Weighted#

Rolling#

Coarsen#

Visualization (.plot)#

Computing with Multiple Objects#

Broadcasting: adjusting arrays to the same shape#

Alignment : Putting Data on the same grid#

Boolean indexing & masking#

Masking with where()#

Xarray can wrap many NumPy-like arrays#

Supplementary Material: Advanced using apply_ufunc#

Summary#

Additional Resources#

Positional Indexing Using Dimension Names (`.isel`)#

Label-based Indexing Using Dimension Names (`.sel`)#

Visualization (`.plot`)#

Masking with `where()`#

Supplementary Material: Advanced using `apply_ufunc`#