This notebook was put together by Anderson Banihirwe as part of 2017 CISL/SIParCS Research Project : PySpark for Big Atmospheric & Oceanic Data Analysis
El Niño 3.4 Index Calculation
Introduction¶
This notebook will introduce El Niño Index Calculation using PySpark to parallelize a number of tasks like computation of monthly averages for a given grid chunk, etc.
NOAA's operational definitions of El Niño and La Niña conditions are based upon the Oceanic Niño Index [ONI]. The ONI is defined as the 3-month running means of SST anomalies in the Niño 3.4 region [5N-5S, 120-170W].
The ONI is one measure of the El Niño-Southern Oscillation, and other indices can confirm whether features consistent with a coupled ocean-atmosphere phenomenon accompanied these periods.
Computational Recipe¶
- Compute area averaged total SST from Niño 3.4 region.
- Compute monthly climatology (1854 - 2016) for area averaged total SST from Niño 3.4 region, and subtract climatology from area averaged total SST time series to obtain anomalies.
- Smooth the anomalies with a 3-month running mean.
Import Packages¶
# Hide warnings if there are any
import warnings
warnings.filterwarnings('ignore')
from pyspark.sql import SparkSession
from pyspark import SparkConf
from sparkxarray import reader
import numpy as np
import xarray as xr
import dask
from __future__ import print_function
from glob import glob
import pandas as pd
%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt
# Setup PySpark Session
spark = SparkSession.builder.appName("el-nino-index-spark").getOrCreate()
sc = spark.sparkContext
Load Data¶
The detailed description of the dataset used in this notebook can be found at this link: NCEP Version 2.0 OI Global SST and NCDC Version 4.0 Extended Reconstructed SST Analyses
paths = "/glade/p/rda/data/ds277.0/ersst.v4.nc/*.nc"
# Explore file metadata with ncdump
!ncdump -h /glade/p/rda/data/ds277.0/ersst.v4.nc/ersst.v4.196201.nc
rdd = reader.nc_multi_read(sc, paths, data_splitting_mode='slice')
rdd.count()
There are 1961 files for months from 01-01-1854 to 05-31-2017. Each file contains monthly data for a particular month.
# Get the first element
rdd.first()
Helper Function to create a new RDD with elements containing values from our region of interest [5N-5S, 120-170W]
¶
Image Source:(http://www.ubergizmo.com/how-to/read-gps-coordinates/)
def get_region_of_interest(dset):
return dset.sel(lat=slice(-6, 6), lon=slice(190, 240))
region_of_interest = rdd.map(get_region_of_interest).cache()
region_of_interest.count()
Global mean Temperature Calculation and plotting¶
To compute the global mean temperature for each month, We calculate the mean temperature at each latitude for that month, i.e., take the simple mean of the temperatures of all longitudes at a given latitude:
$$\bar{T}_{lat} = \frac{1}{nLon}\sum_{i=1}^{nLon}T_{lon, i}$$Once we have the mean at each latitude, take the cosine weighted mean of those, giving the mean global temperature for this month:
$$\bar{T}_{month} = \frac{\sum_{j=1}^{nLat} \cos(lat_j) \bar{T}_{lat, j}}{\sum_{j=1}^{nLat}\cos(lat_j)}$$def global_mean_sst(dset):
# Find mean temperature for each latitude
mean_sst_lat = dset.sst.mean(dim='lon')
# Find Weighted mean of those values
num =(np.cos(dset.lat) * mean_sst_lat).sum(dim='lat')
denom = np.sum(np.cos(dset.lat))
# Find mean global temperature
mean_global_temp = num / denom
return mean_global_temp
mean_global_temp = region_of_interest.map(global_mean_sst)
mean_global_temp.take(5)
mean_global_temp.count()
Collect the mean sst values for further analysis with xarray
¶
There is a slight mismatch in the zlev dimension. For files for times before Year = 2017
, zlev is named zlev
and any file from 2017 has zlev
renamed to lev
. This is a potential issue when concatenating xarray dataarray.
As a result we drop those months for year of 2017.
mean_sst = mean_global_temp.collect()
mean_sst[0:2]
mean_sst[1956:]
- Concatenate all the mean sst data in one datarray
a = xr.concat(mean_sst[:1956], dim='time')
- Convert the dataarray into a dataset
ds = a.to_dataset(name="mean_sst")
ds
Calculate monthly anomalies¶
In climatology, “anomalies” refer to the difference between observations and typical weather for a particular season. Unlike observations, anomalies should not show any seasonal cycle.
climatology = ds.groupby('time.month').mean('time')
climatology
anomalies = ds.groupby('time.month') - climatology
anomalies
Compute the 3 month running mean for the anomalies¶
df = anomalies['mean_sst'].to_dataframe()
df_anom = anomalies.rolling(min_periods=1, time=3, center=True).mean().to_dataframe()
df.head()
df_anom.head()
# For better visualization, drop the zlev index
df = df.set_index(df.index.droplevel(level=1))
df_anom = df_anom.set_index(df_anom.index.droplevel(level=1))
ax = df['mean_sst'].plot(legend=True, label="unsmoothed anom", figsize=(16, 8))
ax = df_anom['mean_sst'].plot(legend=True, label="smoothed anom")
ax.set_xlabel("Year")
ax.set_ylabel("Oceanic Nino Index (ONI)[C]")
plt.show()