edit Oceanic-Niño-Index

This notebook was put together by Anderson Banihirwe as part of 2017 CISL/SIParCS Research Project : PySpark for Big Atmospheric & Oceanic Data Analysis

El Niño 3.4 Index Calculation

Introduction¶

This notebook will introduce El Niño Index Calculation using PySpark to parallelize a number of tasks like computation of monthly averages for a given grid chunk, etc.

NOAA's operational definitions of El Niño and La Niña conditions are based upon the Oceanic Niño Index [ONI]. The ONI is defined as the 3-month running means of SST anomalies in the Niño 3.4 region [5N-5S, 120-170W].

The ONI is one measure of the El Niño-Southern Oscillation, and other indices can confirm whether features consistent with a coupled ocean-atmosphere phenomenon accompanied these periods.

Computational Recipe¶

  • Compute area averaged total SST from Niño 3.4 region.
  • Compute monthly climatology (1854 - 2016) for area averaged total SST from Niño 3.4 region, and subtract climatology from area averaged total SST time series to obtain anomalies.
  • Smooth the anomalies with a 3-month running mean.

Import Packages¶

In [1]:
# Hide warnings if there are any
import warnings
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession
from pyspark import SparkConf
from sparkxarray import reader
import numpy as np
import xarray as xr
import dask
from __future__ import print_function
from glob import glob
import pandas as pd
%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
import matplotlib.pyplot as plt
In [2]:
# Setup PySpark Session
spark = SparkSession.builder.appName("el-nino-index-spark").getOrCreate()
In [3]:
sc = spark.sparkContext

Load Data¶

back to top

The detailed description of the dataset used in this notebook can be found at this link: NCEP Version 2.0 OI Global SST and NCDC Version 4.0 Extended Reconstructed SST Analyses

In [4]:
paths = "/glade/p/rda/data/ds277.0/ersst.v4.nc/*.nc"
In [5]:
# Explore file metadata with ncdump
!ncdump -h /glade/p/rda/data/ds277.0/ersst.v4.nc/ersst.v4.196201.nc 
netcdf ersst.v4.196201 {
dimensions:
    time = 1 ;
    zlev = 1 ;
    lat = 89 ;
    lon = 180 ;
    nv = 2 ;
variables:
    float time(time) ;
        time:long_name = "Center time of the day" ;
        time:standard_name = "time" ;
        time:axis = "T" ;
        time:units = "days since 1854-01-15" ;
        time:delta_t = "0000-01-00" ;
        time:avg_period = "0000-01-00" ;
    float zlev(zlev) ;
        zlev:long_name = "Depth of sea surface temperature measurements" ;
        zlev:standard_name = "depth" ;
        zlev:units = "meters" ;
        zlev:_CoordinateAxisType = "Height" ;
        zlev:comment = "Measurement depth of in situ sea surface temperature varies" ;
        zlev:axis = "Z" ;
        zlev:positive = "down" ;
    float lat(lat) ;
        lat:long_name = "Latitude" ;
        lat:standard_name = "latitude" ;
        lat:axis = "Y" ;
        lat:bounds = "lat_bnds" ;
        lat:units = "degrees_north" ;
        lat:grids = "Uniform grid from -88 to 88 by 2" ;
    float lon(lon) ;
        lon:long_name = "Longitude" ;
        lon:standard_name = "longitude" ;
        lon:axis = "X" ;
        lon:bounds = "lon_bnds" ;
        lon:units = "degrees_east" ;
        lon:grids = "Uniform grid from 0 to 358 by 2" ;
    float lat_bnds(lat, nv) ;
        lat_bnds:units = "degrees_north" ;
        lat_bnds:comment = "This variable defines the latitude values at the south and north bounds of every 2-degree pixel" ;
    float lon_bnds(lon, nv) ;
        lon_bnds:units = "degrees_east" ;
        lon_bnds:comment = "This variable defines the longitude values at the west and east bounds of every 2-degree pixel" ;
    short sst(time, zlev, lat, lon) ;
        sst:long_name = "Extended reconstructed sea surface temperature" ;
        sst:standard_name = "sea_surface_temperature" ;
        sst:units = "degree_C" ;
        sst:_FillValue = -999s ;
        sst:add_offset = 0.f ;
        sst:scale_factor = 0.01f ;
        sst:valid_min = -300s ;
        sst:valid_max = 4500s ;
    short anom(time, zlev, lat, lon) ;
        anom:long_name = "Extended reconstructed SST anomalies" ;
        anom:units = "degree_C" ;
        anom:_FillValue = -999s ;
        anom:add_offset = 0.f ;
        anom:scale_factor = 0.01f ;
        anom:valid_min = -1200s ;
        anom:valid_max = 1200s ;

// global attributes:
        :Conventions = "CF-1.6" ;
        :Metadata_Conventions = "CF-1.6, Unidata Dataset Discovery v1.0" ;
        :metadata_link = "C00884" ;
        :id = "ersst.v4.196201" ;
        :naming_authority = "gov.noaa.ncdc" ;
        :title = "NOAA Extended Reconstructed Sea Surface Temperature (ERSST), Version 4 (in situ only)" ;
        :summary = "ERSST.v4 is developped based on v3b after revisions of 11 parameters using updated data sets and advanced knowledge of ERSST analysis" ;
        :institution = "NOAA/NESDIS/NCDC" ;
        :creator_name = "Boyin Huang" ;
        :creator_email = "boyin.huang@noaa.gov" ;
        :date_created = "2014-10-24" ;
        :production_version = "Beta Version 4" ;
        :history = "Version 4 based on Version 3b" ;
        :publisher_name = "Boyin Huang" ;
        :publisher_email = "boyin.huang@noaa.gov" ;
        :publisher_url = "http://www.ncdc.noaa.gov" ;
        :creator_url = "http://www.ncdc.noaa.gov" ;
        :license = "No constraints on data access or use" ;
        :time_coverage_start = "1962-01-15T000000Z" ;
        :time_coverage_end = "1962-01-15T000000Z" ;
        :geospatial_lon_min = "-1.0f" ;
        :geospatial_lon_max = "359.0f" ;
        :geospatial_lat_min = "-89.0f" ;
        :geospatial_lat_max = "89.0f" ;
        :geospatial_lat_units = "degrees_north" ;
        :geospatial_lat_resolution = "2.0" ;
        :geospatial_lon_units = "degrees_east" ;
        :geospatial_lon_resolution = "2.0" ;
        :spatial_resolution = "2.0 degree grid" ;
        :cdm_data_type = "Grid" ;
        :processing_level = "L4" ;
        :standard_name_vocabulary = "CF Standard Name Table v27" ;
        :keywords = "Earth Science > Oceans > Ocean Temperature > Sea Surface Temperature &gt" ;
        :keywords_vocabulary = "NASA Global Change Master Directory (GCMD) Science Keywords" ;
        :project = "NOAA Extended Reconstructed Sea Surface Temperature (ERSST)" ;
        :platform = "Ship and Buoy SSTs from ICOADS R2.5 and NCEP GTS" ;
        :instrument = "Conventional thermometers" ;
        :source = "ICOADS R2.5 SST, NCEP GTS SST, HadISST ice, NCEP ice" ;
        :comment = "SSTs were observed by conventional thermometers in Buckets (insulated or un-insulated canvas and wooded buckets) or Engine Room Intaker" ;
        :references = "Huang et al, 2014: Extended Reconstructed Sea Surface Temperatures Version 4 (ERSST.v4), Part I. Upgrades and Intercomparisons. Journal of Climate, DOI: 10.1175/JCLI-D-14-00006.1." ;
        :climatology = "Climatology is based on 1971-2000 SST, Xue, Y., T. M. Smith, and R. W. Reynolds, 2003: Interdecadal changes of 30-yr SST normals during 1871.2000. Journal of Climate, 16, 1601-1612." ;
        :description = "In situ data: ICOADS2.5 before 2007 and NCEP in situ data from 2008 to present. Ice data: HadISST ice before 2010 and NCEP ice after 2010." ;
}

RDD creation with spark-xarray¶

Create Spark RDD using spark-xarray package. You can get it from here

In [6]:
rdd = reader.nc_multi_read(sc, paths, data_splitting_mode='slice')
In [7]:
rdd.count()
Out[7]:
1961

There are 1961 files for months from 01-01-1854 to 05-31-2017. Each file contains monthly data for a particular month.

In [8]:
# Get the first element
rdd.first()
Out[8]:
<xarray.Dataset>
Dimensions:   (lat: 89, lon: 180, nv: 2, time: 1, zlev: 1)
Coordinates:
  * time      (time) datetime64[ns] 1854-01-15
  * zlev      (zlev) float32 0.0
  * lat       (lat) float32 -88.0 -86.0 -84.0 -82.0 -80.0 -78.0 -76.0 -74.0 ...
  * lon       (lon) float32 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.0 ...
Dimensions without coordinates: nv
Data variables:
    lat_bnds  (lat, nv) float32 -89.0 -87.0 -87.0 -85.0 -85.0 -83.0 -83.0 ...
    lon_bnds  (lon, nv) float32 -1.0 1.0 1.0 3.0 3.0 5.0 5.0 7.0 7.0 9.0 9.0 ...
    sst       (time, zlev, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    anom      (time, zlev, lat, lon) float64 nan nan nan nan nan nan nan nan ...
Attributes:
    Conventions:                CF-1.6
    Metadata_Conventions:       CF-1.6, Unidata Dataset Discovery v1.0
    metadata_link:              C00884
    id:                         ersst.v4.185401
    naming_authority:           gov.noaa.ncdc
    title:                      NOAA Extended Reconstructed Sea Surface Tempe...
    summary:                    ERSST.v4 is developped based on v3b after rev...
    institution:                NOAA/NESDIS/NCDC
    creator_name:               Boyin Huang
    creator_email:              boyin.huang@noaa.gov
    date_created:               2014-10-24
    production_version:         Beta Version 4
    history:                    Version 4 based on Version 3b
    publisher_name:             Boyin Huang
    publisher_email:            boyin.huang@noaa.gov
    publisher_url:              http://www.ncdc.noaa.gov
    creator_url:                http://www.ncdc.noaa.gov
    license:                    No constraints on data access or use
    time_coverage_start:        1854-01-15T000000Z
    time_coverage_end:          1854-01-15T000000Z
    geospatial_lon_min:         -1.0f
    geospatial_lon_max:         359.0f
    geospatial_lat_min:         -89.0f
    geospatial_lat_max:         89.0f
    geospatial_lat_units:       degrees_north
    geospatial_lat_resolution:  2.0
    geospatial_lon_units:       degrees_east
    geospatial_lon_resolution:  2.0
    spatial_resolution:         2.0 degree grid
    cdm_data_type:              Grid
    processing_level:           L4
    standard_name_vocabulary:   CF Standard Name Table v27
    keywords:                   Earth Science &gt; Oceans &gt; Ocean Temperat...
    keywords_vocabulary:        NASA Global Change Master Directory (GCMD) Sc...
    project:                    NOAA Extended Reconstructed Sea Surface Tempe...
    platform:                   Ship and Buoy SSTs from ICOADS R2.5 and NCEP GTS
    instrument:                 Conventional thermometers
    source:                     ICOADS R2.5 SST, NCEP GTS SST, HadISST ice, N...
    comment:                    SSTs were observed by conventional thermomete...
    references:                 Huang et al, 2014: Extended Reconstructed Sea...
    climatology:                Climatology is based on 1971-2000 SST, Xue, Y...
    description:                In situ data: ICOADS2.5 before 2007 and NCEP ...

Helper Function to create a new RDD with elements containing values from our region of interest [5N-5S, 120-170W]¶

Image Source:(http://www.ubergizmo.com/how-to/read-gps-coordinates/)

In [9]:
def get_region_of_interest(dset):
    return dset.sel(lat=slice(-6, 6), lon=slice(190, 240))
In [10]:
region_of_interest = rdd.map(get_region_of_interest).cache()
In [11]:
region_of_interest.count()
Out[11]:
1961

Global mean Temperature Calculation and plotting¶

back to top

To compute the global mean temperature for each month, We calculate the mean temperature at each latitude for that month, i.e., take the simple mean of the temperatures of all longitudes at a given latitude:

$$\bar{T}_{lat} = \frac{1}{nLon}\sum_{i=1}^{nLon}T_{lon, i}$$

Once we have the mean at each latitude, take the cosine weighted mean of those, giving the mean global temperature for this month:

$$\bar{T}_{month} = \frac{\sum_{j=1}^{nLat} \cos(lat_j) \bar{T}_{lat, j}}{\sum_{j=1}^{nLat}\cos(lat_j)}$$
In [12]:
def global_mean_sst(dset):
    # Find mean temperature for each latitude
    mean_sst_lat = dset.sst.mean(dim='lon')

    # Find Weighted mean of those values
    num =(np.cos(dset.lat) * mean_sst_lat).sum(dim='lat')
    denom = np.sum(np.cos(dset.lat))

    # Find mean global temperature
    mean_global_temp = num / denom

    return mean_global_temp
In [13]:
mean_global_temp = region_of_interest.map(global_mean_sst)
In [14]:
mean_global_temp.take(5)
Out[14]:
[<xarray.DataArray (time: 1, zlev: 1)>
 array([[ 27.226424]])
 Coordinates:
   * time     (time) datetime64[ns] 1854-01-15
   * zlev     (zlev) float32 0.0, <xarray.DataArray (time: 1, zlev: 1)>
 array([[ 27.234468]])
 Coordinates:
   * time     (time) datetime64[ns] 1854-02-15
   * zlev     (zlev) float32 0.0, <xarray.DataArray (time: 1, zlev: 1)>
 array([[ 27.370031]])
 Coordinates:
   * time     (time) datetime64[ns] 1854-03-15
   * zlev     (zlev) float32 0.0, <xarray.DataArray (time: 1, zlev: 1)>
 array([[ 27.562093]])
 Coordinates:
   * time     (time) datetime64[ns] 1854-04-15
   * zlev     (zlev) float32 0.0, <xarray.DataArray (time: 1, zlev: 1)>
 array([[ 27.820273]])
 Coordinates:
   * time     (time) datetime64[ns] 1854-05-15
   * zlev     (zlev) float32 0.0]
In [15]:
mean_global_temp.count()
Out[15]:
1961

Collect the mean sst values for further analysis with xarray¶

There is a slight mismatch in the zlev dimension. For files for times before Year = 2017, zlev is named zlev and any file from 2017 has zlev renamed to lev. This is a potential issue when concatenating xarray dataarray.

As a result we drop those months for year of 2017.

In [16]:
mean_sst = mean_global_temp.collect()
mean_sst[0:2]
Out[16]:
[<xarray.DataArray (time: 1, zlev: 1)>
 array([[ 27.226424]])
 Coordinates:
   * time     (time) datetime64[ns] 1854-01-15
   * zlev     (zlev) float32 0.0, <xarray.DataArray (time: 1, zlev: 1)>
 array([[ 27.234468]])
 Coordinates:
   * time     (time) datetime64[ns] 1854-02-15
   * zlev     (zlev) float32 0.0]
In [17]:
mean_sst[1956:]
Out[17]:
[<xarray.DataArray (time: 1, lev: 1)>
 array([[ 27.68285]])
 Coordinates:
   * lev      (lev) float64 0.0
   * time     (time) datetime64[ns] 2017-01-01,
 <xarray.DataArray (time: 1, lev: 1)>
 array([[ 27.924024]])
 Coordinates:
   * lev      (lev) float64 0.0
   * time     (time) datetime64[ns] 2017-02-01,
 <xarray.DataArray (time: 1, lev: 1)>
 array([[ 28.118005]])
 Coordinates:
   * lev      (lev) float64 0.0
   * time     (time) datetime64[ns] 2017-03-01,
 <xarray.DataArray (time: 1, lev: 1)>
 array([[ 28.527588]])
 Coordinates:
   * lev      (lev) float64 0.0
   * time     (time) datetime64[ns] 2017-04-01,
 <xarray.DataArray (time: 1, lev: 1)>
 array([[ 28.697374]])
 Coordinates:
   * lev      (lev) float64 0.0
   * time     (time) datetime64[ns] 2017-05-01]
  • Concatenate all the mean sst data in one datarray
In [18]:
a = xr.concat(mean_sst[:1956], dim='time')
  • Convert the dataarray into a dataset
In [19]:
ds = a.to_dataset(name="mean_sst")
In [20]:
ds
Out[20]:
<xarray.Dataset>
Dimensions:   (time: 1956, zlev: 1)
Coordinates:
  * zlev      (zlev) float32 0.0
  * time      (time) datetime64[ns] 1854-01-15 1854-02-15 1854-03-15 ...
Data variables:
    mean_sst  (time, zlev) float64 27.23 27.23 27.37 27.56 27.82 27.56 27.37 ...

Calculate monthly anomalies¶

In climatology, “anomalies” refer to the difference between observations and typical weather for a particular season. Unlike observations, anomalies should not show any seasonal cycle.

In [21]:
climatology = ds.groupby('time.month').mean('time')
climatology
Out[21]:
<xarray.Dataset>
Dimensions:   (month: 12, zlev: 1)
Coordinates:
  * zlev      (zlev) float32 0.0
  * month     (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
Data variables:
    mean_sst  (month, zlev) float64 27.28 27.33 27.53 27.75 27.9 27.78 27.68 ...
In [22]:
anomalies = ds.groupby('time.month') - climatology
anomalies
Out[22]:
<xarray.Dataset>
Dimensions:   (time: 1956, zlev: 1)
Coordinates:
  * zlev      (zlev) float32 0.0
  * time      (time) datetime64[ns] 1854-01-15 1854-02-15 1854-03-15 ...
    month     (time) int64 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 ...
Data variables:
    mean_sst  (time, zlev) float64 -0.052 -0.09665 -0.1622 -0.1913 -0.07647 ...

Compute the 3 month running mean for the anomalies¶

In [23]:
df = anomalies['mean_sst'].to_dataframe()
In [24]:
df_anom = anomalies.rolling(min_periods=1, time=3, center=True).mean().to_dataframe()
In [25]:
df.head()
Out[25]:
month mean_sst
time zlev
1854-01-15 00:00:00 0.0 1 -0.052002
1854-02-15 00:00:00 0.0 2 -0.096649
1854-03-15 00:00:00 0.0 3 -0.162221
1854-04-15 00:00:00 0.0 4 -0.191310
1854-05-15 00:00:00 0.0 5 -0.076473
In [26]:
df_anom.head()
Out[26]:
month mean_sst
time zlev
1854-01-15 00:00:00 0.0 1 -0.074325
1854-02-15 00:00:00 0.0 2 -0.103624
1854-03-15 00:00:00 0.0 3 -0.150060
1854-04-15 00:00:00 0.0 4 -0.143335
1854-05-15 00:00:00 0.0 5 -0.165060
In [27]:
# For better visualization, drop the zlev index
df = df.set_index(df.index.droplevel(level=1))
df_anom = df_anom.set_index(df_anom.index.droplevel(level=1))
In [28]:
ax = df['mean_sst'].plot(legend=True, label="unsmoothed anom", figsize=(16, 8))
ax = df_anom['mean_sst'].plot(legend=True, label="smoothed anom")
ax.set_xlabel("Year")
ax.set_ylabel("Oceanic Nino Index (ONI)[C]")
plt.show()