This notebook was put together by Anderson Banihirwe as part of 2017 CISL/SIParCS Research Project : PySpark for Big Atmospheric & Oceanic Data Analysis
Trimmed Mean¶
- Discards a percentage of the outlying values before calculating the arithmetic average.
- A measure that incorporates characteristics of the mean and the median.
- Less affected by outliers than the untrimmed average.
- A $x\%$ trimmed mean will eliminate the largest $x\%$ and the smallest $x\%$ of the sample before calculated the mean.
- Typical range for $x\%$ is $5\%$ to $25\%$.
Source: Wilks, Daniel S. Statistical Methods in the Atmospheric Sciences. p 26.
Example:¶
- Calculate the $20\%$ trimmed mean of spatially averaged temperature data.
The dataset can be found on NCAR's Glade:
/glade/p/CMIP/CMIP5/output1/NOAA-GFDL/GFDL-ESM2M/rcp85/mon/atmos/Amon/r1i1p1/v20111228/ta/ta_Amon_GFDL-ESM2M_rcp85_r1i1p1_200601-201012.nc
Step 1: Load Dataset in a Spark dataframe¶
In [1]:
from pyspark4climate import read
from pyspark4climate.functions import shift_lon_udf
from pyspark.sql import SparkSession
import geopandas as gpd
import pandas as pd
import seaborn as sns
import matplotlib
matplotlib.style.use('ggplot')
matplotlib.rcParams['figure.figsize'] = (12, 15)
%matplotlib inline
import matplotlib.pyplot as plt
jet=plt.get_cmap('coolwarm') # Used for multiple scatter plots
In [2]:
spark = SparkSession.builder.appName("trimmed-mean").getOrCreate()
sc = spark.sparkContext
In [3]:
!ncdump -h ../data/ta_Amon_GFDL-ESM2M_rcp85_r1i1p1_200601-201012.nc
In [4]:
filename='../data/ta_Amon_GFDL-ESM2M_rcp85_r1i1p1_200601-201012.nc'
var = 'ta'
In [5]:
data = read.DataFrame(sc, (filename, var), mode='single')
Pyspark4climate DataFrame class returns an object. In order to access spark's dataframe we need to do the following:
In [6]:
type(data)
Out[6]:
In [7]:
data_df = data.df
type(data_df)
Out[7]:
In [8]:
data_df.show()
In [9]:
# Print the schema of data_df dataframe
data_df.printSchema()
Step 2: Shift longitudes on grid so that they are in range [-180 -> 180]¶
To achieve this we will use pyspark4climate
builtin function shift_grid_udf()
In [10]:
# Shift grid and Drop the lon column
data_df = data_df.withColumn("shifted_lon", shift_lon_udf(data_df["lon"])).cache()
data_df = data_df.selectExpr("time", "plev", "lat", "shifted_lon as lon", "ta")
In [11]:
data_df.show()
Step 2: Select Temporal and Spatial Domains¶
Select North-America: Region with only values 60W to 130W, 20N to 70N
In [12]:
import pyspark.sql.functions as F
In [13]:
df = data_df.filter((data_df["lon"] <= -60) & (data_df["lon"] >=-130) &\
(data_df["lat"] >=20) & (data_df["lat"] <=70))\
.orderBy(F.col('time'), F.col('lat'), F.col('lon'))
df.show()
Step 3: Calculate Spatial Average¶
This operation computes a spatial mean. For each month, the temperature data at each spatial grid point is averaged together to generate one value.
In [14]:
spatial_avg = df.groupby('time')\
.agg(F.avg('ta').alias('mean_ta'))\
.orderBy(F.col('time')).cache()
spatial_avg.show()
In [17]:
spatial_avg.count()
Out[17]:
Step 4: Calculate Trimmed Mean¶
- Use Spark's
approxQuantile(col, probabilities, relativeError)
to calculate the approximate quantiles of a numerical column of a DataFrame. This function gives us the lowest $20\%$ of the values and the highest $20\%$ of the values from the dataset.
In [15]:
lowest_highest_20th_values = spatial_avg.approxQuantile("mean_ta", [0.2, 0.8], 0.001)
lowest_highest_20th_values
Out[15]:
- Discard the lowest $20\%$ of the values and the highest $20\%$ of the values from the dataset.
In [16]:
trimmed_mean = spatial_avg.where((spatial_avg["mean_ta"] > lowest_highest_20th_values[0])\
& (spatial_avg["mean_ta"] < lowest_highest_20th_values[1])).cache()
trimmed_mean.show()
In [18]:
trimmed_mean.count()
Out[18]:
4.1 Plot the trimmed-mean values and Spatial mean values¶
In [19]:
# convert the spark dataframe to pandas dataframe for visualization
df = trimmed_mean.toPandas()
df.head()
Out[19]:
In [20]:
df.describe()
Out[20]:
In [21]:
df = df.set_index('time')
df.head()
Out[21]:
In [23]:
ax = df['mean_ta'].plot(legend=True, figsize=(16, 8))
ax.set_xlabel("Time range [Jan-01-2006; ....; Dec-31-2010]")
ax.set_ylabel("Trimmed Mean Temperature [K]]")
ax.set_title("Trimmed Mean of Temperature at 60W to 130W, 20N to 70N for Jan 2006 - Dec 2010")
plt.show()
4.2 Plot the Spatial mean temperature values¶
In [24]:
spatial_avg_df = spatial_avg.toPandas()
spatial_avg_df.describe()
Out[24]:
In [25]:
spatial_avg_df = spatial_avg_df.set_index('time')
ax = spatial_avg_df['mean_ta'].plot(legend=True, figsize=(16, 8))
ax.set_xlabel("Time range [Jan-01-2006; ....; Dec-31-2010]")
ax.set_ylabel("Spatial Average [K]]")
plt.show()
4.3 Plot of Trimmed Mean and Spatial Mean Temperatures¶
In [35]:
ax = df['mean_ta'].plot(legend=True, figsize=(16, 8), label='Trimmed Mean')
ax = spatial_avg_df['mean_ta'].plot(legend=True, figsize=(16, 8), label='Spatial Mean')
ax.set_ylabel("Mean Temperature [K]")
ax.set_xlabel("Time range [Jan-01-2006; ....; Dec-31-2010]")
plt.show()