This notebook was put together by Anderson Banihirwe as part of 2017 CISL/SIParCS Research Project : PySpark for Big Atmospheric & Oceanic Data Analysis
Mean¶
- Defined as the arithmetic average of the set.
- Calculated by summing all values, then dividing by the number of values.
- One of the simplest measures of center to calculate.
- May provide an incomplete description of the central tendency if not accompanied by other measures.
- Greatly affected by extreme values.
Example:¶
- Calculate the temporal, spatial, and zonal mean of temperature data over eastern North America for the period 2006-2010.
The dataset can be found on NCAR's Glade:
/glade/p/CMIP/CMIP5/output1/NOAA-GFDL/GFDL-ESM2M/rcp85/mon/atmos/Amon/r1i1p1/v20111228/ta/ta_Amon_GFDL-ESM2M_rcp85_r1i1p1_200601-201012.nc
Step 1: Load Dataset in a Spark dataframe¶
from pyspark4climate import read
from pyspark4climate.functions import shift_lon_udf
from pyspark.sql import SparkSession
import geopandas as gpd
import pandas as pd
import seaborn as sns
import matplotlib
matplotlib.style.use('ggplot')
matplotlib.rcParams['figure.figsize'] = (12, 15)
%matplotlib inline
import matplotlib.pyplot as plt
jet=plt.get_cmap('coolwarm') # Used for multiple scatter plots
spark = SparkSession.builder.appName("mean").getOrCreate()
sc = spark.sparkContext
!ncdump -h ../data/ta_Amon_GFDL-ESM2M_rcp85_r1i1p1_200601-201012.nc
filename='../data/ta_Amon_GFDL-ESM2M_rcp85_r1i1p1_200601-201012.nc'
var = 'ta'
data = read.DataFrame(sc, (filename, var), mode='single')
Pyspark4climate DataFrame class returns an object. In order to access spark's dataframe we need to do the following:
type(data)
data_df = data.df
type(data_df)
data_df.show()
# Print the schema of data_df dataframe
data_df.printSchema()
Step 2: Shift longitudes on grid so that they are in range [-180 -> 180]¶
To achieve this we will use pyspark4climate
builtin function shift_grid_udf()
# Shift grid and Drop the lon column
data_df = data_df.withColumn("shifted_lon", shift_lon_udf(data_df["lon"])).cache()
data_df = data_df.selectExpr("time", "plev", "lat", "shifted_lon as lon", "ta")
data_df.show()
Step 2: Select Temporal and Spatial Domains¶
Select North-America: Region with only values 60W to 130W, 20N to 70N
import pyspark.sql.functions as F
df = data_df.filter((data_df["lon"] <= -60) & (data_df["lon"] >=-130) &\
(data_df["lat"] >=20) & (data_df["lat"] <=70))\
.orderBy(F.col('time'), F.col('lat'), F.col('lon'))
df.show()
Step 3: Calculate Temporal Average¶
- This operation computes a temporal mean of the data by calculating the mean at each spatial grid point over the time range.
temp_avg = df.groupby('lat', 'lon')\
.agg(F.avg('ta').alias('mean_ta'))\
.orderBy(F.col('lat'), F.col('lon')).cache()
temp_avg.show()
3.1 Visualize Temporal Average¶
temporal_avg_df = temp_avg.toPandas()
temporal_avg_df.describe()
data = temporal_avg_df['mean_ta']
x = temporal_avg_df['lon']
y = temporal_avg_df['lat']
def plot_scatter(data, x, y):
plt.scatter(x, y, c=data, cmap=jet, vmin=data.min(), vmax=data.max())
plt.clim(data.min(), data.max())
plt.colorbar()
#plt.title('Temporal Average')
plt.ylabel('Latitude')
plt.xlabel('Longitude')
plt.show()
plot_scatter(data, x, y)
# plot the distribution of the temporal mean
ax = sns.distplot(temporal_avg_df['mean_ta'], rug=True, hist=False)
ax.set_title("Temporal Average distribution")
plt.show()
Step 4: Calculate Zonal Average¶
This operation computes a zonal mean of the data as a function of time and latitude. The only independent variables remaining are latitude and time.
zonal_avg = df.groupby('lat', 'time')\
.agg(F.avg('ta').alias('mean_ta'))\
.orderBy(F.col('lat'), F.col('time')).cache()
zonal_avg.show()
4.1 Visualize Zonal Average¶
zonal_avg_df = zonal_avg.toPandas()
zonal_avg_df.describe()
zonal_avg_df.dtypes
zonal_avg_df = zonal_avg_df.set_index('time')
ax = zonal_avg_df['mean_ta'].plot(legend=True, figsize=(16, 8))
ax.set_xlabel("Time range [Jan-01-2006; ....; Dec-31-2010]")
ax.set_ylabel("Zonal Average [K]]")
plt.show()
# plot the distribution of the zonal average
ax = sns.distplot(zonal_avg_df['mean_ta'], rug=True, hist=False)
ax.set_title("Zonal Average distribution")
plt.show()
Step 5: Calculate Spatial Average¶
This operation computes a spatial mean. For each month, the temperature data at each spatial grid point is averaged together to generate one value.
spatial_avg = df.groupby('time')\
.agg(F.avg('ta').alias('mean_ta'))\
.orderBy(F.col('time')).cache()
spatial_avg.show()
5.1 Visualize Spatial Average¶
spatial_avg_df = spatial_avg.toPandas()
spatial_avg_df.describe()
spatial_avg_df = spatial_avg_df.set_index('time')
ax = spatial_avg_df['mean_ta'].plot(legend=True, figsize=(16, 8))
ax.set_xlabel("Time range [Jan-01-2006; ....; Dec-31-2010]")
ax.set_ylabel("Spatial Average [K]]")
plt.show()
sns.distplot(spatial_avg_df['mean_ta'], rug=True, hist=False)