edit 01-mean

This notebook was put together by Anderson Banihirwe as part of 2017 CISL/SIParCS Research Project : PySpark for Big Atmospheric & Oceanic Data Analysis

Mean¶

  • Defined as the arithmetic average of the set.
  • Calculated by summing all values, then dividing by the number of values.
  • One of the simplest measures of center to calculate.
  • May provide an incomplete description of the central tendency if not accompanied by other measures.
  • Greatly affected by extreme values.

Example:¶

  • Calculate the temporal, spatial, and zonal mean of temperature data over eastern North America for the period 2006-2010.
  • The dataset can be found on NCAR's Glade:

    • /glade/p/CMIP/CMIP5/output1/NOAA-GFDL/GFDL-ESM2M/rcp85/mon/atmos/Amon/r1i1p1/v20111228/ta/ta_Amon_GFDL-ESM2M_rcp85_r1i1p1_200601-201012.nc

Step 1: Load Dataset in a Spark dataframe¶

In [1]:
from pyspark4climate import read
from pyspark4climate.functions import shift_lon_udf
from pyspark.sql import SparkSession
import geopandas as gpd
import pandas as pd
import seaborn as sns
import matplotlib
matplotlib.style.use('ggplot')
matplotlib.rcParams['figure.figsize'] = (12, 15)
%matplotlib inline
import matplotlib.pyplot as plt
jet=plt.get_cmap('coolwarm')  # Used for multiple scatter plots
In [2]:
spark = SparkSession.builder.appName("mean").getOrCreate()
sc = spark.sparkContext
In [3]:
!ncdump -h ../data/ta_Amon_GFDL-ESM2M_rcp85_r1i1p1_200601-201012.nc
netcdf ta_Amon_GFDL-ESM2M_rcp85_r1i1p1_200601-201012 {
dimensions:
    plev = 17 ;
    time = UNLIMITED ; // (60 currently)
    lat = 90 ;
    lon = 144 ;
    bnds = 2 ;
variables:
    double plev(plev) ;
        plev:units = "Pa" ;
        plev:long_name = "pressure" ;
        plev:axis = "Z" ;
        plev:positive = "down" ;
        plev:standard_name = "air_pressure" ;
    double average_DT(time) ;
        average_DT:long_name = "Length of average period" ;
        average_DT:units = "days" ;
    double average_T1(time) ;
        average_T1:long_name = "Start time for average period" ;
        average_T1:units = "days since 2006-01-01 00:00:00" ;
    double average_T2(time) ;
        average_T2:long_name = "End time for average period" ;
        average_T2:units = "days since 2006-01-01 00:00:00" ;
    double lat(lat) ;
        lat:long_name = "latitude" ;
        lat:units = "degrees_north" ;
        lat:standard_name = "latitude" ;
        lat:axis = "Y" ;
        lat:bounds = "lat_bnds" ;
    double lon(lon) ;
        lon:long_name = "longitude" ;
        lon:units = "degrees_east" ;
        lon:standard_name = "longitude" ;
        lon:axis = "X" ;
        lon:bounds = "lon_bnds" ;
    double bnds(bnds) ;
        bnds:long_name = "vertex number" ;
        bnds:cartesian_axis = "N" ;
    float ta(time, plev, lat, lon) ;
        ta:long_name = "Air Temperature" ;
        ta:units = "K" ;
        ta:valid_range = 100.f, 350.f ;
        ta:missing_value = 1.e+20f ;
        ta:cell_methods = "time: mean" ;
        ta:_FillValue = 1.e+20f ;
        ta:standard_name = "air_temperature" ;
        ta:original_units = "K" ;
        ta:original_name = "temp" ;
        ta:cell_measures = "area: areacella" ;
        ta:associated_files = "baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation areacella: areacella_fx_GFDL-ESM2M_rcp85_r0i0p0.nc" ;
    double time(time) ;
        time:long_name = "time" ;
        time:units = "days since 2006-01-01 00:00:00" ;
        time:cartesian_axis = "T" ;
        time:calendar_type = "noleap" ;
        time:calendar = "noleap" ;
        time:bounds = "time_bnds" ;
        time:standard_name = "time" ;
        time:axis = "T" ;
    double time_bnds(time, bnds) ;
        time_bnds:long_name = "time axis boundaries" ;
        time_bnds:units = "days since 2006-01-01 00:00:00" ;
    double lat_bnds(lat, bnds) ;
    double lon_bnds(lon, bnds) ;

// global attributes:
        :title = "NOAA GFDL GFDL-ESM2M, RCP8.5 (run 1) experiment output for CMIP5 AR5" ;
        :institute_id = "NOAA GFDL" ;
        :source = "GFDL-ESM2M 2010 ocean: MOM4 (MOM4p1_x1_Z50_cCM2M,Tripolar360x200L50); atmosphere: AM2 (AM2p14,M45L24); sea ice: SIS (SISp2,Tripolar360x200L50); land: LM3 (LM3p7_cESM,M45)" ;
        :contact = "gfdl.climate.model.info@noaa.gov" ;
        :project_id = "CMIP5" ;
        :table_id = "Table Amon (31 Jan 2011)" ;
        :experiment_id = "rcp85" ;
        :realization = 1 ;
        :modeling_realm = "atmos" ;
        :tracking_id = "095bd34a-27ca-4f5a-aaf2-ac2edba4317e" ;
        :Conventions = "CF-1.4" ;
        :references = "The GFDL Data Portal (http://nomads.gfdl.noaa.gov/) provides access to NOAA/GFDL\'s publicly available model input and output data sets. From this web site one can view and download data sets and documentation, including those related to the GFDL coupled models experiments run for the IPCC\'s 5th Assessment Report and the US CCSP." ;
        :comment = "GFDL experiment name = ESM2M-HC1_2006-2100_all_rcp85_ZC1. PCMDI experiment name = RCP8.5 (run1). Initial conditions for this experiment were taken from 1 January 2006 of the parent experiment, ESM2M-C1_all_historical_HC1 (historical). Several forcing agents varied during the 95 year duration of the RCP8.5 experiment based upon the MESSAGE integrated assessment model for the 21st century. The time-varying forcing agents include the well-mixed greenhouse gases (CO2, CH4, N2O, halons), tropospheric and stratospheric O3, model-derived aerosol concentrations (sulfate, black and organic carbon, sea salt and dust), and land use transitions. Volcanic aerosols were zero and solar irradiance varied seasonally based upon late 20th century averages but with no interannual variation. The direct effect of tropospheric aerosols is calculated by the model, but not the indirect effects." ;
        :gfdl_experiment_name = "ESM2M-HC1_2006-2100_all_rcp85_ZC1" ;
        :creation_date = "2011-08-12T06:08:37Z" ;
        :model_id = "GFDL-ESM2M" ;
        :branch_time = "52925" ;
        :experiment = "RCP8.5" ;
        :forcing = "GHG,SD,Oz,LU,SS,BC,MD,OC (GHG includes CO2, CH4, N2O, CFC11, CFC12, HCFC22, CFC113)" ;
        :frequency = "mon" ;
        :initialization_method = 1 ;
        :parent_experiment_id = "historical" ;
        :physics_version = 1 ;
        :product = "output1" ;
        :institution = "NOAA GFDL(201 Forrestal Rd, Princeton, NJ, 08540)" ;
        :history = "File was processed by fremetar (GFDL analog of CMOR). TripleID: [exper_id_NzrgHxy3PV,realiz_id_ntKPD70REo,run_id_fBzP3eBa0W]" ;
        :parent_experiment_rip = "r1i1p1" ;
}
In [4]:
filename='../data/ta_Amon_GFDL-ESM2M_rcp85_r1i1p1_200601-201012.nc'
var = 'ta'
In [5]:
data = read.DataFrame(sc, (filename, var), mode='single')

Pyspark4climate DataFrame class returns an object. In order to access spark's dataframe we need to do the following:

In [6]:
type(data)
Out[6]:
pyspark4climate.read.DataFrame
In [7]:
data_df = data.df
type(data_df)
Out[7]:
pyspark.sql.dataframe.DataFrame
In [8]:
data_df.show()
+-------------------+--------+------------------+------------------+----+
|               time|    plev|               lat|               lon|  ta|
+-------------------+--------+------------------+------------------+----+
|2006-01-16 12:00:00|100000.0|-85.95505617977527|            256.25|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|            258.75|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|            261.25|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|            263.75|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|            266.25|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|            268.75|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|            271.25|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|273.74999999999994|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|            276.25|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|            278.75|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|236.24999999999997|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            238.75|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            241.25|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            243.75|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            246.25|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            248.75|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            251.25|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            253.75|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            256.25|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            258.75|null|
+-------------------+--------+------------------+------------------+----+
only showing top 20 rows

In [9]:
# Print the schema of data_df dataframe
data_df.printSchema()
root
 |-- time: string (nullable = true)
 |-- plev: double (nullable = true)
 |-- lat: double (nullable = true)
 |-- lon: double (nullable = true)
 |-- ta: double (nullable = true)

Step 2: Shift longitudes on grid so that they are in range [-180 -> 180]¶

To achieve this we will use pyspark4climate builtin function shift_grid_udf()

In [10]:
# Shift grid and Drop the lon column
data_df = data_df.withColumn("shifted_lon", shift_lon_udf(data_df["lon"])).cache()
data_df = data_df.selectExpr("time", "plev", "lat", "shifted_lon as lon", "ta")
In [11]:
data_df.show()
+-------------------+--------+------------------+-------------------+----+
|               time|    plev|               lat|                lon|  ta|
+-------------------+--------+------------------+-------------------+----+
|2006-01-16 12:00:00|100000.0|-85.95505617977527|            -103.75|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|            -101.25|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|             -98.75|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|             -96.25|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|             -93.75|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|             -91.25|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|             -88.75|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527| -86.25000000000006|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|             -83.75|null|
|2006-01-16 12:00:00|100000.0|-85.95505617977527|             -81.25|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|-123.75000000000003|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            -121.25|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            -118.75|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            -116.25|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            -113.75|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            -111.25|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            -108.75|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            -106.25|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            -103.75|null|
|2006-01-16 12:00:00|100000.0|-71.79775280898876|            -101.25|null|
+-------------------+--------+------------------+-------------------+----+
only showing top 20 rows

Step 2: Select Temporal and Spatial Domains¶

Select North-America: Region with only values 60W to 130W, 20N to 70N

In [12]:
import pyspark.sql.functions as F
In [13]:
df = data_df.filter((data_df["lon"] <= -60) & (data_df["lon"] >=-130) &\
                 (data_df["lat"] >=20) & (data_df["lat"] <=70))\
                 .orderBy(F.col('time'), F.col('lat'), F.col('lon'))
df.show()
+-------------------+--------+------------------+-------------------+------------------+
|               time|    plev|               lat|                lon|                ta|
+-------------------+--------+------------------+-------------------+------------------+
|2006-01-16 12:00:00| 92500.0|21.235955056179776|-128.74999999999997| 284.8731384277344|
|2006-01-16 12:00:00| 60000.0|21.235955056179776|-128.74999999999997|  271.363037109375|
|2006-01-16 12:00:00| 85000.0|21.235955056179776|-128.74999999999997|  284.496337890625|
|2006-01-16 12:00:00|100000.0|21.235955056179776|-128.74999999999997|290.64520263671875|
|2006-01-16 12:00:00| 30000.0|21.235955056179776|-128.74999999999997|235.12181091308594|
|2006-01-16 12:00:00| 50000.0|21.235955056179776|-128.74999999999997| 261.9516906738281|
|2006-01-16 12:00:00| 70000.0|21.235955056179776|-128.74999999999997|278.41375732421875|
|2006-01-16 12:00:00| 25000.0|21.235955056179776|-128.74999999999997|226.87403869628906|
|2006-01-16 12:00:00|  2000.0|21.235955056179776|-128.74999999999997|      224.27734375|
|2006-01-16 12:00:00|  3000.0|21.235955056179776|-128.74999999999997|218.32974243164062|
|2006-01-16 12:00:00|  7000.0|21.235955056179776|-128.74999999999997|205.61280822753906|
|2006-01-16 12:00:00| 10000.0|21.235955056179776|-128.74999999999997|200.11734008789062|
|2006-01-16 12:00:00| 20000.0|21.235955056179776|-128.74999999999997|  217.828369140625|
|2006-01-16 12:00:00| 15000.0|21.235955056179776|-128.74999999999997| 208.1819610595703|
|2006-01-16 12:00:00| 40000.0|21.235955056179776|-128.74999999999997|250.10055541992188|
|2006-01-16 12:00:00|  5000.0|21.235955056179776|-128.74999999999997|210.75625610351562|
|2006-01-16 12:00:00|  1000.0|21.235955056179776|-128.74999999999997| 232.8317108154297|
|2006-01-16 12:00:00| 40000.0|21.235955056179776|-126.24999999999997|250.02354431152344|
|2006-01-16 12:00:00| 10000.0|21.235955056179776|-126.24999999999997| 200.3651580810547|
|2006-01-16 12:00:00| 20000.0|21.235955056179776|-126.24999999999997|217.65023803710938|
+-------------------+--------+------------------+-------------------+------------------+
only showing top 20 rows

Step 3: Calculate Temporal Average¶

  • This operation computes a temporal mean of the data by calculating the mean at each spatial grid point over the time range.
In [14]:
temp_avg = df.groupby('lat', 'lon')\
             .agg(F.avg('ta').alias('mean_ta'))\
             .orderBy(F.col('lat'), F.col('lon')).cache()
temp_avg.show()
+------------------+-------------------+------------------+
|               lat|                lon|           mean_ta|
+------------------+-------------------+------------------+
|21.235955056179776|-128.74999999999997|242.76008623908547|
|21.235955056179776|-126.24999999999997|242.82546629064223|
|21.235955056179776|-123.75000000000003| 242.9022134070303|
|21.235955056179776|            -121.25|243.00799309225644|
|21.235955056179776|            -118.75|243.12291736976772|
|21.235955056179776|            -116.25|243.25170994178922|
|21.235955056179776|            -113.75|243.35953623453776|
|21.235955056179776|            -111.25|243.53641993204752|
|21.235955056179776|            -108.75|243.57952067057292|
|21.235955056179776|            -106.25|243.17834014438898|
|21.235955056179776|            -103.75| 232.6847396123977|
|21.235955056179776|            -101.25|232.76408983866375|
|21.235955056179776|             -98.75|236.32689595540364|
|21.235955056179776|             -96.25|239.75000178019206|
|21.235955056179776|             -93.75| 243.1135140063716|
|21.235955056179776|             -91.25|243.16259406594668|
|21.235955056179776|             -88.75|243.21412064795402|
|21.235955056179776| -86.25000000000006|243.14876427743948|
|21.235955056179776|             -83.75|243.19449928134097|
|21.235955056179776|             -81.25|243.24257423550475|
+------------------+-------------------+------------------+
only showing top 20 rows

3.1 Visualize Temporal Average¶

In [15]:
temporal_avg_df = temp_avg.toPandas()
temporal_avg_df.describe()
Out[15]:
lat lon mean_ta
count 700.000000 700.000000 700.000000
mean 45.505618 -95.000000 236.619842
std 14.594681 20.208808 3.805626
min 21.235955 -128.750000 230.788871
25% 33.370787 -111.875000 233.394349
50% 45.505618 -95.000000 235.729313
75% 57.640449 -78.125000 240.352682
max 69.775281 -61.250000 243.579521
In [16]:
data = temporal_avg_df['mean_ta']
x = temporal_avg_df['lon']
y = temporal_avg_df['lat']
In [17]:
def plot_scatter(data, x, y):
    plt.scatter(x, y, c=data, cmap=jet, vmin=data.min(), vmax=data.max())
    plt.clim(data.min(), data.max())
    plt.colorbar()
    #plt.title('Temporal Average')
    plt.ylabel('Latitude')
    plt.xlabel('Longitude')
    plt.show()
In [18]:
plot_scatter(data, x, y)
In [19]:
# plot the distribution of the temporal mean
ax = sns.distplot(temporal_avg_df['mean_ta'], rug=True, hist=False)
ax.set_title("Temporal Average distribution")
plt.show()

Step 4: Calculate Zonal Average¶

This operation computes a zonal mean of the data as a function of time and latitude. The only independent variables remaining are latitude and time.

In [20]:
zonal_avg = df.groupby('lat', 'time')\
              .agg(F.avg('ta').alias('mean_ta'))\
              .orderBy(F.col('lat'), F.col('time')).cache()
zonal_avg.show()
+------------------+-------------------+------------------+
|               lat|               time|           mean_ta|
+------------------+-------------------+------------------+
|21.235955056179776|2006-01-16 12:00:00|240.63244262956704|
|21.235955056179776|2006-02-15 00:00:00| 241.0041486842658|
|21.235955056179776|2006-03-16 12:00:00|241.01563464749245|
|21.235955056179776|2006-04-16 00:00:00|241.38070229150915|
|21.235955056179776|2006-05-16 12:00:00|241.71975753849668|
|21.235955056179776|2006-06-16 00:00:00|242.10359871156754|
|21.235955056179776|2006-07-16 12:00:00|  243.025920173596|
|21.235955056179776|2006-08-16 12:00:00|243.39395702731224|
|21.235955056179776|2006-09-16 00:00:00|243.37296900255927|
|21.235955056179776|2006-10-16 12:00:00| 242.5644244818852|
|21.235955056179776|2006-11-16 00:00:00|241.55782674153645|
|21.235955056179776|2006-12-16 12:00:00| 241.3238802017704|
|21.235955056179776|2007-01-16 12:00:00|240.65969933950774|
|21.235955056179776|2007-02-15 00:00:00|241.17574446483326|
|21.235955056179776|2007-03-16 12:00:00| 241.7991564191621|
|21.235955056179776|2007-04-16 00:00:00|242.11138463174143|
|21.235955056179776|2007-05-16 12:00:00| 242.3503931517242|
|21.235955056179776|2007-06-16 00:00:00|243.00820559391136|
|21.235955056179776|2007-07-16 12:00:00|243.38494422144757|
|21.235955056179776|2007-08-16 12:00:00|243.59986857752645|
+------------------+-------------------+------------------+
only showing top 20 rows

4.1 Visualize Zonal Average¶

In [21]:
zonal_avg_df = zonal_avg.toPandas()
zonal_avg_df.describe()
Out[21]:
lat mean_ta
count 1500.000000 1500.000000
mean 45.505618 236.722027
std 14.589116 5.043879
min 21.235955 220.987149
25% 33.370787 233.258303
50% 45.505618 237.660850
75% 57.640449 240.841177
max 69.775281 243.973577
In [22]:
zonal_avg_df.dtypes
Out[22]:
lat        float64
time        object
mean_ta    float64
dtype: object
In [23]:
zonal_avg_df = zonal_avg_df.set_index('time')
In [24]:
ax = zonal_avg_df['mean_ta'].plot(legend=True, figsize=(16, 8))
ax.set_xlabel("Time range [Jan-01-2006; ....; Dec-31-2010]")
ax.set_ylabel("Zonal Average [K]]")
plt.show()
In [25]:
# plot the distribution of the zonal average
ax = sns.distplot(zonal_avg_df['mean_ta'], rug=True, hist=False)
ax.set_title("Zonal Average distribution")
plt.show()

Step 5: Calculate Spatial Average¶

This operation computes a spatial mean. For each month, the temperature data at each spatial grid point is averaged together to generate one value.

In [26]:
spatial_avg = df.groupby('time')\
                .agg(F.avg('ta').alias('mean_ta'))\
                .orderBy(F.col('time')).cache()
spatial_avg.show()
+-------------------+------------------+
|               time|           mean_ta|
+-------------------+------------------+
|2006-01-16 12:00:00|231.59926081445633|
|2006-02-15 00:00:00|232.37893242814513|
|2006-03-16 12:00:00| 234.1541619403338|
|2006-04-16 00:00:00|235.78591857085357|
|2006-05-16 12:00:00|238.01055345426494|
|2006-06-16 00:00:00|240.42150196670758|
|2006-07-16 12:00:00| 241.9528985937708|
|2006-08-16 12:00:00| 241.2648552284146|
|2006-09-16 00:00:00| 239.6473727575904|
|2006-10-16 12:00:00|237.19174075318733|
|2006-11-16 00:00:00|234.90129093010106|
|2006-12-16 12:00:00|232.37434658849983|
|2007-01-16 12:00:00| 233.7475353484458|
|2007-02-15 00:00:00|234.06008903985523|
|2007-03-16 12:00:00|234.13952266820945|
|2007-04-16 00:00:00|236.35817862969358|
|2007-05-16 12:00:00|238.67507688947597|
|2007-06-16 00:00:00| 240.9092952795669|
|2007-07-16 12:00:00|242.37356627412072|
|2007-08-16 12:00:00|241.68422413936645|
+-------------------+------------------+
only showing top 20 rows

5.1 Visualize Spatial Average¶

In [27]:
spatial_avg_df = spatial_avg.toPandas()
spatial_avg_df.describe()
Out[27]:
mean_ta
count 60.000000
mean 236.757677
std 3.538398
min 231.549194
25% 233.926710
50% 236.512822
75% 239.926692
max 242.373566
In [28]:
spatial_avg_df = spatial_avg_df.set_index('time')
In [29]:
ax = spatial_avg_df['mean_ta'].plot(legend=True, figsize=(16, 8))
ax.set_xlabel("Time range [Jan-01-2006; ....; Dec-31-2010]")
ax.set_ylabel("Spatial Average [K]]")
plt.show()
In [30]:
sns.distplot(spatial_avg_df['mean_ta'], rug=True, hist=False)
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x11bb0a090>