edit

Build Status codecov

spark-xarray

spark-xarray is an open source project and Python package that seeks to integrate PySpark and xarray for Climate Data Analysis. It is built on top of PySpark - Spark Python API and xarray.

spark-xarray was originally conceived during the Summer of 2017 as part of PySpark for "Big" Atmospheric & Oceanic Data Analysis - A CISL/SIParCS Research Project.

It is currently maintained by Anderson Banihirwe.

Documentation is available at https://andersy005.github.io/spark-xarray/.

Installation

We will guide you how to install spark-xarray. However, we will assume that an Apache Spark installation is available.

Install

Requirements

For the installation of spark-xarray, the following packages are required:

Install

Clone the repository directly from GitHub and install it aftwards using $ python setup.py. This will also resolve possible missing dependencies.

$ git clone https://github.com/andersy005/spark-xarray.git
$ cd spark-xarray
$ python setup.py install

Development

We welcome new contributors of all experience levels.

Examples

Single file

>>> from sparkxarray.reader import ncread
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('spark-rdd').getOrCreate()
>>> sc = spark.SparkContext
>>> filepath='spark-xarray/sparkxarray/tests/data/air.sig995.2012.nc'
>>> # Create an RDD
>>> rdd = ncread(sc, filepath, mode='single', partition_on=['time'], partitions=100)
>>> rdd.first()  # Get the first element
<xarray.Dataset>
Dimensions:  (lat: 73, lon: 144, time: 1)
Coordinates:
  * lat      (lat) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 70.0 67.5 ...
  * lon      (lon) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 ...
  * time     (time) datetime64[ns] 2012-01-01
Data variables:
    air      (time, lat, lon) float64 234.5 234.5 234.5 234.5 234.5 234.5 ...
Attributes:
    Conventions:  COARDS
    title:        mean daily NMC reanalysis (2012)
    history:      created 2011/12 by Hoop (netCDF2.3)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
>>> rdd.count()   # Get a count of elements in the rdd
366
>>> # The count above corresponds to number of timesteps in the netCDF file 
>>> rdd.getNumPartitions()  # Get the number of partitions
100
>>> # Compute the daily average for each day (element) in RDD
>>> daily_average = rdd.map(lambda x: x.mean(dim=['lat', 'lon']))
>>> daily_average.take(3)
[<xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2012-01-01
Data variables:
    air      (time) float64 277.0, <xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2012-01-02
Data variables:
    air      (time) float64 276.8, <xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2012-01-03
Data variables:
    air     

Multiple files

>>> from sparkxarray.reader import ncread
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('spark-rdd').getOrCreate()
>>> sc = spark.SparkContext
>>> paths='spark-xarray/sparkxarray/tests/data/NCEP/*.nc'
>>> multi_rdd = ncread(sc, paths, mode='multi', partition_on=['lat', 'lon'], partitions=300)
>>> multi_rdd.count()
16020
>>> multi_rdd.first()
<xarray.Dataset>
Dimensions:   (lat: 1, lon: 1, nv: 2, time: 4, zlev: 1)
Coordinates:
  * zlev      (zlev) float32 0.0
  * lat       (lat) float32 -88.0
  * lon       (lon) float32 0.0
  * time      (time) datetime64[ns] 1854-01-15 1854-02-15 1854-03-15 1854-04-15
Dimensions without coordinates: nv
Data variables:
    lat_bnds  (time, lat, nv) float32 -89.0 -87.0 -89.0 -87.0 -89.0 -87.0 ...
    lon_bnds  (time, lon, nv) float32 -1.0 1.0 -1.0 1.0 -1.0 1.0 -1.0 1.0
    sst       (time, zlev, lat, lon) float64 nan nan nan nan
    anom      (time, zlev, lat, lon) float64 nan nan nan nan
Attributes:
    Conventions:                CF-1.6
    Metadata_Conventions:       CF-1.6, Unidata Dataset Discovery v1.0
    metadata_link:              C00884
    id:                         ersst.v4.185401
    naming_authority:           gov.noaa.ncdc
    title:                      NOAA Extended Reconstructed Sea Surface Tempe...
    summary:                    ERSST.v4 is developped based on v3b after rev...
    institution:                NOAA/NESDIS/NCDC
    creator_name:               Boyin Huang
    creator_email:              boyin.huang@noaa.gov
    date_created:               2014-10-24
    production_version:         Beta Version 4
    history:                    Version 4 based on Version 3b
    publisher_name:             Boyin Huang
    publisher_email:            boyin.huang@noaa.gov
    publisher_url:              http://www.ncdc.noaa.gov
    creator_url:                http://www.ncdc.noaa.gov
    license:                    No constraints on data access or use
    time_coverage_start:        1854-01-15T000000Z
    time_coverage_end:          1854-01-15T000000Z
    geospatial_lon_min:         -1.0f
    geospatial_lon_max:         359.0f
    geospatial_lat_min:         -89.0f
    geospatial_lat_max:         89.0f
    geospatial_lat_units:       degrees_north
    geospatial_lat_resolution:  2.0
    geospatial_lon_units:       degrees_east
    geospatial_lon_resolution:  2.0
    spatial_resolution:         2.0 degree grid
    cdm_data_type:              Grid
    processing_level:           L4
    standard_name_vocabulary:   CF Standard Name Table v27
    keywords:                   Earth Science &gt; Oceans &gt; Ocean Temperat...
    keywords_vocabulary:        NASA Global Change Master Directory (GCMD) Sc...
    project:                    NOAA Extended Reconstructed Sea Surface Tempe...
    platform:                   Ship and Buoy SSTs from ICOADS R2.5 and NCEP GTS
    instrument:                 Conventional thermometers
    source:                     ICOADS R2.5 SST, NCEP GTS SST, HadISST ice, N...
    comment:                    SSTs were observed by conventional thermomete...
    references:                 Huang et al, 2014: Extended Reconstructed Sea...
    climatology:                Climatology is based on 1971-2000 SST, Xue, Y...
    description:                In situ data: ICOADS2.5 before 2007 and NCEP ...