edit

PySpark for "Big" Atmospheric & Oceanic Data Analysis - A CISL/SIParCS Research Project

Processing, analyzing climate data can be rather intimidating due to large sizes, high dimensionality of data sets. Some climate model components can produce individual monthly mean files that exceeds 10GB while some observationally based reanalysis datasets exceed tera bytes. In addition, climate data is growing faster than processing speeds. The discrepancy stems from the complex nature of climate data as well as the scientific questions climate science brings forth. Legacy software tools for scientific analysis can't handle these massive data sets in a decent amount of time. To bridge this gap, there is a need to parallelize data analysis tasks on large clusters.

Typically, tools installed with a supercomputer's High Performance Computing (HPC) software stack are used to parallelize climate data analyses. This research project explores an alternate approach to parallelize embarrassingly parallel tasks. We used PySpark API as a Python interface to Apache Spark , which is a modern framework aimed at performing fast distributed computing on Big Data. Our research deployed Apache Spark on NCAR’s Cheyenne and Yellowstone supercomputers. We designed and developed a Python package (spark-xarray) to bridge the I/O gap between Spark and scientific data stored in netCDF format. We evaluated use cases representative of analyses commonly performed by atmospheric and oceanic scientists such as temporal averaging and, computation of climatologies.

Our experience with Apache Spark for climate data analysis indicates that a combination of emerging Big Data frameworks such as Apache Spark and the Python data analysis stack is a useful high-performance alternative for distributed parallel computing.