edit

Quick Start: Yellowstone¶

Prerequisites:

If this is your first time running interactive jobs on multiple nodes, or if you've never installed SSH keys in your yellowstone/cheyenne user environment, installing SSH keys on Yellowstone/Cheyenne will simplify the process of running Spark jobs. For more details on how to install SSH keys go here.

Logging in¶

Log in to the Cheyenne system from the terminal
- $ ssh -X -l username yellowstone.ucar.edu
Run the following commands:
- $ module use /glade/p/work/abanihi/ys/modulefiles/
- $ module load spark

Submitting jobs¶

Spark can be run interactively ( via ipython shell, jupyter notebook) or in batch mode.

1. Interactive jobs¶

To start an interactive job, use the qsub command with the necessary options.

bsub -Is -W 01:00 -q small -P Project_code -R "span[ptile=1]" -n 4 bash

1.1 Load IPython shell with PySpark¶

To start IPython shell with PySpark, run the following:
- $ start-pyspark.sh

1.2. Run PySpark in a Jupyter notebook¶

To run PySpark in a Jupyter notebook, run the following:
- $ start-sparknotebook and follow the instructions given.

Note:

When you run PySpark shell, SparkSession (single point of entry to interact with underlying Spark functionality) is created for you. This is not the case for the Jupyter notebook. Once the jupyter notebook is running, you will need to create and Initialize SparkSession and SparkContext before starting to use Spark.

# Import SparkSession
from pyspark.sql import SparkSession

# Initialize SparkSession and attach a sparkContext to the created sparkSession
spark = SparkSession.builder.appName("pyspark").getOrCreate()
sc = spark.sparkContext

NOTE: We've not been able to get SparkUI feature working on Yellowstone yet!

If you need to use Spark Master WebUI, consider running spark on Cheyenne. As of now, Spark Master WebUI is not available on Yellowstone.

2. Batch jobs¶

To submit a Spark batch job, use the qsub command followed by the name of your PBS batch script file. - bsub < script_name

2.1. Spark job script example¶

Batch script to run a Spark job:

spark-test.sh

#!/usr/bin/env bash
#BSUB -P project_code         # project code
#BSUB -W 00:20                # wall-clock time (hrs:min)
#BSUB -n 4                    # number of tasks in job
#BSUB -R "span[ptile=1]"      # run 1 MPI tasks in job
#BSUB -J spark_example        # job name
#BSUB -o spark_example.%J.out # output file name in which %J is replaced by the Job ID
#BSUB -e spark_example.%J.err # error file name in which %J is replaced by the Job ID
#BSUB -q queue_name           # queue


module use /glade/p/work/ys/modulefiles/
module use spark

source spark-cluster.sh start
$SPARK_HOME/bin/spark-submit --master $MASTER spark-test.py

spark-test.py

from __future__ import print_function
from read import RDD, DataFrame
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('spark-test').getOrCreate()
sc = spark.sparkContext
sc.addPyFile("/glade/p/work/abanihi/pyspark4climate/read.py")
filepath = '/glade/u/home/abanihi/data/pres_monthly_1948-2008.nc'
var = 'pres'
data_df = DataFrame(sc, (filepath, var), 'single')
df = data_df.df
print(df.show())

To run this spark job, run:

bsub < spark-test.sh