Quick Start: Yellowstone¶
Prerequisites:
- If this is your first time running interactive jobs on multiple nodes, or if you've never installed SSH keys in your yellowstone/cheyenne user environment, installing SSH keys on Yellowstone/Cheyenne will simplify the process of running Spark jobs. For more details on how to install SSH keys go here.
Logging in¶
-
Log in to the Cheyenne system from the terminal
$ ssh -X -l username yellowstone.ucar.edu
-
Run the following commands:
$ module use /glade/p/work/abanihi/ys/modulefiles/
$ module load spark
Submitting jobs¶
Spark can be run interactively ( via ipython shell, jupyter notebook) or in batch mode.
1. Interactive jobs¶
To start an interactive job, use the qsub command with the necessary options.
bsub -Is -W 01:00 -q small -P Project_code -R "span[ptile=1]" -n 4 bash
1.1 Load IPython shell with PySpark¶
-
To start IPython shell with PySpark, run the following:
$ start-pyspark.sh
1.2. Run PySpark in a Jupyter notebook¶
- To run PySpark in a Jupyter notebook, run the following:
$ start-sparknotebook
and follow the instructions given.
Note:
When you run PySpark shell, SparkSession (single point of entry to interact with underlying Spark functionality) is created for you. This is not the case for the Jupyter notebook. Once the jupyter notebook is running, you will need to create and Initialize SparkSession
and SparkContext
before starting to use Spark.
# Import SparkSession from pyspark.sql import SparkSession # Initialize SparkSession and attach a sparkContext to the created sparkSession spark = SparkSession.builder.appName("pyspark").getOrCreate() sc = spark.sparkContext
NOTE: We've not been able to get SparkUI feature working on Yellowstone yet!
- If you need to use Spark Master WebUI, consider running spark on Cheyenne. As of now, Spark Master WebUI is not available on Yellowstone.
2. Batch jobs¶
To submit a Spark batch job, use the qsub command followed by the name of your PBS batch script file.
- bsub < script_name
2.1. Spark job script example¶
Batch script to run a Spark job:
spark-test.sh
#!/usr/bin/env bash #BSUB -P project_code # project code #BSUB -W 00:20 # wall-clock time (hrs:min) #BSUB -n 4 # number of tasks in job #BSUB -R "span[ptile=1]" # run 1 MPI tasks in job #BSUB -J spark_example # job name #BSUB -o spark_example.%J.out # output file name in which %J is replaced by the Job ID #BSUB -e spark_example.%J.err # error file name in which %J is replaced by the Job ID #BSUB -q queue_name # queue module use /glade/p/work/ys/modulefiles/ module use spark source spark-cluster.sh start $SPARK_HOME/bin/spark-submit --master $MASTER spark-test.py
spark-test.py
from __future__ import print_function from read import RDD, DataFrame from pyspark.sql import SparkSession spark = SparkSession.builder.appName('spark-test').getOrCreate() sc = spark.sparkContext sc.addPyFile("/glade/p/work/abanihi/pyspark4climate/read.py") filepath = '/glade/u/home/abanihi/data/pres_monthly_1948-2008.nc' var = 'pres' data_df = DataFrame(sc, (filepath, var), 'single') df = data_df.df print(df.show())
To run this spark job, run:
bsub < spark-test.sh