Apache Spark 2.1.1 on Yellowstone¶
The first task of our research project was to install the newest version of Apache Spark on Yellowstone.
At the time (May/2017), there was an old installation of Apache Spark (Spark 1.6) on Yellowstone. This installation was done by Davide Del Vento.
As of today (June/2017), we've been able to install the newest version of Apache Spark - Spark 2.1.1+Hadoop2.7.
Below are the steps that we took to get Spark Up and Running on Yellowstone.
If all you need is to run the existing Apache Spark on Yellowstone, just skip to Section 2 of this page.
1. Installation¶
1.1 Downloading Spark and Setting up Spark's directory and necessary files¶
- Log into Yellowstone
- Change working directory to
/glade/p/work/abanihicd /glade/p/work/abanihi/
-
Go to Apache Spark's official website and follow steps 1-4 to get a download link for Spark. Copy the download link and
-
Go to
/glade/p/work/abanihi/and download Sparkwget https://d3kbcqa49mib13.cloudfront.net/spark-2.1.1-bin-hadoop2.7.tgz- Untar
spark-2.1.1-bin-hadoop2.7.tgzwithtar -xzf spark-2.1.1-bin-hadoop2.7.tgz
-
Change directory to
spark-2.1.1-bin-hadoop2.7/confand openlog4j.properties.templatecd spark-2.1.1-bin-hadoop2.7/confnano log4j.properties.template- Go to the following line:
log4j.rootCategory=INFO, consoleand changeINFOtoERROR - Exit out of the editor
-
Rename
log4j.properties.templatetolog4j.propertiesmv log4j.properties.template log4.properties
-
Change working directory to
/glade/p/work/abanihicd /glade/p/work/abanihi/
-
Download H5Spark Package: This package supports Hierarchical Data Format, HDF5/netCDF4 and Rich Parallel I/O interface in Spark. For more details, please see this page.
git clone https://github.com/valiantljk/h5spark.git- This package will be added to
Python Pathinspark-cluster.shscript.
1.2 Scripts¶
Note: The scripts in this directory are based on Davide's previous scripts for Spark 1.6.
-
Change working directory to
/glade/p/work/abanihicd /glade/p/work/abanihi
-
Create a new directory called
yellowstoneand move into itmkdir yellowstone`cd yellowstone- Create
spark-cluster-scriptsdirectory and move into it mkdir spark-cluster-scriptscd spark-cluster-scripts/
-
Create a new script and name it
spark-env.shnano spark-env.shspark-env.shshould have the following content
#!/usr/bin/env bash source /etc/profile.d/modules.sh module restore system module swap intel gnu/5.3.0 module load java module load python all-python-libs module load h5py export SPARK_WORKER_DIR=/glade/scratch/$USER/spark/work export SPARK_LOG_DIR=/glade/scratch/$USER/spark/logs export SPARK_LOCAL_DIRS=/glade/scratch/$USER/spark/temp export SPARK_LOCAL_IP=$(sed -e 's/\([^.]*\).*$/\1-ib/' <<< $(hostname))
- Create a new script file and name it
spark-cluster.shnano spark-cluster.sh
#!/usr/bin/env bash export SPARK_HOME=/glade/p/work/abanihi/spark-2.1.1-bin-hadoop2.7/ export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$PYTHONPATH:/glade/p/work/abanihi/pyspark4climate/ export SPARK_CONF_DIR=~/yellowstone/spark/conf export SPARK_HOSTFILE=$SPARK_CONF_DIR/spark_hostfile # create temp hostfile export SPARK_TEMP_HOSTFILE=$SPARK_CONF_DIR/spark_temp_hostfile rm $SPARK_HOSTFILE $SPARK_CONF_DIR/slaves mpirun.lsf hostname | grep -v Execute | sort >> $SPARK_TEMP_HOSTFILE sed -i 's/$/-ib/' $SPARK_TEMP_HOSTFILE cat $SPARK_TEMP_HOSTFILE | sort -u >> $SPARK_HOSTFILE tail -n +2 $SPARK_TEMP_HOSTFILE | sort -u >> $SPARK_CONF_DIR/slaves tail -n +2 $SPARK_TEMP_HOSTFILE | uniq -c > $SPARK_CONF_DIR/temp_ncores_slaves rm $SPARK_TEMP_HOSTFILE export SPARK_MASTER_HOST=$(head -n 1 $SPARK_HOSTFILE) export MASTER=spark://$SPARK_MASTER_HOST:7077 cp ~/yellowstone/spark/spark-cluster-scripts/spark-env.sh $SPARK_CONF_DIR/spark-env.sh source $SPARK_CONF_DIR/spark-env.sh if [ "$1" == "start" ]; then cmd_master="$SPARK_HOME/sbin/start-master.sh" cmd_slave="$SPARK_HOME/sbin/spark-daemon.sh --config $SPARK_CONF_DIR start org.apac he.spark.deploy.worker.Worker 1 $MASTER" elif [ "$1" == "stop" ]; then cmd_master="$SPARK_HOME/sbin/stop-master.sh" cmd_slave="$SPARK_HOME/sbin/spark-daemon.sh --config $SPARK_CONF_DIR stop org.apach e.spark.deploy.worker.Worker 1" else exit 1 fi $cmd_master while read ncore_slave do ncore=$(echo $ncore_slave | cut -d' ' -f1) slave=$(echo $ncore_slave | cut -d' ' -f2) if [ "$slave" == "$SPARK_MASTER_HOST" ]; then echo "On Master node. Running: cmd_slave --cores $ncore" $cmd_slave --cores $ncore else echo "On Worker node. Running: cmd_slave --cores $ncore" ssh $slave "$cmd_slave" --cores $ncore </dev/null & fi done <$SPARK_CONF_DIR/temp_ncores_slaves
- Create a new script file and name it
start-pyspark.sh
#!/usr/bin/env bash source ~/yellowstone/spark/spark-cluster-scripts/spark-cluster.sh start $SPARK_HOME/bin/pyspark --master $MASTER
- Create a new script file and name it
start-sparknotebook. This script file is an extension of/glade/apps/opt/jupyter/5.0.0/gnu/4.8.2/bin/start-notebookscript file.
#!/usr/bin/env bash #source spark-cluster.sh start source ~/yellowstone/spark/spark-cluster-scripts/spark-cluster.sh start # Add the PySpark classes to the Python path: export PATH="$SPARK_HOME:$PATH" export PATH=$PATH:$SPARK_HOME/bin export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH # Create trap to kill notebook when user is done kill_server() { if [[ "$JNPID" != -1 ]]; then echo -en "\nKilling Jupyter Notebook Server with PID=$JNPID ... " kill "$JNPID" echo "done" exit 0 else exit 1 fi } JNPID=-1 trap kill_server SIGHUP SIGINT SIGTERM # Begin server creation JNHOST=$(hostname) LOGDIR=/glade/scratch/${USER}/.jupyter-notebook LOGFILE=${LOGDIR}/log.$(date +%Y%m%dT%H%M%S) mkdir -p "$LOGDIR" if [[ $JNHOST == ch* || $JNHOST == r* ]]; then STHOST=cheyenne else STHOST=yellowstone fi echo "Logging this session in $LOGFILE" # Check if running on login nodes if [[ $JNHOST == yslogin* ]]; then cat << EOF See "Use of login nodes" here before running Jupyter Notebook on this node: https://www2.cisl.ucar.edu/resources/yellowstone/using_resources. Consider running on Geyser instead by using execgy to start a session. (Run execgy -hel p.) EOF elif [[ $JNHOST == cheyenne* ]]; then cat << EOF See "Use of login nodes" here before running Jupyter Notebook on this node: https://www2.cisl.ucar.edu/resources/computational-systems/cheyenne/running-jobs. Consider running in an interactive job instead by using qinteractive. (Run qinterative -help.) EOF fi jupyter notebook "$@" --no-browser --ip="$JNHOST"> "$LOGFILE" 2>&1 & JNPID=$! echo -en "\nStarting jupyter notebook server, please wait ... " ELAPSED=0 ADDRESS= while [[ $ADDRESS != *"${JNHOST}"* ]]; do sleep 1 ELAPSED=$((ELAPSED+1)) ADDRESS=$(tail -n 1 "$LOGFILE") if [[ $ELAPSED -gt 30 ]]; then echo -e "something went wrong\n---" cat "$LOGFILE" echo "---" kill_server fi done echo -e "done\n---\n" ADDRESS=${ADDRESS##*:} PORT=${ADDRESS%/*} TOKEN=${ADDRESS#*=} cat << EOF Run the following command on your desktop or laptop: ssh -N -l $USER -L 8888:${JNHOST}:$PORT ${STHOST}.ucar.edu Log in with your YubiKey/Cryptocard (there will be no prompt). Then open a browser and go to http://localhost:8888. The Jupyter web interface will ask you for a token. Use the following: $TOKEN Note that anyone to whom you give the token can access (and modify/delete) files in your GLADE spaces, regardless of the file permissions you have set. SHARE TOKENS RARELY AND WISELY! To stop the server, press Ctrl-C. EOF # Wait for user kill command sleep inf
Note: Make the above scripts executable (chmod +x script_name.sh)