Setting up a scalable data exploration environment with Spark and Jupyter Lab

Whether you are a data scientist interested in training a model with a large feature data set, or a data engineer creating features out of a data lake, combining the scalability of a Spark cluster on HDFS with the convenience of Jupyter notebooks is effectively become the preferred alternative.

This note walks thru the configuration steps required to create a data exploration environment with open source versions of HDFS, Spark and Jupyter Lab. For integrating Spark and Jupyter we will use Apache Livy and the sparkmagic Jupyter extension. Setup was done on CentOS 7.4 as the centos user.


Step 0: Provision hosts and assign roles

Before anything else, provision hosts or VMs and decide which node(s) will play special roles:

  • For HDFS, some node must act as the NameNode. We will not be using a cluster manager. Refer to the architecture documentation.
  • For Spark, some node must act as the Spark Master node. We will be running in standalone mode. Refer to the architecture documentation.
  • Spark and HDFS nodes will be co-located for performance.
  • Livy is a REST service on top of Spark. It will need to run in at least one node in order to serve Jupyter requests.
  • Jupyter is a web-based notebook application. It will need to run in some host, although this host does not need to be part of the Spark/HDFS cluster.

Step 1: Prepare your hosts, download software

Generate a SSH public/private key pair for passwordless authentication:

ssh-keygen -t rsa
Generating public/private rsa key pair.
...
centos@spark3.novalocal
The key's randomart image is:
+---[RSA 2048]----+
|+ oo. o ..o. |
| + o = = +... |
| o o * * *o. |
| o B = *.= .|
| . * B S +.. E |
| . o * . . |
|. . + . . |
| = + . |
|. + . |
+----[SHA256]-----+

Install the unzip and bzip2 utilities, as well as several packages required in subsequent steps:

sudo yum install unzip
sudo yum install bzip2
sudo yum install gcc python-devel krb5-devel krb5-workstation python-devel

Trending AI Articles:

1. From Perceptron to Deep Neural Nets
2. Neural networks for solving differential equations
3. Turn your Raspberry Pi into homemade Google Home

Install Java (OpenJDK 1.8 works just fine):

sudo yum install java-1.8.0-openjdk-devel
java -version
openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

Stage software binaries for Hadoop, Spark, Livy and Anaconda, place them under /opt:

curl -LO https://archive.apache.org/dist/hadoop/common/hadoop-2.7.5/hadoop-2.7.5.tar.gz
curl -LO http://apache.mirrors.tds.net/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
curl -LO http://archive.apache.org/dist/incubator/livy/0.4.0-incubating/livy-0.4.0-incubating-bin.zip
curl -LO https://repo.anaconda.com/archive/Anaconda3-5.1.0-Linux-x86_64.sh
tar -xvzf hadoop-2.7.5.tar.gz
tar -xvzf spark-2.2.1-bin-hadoop2.7.tgz
unzip livy-0.4.0-incubating-bin.zip
sudo mv hadoop-2.7.5/ /opt/
sudo mv spark-2.2.1-bin-hadoop2.7/ /opt/
sudo mv livy-0.4.0-incubating-bin/ /opt/
cd /opt
sudo ln -s hadoop-2.7.5/ hadoop
sudo ln -s spark-2.2.1-bin-hadoop2.7/ spark
sudo ln -s livy-0.4.0-incubating-bin/ livy

Configure environment variables to point to the new binaries:

vi .bash_profile
export JAVA_HOME=/usr/lib/jvm/java-1.8.0
export HADOOP_PREFIX=/opt/hadoop
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export SPARK_HOME=/opt/spark
export JAVA_LIBRARY_PATH=/opt/hadoop/lib/native
export LD_LIBRARY_PATH=/opt/hadoop/lib/native
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$JAVA_HOME/bin:$HADOOP_PREFIX/bin:$HADOOP_PREFIX/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PATH

Step 2: Configure HDFS

Create a physical location for HDFS data, with individual directories for NameNode metadata, DataNode blocks, and temp space:

sudo mkdir /var/lib/hdfs
sudo mkdir /var/lib/hdfs/name
sudo mkdir /var/lib/hdfs/data
sudo mkdir /var/lib/hdfs/tmp
sudo chown -R centos:centos /var/lib/hdfs

Modify $HADOOP_PREFIX/etc/hadoop/core-site.xml to specify:

  • The network address and port of the NameNode
  • A non-default location for temp files (otherwise /tmp will run out of space on your first large file operation)
  • Optionally, the use of the Snappy (OS library) for file compression
vi $HADOOP_PREFIX/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://10.10.80.4:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hdfs/tmp</value>
</property>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
</configuration>

Modify $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml to specify the file locations for NameNode and DataNodes created in a previous step:

vi $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///var/lib/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///var/lib/hdfs/data</value>
</property>
</configuration>

Create a new$HADOOP_PREFIX/etc/hadoop/slaves file in order to run cluster-wide commands:

vi $HADOOP_PREFIX/etc/hadoop/slaves
10.10.80.13
10.10.80.14
10.10.80.9
10.10.80.12

Start HDFS daemons:

  • The first time you bring up HDFS, it must be formatted (i.e. metadata in the NameNode gets initialized):
$HADOOP_PREFIX/bin/hdfs namenode -format <your_cluster_name>
  • All daemons in the cluster: Assuming that passwordless SSH has been configured from the node where the command is issued to all hosts in $HADOOP_PREFIX/etc/hadoop/slaves, simply issue:
 start-dfs.sh
  • Local daemon(s): Issue the following command to control local HDFS processes:
$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
$HADOOP_PREFIX/sbin/hadoop-daemons.sh — config $HADOOP_CONF_DIR — script hdfs start datanode

Verify that daemons started normally:

jps
32512 Jps
11665 NameNode
21742 DataNode
11877 SecondaryNameNode
4279 Master
4395 Worker
ps -ef|grep java
centos   11665     1  1 Apr13 ?        04:37:27 /usr/lib/jvm/java-1.8.0/bin/java ... org.apache.hadoop.hdfs.server.namenode.NameNode
centos   21742     1  1 05:56 ?        00:07:14 /usr/lib/jvm/java-1.8.0/bin/java ... org.apache.hadoop.hdfs.server.datanode.DataNode
  • Note: For anything other than testing, it is advisable to run the NameNode daemon on a dedicated host.

Check logs for error messages, example:

tail -f $HADOOP_PREFIX/logs/hadoop-centos-namenode-spark1.novalocal.log
tail -f $HADOOP_PREFIX/logs/hadoop-centos-datanode-spark3.novalocal.log

Browse the file system by uploading a file:

# Upload /etc/hosts to the root directory
hdfs dfs -put /etc/hosts /
# List file contents
hdfs dfs -ls /
# View file contents
hdfs dfs -cat /hosts

At this point you should be able to see the HDFS web console running on your NameNode, port 50070:

Step 3: Configure Spark

Create a copy and modify the following configuration files:

$SPARK_HOME/conf/spark-env.sh : Sets the environment for several programs ran locally (will not take effect on jobs submitted remotely by spark-submit)

cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-xenv.sh
vi $SPARK_HOME/conf/spark-env.sh
# Worker cores available for Spark on this host
export SPARK_WORKER_CORES=3
# Memory available for Spark on this host
export SPARK_WORKER_MEMORY=6G
# Location of Python binaries
export PYSPARK_PYTHON=/usr/bin/python
export PYSPARK_DRIVER_PYTHON=/usr/bin/python
export PYSPARK3_PYTHON=/usr/bin/python3
export PYSPARK3_DRIVER_PYTHON=/usr/bin/python3
# Default number of cores given to a job (avoids allocating all cores to a given job, which is the default behavior)
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4"

$SPARK_HOME/conf/slaves : Lists all nodes in the cluster in order to run cluster-wide commands

cp $SPARK_HOME/conf/slaves.template $SPARK_HOME/conf/slaves
vi $SPARK_HOME/conf/slaves
10.10.80.13
10.10.80.14
10.10.80.9
10.10.80.12

Start Spark daemons: From the master node, run:

start-master.sh
start-slaves.sh

Verify that daemons started normally:

jps
32512 Jps
11665 NameNode
21742 DataNode
11877 SecondaryNameNode
4279 Master
4395 Worker
ps -ef|grep java
centos   19244     1  2 03:01 pts/0    00:00:06 /usr/lib/jvm/java-1.8.0/bin/java ... org.apache.spark.deploy.master.Master ...
centos    5453     1  7 03:02 ?        00:00:04 /usr/lib/jvm/java-1.8.0/bin/java ... org.apache.spark.deploy.worker.Worker ...

Test Spark by opening a local spark-shell session:

$ spark-shell —master spark://spark1.novalocal:7077
Spark context available as ‘sc’ (master = spark://spark1.novalocal:7077, app id = app-20180502031621–0001).
Spark session available as ‘spark’.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_161)
Type in expressions to have them evaluated.
Type :help for more information.
scala>

Browse to the master node, port 8080, for Spark’s web console:

Step 4: Configure Livy

Perform the following steps in some cluster node in order to point Livy to the Spark master:

cp /opt/livy/livy.conf.template /opt/livy/conf/livy.conf
vi /opt/livy/conf/livy.conf
# What spark master Livy sessions should use.
livy.spark.master = spark://spark1.novalocal:7077

Start the Livy daemon, which by default listens on 8998:

/opt/livy/bin/livy-server start
starting /usr/lib/jvm/java-1.8.0/bin/java -cp /opt/livy/jars/*:/opt/livy/conf:/opt/hadoop/etc/hadoop: org.apache.livy.server.LivyServer, logging to /opt/livy/logs/livy-centos-server.out

Browse the Livy web console, which won’t display much initially:

Step 5: Configure Anaconda (Python and Jupyter)

Start by creating a dedicated directory for the Anaconda distribution:

sudo mkdir /opt/anaconda
sudo chown centos:centos /opt/anaconda

Grant execute permissions to the installer file and run:

chmod +x Anaconda3–5.1.0-Linux-x86_64.sh
./Anaconda3-5.1.0-Linux-x86_64.sh
Welcome to Anaconda3 5.1.0
In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>>

Specify a custom location for the software when prompted:

Anaconda3 will now be installed into this location:
/home/centos/anaconda3
- Press ENTER to confirm the location
— Press CTRL-C to abort the installation
— Or specify a different location below
[/home/centos/anaconda3] >>> /opt/anaconda/conda

Add conda binaries to your environment by adding the following line to the end of .bash_profile :

export PATH=/opt/anaconda/conda/bin:$PATH

Have the environment changes take effect:

source .bash_profile 

Create a conda environment named jupyter_spark that will contain our versions of Python, Jupyter and other dependencies (a number of packages dependencies will be downloaded):

conda create -n jupyter_spark pip python=3.6
The following packages will be downloaded:
    package                    |            build
---------------------------|-----------------
pip-10.0.1 | py36_0 1.8 MB
openssl-1.0.2o | h20670df_0 3.4 MB
sqlite-3.23.1 | he433501_0 1.5 MB
xz-5.2.3 | h5e939de_4 365 KB
libgcc-ng-7.2.0 | hdf63c60_3 6.1 MB
libstdcxx-ng-7.2.0 | hdf63c60_3 2.5 MB
python-3.6.5 | hc3d631a_2 29.4 MB
certifi-2018.4.16 | py36_0 142 KB
ca-certificates-2018.03.07 | 0 124 KB
setuptools-39.1.0 | py36_0 550 KB
wheel-0.31.0 | py36_0 62 KB
------------------------------------------------------------
Total: 46.0 MB

Activate your new environment (prompt will change accordingly):

source activate jupyter_spark
(jupyter_spark) $

Once there, install Jupyter Lab:

pip install jupyterlab

Configure Jupyter to be made available beyond localhost:

jupyter notebook --generate-config
Writing default config to: .jupyter/jupyter_notebook_config.py
vi .jupyter/jupyter_notebook_config.py
## The IP address the notebook server will listen on.
c.NotebookApp.ip = '*'

Configure Jupyter to have a general password (instead of a new autogenerated token on every start):

jupyter notebook password
Enter password:
Verify password:
Wrote hashed password to .jupyter/jupyter_notebook_config.json

Launch Jupyter and access on HTTP port 8888:

jupyter lab

At this point we have a functional Jupyter Lab, but notebooks are restricted to local Python processes.

Step 5: Configure Sparkmagic

The last part of the setup adds the Sparkmagic Jupyter extension and points it to Livy’s REST endpoint (configured in Step 4):

pip install sparkmagic
jupyter nbextension enable --py --sys-prefix widgetsnbextension

Find the file locations for sparkmagic and install additional Scala, Python, Python3 and R kernels to Jupyter:

pip show sparkmagic
cd /opt/anaconda/conda/envs/jupyter_spark/lib/python3.6/site-packages
jupyter-kernelspec install sparkmagic/kernels/sparkkernel --user
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
--user
jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel
--user
jupyter-kernelspec install sparkmagic/kernels/sparkrkernel
--user
jupyter serverextension enable --py sparkmagic
mkdir .sparkmagic
vi .sparkmagic/config.json
{
"kernel_python_credentials" : {
"username": "",
"password": "",
"url": "http://192.168.40.161:8998",
"auth": "None"
},
"kernel_scala_credentials" : {
"username": "",
"password": "",
"url": "http://192.168.40.161:8998",
"auth": "None"
},
"kernel_r_credentials": {
"username": "",
"password": "",
"url": "http://192.168.40.161:8998",
"auth": "None"
}
}

Restart Jupyter, four new notebook languages are available:

Step 6: Validate your environment

To test the configuration end to end, we run a simple Scala program that creates an RDD from a collection and displays its contents:

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
distData.count
distData.collect.foreach(println(_))

As the first cell runs (Ctrl+Enter), the notebook will connect to Spark thru Livy, submit your job, send back the results and maintain your session state for subsequent commands:

Back to the Livy web console, you can see the session and its work:

At the same time, you will see your Livy session as a running Spark application:

Sparkmagic includes several magics or special commands prefixed with %% (%%help is a good place to start). You can control the number of resources available to your session with %%configure:

%%configure -f 
{“numExecutors”:2, “executorMemory”: “3G”, “executorCores”:2}

Most notably, using %%sql, you can browse thru the contents of your data frames with arbitrary SQL (including joins among them), once you introduce them as temp views:

val df = spark.read.csv("/adsb/02242018/metars.csv")
df.createOrReplaceTempView("metars")
%%sql
SELECT * FROM metars ORDER BY _c1 DESC

Sky is the limit!

From here on, you can load arbitrary amounts of data into HDFS (your very own data lake), analyze them with Spark in either Scala, Python (PySpark), Java or SQL, all from the comfort of Jupyter notebook. For large data sets, you may want to convert your data Parquet for performance.

This setup is ideally suited for small multi-tenant dev/modeling environments, where users explore data, create features and train models.