Installing local version of pyspark with virtualenv (Linux)

Below is a walkthrough of the entire setup. For a quick reference jump here.

Recently I set up pyspark in my local machine. While there are many resources to set up spark locally, there was nothing specific to my workflow of using virtualenv for different projects. Also as it requires an older version of python for compatibilty which also means virtualenv is a good approach. While the entire premise of spark is to be deployed on a distributed system, having a local install of pyspark is helpful for learning and development.

Specifically, the goal was to set up pyspark locally on my machine in a python vitual environment with jupyter notebook functionality.

Download spark and python and java

A major reason for setting up a virtual enironment for pyspark is that there are many compatibility issues between the versions of spark, java and python.

Here we will use versions:

Spark 2.4
Python 3.6
Java 8

Download python 3.6

Pyspark does not support some later versions of python. Therefore we must download an earlier version (This is also helpful for other ML projects as important libraries like sklearn are unsupported in the later versions at the time this was written). I use arch linux so I will download from the AUR (Manjaro can also use this). Download it with the package manager of your distribution.

Arch Linux

Download the PKGBUILD from the AUR and unpack the tar file

$ tar xfv path/to/tar/file/python36.tar.gz

The python binaries are located at 'path/to/tar/file/python36/pkg/usr/bin/'.

(Optional) Copy the binary files into a folder in the home directory.

$ mkdir ~/.python36
$ cp path/to/tar/file/python36/pkg ~/.python36

Add python36 binary directory to path

### Add python 3.6 binary directory to .bash_profile OR .bashrc file
export PATH=~/.python36/pkg/python36/usr/bin:$PATH

Execute the .bashrc or .bash_profile to implement the changes to the path

source ~/.bashrc

Download Spark

We use the release "Spark 2.4.5 predbuilt for Apache Hadoop 2.7". This can be selected and downloaded here.

Unpack the Spark .tgz file

$ tar zxvf path/to/spark/download/spark-2.4.5-bin-hadoop2.7.tgz

Move the unpacked folder to the /opt/ directory

$ sudo mv path/to/spark/download/spark-2.4.5-bin-hadoop2.7 /opt/spark-2.4

Symbolic link the spark-2.4 directory to spark

$ ln -s spark-2.4 spark

Download compatible Java version

Spark 2.4.5 runs on Java 8. Download the Java Development Kit 8 (jdk8) from your package manager.

$ sudo pacman -Si jre8-openjdk-headless

Find where this package is located. For example mine is located at /usr/lib/jvm/java-8-openjdk/jre.

Set up Virtual Environment

You can skip this step if already set-up.

We use the python module virtualenv to manage virtual environments. Python has a lot of different options, some of which are shipped as standard. A good comparison of these can be found in this stack exchange answer. We also use the virtualenvwrapper for ease of use. Skip to the next step if you have already set up virtualenv. Read the docs of virtualenvwrapper for more information or help.

Install virtualenv

$ pip install virtualenv

Install virtualenvwrapper

$ pip install virtualenvwrapper

Create a directory for the virtual environments

$ mkdir ~/Virtual_Envs

Set the environment variables for mkvirtualenv in your ~/.bashrc file:

## Add to .bashrc or bash_profile
export WORKON_HOME=$HOME/Virtual_Envs
source /usr/bin/virtualenvwrapper.sh

Create pyspark virtual environment

Create a new virtual environment which runs python 3.6.10 (or the version you downloaded above).

$ mkvirtualenv pyspark_env --python=home/User/.python36/usr/bin/python3.6
created virtual environment CPython3.6.10.final.0-64 in 94ms
  creator CPython3Posix(dest=/home/User/Virtual_Envs/pyspark_env, clear=False, global=False)
  seeder FromAppData(download=False, pip=latest, setuptools=latest, wheel=latest, via=copy, app_data_dir=/home/User/.local/share/virtualenv/seed-app-data/v1.0.1)
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
virtualenvwrapper.user_scripts creating /home/User/Virtual_Envs/pyspark_env/bin/predeactivate
virtualenvwrapper.user_scripts creating /home/User/Virtual_Envs/pyspark_env/bin/postdeactivate
virtualenvwrapper.user_scripts creating /home/User/Virtual_Envs/pyspark_env/bin/preactivate
virtualenvwrapper.user_scripts creating /home/User/Virtual_Envs/pyspark_env/bin/postactivate
virtualenvwrapper.user_scripts creating /home/User/Virtual_Envs/pyspark_env/bin/get_env_details

Now we need to edit the environmental variables of the virtual environment. Navigate to the directory with virtual environments (I set mine to be ~/Virtual_Envs) and cd into the pyspark virtual environment folder. We edit the postactivate file which acts similarly to a .bashrc file when the virtual env is activated.

$ cd ~/Virtual_Envs/pyspark_env
$ cd bin

To set the ne environment variables after activating the virtual environment we edit the postactivate file.

Add the following to the postactivate file:

Set the Java environment variable

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk/jre

Set the Spark environment variable

export SPARK_HOME=/opt/spark

Add Spark binaries to $PATH variable

export PATH=$SPARK_HOME/bin:$PATH

Tell pyspark to use the python interpreter of the virtual environment

export PYSPARK_PYTHON=python

After this your postactivate file should have the following

#!/bin/bash
# This hook is sourced after this virtualenv is activated.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk/jre
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python

That's it for the primary installation of Spark. Now activate the virtual environment and test the environmental variables are implemented:

$ workon pyspark_env
(pyspark_env) $ echo $JAVA_HOME
/usr/lib/jvm/java-8-openjdk/jre/
(pyspark_env) $ echo $SPARK_HOME
/opt/spark

To test that it works you can run

(pyspark_env) $ spark-shell --version

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_252
Branch HEAD
Compiled by user centos on 2020-02-02T19:38:06Z
Revision cee4ecbb16917fa85f02c635925e2687400aa56b
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.

You can also run pyspark in the command-line

(pyspark_env) $ pyspark

Python 3.6.10 (default, Apr 21 2020, 02:51:13)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Python version 3.6.10 (default, Apr 21 2020 02:51:13)
SparkSession available as 'spark'.
>>>

You can now run pyspark commands. While these would normally be executed through a script or jupyter notebook, you can test that the raw pyspark works with the canonical example of calculating pi. The SparkContext and SparkSession are already initialized.

>>> print(sc)
<SparkContext master=local[*] appName=PySparkShell>
>>> print(spark)
<pyspark.sql.session.SparkSession object at 0x7f1e2d93a390>
>>> import random
>>> NUM_SAMPLES = 10000000
>>> def inside(p):
...  x, y = random.random(), random.random()
...  return x*x + y*y < 1
...
>>> count = sc.parallelize(range(0, NUM_SAMPLES)).filter(inside).count(
[Stage 1:>                                                        (0 + 16) / 16])
>>> pi = 4 * count / NUM_SAMPLES
>>> print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
Pi is roughly 3.1414352

In order to use pyspark in a script you can install the pyspark module with pip.

(pyspark_env) $ pip install pyspark

and import pyspark like other modules.

Set Jupyter with pyspark

As Jupyter notebooks are popular in data science workflows, especially for testing and development, it is important to be able to use pyspark in a jupyter notebook. As using pyspark in the command line is impractical and inefficient for projects of any size, we will set the environment up so that pyspark will open jupyter notebook.

Set up pyspark driver

Add the following to the postactivate file of the virtual environment to

export PYTHON_PATH=$SPARK_HOME/python:PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

If you installed jupyter-notebook through your package manager or on your main python interpreter jupyter will use the main instance of python by default. We need to add a new jupyter kernel for our virtual environment.

Set up new jupyter kernel

Install ipykernel inside the virtual environment.

(pyspark_env) $ pip install --user ipykernel

Install a new kernel with the new virtual environment.

(pyspark_env) $ python -m ipykernel install --user --name=pyspark_env

You should now be able to open a jupyter notebook in the virtual environment with pyspark. You can then start a new notebook with the kernel 'pyspark_env'. You don't even need to be in the virtual environment to use the new kernel. It will be available with any jupyter-notebook.

Quick Reference

Download and unpack python 3.6, spark 2.4 and java 8 (jdk8) for your distribution.
Install virtualenv and virtualenvwrapper
Create virtual environment mkvirtualenv -p='path/to/python36/binary' pyspark_env
Add the following to your .bashrc file

## virtualenvwrapper setup
export WORKON_HOME=$HOME/Virtual_Envs
source /usr/bin/virtualenvwrapper.sh
## Add python 3.6 to binary path. Should be done before virtual env is created.
export PATH=~/.python36/pkg/python36/usr/bin:$PATH

5 . The postactivate file of the virtual environment should have the following

#!/bin/bash
# This hook is sourced after this virtualenv is activated.

## Set the Java and spark locations and add spark to the binary path
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk/jre
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python

## Set the jupyter notebook for pyspark
export PYTHON_PATH=$SPARK_HOME/python:PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

6 . Inside the virtual environment install new python kernel for jupyter with pip install --user ipykernel and $ python -m ipykernel install --user --name=pyspark_env

Possible Errors

Problem: postactivate, preactivate etc. hooks are not created with virtual environment.

Possible Solution: python36 binary directory not added to path. Add them to the path in bashrc (See above).

Contents