Below is a walkthrough of the entire setup. For a quick reference jump here.
Recently I set up pyspark in my local machine. While there are many resources to set up spark locally, there was nothing specific to my workflow of using virtualenv for different projects. Also as it requires an older version of python for compatibilty which also means virtualenv is a good approach. While the entire premise of spark is to be deployed on a distributed system, having a local install of pyspark is helpful for learning and development.
Specifically, the goal was to set up pyspark locally on my machine in a python vitual environment with jupyter notebook functionality.
A major reason for setting up a virtual enironment for pyspark is that there are many compatibility issues between the versions of spark, java and python.
Here we will use versions:
Pyspark does not support some later versions of python. Therefore we must download an earlier version (This is also helpful for other ML projects as important libraries like sklearn are unsupported in the later versions at the time this was written). I use arch linux so I will download from the AUR (Manjaro can also use this). Download it with the package manager of your distribution.
Download the PKGBUILD from the AUR and unpack the tar file
$ tar xfv path/to/tar/file/python36.tar.gz
The python binaries are located at 'path/to/tar/file/python36/pkg/usr/bin/'.
(Optional) Copy the binary files into a folder in the home directory.
$ mkdir ~/.python36
$ cp path/to/tar/file/python36/pkg ~/.python36
Add python36 binary directory to path
### Add python 3.6 binary directory to .bash_profile OR .bashrc file
export PATH=~/.python36/pkg/python36/usr/bin:$PATH
Execute the .bashrc or .bash_profile to implement the changes to the path
source ~/.bashrc
We use the release "Spark 2.4.5 predbuilt for Apache Hadoop 2.7". This can be selected and downloaded here.
Unpack the Spark .tgz file
$ tar zxvf path/to/spark/download/spark-2.4.5-bin-hadoop2.7.tgz
Move the unpacked folder to the /opt/ directory
$ sudo mv path/to/spark/download/spark-2.4.5-bin-hadoop2.7 /opt/spark-2.4
Symbolic link the spark-2.4 directory to spark
$ ln -s spark-2.4 spark
Spark 2.4.5 runs on Java 8. Download the Java Development Kit 8 (jdk8) from your package manager.
$ sudo pacman -Si jre8-openjdk-headless
Find where this package is located. For example mine is located at /usr/lib/jvm/java-8-openjdk/jre
.
You can skip this step if already set-up.
We use the python module virtualenv
to manage virtual environments. Python has a lot of different options, some of which are shipped as standard. A good comparison of these can be found in this stack exchange answer. We also use the virtualenvwrapper
for ease of use.
Skip to the next step if you have already set up virtualenv. Read the docs of virtualenvwrapper for more information or help.
$ pip install virtualenv
$ pip install virtualenvwrapper
$ mkdir ~/Virtual_Envs
~/.bashrc
file:## Add to .bashrc or bash_profile
export WORKON_HOME=$HOME/Virtual_Envs
source /usr/bin/virtualenvwrapper.sh
Create a new virtual environment which runs python 3.6.10 (or the version you downloaded above).
$ mkvirtualenv pyspark_env --python=home/User/.python36/usr/bin/python3.6
created virtual environment CPython3.6.10.final.0-64 in 94ms
creator CPython3Posix(dest=/home/User/Virtual_Envs/pyspark_env, clear=False, global=False)
seeder FromAppData(download=False, pip=latest, setuptools=latest, wheel=latest, via=copy, app_data_dir=/home/User/.local/share/virtualenv/seed-app-data/v1.0.1)
activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
virtualenvwrapper.user_scripts creating /home/User/Virtual_Envs/pyspark_env/bin/predeactivate
virtualenvwrapper.user_scripts creating /home/User/Virtual_Envs/pyspark_env/bin/postdeactivate
virtualenvwrapper.user_scripts creating /home/User/Virtual_Envs/pyspark_env/bin/preactivate
virtualenvwrapper.user_scripts creating /home/User/Virtual_Envs/pyspark_env/bin/postactivate
virtualenvwrapper.user_scripts creating /home/User/Virtual_Envs/pyspark_env/bin/get_env_details
Now we need to edit the environmental variables of the virtual environment. Navigate to the directory with virtual environments (I set mine to be ~/Virtual_Envs) and cd
into the pyspark virtual environment folder. We edit the postactivate
file which acts similarly to a .bashrc
file when the virtual env is activated.
$ cd ~/Virtual_Envs/pyspark_env
$ cd bin
To set the ne environment variables after activating the virtual environment we edit the postactivate
file.
Add the following to the postactivate
file:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk/jre
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python
After this your postactivate
file should have the following
#!/bin/bash
# This hook is sourced after this virtualenv is activated.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk/jre
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python
That's it for the primary installation of Spark. Now activate the virtual environment and test the environmental variables are implemented:
$ workon pyspark_env
(pyspark_env) $ echo $JAVA_HOME
/usr/lib/jvm/java-8-openjdk/jre/
(pyspark_env) $ echo $SPARK_HOME
/opt/spark
To test that it works you can run
(pyspark_env) $ spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_252
Branch HEAD
Compiled by user centos on 2020-02-02T19:38:06Z
Revision cee4ecbb16917fa85f02c635925e2687400aa56b
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
You can also run pyspark in the command-line
(pyspark_env) $ pyspark
Python 3.6.10 (default, Apr 21 2020, 02:51:13)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Python version 3.6.10 (default, Apr 21 2020 02:51:13)
SparkSession available as 'spark'.
>>>
You can now run pyspark commands. While these would normally be executed through a script or jupyter notebook, you can test that the raw pyspark works with the canonical example of calculating pi. The SparkContext and SparkSession are already initialized.
>>> print(sc)
<SparkContext master=local[*] appName=PySparkShell>
>>> print(spark)
<pyspark.sql.session.SparkSession object at 0x7f1e2d93a390>
>>> import random
>>> NUM_SAMPLES = 10000000
>>> def inside(p):
... x, y = random.random(), random.random()
... return x*x + y*y < 1
...
>>> count = sc.parallelize(range(0, NUM_SAMPLES)).filter(inside).count(
[Stage 1:> (0 + 16) / 16])
>>> pi = 4 * count / NUM_SAMPLES
>>> print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
Pi is roughly 3.1414352
In order to use pyspark in a script you can install the pyspark
module with pip.
(pyspark_env) $ pip install pyspark
and import pyspark
like other modules.
As Jupyter notebooks are popular in data science workflows, especially for testing and development, it is important to be able to use pyspark in a jupyter notebook. As using pyspark in the command line is impractical and inefficient for projects of any size, we will set the environment up so that pyspark
will open jupyter notebook.
Add the following to the postactivate
file of the virtual environment to
export PYTHON_PATH=$SPARK_HOME/python:PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
If you installed jupyter-notebook through your package manager or on your main python interpreter jupyter will use the main instance of python by default. We need to add a new jupyter kernel for our virtual environment.
Install ipykernel
inside the virtual environment.
(pyspark_env) $ pip install --user ipykernel
Install a new kernel with the new virtual environment.
(pyspark_env) $ python -m ipykernel install --user --name=pyspark_env
You should now be able to open a jupyter notebook in the virtual environment with pyspark
. You can then start a new notebook with the kernel 'pyspark_env'. You don't even need to be in the virtual environment to use the new kernel. It will be available with any jupyter-notebook.
virtualenv
and virtualenvwrapper
mkvirtualenv -p='path/to/python36/binary' pyspark_env
.bashrc
file## virtualenvwrapper setup
export WORKON_HOME=$HOME/Virtual_Envs
source /usr/bin/virtualenvwrapper.sh
## Add python 3.6 to binary path. Should be done before virtual env is created.
export PATH=~/.python36/pkg/python36/usr/bin:$PATH
5 . The postactivate
file of the virtual environment should have the following
#!/bin/bash
# This hook is sourced after this virtualenv is activated.
## Set the Java and spark locations and add spark to the binary path
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk/jre
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python
## Set the jupyter notebook for pyspark
export PYTHON_PATH=$SPARK_HOME/python:PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
6 . Inside the virtual environment install new python kernel for jupyter with pip install --user ipykernel
and $ python -m ipykernel install --user --name=pyspark_env
Problem: postactivate
, preactivate
etc. hooks are not created with virtual environment.
Possible Solution: python36 binary directory not added to path. Add them to the path in bashrc (See above).