Issue with IPython/Jupyter on Spark (Unrecognized alias)

I am working on setting up a set of VMs to experiment with Spark before I spend go out and spend money on building up a cluster with some hardware. Quick note: I am an academic with a background in applied machine learning and work quit a bit in data science. I use the tools for computing, rarely would I need to set them up.

I've created 3 VMs (1 master, 2 slaves) and installed Spark successfully. Everything appears to be working as it should. My problem lies in creating a Jupyter server that can be connected to from a browser not running on a machine on the cluster.

I've installed Jupyter notebook successfully... and it runs. I've added a new IPython profile connecting to a remote server with Spark.

now the problem

The command

$ ipython --profile=pyspark runs fine and it connects to the spark cluster. However,

$ ipython notebook --profile=pyspark [stuff is here] Unrecognized alias: "profile=pyspark", it will probably have no effect. defaults to the default profile not the pyspark profile.

My notebook config for pyspark has:

c = get_config() c.NotebookApp.ip = '*' c.NotebookApp.open_browser = False c.NotebookApp.port = 8880 c.NotebookApp.server_extensions.append('ipyparallel.nbextension') c.NotebookApp.password = u'some password is here'

Topic ipython pyspark apache-spark python

Category Data Science


The issue is that pyspark is not on os sys path by default. After several failed attempt to add it manually to my config files/paths/etc, I came across this GitHub repository called findspark.

I cloned this repository using

git clone https://github.com/minrk/findspark.git

Then I ran pip install findspark from the findspark root.

Started a Jupyter notebook, created a new Python3 notebook and added

import findspark  
import os  
findspark.init()  
import pyspark  
sc = pyspark.SparkContext() 

Before findspark.init(), import pyspark came back with an error.

To test I just typed sc and got back:

pyspark.context.SparkContext at 0x4526d30

All working for me now.


Assume your configure file is ~/.ipython/profile_pyspark/ipython_notebook_config.py, you can still use this configure file by:

ipython notebook --config='~/.ipython/profile_pyspark/ipython_notebook_config.py'

or

jupyter-notebook --config='~/.ipython/profile_pyspark/ipython_notebook_config.py'

IPython has now moved to version 4.0, which means that if you are using it, it will be reading its configuration from ~/.jupyter, not ~/.ipython. You have to create a new configuration file with

jupyter notebook --generate-config

and then edit the resulting ~/.jupyter/jupyter_notebook_config.py file according to your needs.

More installation instructions here.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.