My Spark & Jupyter Notebook EDA Workflow

I’ve been interning @mozilla for the past 6 months, and at last I’ve discovered my ideal setup for productivity in performing exploratory data analysis on our clusters! I’ll work towards presenting my solution by addressing the various problems that I have come across in performing analysis with spark.

Problems With Other Solutions

“Programming” in Notebooks

Whenever I have limited myself to only working in a Jupyter or Zeppelin notebook, I begin to feel claustrophobic after notebook cells begin to fill up. I’ll be jumping from the bottom of a notebook to the top in order to redefine a map operation or a query. After a while, it’s difficult to program effectively.

Submitting Spark Jobs With Python Files or Scala Jars

On previous projects where I haven’t had to create any visualizations, I’ve worked using spark-submit. These projects were great to work on, because I was no longer restricted by a single notebook file, spark-shell, etc.

However, notebooks are fantastic for sharing knowledge through the generation of reports. In these previous projects our output was JSON for consumption via a web front-end, rather than graphics for other analysts to view. When performing exploratory data analysis, inline plots supported by matplotlib & seaborn can easily be taken advantage of.

I also wanted to take advantage of caching/check-pointing dataframes in spark when exploring such that I don’t re-run my entire application when I have discovered an error in my logic. When using spark-submit, I don’t have this benefit.

Zeppelin Notebooks

I’ve been experimenting working with Zeppelin notebooks, as our spark cluster now has support for them. In Zeppelin, you can add dependencies, which is a great feature when using Scala.

However, Zeppelin is currently not parsed by gist.github.com and is somewhat difficult to get running locally. This makes it a little difficult to pass around notebooks to another analyst to review. Let me know if you have had any success with Zeppelin Notebooks!

Sparks sc.addPyFiles Method

sc.addPyFile has quickly become one of my favourite functions in standard spark. At @mozilla, our cluster launches pyspark with a jupyter server during the bootstrapping process, so requiring python files with the --py-files argument requires launching another spark application. sc.addPyFile solves this problem by distributing dependencies programmatically. I’ve had success in the past by specifying all my source files in a cell at the top of a notebook.

However, this doesn’t allow me to “hot-swap” code in a manor that I have been able to do outside of spark notebooks with a workflow something along the lines of…

import mymodule
reload(mymodule)

As an example, I might have made a mistake in mymodule, fixed the mistake, and wanted to reimport the module. Unfortunately, sc.addPyFile doesn’t redistribute the updated python module from what I have found. This leads me to my somewhat hacky solution.

Solution: Temporarily Using execfile or exec With Later Replacement Using sc.addPyFile

# python 2
execfile('mymodule.py')

# python 3
exec(open('./mymodule.py').read())

I didn’t promise that this solution would be pretty, but I’ve had the best results with it. By having a code cell that runs execfile on each of my source files, I get the benefit of being able to structure my code the way I want and am able to interactively explore data in a notebook. Since it executes my files in the pyspark session, all externally defined functions in those files are available by all of the executors. I can also “hot-swap” updates to those files by updating them, re-executing them, and continuing my analysis. When I use this solution, I am able to separate ETL logic from analysis logic resulting in shorter and easier to follow notebooks.

After finalising a notebook/application, I recommend replacing execfile or exec statements with sc.addPyFile or using the --py-files arguments and performing imports where necessary.

Please Post Other Workflows/Setups

I’m eager to explore other ways of structuring code with spark both for exploratory and production code!

You're awesome for taking time out of your day to read this! Please consider sharing below!