A Software Developer's Take on Zeppelin Notebooks

For the past two months I have used Zeppelin-based notebooks as my primary IDE for data exploration, experiment documentation and prototyping of production code. While the former two notebook use cases are perfectly legitimate, I would argue that Zeppelin is poorly suited for production code prototyping. This is primarily due to high barrier to extraction of common code to libraries. In Jupyter, for example, all it takes to extract a function to a shareable module is to copy it from the notebook to a Python file stored in the same directory as the notebook. That file can then be incorporated in the production code base, unit tests can be written for it using the stard testing frameworks, and so on. In Zeppelin + Scala Spark, for an extracted function to be available to the notebooks, the function has to be copied to a file, compiled to a .class file, added to the classpath of the interpreter, and the interpreter has to be restarted. As a result, noone bothers; the code lives on in the notebook, and any unit tests developed (in the notebook) around it have to be ported to whatever testing framework is used in the production code base.

The reality of the environment I ended up working in was such, though, that prototyping in a Zeppelin notebook appeared the least cumbersome option. In the process, I have developed a few working practices which I would like to share here. They are applicable to both prototyping and purely experimental notebooks.

Version Control

Zeppelin notebooks are stored in JSON files, code and output together, with the code encoded in such a way that line breaks are not present in the JSON. This makes it poorly suited for comparing different versions using standard tools such as diff. Collaboration on a single notebook is also tricky, since merging of changes does not work. Still, version control is essential for reproducibility of the experiments. I store all notebooks in git, and after running each experiment create a tag in the git repository, with the tag name corresponding to the experiment id. The exact way to work with a git repository would depend on the specific setup of the Zeppelin; in my case, the Zeppelin server runs on a remote machine, so instead of SSH-ing in every time I need to checkpoint my work I found it convenient to create a dedicated notebook that performs the git committing and tagging using a shell cell.

Structure

As noted, the sharing of code across Zeppelin notebooks is inconvenient. Things look a bit better when it comes to avoiding unnecessary code duplication within a single notebook, although here we also run into an obvious restriction: since the cells are meant to execute in sequence, the functionality must be built bottom-up. This means gradually creating more and more complex building blocks from those defined previously, and eventually putting them all together into the desired, top-level functionality. It is the opposite to how I prefer to read the code: first high-level logic, then the implementation details. That concern aside, it is still perfectly possible to apply the usual practices of extracting common code into reusable functions and keeping each function small and to the point. Scala offers good syntactic support for that.

I attempt to ensure that the experiment or prototyping notebooks I develop are safe to run in their entirety; the side effects should not destroy any previously created data. If a notebook writes out any data – and all experiment notebooks should save the results in some form – I write it under file name or Parquet key that is either derived from the experiment id or input parmeters, or is a GUID. Where possible, I try to make the notebooks deterministic. Wherever a random number generator is used, I always pass a constant seed in. This acts as an additional safeguard against accidental overwrite of previous results: if they do get overwritten, as long as nothing significant has changed in the notebook the new results will be the same. Finally, all writing in my notebooks tends to happen in the final cell. If there is any risk that a rerun could overwrite some data, or if I just don’t want to save on every run, it is easy to disable execution of this single cell.

In summary, this is the high-level structure that works well for the majority of notebooks I develop:

  • description: one-paragraph summary of what the notebook does
  • frequently changing parameters: experiment id, the inputs or keys to read them from
  • infrequently changing parameters and constants: paths of data sources, RNG seed
  • functions and their unit tests
  • invocation of the main experiment/prototype function
  • output and visualisation of results
  • storage of results

Testing

Ease of unit testing is a well-known property of code factored into small, focussed functions. As such, I tend to write a set of test cases for each such function, in a cell that immediately follows. This cuts down on the time spent debugging, since as we move on to building higher-level functionality we can be confident that the blocks work as intended. Scala’s assert function is an adequate substitute for a fully-fledged testing framework. By way of illustration, say, we had a cell with the following function:

def denoise(text: String): String = text
    .toLowerCase
    .replaceAll("""[^\s\w]""", "")
    .split("""\s+""")
    .filter(w => !stopWords.contains(w))
    .mkString(" ")

The cell that follows could contain:

{
    println("test case: stop words")
    val result = denoise("a holiday a holiday and the first one of the year")
    assert(result == "holiday holiday and first one of year", "result: " + result)
}

{
    println("test case: lower case")
    val result = denoise("FOO BAR")
    assert(result == "foo bar", "result: " + result)
}

// etc.

println("Passed.")

The curly braces around each test case make all variables defined in it local, so that they do not pollute the global name space. Additionally, they ensure that only the output explicitly printed appears in the notebook, cutting down on the noise. I tend to hide the output of the function cell (it is not interesting, unless the function fails to compile) and the code of the unit test cell, to make the information visible in the notebook more relevant for the reader.

One issue with this setup that I have not yet figured out how to overcome is that in Zeppelin a failure of one cell does not prevent further ones from executing. Consequently, when running the notebook, we have to manually check each unit test cell for failures; otherwise, if the code compiles, the experiment could run and write its (potentially erroneous) results despite some of the unit test assertions failing.

* * *

While I appreciate the convenience of rich, heterogeneous REPL, I am still not quite sold on notebooks. I have seen them used as vehicles for building data transformation pipelines – one notebook to import data from flat files into Parquet, another to transform to a different format, yet another to apply a classifier, etc. A pipeline constructed in this way is just a collection of disjoint pieces with no overall control over the whole, and as such makes the provenance of a given piece of data very difficult to establish. I would be interested in experimenting with tools such as Pachyderm that address this specific issue.

02/09/2017