RAPIDS 2018

Having for the last year incessantly worried about the day when someone asks how we achieved a certain model metric, and not being able to answer or even replicate the result, I found the subject of Reproducibility and Provenance in Data Science mini-conference – RAPIDS in short – much to my interest. The solutions to the provenance tracking and reproducibility problem I came up with so far were hand-cranked and built on top of tools whose selection was influenced more by my personal preferences than the prevailing industry practices. As a result, I looked forward to learning what the fellow data scientists in the broader community do.

Many of the speakers, independently, came to the conclusion that as far as machine learning model reproducibility is concerned, the three types of artifacts that need to be tracked and versioned are

  1. the code,
  2. the data, and
  3. the environment, meaning the configuration of the machine and all libraries used to train the model.

There was also general consensus on the tools of choice for each of the three: Git for code, AWS S3 for data, Docker for environment. There are caveats with these: while everyone seemed happy with Docker, it was acknowledged that the Git workflow might not be well suited to fast iterations in a notebook (would we commit after every parameter change?), not to mention the infamous git command line user experience. As far as Amazon S3 is concerned, while it offers versioning, that is a fairly low-level feature and requires some abstraction layer for convenient use.

This is where dedicated data science workflow software comes in. The admission to the conference being free, one could expect it to be a showcase for some such product. Indeed, RAPIDS 2018 was put together by Dotmesh, vendor of an upcoming product called Dotscience, which coordinates Docker, Git and S3 to deliver a data, code, environment, model and performance tracking system. In its current form the system spins up a Docker container running Jupyter notebook, and tracks metadata about inputs and outputs of each notebook by the way of specially formatted cell output, for example:

DOTSCIENCE_INPUTS=["combined-houses"]
DOTSCIENCE_OUTPUTS=["model"]
DOTSCIENCE_LABELS={"model_type": "linear_regression"}

This is then parsed to build the provenance graph of data and models and to record experiments and their results. The provenance graph can be viewed in the web UI of Dotscience, and looks like this:

While the demonstrations of Dotscience relied on Jupyter, the relevant API can be invoked explicitly from scripts, if required. In fairness to the organisers, the demonstration and workshop on Dotscience did not take too much of the conference time, with the majority of the talks and workshops dedicated to general practices and more fundamental tools, such as Docker and Git.

A data scientist/devops engineer duo from Headstart recruitment company described the continuous model deployment setup they use. It involves automated training of the model in Circle CI, in a Docker container built based to the specification stored alongside the code in Git repository. Model metrics are then captured and compared with the production model, and if the results are satisfactory, a production Docker image is built, load-tested and deployed to production. The container exposes the model via simple Flask-based web API. That last solution was something I have heard proposed as a model deployment architecture that allows for mixing, for example, a JVM-based application with a Python-based model runtime environment. I have not been a fan, due to concerns about ownership and testing of the HTTP service code, but with data science and devops/software engineering working closely, it might be a convenient deployment model.

Finally, on the subject more removed from my day-to-day work, but fascinating nonetheless, a number of talks gave an insight into the challenges to research reproducibility in an academic setting. With the focus on “innovative” results and publications, academic system of incentives does not reward investment in reproducibility. As a consequence, many researchers, in particular those working in less technical fields such as humanities or social sciences, are not familiar with the available tooling – they exchange data via email, copy-and-paste formulae, data and graphs, store multiple copies on local hard drives, and so on. Resulting errors and lack of reproducibility can not only damage the reputation of science and lead to public mistrust in e.g. climate science and vaccinations, but can also have very dramatic and immediate impact on tens of millions of lives. Reinhart and Rogoff’s 2010 Growth in a Time of Debt paper was cited as one of the scientific arguments for austerity policies in early 2010s. Later, significant data processing errors were discovered in the Excel spreadsheet used by the researchers, to the extend that the data did not support the conclusions of the paper.

In summary, the sense in the community seems to be that while a number of targeted data science process products, such as Pachyderm, DVC, and now Dotscience, start competing for the share of the market, we are still best off with lower-level tools that have seen widespread use in software development and devops communities, and should wait on the sidelines until a clear leader emerges.

19/07/2018