KDD 2018: Common Model Infrastructure

Go away and have fun building models like there is no tomorrow. After all, we do the real creative work, innovation, science. The software and operations engineers just deal with the plumbing and have an easy, predictable job compared to us, data scientists. This might not be a mindset many would admit to, but subconciously we do tend to underestimate the importance of other’s jobs. To be fair, the enginneering and operations folks are not much better in this respect – everyone tends to think their work is more demanding and output more valuable than average.

The laissez-faire approach to building and deploying machine learning models is not unfamiliar to software developers, who over the last twenty years of explosive growth of information technology got used to pushing half-baked, poorly tested and code with no formal guarantees to production. It does make the old school civil engineer raise an eyebrow, though. After all, she does not get to dream up a bridge, build it, and let people loose onto it. We do it with machine learning models all the time, with no concern for social consequences of either technical or conceptual failure.

The Common Model Infrastructure workshop was put together by machine learning practicioners particularly concerned with the state of tooling and the resultant brittleness and waste of machine learning workflows. The format was unusual, with short talks lasting for 10 minutes, long talks 15 minutes, and keynotes allotted a whopping 25 minute slots. All that in one morning.

Practices and Concerns

As far as deployment of models to production was concerned, two main approaches advocated for seem to be containers (read Docker) and portable model formats (PMML, PFA, ONNX) that can be deployed into scoring engines. While container are all the rage, one should take a lesson from the virtualisation hype, and not hope that the container we construct today will still be able to run five years down the road. Longevity of those solutions might be an illusion. Also, while standing up a Docker+Kubernetes based platform might not take much effort, what is really difficult is managing the entire lifecycle: how data is submitted, how exceptions are handled, how quality control is ensured, how monitoring is set up.


A famous illustration from Hidden Technical Dept in Machine Learnig Systems. ML code in the context of the entire system


There was a metion of company-wide feature stores (Twitter has one), where embeddings are kept, for reuse by different teams in different models. Twitter advocates for “pipeline as a product” (as opposed to “model as a product”): treating entire machine learning pipeline, from data collection to model evaluation, as an artifact that is shipped and reused. A very specific practice suggested by two guys from a small startup was to put content hashes in file names, to avoid accidental overwriting or using a differnt version of a file when reproducing a model. Databases should be temporal, or a snapshot should be taken, so that queries are reproducible. Use the same code online (in production) and offline (when working on the models). One should capture and record the context – who and when produced a given piece of data, etc.

A concern mentioned in passing, but quite important, was the privacy considerations. Provenance tracking is essential for that, since, unless proven otherwise, any machine learning model should be assumed to be tainted by the training data – in other words, since this data can in principle be reverse engineered from the model, the model is as sensitive as the training data. This is not a common perception among the practicioners in the industry, as far as I can tell.

Epsilon Spaces

Epsilon spaces is a concept only tenuously related to the subject of the workshop, but which, ironically, happended to be the most interesting part as far as I am concerned – probably due to the problem being close to what I encountered in practice. Under the guise of a clever-sounding name hides a fairly simple concept. Imagine a setting where we need to build a model based on data from multiple sources, but cannot aggregate this data in a single place – perhaps those are patient records from different hospitals, or the data on a smartphone of a user concerned about their privacy. The idea behind epsilon spaces is that an agent runs in each such location, close to the data, and calculates a set of models from a specified family – for example, 2-layer neural networks, or logistic regression – that achieve loss lower than a given epsilon. Those sets are then submitted back to a central agent, who takes their intersection (if one exists) and chooses a model close to the centre of this intersection. There are some complications to account e.g. for neural networks that might include neurons with functionality specific to individual agents, but, in a nutshell, that is it. Appears simple and useful, though it is not immediately clear to me how feature selection is performed, and how it relates to the dictum that data taints the model.

Products

A number of talks mentioned particular prodcuts developed by organisations large and small. The big players design platforms around their offering that already has significant uer base. There is TensorFlow Extended and TensorFlow Hub from Google, the first one offering the infrastructure around ML models – ingestion pipelines, feature engineering, evaluation, serving – that forms the 99% of the complete system, the second allowing reuse of not just models, but parts of models – individual layers, embeddings, transformers. All built around TensorFlow, of course. Microsoft has Azure ML with its own system for data preparation, based on Azure DataBricks, and Docker-based deployment. Beaker from AI2 is again a Docker-based platform, dataset tracking and workflow system, somewhat similar to Dotscience. Baidu has AutoDL, a system for automated architecture search, currently just for computer vision. Both Beaker and AutoDL might become publicly available later this year.

* * *

My impression from RAPIDS stands: there is currently no clear leader in the space of data science workflow management. There is a lot of activity, with many established players (Google, Microsoft, Amazon) building solutions that tie the users to their platform, a number of fledgling startup-like establishments (Beaker, Dotscience) trying to make their mark, open source solutions (Pachyderm) that are half-measures at best, since they still require a dedicated ops team to run. The difficulty of building a comprehensive and widely applicable tooling for data science is a hard task – as one of the panel members observed, the solution must suit organisations of different sizes and structures, with different processes, constraints and legacy technology. It will be really hard to come up with one-size-fits-all.

20/08/2018