Datapalooza, or IBM's Land Grab

IBM Watson’s spectacular success in Jeopardy a couple years ago was widely reported and highlighted the fact that the company, not generally considered to be at the forefront of technological innovation for the past 15 years or so, is attempting to carve for itself a large slice of the big data/data science/machine learning/cognitive computing/whatever term is fashionable this week pie. I did not appreciate the scope of their foray into this space until the Datapalooza Mashup event today.

Resources and Tooling for Data Science

To start with, IBM offers a wealth of free educational material on all things data science at Big Data University. On the surface it looks similar to Coursera, but solely focussed on big data et al. The selection of paths and courses seems impressive, although on closer look most of the ones in Machine Learning track are not available yet. That said, Rav Ahuja from IBM Canada claims the course completion rate is 60%, which compares rather favourably with industry average of less than 20%. Another free offering from IBM, Data Scientist Workbench, provides data storage and cleaning facilities (OpenRefine) and web-based notebooks (Zeppelin and Jupyter) as well as web version of RStudio.

Wimbledon Social Data

PaaS is a perfect match for data analysis tasks: we can experiment for free on small data sets and then pay to have the large set processed quickly. One caveat is that our data set is going to end up in a thrid-party service, so it might not be appropriate for sensitive data. BlueMix is IBMs “PaaS for data science” offering, built on popular CloudFoundry stack. In order to prove that it is not just for messing around with tiny examples, Darren Shaw demonstrated Cognitive Social Command Center. This sinister-sounding dashboard gives Wimbledon tournament organisers insight into what the community of tennis fans is talking about in the context of the tournament; this in turn can allow them to better target the content they produce, and, ultimately, make more money. Architecturally, it sources data from YouTube, Facebook, Instagram and Gnip APIs. The messages are then pushed onto MessageHub, which is IBM’s version of Apache Kafka. MessageHub acts as the data fabric behind a pipeline that first runs a classifier to identify which messages are of interest (i.e. about the Wimbledon tournament), then annotates them with information such as inferred sentiment, topics and people that occur in the message (e.g. Andy Murray). Finally, analytics stage produces 60-second aggregations with additional annotations such as source, country etc. and pushes them into ElasticSearch, which is then queried by Node.js-based web UI. Behind the scenes, the classifier uses Watson Natural Language Classifier (NLC) and the annotator is backed by Alchemy API. All of those, together with ElasticSearch, are available as services on BlueMix.

Node-RED

BlueMix-based Node-RED seems more like a toy, but very impressive nonetheless. It allows wiring together data inputs, processing components, and outputs, by moving boxes and drawing lines in a diagram. Within fifteen minutes we were able to put together a simple flow that takes a constant sentence and runs one of the BlueMix Watson services to translate it to another language. It’s easy to see how it can be extended to translate e.g. tweets or web pages. BlueMix offers a bunch of data processing algorithms branded as “Watson services”: language classification, translation, personality and concept insights, dialogue and conversation. Beyond NLP, it provides image recognition and speech-to-text as well as text-to-speech. The latter uses SSML to annotate the text with expression hints, specifying whether a given passage should sound apologetic, tentative or happy. IBM appears to slap the “Watson” brand on everything ML-related, in apparent belief that the positive publicity from Jeopardy will carry over to their more run-of-the-mill, commercial offerings.

Maps and Mashups

Web mashups are so 2008. Dan Cunnington didn’t seem to mind, and has shown a map of London where you could select a bus line, and it would attempt to predict if there are any delays on that service. Initially he tried to process tweets that have been sent from a vicinity of a bus stop and mentioned “bus”, but that turned out to not be very informative. What turned out to be more useful were live feeds from traffic CCTV; Dan trained an image classifier to recognise congestion from CCTV stills, and that ended up working pretty well. The app was (of course) built on BlueMix platform.

One of the dat sources for Dan’s app was TransportAPI, which is actually very neat. You can, for example, query bus stops in the area surrounding a given geographical coordinates; for each bus stop you can get the scheduled arrivals – that’s real, live data, not a fixed timetable – and for each of them you can get entire route timetable as of given time. Specifying edge_geometry=true will provide the line segments that form the exact route of the bus, so it can be drawn on a map. There is even a journey planning API. All of this is available for free for non-commercial uses and in a tiered payment plan for commercial uses, starting with 1000 free calls a day.

Ordnance Survey is the ultimate source of geographical data for the UK. Their maps, with 0.5 metre accuracy, building outlines, massive amount of detail and low-contrast colour schemes are perfect for overlaying data from other sources. In addition, “premium” places and property APIs allow searching for coordinates by parts of address, finding out what purpose a given building serves and locating listed buildings in a given area. An app that demonstrates the basics of interaction with TransportAPI, OS and Weather, can be found on Github.

Databases

BlueMix provides a smorgasbord of data storage options. One of them is Cloudant, a document-oriented database which looks a bit like IBM’s take on MongoDB, with its id only access and indices specified as javascript functions run by map/reduce jobs. Somewhat more interesting is the not very imaginatively named IBM Graph – a graph database built on Apache TinkerPop, Titan, Cassandra (as a storage engine) and ElasticSearch (for indices). All are perhaps sound picks, but given a choice between TinkerPop’s Gremlin and Neo4j’s Cypher as a query language, I’d go for the latter. With graph databases becoming more relevant due to applications such as social graph analysis, fraud detection and logistics optimisation, we might eventually end up with a standard, algebraic language for querying graphs that plays the same role as SQL for relational databases.

BlueMix

So, how is BlueMix? I have not kicked the wheels properly and have not developed and deployed a complete application; with that caveat, I would say it looks encouraging. It largely builds on familiar, open-source products. Data science focus puts it apart from dozens of other PaaS offerings, although Microsoft Azure ML and Amazon Machine Learning are close contenders. While not aimed at non-technical users (Watson Analytics, a Watson-branded product that has nothing to do with Jeopardy winner, is for those), it inherits CloudFoundry’s developer-centric ease of wiring things together using REST APIs. Most of the demos presented were toy examples, but the Wimbledon social media analysis app shows that it can be used for serious work.

30/06/2016