What is the best way to implement data science use cases? Which processes and which technology do I need and how do the two work together? These and other similar issues, which are currently being hotly debated by the data science community, dominated the second day of Predictive Analytics World 2019.
The first day of PAW Berlin 2019 closed with a party in Berlin's "Prince Charles" nightclub. Its location in Kreuzberg with its unrendered concrete walls is typical Berlin. The present club used to be a swimming pool for the staff of the "Bechstein" piano manufacturer. The second day of PAW arrived with some very technical topics, such as a special track on the topic of "data operations & engineering".
Prior to the deep-dive presentations, however, Dr. Frank Block from Roche Diagnostics gave the keynote speech. The presentation "From exploration to productionization" addressed the problem of developing data science use cases in robust productive applications. At Roche Diagnostics, the data science lifecycle is divided into three parts.
- In the first part – Create Awareness – ideas are collected and potentials analysed.
- The Proof Of Value (PoV) is the actual data science work. A project is developed and the added value proven according to the CRISP-DM model.
- In the last part – Productionize – the model is implemented - including UI and user support.
The projects have to prove themselves in different "gates". These function like a filter, ensuring that only projects with proven added value are able to reach the deployment/maintenance phase. This approach allows for a large number of experimental projects to be tested at the very least. The concept delivers on the mantra of "fail fast, fail often".
Experiences with data science: the development lifecycle at GfK
René Traue and Christian Lindenlaub (GfK) reported on their approaches to going live in a completely different corporate environment. The title of their presentation was provocatively chosen: "Data science development lifecycle: everyone talks about it, nobody really knows how to do it and everyone thinks everyone else is doing it". At GfK, Scrum has finally reached the data science world in the form of a project management methodology. Acceptance criteria and definitions of done are integral to this approach to new projects. The "product vision" is created using an iterative process with sprint plannings, daily standups and retrospectives. Here, too, proof of value must be provided. When it has, the project is passed to the engineering backlog. After implementation comes the topic of quality assurance. QPIs (Quality Performance Indicators) are used to decide, for example, whether a predictive model should even make a prediction. In the end, a "scaling team" assumes responsibility for the necessary scaling or for using the model for other tasks.
René Traue and Christian Lindenlaub (GfK)
Data Lab to data ops
"From data lab to data ops" was one of the catchphrases of the presentation "From sandbox to production – how to deploy machine learning models". The worst-case scenario for Michael Oettinger (freelancer), is the complete separation of data science and development – carried out by different staff and using different programming languages. There are several ways of preventing this worst-case scenario.
- Using an agile approach, data science and development can work within a network, with coordination via feedback loops.
- The use of web services is another good way to create a bridge. For Python projects, Flask can create an interface via HTTP. Deployment via Docker (perhaps in conjunction with Kubernetes) can enable trouble-free implementation and scaling.
- The cloud providers are then still available as an all-in-one-place solution: AWS, Azure and the Google Cloud Platform provide complete pipelines right through to deployment. Databricks is a special case which, thanks to the integration of Spark, can often be an exciting alternative.
The talk was rounded off with a presentation of some sample projects, such as fraud detection, credit scoring and sales forecasting, including the use of KNIME and KNIME Server.
On the second day of Predictive Analytics World, we were able to see for ourselves what a wide choice of options is currently available for implementing data science projects. A uniform methodology does not (yet) exist, and even a best practice is still a long way off. It was therefore exciting to see how different approaches have been able to establish themselves in environments with different operating conditions.