Due to its six tracks and dozens of sessions, it is impossible to summarize the PAW 2018 in Berlin without a clear focus. I, therefore, want to frame three exciting presentations in the light of a question that Dean Abbot raised during his keynote on the second day: is it possible to define recipes for data science?
Dean Abbot himself focused on the relation between a data preparation pipeline and the modeling further down the road. It would be highly efficient to build one, and only one, data pipeline and then train a whole model zoo on the results. After that, we pick the best model and roll with it. Sounds tempting? It is! However, there are several pitfalls when it comes to how different algorithms react to scaling or missing values (to cite two of the five examples he gave). There are guidelines and best practices. Yet, Dean Abbot emphasized that there is no general set of recipes that should be blindly applied in all cases.
Hans Werner Zimmermann complemented this view with a highly stimulating presentation on how to build a forecasting model. As someone who spent a whole career in this field, he combined theoretical considerations with practical applications. The list of challenges in this area is extensive. To pick two examples:
- A successful model needs to account both for effects of the environment and the momentum of the system itself.
- Predicting an outcome means to figure out where the causal effects actually come from.
From Zimmermann's perspective, more data cannot help here. Instead, he argued for theoretical reasoning and in-depth domain knowledge.
How domains shape the specific challenges of machine learning problems was further demonstrated by Malte Pietsch within the field of NLP (natural language processing). His talk focused on so-called Named-Entity-Recognition or NER for short. In the last years, there have been impressive improvements in this field.
However, there still is an annoying tradeoff. Some areas have a lot of text, but it is hard to build a business use case. In contrast, if there is a valid use case, there is often not enough data. More recently, deep learning architectures combined with embeddings tried to mitigate this problem. For instance, Malte Pietsch referred to Google's BERT project that aims to build the basis for successful transfer learning. The related Github repository was already starred nearly 8,000 times in just two weeks!
For me, there are three major general takeaways from this day:
- A clean problem-definition and in-depth domain knowledge are crucial for determining the best approach for a machine learning problem. Best-practices are no recipes.
- The value of feature-engineering is higher than you probably think, no matter the problem in question.
- Data science and AI stays a fast-paced and rapidly growing field, especially NLP.
We're looking forward to the Predictve Analtics World 2019!