The curse of the PoC
Anyone who deals with machine-learning use cases is presumably familiar with this problem: The abundance of solutions and engines, particularly open-source, which are now available. Though they allow initial analyses and models to be quickly developed, integration of the solution into business processes requires a significant additional overhead. Unfortunately, the outcome here is often only a PoC, a trained model never being applied. Productive roll-out involves a transfer of prepared data, developed properties and trained models to a deployment or application process. This frequently occurs in another infrastructure and, in the worst case, on the basis of a different technology. Existing work steps then must be newly implemented, usually by other persons who developed the model, business logic being kept in multiple instances.
Neglected usually here on top of that is the constant versioning of models as well as monitoring of model performance. Usually, only the latest model is present without any history or traceable development. Results may then no longer be reproducible.
Enter the stage: MLflow
According to their own experiences with customers and users, a team led by Matei Zaharia, the founder of Spark, has addressed this problem and published a Python-based open-source machine-learning platform called MLflow (https://mlflow.org/). It was presented at the Spark summit back in June.
I find it a pity that the publication has not made the impact earned by the vision of this platform. I will therefore briefly introduce the fundamentals of MLflow here again. The product has been in alpha status so far, so it is still far away from productive use, although I am already excited about the framework's current functionality and generic approach. Just recently there was a new release (0.5) which further extends the scope.
Three modules
The framework currently consists of three components:
Tracking
Tracking provides a range of functions in Python, making it possible to log everything ranging from hyper-parameters through results (model performance) right up to artefacts such as charts.
Provided for this purpose by an included tracking server is a neat user interface which displays the results such that they are searchable. This tracking now offers the possibility of collecting preferences, parameters, and results for each of a model's training or scoring runs for the purpose of collective monitoring. In Python, simple function calls similar to logging calls are used to collect values for tracking. Alternatively, it is also possible to address the tracking server directly via a REST API, thus allowing use outside Python.
Projects
This module allows encapsulation of ML projects to form a closed package or Git repository. The project's user need not know anything more about the internal workings here. Entry points and basic configurations are defined via a YAML file. It can be used to control, for example, the layout of a Conda environment subsequently created by MLflow if necessary.
This means that an ML model and data preparation can be started via a single command line call specifying a Git repository. The actual data set is passed here as a parameter via the related path, for example. This fixed Git link also makes it possible to execute various versions of a model without any problems.
Models of every flavour
The third module adopts the interface to downstream technologies and allows a simplified deployment. "Models" offers different flavours of models here. A project is stored in a flavour represented by a binary file, for example as a pure Python function or as sklearn-, Keras-, TensorFlow-, PyTorch-, Spark Mlib- oder H2O model.
In addition, there is already support for the cloud, such as AzureML and Amazon's Sagemaker. An accompanying config-file helps the consumer again to apply the available model to new data, similar to "Projects".
Application
Simple standard installation
At the moment, MLflow is used primarily as a Python module. I have performed the installation myself via pip in a Conda environment, and was able to start the examples without any problems from the Git repository. MLflow also uses the Conda package Manager to keep projects for a pre-defined Python interpreter.
APIs
However, the final application need not necessarily pass through a Python stage, because MLflow offers a Rest-API in addition to a command line interface. It can be used to run models and log information via the tracking server.
One ring to link everything?
As mentioned above, the platform is still in an early alpha stage. The available functionality is nonetheless impressive, and this is just the beginning. The vision of establishing a framework which combines other frameworks is certainly not new, and not necessarily crowned with success. The offered solution of encapsulating ML projects of various frameworks such that they can be distributed and used agnostically with regard to implementation has yet to prove itself - both in terms of compatibility with target environments as well as performance and responsiveness.
However, two things here nevertheless make me very confident:
Matei ist an Bord
Matei, the mind responsible behind Spark, is personally involved in the project and was probably the first committer of the code base symbolically. Though at the current stage, we are dealing with an open-source product, the crew behind it has already succeeded in realizing the largest open-source project in the big-data environment with Spark.
I therefore have no misgivings about MLflow fizzling out after an initial euphoria. On the contrary, in terms of a possible future commercial version, I anticipate that the platform will quickly exit the alpha phase.
Itʼs Python
In a previous blog post of mine several years ago, I had already recommended learning Python as the future lingua franca of data science. Developments in Python during the last three years have exceeded my wildest expectations, however. The "Economist" in July, for example, published an article stating that Python is becoming the most popular programming language of all.
Apart from Spark where the latest features are always provided first for the primary Scala API, the ML-world and, especially, deep learning frameworks appear to agree on Python as a common language. The reliance of MLflow on Python from the outset, despite the developers' Spark/Scala history, is an important signal for me. Python is here to stay.