I am not writing this blog post in a quiet minute in our b.telligent offices, but live from the Spark Summit in Brussels. For data scientists, it offers an enormous scope of machine learning procedures, both traditional for static data sets, and for streaming data in real-time. All those with practical experiences in the Python library sklearn will immediately feel at home, as this served as an example.
Spark Recap
Actually, one should not have to introduce Spark anymore. However, for all those who have switched on late: Spark is an open source framework for distributed computing. It is frequently mentioned in one sentence with many big data technologies and often wrongly referred to as Hadoop competitor. Actually, it is rather a useful addition to it, because while Hadoop is more like a framework for robust storage of mass data distributed among clusters, Spark is able to process those data volumes for calculations in a highly performant, distributed manner. Spark itself may be seen as big data technology, but per se it has no own data storage.
The Spark Summit
Spark still is very new in measures of traditional commercial software. The first version has been released only four years ago, but in the USA, it has already been celebrated as the Taylor "Swift of Big Data Software" for one year now and there have already been various international Spark fairs. I am currently at one of those, the European Spark Summit 2016. This is one of by now three events per year and the only one outside the USA. Here, it is reported on the development/trends, technical details and use cases of Spark with 1,000+ visitors and 100 speakers from industry and academia.
The Master Speaks
For me, it is impressive to witness Matei Zaharia's keynote speech. Matei has released Spark in its first version only a few years ago as university work during his PhD studies in Berkeley. Now, he holds a professorial chair in Stanford and is co-founder and CTO of Databricks. The business bears most of the Spark development work and is the world leader in providing commercial work environments for Spark. So to say, the summit in Brussels is a Databrick in-house exhibition.
At the Spark Summits, Matei takes the opportunity to give the opening speech himself and to share his latest visions for Spark with the obviously excited and respectfully listening audience. In this process, he has the aura of an IT nerd, is highly focused on his topic, talks non-stop, ... you literally hear him catch his breath between long sentences. However, he is experienced and very secure in his presentation. One can even say that the Spark Summit is his event in the end.
The Innovations in Spark 2.0
The current summit is the first since the launch of Spark in the version 2.0. First of all, this version provides significant gains in performance by further fundamental optimizations (Project Tungsten: "Bringing Apache Spark Closer to Bare Metal") as well as standardizations in the APIs, especially when merging DataFrames and DataSets.
In addition, the new interface for stream data was introduced. A live demonstration with twitter feeds was shown on this topic in a second keynote, for us Europeans regarding Brexit instead of the Clinton-Trump campaign. The machine learning model, which had initially been created for statically loaded data and batch calculations, could also be used for a streaming source, simply by adjusting one line. Here, the speaker continually switched between code in Scala and code in Python on the same notebook. By means of this more generic API, the Spark user does not have to decide in advance whether he/she develops for streaming or batch processing.
Interesting Presentation Topics
The presentations following the keynotes are divided into four tracks: data science, developer, Spark ecosystem and enterprise. With a total of approximately 1,000 Summit visitors, the presentations are always well attended. The content-related range extends from predictive maintenance topics, such as e.g. forecasting of track switch failures, the linking of Spark with appropriate technologies (R-Wrapper, Sparkling Water/H20, Apache Ignite), to "Internet of Things" topics. A presentation by mmmmoooogle even reports on the use of Spark for the analysis of numerous biometric data on linked cows in order to increase the milk production, thus, an "Internet of Cows".
The Hype Goes On
Spark has been spreading very fast during the last three years, both in academia as well as in industry and economy. In an appointment, a leading Databricks Business Developer told me that actually none of the Databricks customers tried Spark and concluded that it did not work as desired or was unsuitable for him/her. In addition, he explained that the requests and ideas for new features in Spark are so enormous due to the rapidly growing community that Databricks always needs to prioritize first which development presents the greatest benefit for as many users as possible, before taking up a topic.
In my opinion, talking about a hype when it comes to Spark does the technology wrong. For me, hype always implies a bubble around the topic. However, Spark does not promise anything it cannot keep, and the success and the rapid spread speak for themselves. It remains exciting.
"ONE MORE THING", UPDATE For Day 2
The latest keynotes on day two of the Spark Summit bring a few interesting announcements.
APACHE SPARK + GOOGLE TENSORFLOW + DATABRICKS = TENSORFRAME
Databricks has just announced a new project, namely TensorFrame. Anyone using a Databricks notebook may rely on the comfortable service that, if required, a scalable cluster of Spark nodes is started in the AWS Amazon Cloud (or shortly also in Microsoft Azure!) within minutes. This relieves the user from the complete time-consuming infrastructure and almost immediately provides complex pre-configured computing systems. This status quo is extended by newly available Spark nodes based on GPUs, i.e. graphic cards, and tailored for deep learning tasks.
The speaker was able to impressively show a live demonstration in which Google's DeepDream algorithm was used via Tensorflow within a Databricks Spark notebook on an image of Boris Johnson downloaded from the internet just before. Impressive!
By the way, Spark works together with the graphic card manufacturer NVIDIA in order to further align the technologies.
IBM WATSON DATA PLATFORM
In the second keynote, IBM introduces its new machine learning workbench. The product only required half a year development until its current release as "closed beta" version. It is part of the IBM-Watson family and offers an "end to end collaboration platform" for machine learning. People even talk about a democratization of machine learning as the handling is that easy anyone could work with it. At the same time, it offers sufficient options and possibilities of intervention for professional data scientists.
The product has been completely realized in Spark. That it has been successfully used as core part for an IBM-Watson product only approximately three years after its release shows Spark's potential.