The Innovations in Spark 2.0
The current summit is the first since the launch of Spark in the version 2.0. First of all, this version provides significant gains in performance by further fundamental optimizations (Project Tungsten: "Bringing Apache Spark Closer to Bare Metal") as well as standardizations in the APIs, especially when merging DataFrames and DataSets.
In addition, the new interface for stream data was introduced. A live demonstration with twitter feeds was shown on this topic in a second keynote, for us Europeans regarding Brexit instead of the Clinton-Trump campaign. The machine learning model, which had initially been created for statically loaded data and batch calculations, could also be used for a streaming source, simply by adjusting one line. Here, the speaker continually switched between code in Scala and code in Python on the same notebook. By means of this more generic API, the Spark user does not have to decide in advance whether he/she develops for streaming or batch processing.
Interesting Presentation Topics
The presentations following the keynotes are divided into four tracks: data science, developer, Spark ecosystem and enterprise. With a total of approximately 1,000 Summit visitors, the presentations are always well attended. The content-related range extends from predictive maintenance topics, such as e.g. forecasting of track switch failures, the linking of Spark with appropriate technologies (R-Wrapper, Sparkling Water/H20, Apache Ignite), to "Internet of Things" topics. A presentation by mmmmoooogle even reports on the use of Spark for the analysis of numerous biometric data on linked cows in order to increase the milk production, thus, an "Internet of Cows".
The Hype Goes On
Spark has been spreading very fast during the last three years, both in academia as well as in industry and economy. In an appointment, a leading Databricks Business Developer told me that actually none of the Databricks customers tried Spark and concluded that it did not work as desired or was unsuitable for him/her. In addition, he explained that the requests and ideas for new features in Spark are so enormous due to the rapidly growing community that Databricks always needs to prioritize first which development presents the greatest benefit for as many users as possible, before taking up a topic.
In my opinion, talking about a hype when it comes to Spark does the technology wrong. For me, hype always implies a bubble around the topic. However, Spark does not promise anything it cannot keep, and the success and the rapid spread speak for themselves. It remains exciting.
"One More Thing", Update for Day 2
The latest keynotes on day two of the Spark Summit bring a few interesting announcements.
Apache Spark + Google Tensorflow + Databricks = Tensorframe
Databricks has just announced a new project, namely TensorFrame. Anyone using a Databricks notebook may rely on the comfortable service that, if required, a scalable cluster of Spark nodes is started in the AWS Amazon Cloud (or shortly also in Microsoft Azure!) within minutes. This relieves the user from the complete time-consuming infrastructure and almost immediately provides complex pre-configured computing systems. This status quo is extended by newly available Spark nodes based on GPUs, i.e. graphic cards, and tailored for deep learning tasks.
The speaker was able to impressively show a live demonstration in which Google's DeepDream algorithm was used via Tensorflow within a Databricks Spark notebook on an image of Boris Johnson downloaded from the internet just before. Impressive!
By the way, Spark works together with the graphic card manufacturer NVIDIA in order to further align the technologies.