As stated in part one of this blog series on the reference architecture for our cloud data platform, we will share and describe different parts of this model, and then translate it for the three major cloud providers – Google Cloud Platform, AWS, and Azure. In case you just came across this blog post before seeing the first one about the ingestion part of our model, you can still read it here first. For all others, we’ll start by looking at the data lake part of the b.telligent reference architecture before diving deeper into analytics and DWH in part 3.
Just so you won’t keep complaining, the reference architecture is in its entirety here. Go for it!
We have successfully ingested data into the platform, which by itself is usually a major obstacle in many organizations we’ve dealt with. But now we need to tackle a myriad of topics in the field of data management. Specifically, we must transform raw data sets into a standardized and curated form (like data cleansing, data quality, GDPR, etc.), or act upon incoming events, such as sending a mobile notification update on your bank account.
The stream processing service handles incoming events from the message queue on an event or micro-batch basis, and applies transformations, lookups, or other transformations. This results in a lower footprint of these services, which means lower CPU and RAM requirements. Whether in a managed or serverless form, both the options must account for cold starts (might take a few seconds or minutes to spin-up resources in the background before the first event is processed), and timeouts (not suitable for long-running processes). Another option is to run a self-managed stream processing engine (like Spark), but here you need to seriously consider whether the increased operational needs offset the configuration freedom you gain. On the other hand, you may have existing streaming pipelines written in Spark, and you need to lift & shift them quickly. In that case, your obvious choice would be a managed, big data cluster service.
Batch processing almost reflects previously stated characteristics. It typically operates on large data sets, thus placing more importance on CPU and RAM configuration, while pushing cold starts and timeouts aside. With the processing time going into minutes or hours, parallel processing with high-availability and durability of intermediate results takes precedence (remember Hadoop?) From this you may have already guessed that what one deemed tedious work just a few years ago (setting up your cluster), has transformed into a matter of low single-digit minutes in the cloud-native world. The outcome: transient and single-purpose clusters have become more relevant than 24/7 general clusters.
Once we have processed our data into the final form, we need to store the results in a multi-purpose persistent storage, a data lake. Technically, the requirements for this component-scalable and highly-durable storage service are features like geofencing, encryption-at-rest, fine-grained security policies, etc. From a logical perspective, we can structure our data lake into areas based on the data processing stage (from a raw to a standardized, curated, and trusted layer, for instance), and further on business dimensions such as the country (for organizations with a global footprint). Most importantly, our data lake service must be tightly integrated with other cloud services to enable AI/ML, business, or DWH teams to work with consistent and reliable data sets.
That’s two out of three parts. Before we proceed to individual vendor-specific versions of our model, we’ll dive first in part 3 into the analytics platform and data warehouse and BI elements of our architecture. Any questions so far?
Do you want to replace the IoT core with a multi-cloud solution and utilise the benefits of other IoT services from Azure or Amazon Web Services? Then get in touch with us and we will support you in the implementation with our expertise and the b.telligent partner network.
The world of medium-sized companies is being stirred at present by one topic, in particular: Industry 4.0. Pressure here is growing enormously because the digitization bus must not be missed if successful interaction on the market is to remain possible in the long term. What most companies do not know, however: Not all internal procedures must be cast overboard in order to remain abreast of the issues of big data and Industry 4.0. It is much more important make use of one's own employees' curiosity and convey the data culture to all departments. This blog post shows how to master the change to digitization step by step.
Ever thought about what the architecture of a cloud data platform should look like? We did! In our free webinar series Data Firework Days, we introduced our b.telligent reference architecture for a cloud data platform, a blueprint of how to build a successful data platform for your analytics, AI/ML, or DWH use cases. And we went a step further. Since we all know there’s not just one cloud out there, we also translated our model for the three major cloud providers – Google Cloud Platform, AWS, and Azure. In this blog series, we intend to describe the reference architecture in the first three blog posts and then, in parts 4–6, we’ll look into implementation options for each of them. So, do join us on our journey through the cloud.
Congratulations, you’ve managed to get through previous sections of our reference architecture model unscarred! The most tedious and cumbersome part is behind us now. However, it’s no problem if you're just getting started with part 3 of our blog series! Simply click on the links to part 1 and part 2, where we take a closer look on ingestions and data lakes as well as the entire reference architecture.
Continue with part 1 and part 2 of this blogseries.