In our free series of online events under the banner of Data Firework Days, we introduced you to the b.telligent reference architecture for cloud data platforms. Now we'd like to use this blog series to take a closer look at the subject of the cloud and the individual providers of cloud services. In the first of this three-part series Blueprint: Cloud Data Platform Architecture, we were interested in the architecture of cloud platforms in general.
Central services – Supplementary services for building and running a data platform and ensuring data governance
Data services – on-demand data access
Spotlight on AWS
In this blog post we look at how Amazon's Lake House approach fits in with our reference architecture and how it can be used to implement a cloud data platform.
So What Exactly Is the Lake House Approach and What Services Does It Offer?
The AWS Lake House architecture approach describes how various services are integrated in the AWS cloud. This architecture makes it possible to create an integrated and scalable solution for processing, storage, analysis and governance of large amounts of data. Unlike other providers' lakehouse concepts, this architecture goes beyond the mere integration of data lake and data warehouse technologies. It also includes streaming, data governance and analytics.
In addition to the cost-effective scalability of the overall system, the use of demand-oriented data services is also important for AWS. AWS maintains that a product based on a single service can never be more than a compromise solution. The provider is therefore a keen advocate of combining a variety of services to meet the user's requirements as optimally and comprehensively as possible.
The above diagram shows the individual AWS services available as part of the Lake House architecture, with the data lake at its core. The following services are used to build a data lake on AWS:
Amazon S3 – data storage
Amazon Glue – data catalog & ETL
Amazon Lake Formation – data governance
Amazon Aurora – direct access to the data lake via an SQL interface
The other services are then arranged around this central core. The advantage here is that, depending on the application, they can be used individually or in combination – and the cost is usage-based.
In this integrated system, communication between the services is also possible in different directions. A service such as a relational database can not only supply data to the data lake but can also receive reprocessed data from the data lake or data warehouse. The Amazon Glue Data Catalog ensures that data is available to all services via a common metadata view. The data in the Data Catalog is subject to governance via Amazon Lake Formation.
The concept can be easily understood by looking at the interoperability between the Data Lake and Amazon Redshift, the AWS cloud data warehouse service. The metadata in the data catalog allows a specific part of the data lake to be accessed transparently as part of the data warehouse. The Spectrum service within Amazon Redshift is used to do this. With Spectrum, a database defined in the Glue Data Catalog is represented as a schema integrated within the Amazon Redshift database. The tables in this schema together with the Redshift-native tables can then be used in SQL queries with joins etc. This use is still subject to governance by Amazon Lake Formation, however. So users can only access the tables, columns and data for which they have access rights.
Data Lake, Data Warehouse, Data Catalog, Etc. – What Parts of the Cloud Data Platform Are Covered by the Lake House Approach?
It is now possible to transfer the services and functionalities described in the Lake House approach to the majority of the b.telligent reference architecture for cloud data platforms:
Ingestion and processing
Streaming – Amazon Kinesis
Batch – Amazon Glue ETL
Data Lake - data storage
Amazon S3
Amazon Athena
Data warehouse
Amazon Redshift
Analytical platform
Amazon SageMaker
Meta-data management and GDPR services
AWS Glue Data Catalog
Amazon Lake Formation
Parts of the reference architecture are covered by AWS Lake House architecture
The remaining parts of the b.telligent reference architecture are covered by other services. These are not explicitly mentioned within the Amazon Lake House architecture approach, however, but are part of the large AWS service catalog.
Data visualization/reporting – AWS Quicksight
Automation & scheduling – Amazon Managed Airflow
Continuous deployment and integration – AWS Codepipeline and Code Deploy
Process & cost monitoring & logging – AWS Cloud Watch, Cloud Trail and AWS Cost Explorer
A complete implementation of the reference architecture with AWS services could look like this:
If you're about to implement a cloud data platform and would like to know more about implementing it on AWS, just contact us and we'll be happy to explain how we can guide you on your journey into the world of the AWS cloud.
Do you want to replace the IoT core with a multi-cloud solution and utilise the benefits of other IoT services from Azure or Amazon Web Services? Then get in touch with us and we will support you in the implementation with our expertise and the b.telligent partner network.
Exasol is a leading manufacturer of analytical database systems. Its core product is a high-performance, in-memory, parallel processing software specifically designed for the rapid analysis of data. It normally processes SQL statements sequentially in an SQL script. But how can you execute several statements simultaneously? Using the simple script contained in this blog post, we show you how.
Many companies with SAP source systems are familiar with this challenge: They want to integrate their data into an Azure data lake in order to process them there with data from other source systems and applications for reporting and advanced analytics. The new SAP notice on use of the SAP ODP framework has also raised questions among b.telligent's customers. This blog post presents three good approaches to data integration (into Microsoft's Azure cloud) which we recommend at b.telligent and which are supported by SAP.
First of all, let us summarize the customers' requirements. In most cases, enterprises want to integrate their SAP data into a data lake in order to process them further in big-data scenarios and for advanced analytics (usually also in combination with data from other source systems).
As part of their current modernization and digitization initiatives, many companies are deciding to move their data warehouse (DWH) or data platform to the cloud. This article discusses from a technical/organizational perspective which aspects areof particularly important for this and which strategies help to minimize anyrisks. Migration should not be seen as a purely technical exercise. "Soft" factors and business use-cases have a much higher impact.