As we have seen in the previous blog post, we should now have our transformed data in the data lake and have it available in the Glue Data Catalogue. In this blog post, we will first discuss what AWS Lake Formation is and see how we can use it to securely share access to the data.
Table of Contents
What is AWS Lake Formation?
AWS Lake Formation is a fully managed service that makes it easy to set up, secure, and manage a data lake on AWS. With AWS Lake Formation, you can define security and access controls, set up data management and governance or integrate with other AWS services for data processing and analysis.
Some of the benefits of AWS Lake Formation include:
Easy set up: Lake Formation simplifies the process of setting up a data lake by making it easy to import databases from within AWS as well as from external sources. It provides several pre-built data connectors, making it easy to get data from various sources such as RDS, DynamoDB, and S3. It is also possible to use an already existing data lake that is set up in S3 for example.
Catalog your data: Lake Formation crawls your data sources via Glue to extract the metadata and creates a searchable data catalogue.
Manage Access Controls: Lake Formation allows you to manage access to your data in the data lake. It supports fine-grained access controls, which allow you to define who can access which data and at what level of granularity e.g., table, column, row, and cell level. These policies apply to IAM users and roles.
Cross account access: Lake Formationsimplifiessharing access to data across different AWS accounts.
AWS service integration: Lake Formationalso integrates with AWS Glue, AWS IAM, Amazon Redshift, Amazon Athena, AWS KMS and several other services.
To get an overview of how AWS Lake Formation handles requests by users, we can have a look at the diagram below. If a user tries to access data that is handled by AWS Lake Formation, it will return temporary credentials to the user with the access limitations that are defined in Lake Formation.
How can it help us?
As we have seen in the previous blog post, we now have the PII and Non-PII data separated into two different buckets. PII data needs to be handled with upmost care as infringements can be expensive as seen in several countries. To make sure that access to the PII and Non-PII data is regulated, we will use AWS Lake Formation.
First, we need to setup a data lake administrator which is the only user that can grant Lake Formation permissions on data locations and Data catalogue resources to any principal. It is best practice, not to use a user with Administrator Access for that. The policy required for the data lake administrator user is “AWSLakeFormationDataAdmin”. There are also other policies that can be added to the user e.g., if you want to do cross-account data sharing.
Sobald der/die User:in eingerichtet wurde, müssen wir die S3-Speicherorte in Lake Formation registrieren. Dazu gehen wir in den Abschnitt „Data Lake locations“ und rOnce we have the user set up, we need to register the S3 locations in Lake Formation. To do that, we navigate to the “Data Lake locations” section in Lake Formation and register one location for each bucket. This is necessary so that permissions to access the data in the bucket are handled by Lake Formation and not the user´s permissions.
Now that we have everything set up, we can have a look at how to change the permissions. There are several ways to do this with varying granularity. We can set permissions on the database, table, column, row, or tag level (see screenshot below). With data filters (see screenshot below on the right), we can define columns that should be included or excluded when added to a permission. Here, it is also possible to use a where-clause to filter on the row level. Using tags to manage permissions, gives us the option to easily grant groups permissions for certain data that have that tag. If you want to find out more about all the options for permissions, check out the documentation (AWS LakeFormation Documentation). Furthermore, AWS already announced that it will now support centralized access control to the Redshift with AWS LakeFormation (AWS Announcement).
In this post, we illustrated what Lake Formation can be used for and how it is beneficial for our use case. We have now set up the config bucket, the crawlers, the ETL jobs, the PII and Non-PII buckets, and AWS Lake Formation to manage all the permissions. We have also seen the granularity of the permissions that we can grant.
In the next blog post, we will discuss how the framework becomes further GDPR compliant by implementing the “right to be forgotten” and the “right of access”. work. If you have any questions, please contact us!
Do you want to replace the IoT core with a multi-cloud solution and utilise the benefits of other IoT services from Azure or Amazon Web Services? Then get in touch with us and we will support you in the implementation with our expertise and the b.telligent partner network.
Exasol is a leading manufacturer of analytical database systems. Its core product is a high-performance, in-memory, parallel processing software specifically designed for the rapid analysis of data. It normally processes SQL statements sequentially in an SQL script. But how can you execute several statements simultaneously? Using the simple script contained in this blog post, we show you how.
Many companies with SAP source systems are familiar with this challenge: They want to integrate their data into an Azure data lake in order to process them there with data from other source systems and applications for reporting and advanced analytics. The new SAP notice on use of the SAP ODP framework has also raised questions among b.telligent's customers. This blog post presents three good approaches to data integration (into Microsoft's Azure cloud) which we recommend at b.telligent and which are supported by SAP.
First of all, let us summarize the customers' requirements. In most cases, enterprises want to integrate their SAP data into a data lake in order to process them further in big-data scenarios and for advanced analytics (usually also in combination with data from other source systems).
As part of their current modernization and digitization initiatives, many companies are deciding to move their data warehouse (DWH) or data platform to the cloud. This article discusses from a technical/organizational perspective which aspects areof particularly important for this and which strategies help to minimize anyrisks. Migration should not be seen as a purely technical exercise. "Soft" factors and business use-cases have a much higher impact.