How To Set Up a GDPR Compliant Data Lake From Scratch – Part 2

How To Set Up a GDPR Compliant Data Lake From Scratch – Part 2

As we have seen in the previous blog post, we should now have our transformed data in the data lake and have it available in the Glue Data Catalogue. In this blog post, we will first discuss what AWS Lake Formation is and see how we can use it to securely share access to the data.

Table of Contents

What is AWS Lake Formation?

AWS Lake Formation is a fully managed service that makes it easy to set up, secure, and manage a data lake on AWS. With AWS Lake Formation, you can define security and access controls, set up data management and governance or integrate with other AWS services for data processing and analysis.

Some of the benefits of AWS Lake Formation include:

  • Easy set up: Lake Formation simplifies the process of setting up a data lake by making it easy to import databases from within AWS as well as from external sources. It provides several pre-built data connectors, making it easy to get data from various sources such as RDS, DynamoDB, and S3. It is also possible to use an already existing data lake that is set up in S3 for example.
  • Catalog your data: Lake Formation crawls your data sources via Glue to extract the metadata and creates a searchable data catalogue.
  • Manage Access Controls: Lake Formation allows you to manage access to your data in the data lake. It supports fine-grained access controls, which allow you to define who can access which data and at what level of granularity e.g., table, column, row, and cell level. These policies apply to IAM users and roles.
  • Cross account access: Lake Formationsimplifiessharing access to data across different AWS accounts.
  • AWS service integration: Lake Formationalso integrates with AWS Glue, AWS IAM, Amazon Redshift, Amazon Athena, AWS KMS and several other services.

To get an overview of how AWS Lake Formation handles requests by users, we can have a look at the diagram below. If a user tries to access data that is handled by AWS Lake Formation, it will return temporary credentials to the user with the access limitations that are defined in Lake Formation.

docs.aws.amazon.com/lake-formation/latest/dg/how-vending-works.html

How can it help us?

As we have seen in the previous blog post, we now have the PII and Non-PII data separated into two different buckets. PII data needs to be handled with upmost care as infringements can be expensive as seen in several countries. To make sure that access to the PII and Non-PII data is regulated, we will use AWS Lake Formation.

First, we need to setup a data lake administrator which is the only user that can grant Lake Formation permissions on data locations and Data catalogue resources to any principal. It is best practice, not to use a user with Administrator Access for that. The policy required for the data lake administrator user is “AWSLakeFormationDataAdmin”. There are also other policies that can be added to the user e.g., if you want to do cross-account data sharing.

Sobald der/die User:in eingerichtet wurde, müssen wir die S3-Speicherorte in Lake Formation registrieren. Dazu gehen wir in den Abschnitt „Data Lake locations“ und rOnce we have the user set up, we need to register the S3 locations in Lake Formation. To do that, we navigate to the “Data Lake locations” section in Lake Formation and register one location for each bucket. This is necessary so that permissions to access the data in the bucket are handled by Lake Formation and not the user´s permissions.

Lake Formation options for granting to principals

Now that we have everything set up, we can have a look at how to change the permissions. There are several ways to do this with varying granularity. We can set permissions on the database, table, column, row, or tag level (see screenshot below). With data filters (see screenshot below on the right), we can define columns that should be included or excluded when added to a permission. Here, it is also possible to use a where-clause to filter on the row level. Using tags to manage permissions, gives us the option to easily grant groups permissions for certain data that have that tag. If you want to find out more about all the options for permissions, check out the documentation (AWS LakeFormation Documentation). Furthermore, AWS already announced that it will now support centralized access control to the Redshift with AWS LakeFormation (AWS Announcement).

Table Permissions that can be defined

In this post, we illustrated what Lake Formation can be used for and how it is beneficial for our use case. We have now set up the config bucket, the crawlers, the ETL jobs, the PII and Non-PII buckets, and AWS Lake Formation to manage all the permissions. We have also seen the granularity of the permissions that we can grant.

In the next blog post, we will discuss how the framework becomes further GDPR compliant by implementing the “right to be forgotten” and the “right of access”. work. If you have any questions, please contact us!

Want To Learn More? Contact Us!

Helene Fuchs

Your contact person

Helene Fuchs

Domain Lead Data Platform & Data Management

Pia Ehrnlechner

Your contact person

Pia Ehrnlechner

Domain Lead Data Platform & Data Management

Related Posts

chevron left icon
Previous post
Next post
chevron right icon

No previous post

No next post