How To Set Up a GDPR Compliant Data Lake From Scratch – Part 3

How To Set Up a GDPR Compliant Data Lake From Scratch – Part 3

We have demonstrated in part 1 & part 2 how an AWS data lake is built from scratch and how the data is ingested in a Data Lakehouse. In this blog, we describe how to enforce GDPR law, the Right to be Forgotten (RTBF), in a Data Lakehouse. We make both the data lake and the data warehouse built in the previous blogs compliant with having a user exercise their Right to be Forgotten. Let us first understand what RTBF is.

Table of Contents

What Is Right To Be Forgotten?

Well, the name is fairly self-explanatory! A user can simply ask their data to be “forgotten” a.k.a. removed from all storage systems. Now, if you are building a data lake or a data warehouse, you are likely to have the user's personal data stored. Remember we have split the data in separate PII buckets in part 2. Yes, those buckets may contain a variety of personal information about a user. If a user asks you to remove their personal data, as per RTBF, you are required to permanently remove it.

So, What’s the Issue, Let’s Remove the Personal Data?

We wish it was that simple! For starters, how do we know which buckets and files inside those buckets have the data for the user? A data lake is huge, and we might not know which file has the address of the user and which file has the contact number. How do we know the address and the contact number belong to the same person?

Furthermore, all those files in the S3 buckets are objects and objects are immutable in nature in AWS. A file containing all the addresses of all the users might have 100k records and we only want to delete one of them. We can’t simply delete that one record; we need to replace the entire file! What if we delete the wrong record or replace the wrong file? What if we miss some records? Last but not least, how do we do it automatically? Surely, you don’t want to go to each bucket and look for the data in each file manually?

So, What Do We Do?

As always, it is best to proceed step by step, dividing the task into three steps:

  • First, we focus on finding the complete and accurate user data.
  • Second, we create a comprehensive, shareable, and editable user data report.
  • Lastly, once the user data report is reviewed and approved, we proceed to delete the user data reported.

Now, to achieve all three, we developed a highly customizable, flexible, automated, and easy to use tool to enforce RTBF in a Data Lakehouse. The tool, Forget Me Easy (FME) is developed using AWS Glue and exploits metadata generated by AWS Glue Crawlers. FME provides an end-to-end automated solution for searching, reporting, and deleting the user data from an AWS data lake and AWS Redshift.

A typical workflow is illustrated below:

Searching User Data

FME can search user data with minimum input parameters. A primary identifier such as an email address is mandatory for the search to begin.  We have two more optional parameters, Tag Name and Potential Key Candidates. The Tag Name e.g., ‘PII’, when provided, shall narrow down the search to PII buckets only, else it will try to look for user data in all the buckets. Potential Key Candidate parameter expects the substring like ‘id’, ‘key’ etc., for example substring ‘id’ will use all the column names having ‘id’ like ‘EMP_ID’ ‘ADDRESS_ID’ to join two different files.

Sounds easy to use, right? Wait, FME provides another level of customization. Based on input parameters, an intermediate editable metadata file is generated. The metadata file has all the bucket names, file names, and column names where FME shall search for user data. This intermediate metadata file can be changed to have more column names or file names. FME joins search results for a single user from the data lake and Redshift and generates an HTML file. The HTML file has complete search results with bucket name, file name and location along with an editable field called ‘TO_BE_DELETED’.

Reviewing User Data

The HTML report generated for a user can be shared with responsible administrators. We added this manual step to approve the user data deletion for transparency. FME can be re-run after adjusting input parameters if the user data report is not accurate or missing some information. Once approved for every record, the HTML file is ingested back into the FME.

The HTML report has complete information (personal and non-personal) of a given user stored in a Data Lakehouse. Non personal data of a user may contain e.g., order history, browsing history etc. The HTML report when shared with a user, provides a complete and transparent access to its data stored by an organization.

Delete Me!

At this point, we have two options, we can either mask the record or delete it entirely from the data lake. FME does as specified. In case of masking, a hash value is updated in place of the record value. FME can be re-run from the beginning to make sure that all relevant data is deleted or masked.

Why Trust FME?

You don’t have to take our word. We compare the results of FME to AWS Macie.  AWS Macie is a machine learning based service which can detect and report the presence of sensitive data at bucket level.  As described in link1, AWS Macie can be used to identify PII columns. FME reports the user data in same buckets, files, and columns which AWS Macie identifies to have sensitive data. FME can tell if an address in a file in a bucket belongs to a given email address while AWS Macie only tells that the file contains some addresses.

FME is highly customizable and can function even if a Data Lakehouse doesn’t have pre-existing GDPR measures. A two-phase approach of generating and reviewing a user data report helps bring transparency and mitigates against faults in the process. Presenting a comprehensive report with all the instances of a user data builds trust and minimizes the chances of violating GDPR. With a few knobs to turn and tweak, an automated tool like FME helps make Data Lakehouse GDPR compliant pain free and easy.

In future, we can choose to encrypt the entire data lake using AWS KMS key (Encrypted Data lakes). FME framework can easily be extended to employ an AWS Lambda function to first decrypt data and then finding data of a given user. Decrypted user data can then be reported and anonymized by FME. The rest of the data lake stays encrypted and untouched.

At the end of our series, we demonstrated how straightforward and easy it is to build an AWS Data Lakehouse, keeping it up and running, and making it GDPR compliant. Customers with no pre-existing infrastructure and GDPR measures don’t have to think twice to move to the AWS cloud. Customers new to the cloud can take advantage of our tools like FME to implement the Right to be Forgotten.

Want To Learn More? Contact Us!

Helene Fuchs

Your contact person

Helene Fuchs

Domain Lead Data Platform & Data Management

Pia Ehrnlechner

Your contact person

Pia Ehrnlechner

Domain Lead Data Platform & Data Management

Related Posts

chevron left icon
Previous post
Next post
chevron right icon

No previous post

No next post