We have demonstrated in part 1 & part 2 how an AWS data lake is built from scratch and how the data is ingested in a Data Lakehouse. In this blog, we describe how to enforce GDPR law, the Right to be Forgotten (RTBF), in a Data Lakehouse. We make both the data lake and the data warehouse built in the previous blogs compliant with having a user exercise their Right to be Forgotten. Let us first understand what RTBF is.
Table of Contents
What Is Right To Be Forgotten?
Well, the name is fairly self-explanatory! A user can simply ask their data to be “forgotten” a.k.a. removed from all storage systems. Now, if you are building a data lake or a data warehouse, you are likely to have the user's personal data stored. Remember we have split the data in separate PII buckets in part 2. Yes, those buckets may contain a variety of personal information about a user. If a user asks you to remove their personal data, as per RTBF, you are required to permanently remove it.
So, What’s the Issue, Let’s Remove the Personal Data?
We wish it was that simple! For starters, how do we know which buckets and files inside those buckets have the data for the user? A data lake is huge, and we might not know which file has the address of the user and which file has the contact number. How do we know the address and the contact number belong to the same person?
Furthermore, all those files in the S3 buckets are objects and objects are immutable in nature in AWS. A file containing all the addresses of all the users might have 100k records and we only want to delete one of them. We can’t simply delete that one record; we need to replace the entire file! What if we delete the wrong record or replace the wrong file? What if we miss some records? Last but not least, how do we do it automatically? Surely, you don’t want to go to each bucket and look for the data in each file manually?
So, What Do We Do?
As always, it is best to proceed step by step, dividing the task into three steps:
First, we focus on finding the complete and accurate user data.
Second, we create a comprehensive, shareable, and editable user data report.
Lastly, once the user data report is reviewed and approved, we proceed to delete the user data reported.
Now, to achieve all three, we developed a highly customizable, flexible, automated, and easy to use tool to enforce RTBF in a Data Lakehouse. The tool, Forget Me Easy (FME) is developed using AWS Glue and exploits metadata generated by AWS Glue Crawlers. FME provides an end-to-end automated solution for searching, reporting, and deleting the user data from an AWS data lake and AWS Redshift.
A typical workflow is illustrated below:
Searching User Data
FME can search user data with minimum input parameters. A primary identifier such as an email address is mandatory for the search to begin. We have two more optional parameters, Tag Name and Potential Key Candidates. The Tag Name e.g., ‘PII’, when provided, shall narrow down the search to PII buckets only, else it will try to look for user data in all the buckets. Potential Key Candidate parameter expects the substring like ‘id’, ‘key’ etc., for example substring ‘id’ will use all the column names having ‘id’ like ‘EMP_ID’ ‘ADDRESS_ID’ to join two different files.
Sounds easy to use, right? Wait, FME provides another level of customization. Based on input parameters, an intermediate editable metadata file is generated. The metadata file has all the bucket names, file names, and column names where FME shall search for user data. This intermediate metadata file can be changed to have more column names or file names. FME joins search results for a single user from the data lake and Redshift and generates an HTML file. The HTML file has complete search results with bucket name, file name and location along with an editable field called ‘TO_BE_DELETED’.
Reviewing User Data
The HTML report generated for a user can be shared with responsible administrators. We added this manual step to approve the user data deletion for transparency. FME can be re-run after adjusting input parameters if the user data report is not accurate or missing some information. Once approved for every record, the HTML file is ingested back into the FME.
The HTML report has complete information (personal and non-personal) of a given user stored in a Data Lakehouse. Non personal data of a user may contain e.g., order history, browsing history etc. The HTML report when shared with a user, provides a complete and transparent access to its data stored by an organization.
Delete Me!
At this point, we have two options, we can either mask the record or delete it entirely from the data lake. FME does as specified. In case of masking, a hash value is updated in place of the record value. FME can be re-run from the beginning to make sure that all relevant data is deleted or masked.
Why Trust FME?
You don’t have to take our word. We compare the results of FME to AWS Macie. AWS Macie is a machine learning based service which can detect and report the presence of sensitive data at bucket level. As described in link1, AWS Macie can be used to identify PII columns. FME reports the user data in same buckets, files, and columns which AWS Macie identifies to have sensitive data. FME can tell if an address in a file in a bucket belongs to a given email address while AWS Macie only tells that the file contains some addresses.
FME is highly customizable and can function even if a Data Lakehouse doesn’t have pre-existing GDPR measures. A two-phase approach of generating and reviewing a user data report helps bring transparency and mitigates against faults in the process. Presenting a comprehensive report with all the instances of a user data builds trust and minimizes the chances of violating GDPR. With a few knobs to turn and tweak, an automated tool like FME helps make Data Lakehouse GDPR compliant pain free and easy.
In future, we can choose to encrypt the entire data lake using AWS KMS key (Encrypted Data lakes). FME framework can easily be extended to employ an AWS Lambda function to first decrypt data and then finding data of a given user. Decrypted user data can then be reported and anonymized by FME. The rest of the data lake stays encrypted and untouched.
At the end of our series, we demonstrated how straightforward and easy it is to build an AWS Data Lakehouse, keeping it up and running, and making it GDPR compliant. Customers with no pre-existing infrastructure and GDPR measures don’t have to think twice to move to the AWS cloud. Customers new to the cloud can take advantage of our tools like FME to implement the Right to be Forgotten.
Who is b.telligent?
Do you want to replace the IoT core with a multi-cloud solution and utilise the benefits of other IoT services from Azure or Amazon Web Services? Then get in touch with us and we will support you in the implementation with our expertise and the b.telligent partner network.
Exasol is a leading manufacturer of analytical database systems. Its core product is a high-performance, in-memory, parallel processing software specifically designed for the rapid analysis of data. It normally processes SQL statements sequentially in an SQL script. But how can you execute several statements simultaneously? Using the simple script contained in this blog post, we show you how.
Many companies with SAP source systems are familiar with this challenge: They want to integrate their data into an Azure data lake in order to process them there with data from other source systems and applications for reporting and advanced analytics. The new SAP notice on use of the SAP ODP framework has also raised questions among b.telligent's customers. This blog post presents three good approaches to data integration (into Microsoft's Azure cloud) which we recommend at b.telligent and which are supported by SAP.
First of all, let us summarize the customers' requirements. In most cases, enterprises want to integrate their SAP data into a data lake in order to process them further in big-data scenarios and for advanced analytics (usually also in combination with data from other source systems).
As part of their current modernization and digitization initiatives, many companies are deciding to move their data warehouse (DWH) or data platform to the cloud. This article discusses from a technical/organizational perspective which aspects areof particularly important for this and which strategies help to minimize anyrisks. Migration should not be seen as a purely technical exercise. "Soft" factors and business use-cases have a much higher impact.