More and more decisions of relevance to business are made on the basis of automatically generated data. Testing only the correctness of the source code of transformations is no longer sufficient. The correctness of the data, which is typically compiled from different, rapidly changing sources, must also be checked... In the worst case, a wrong decision may be made due to poor data quality. This is not necessarily a new finding, but unfortunately the tool support for data quality checks in the open source environment has so far been unsatisfactory. This blog shows a methodology and a Python tool for so-called pipeline tests, which fill exactly this gap.
What is pipeline debt?
The great importance of data as a raw material in all areas of an enterprise is already quite well-known. Data warehouse structures are fed with information ranging from sensor data to target data from all such areas. A machine learning model works on the basis of the data it has seen in the training process. Data that is very far away from what the model has seen in the training process can lead to wrong behavior. Data that has been falsified by error-prone sources or processing steps will certainly lead to false results. From the raw data source to the dashboard, a data set typically passes through multiple processes (transformation, loading, data models, machine-learning models). Monitoring of data quality and validation is often neglected here. Provided next is a closer look at a new open-source tool called Great Expectations for validating data before before running a machine learning model or otherwise using numbers.
Data scientists are aware of the problem: We have a dashboard with key figures derived from different data sources. Underlying some of the key figures is a machine-learning model, e.g. for scoring. One of the key figures suddenly changes so drastically that it no longer makes sense or becomes NaN because the value can no longer be calculated.
Why do I need pipeline tests?
Test-driven development has become a standard approach in the field of software development. Writing unit tests is part of good style and helps keep code quality high and avoid errors.
In the meantime, however, complexity is often no longer only manifested in the code itself, but in the data as-well. Unexpected behaviour occurring during running a machine-learning model may actually reflect true anomalous behavior, or the data may be erroneous. Often, the cause is an uncommunicated change in the data model of the source system.
Machine-learning models themselves now consist of a pipeline with several transformation steps:
- Loading of data (from a raw-data source or a data warehouse / another DBMS)
- Cleaning and preparation
- Aggregations
- Features
- Dimensionality reduction
- Normalization or scaling of data
The need for a data quality verification tool in all these steps is obvious. Unlike unit tests which check executable code (mostly at compile or build/deploy time), data tests are called pipeline tests and executed like batches.
Great Expectations to the rescue!
Python has established itself as a development language in the area of machine learning. The Great Expectations tool is a Python package, installable via pip or conda.
pip install great-expectations
conda install conda-forge::great-expectations
Because its scope of application is highly complex, the tool has a very abstract and generic structure. Though this might initially result in a steep learning curve, it flattens as soon as the first use case is dealt with. The developers of Great Expectations have set up a discussion forum providing lucid introductory examples. There is also a slack channel for conversation.
After installation of the Python package, each new project is started by:
great-expectations init
This command creates the basic folder of a Great Expectations (GE) project (also called DataContext in GE jargon); it can be followed immediately by a
git init
to manage the project's versions with git.
Example of the folder structure created by Great Expectations after an initialization call
During the initialization process, you are asked about the location of your data sources; this step may be skipped and carried out manually. Diverse data sources are recognized here, ranging from local CSV through SQLAlchemy Connector to Spark and Hadoop (a partitioned parquet on HDFS, e.g.).
The basic principle of GE are expectations and validations. Once a project has been initialized and data sources have been selected, you can immediately start loading a batch (data excerpt) and defining expectations. Here are some examples of expectations:
- Column "ID" must not be null.
- Column "ID" must increase monotonically.
- Column "ID" must be of type integer.
- Column "A" must always be greater than column "B".
- Column "B" must lie between 0 and 1000.
- ....
A large number of expectation templates are available, including statistical (distribution) tests. For exploratory development of expectations in a new data set, examples of notebooks for the various connectors (csv, SQL, pySpark) are available in the folder great_expectations/notebooks/.
Once a set of expectations has been defined, the expectations need to be validated with a new data batch (for example, after arrival of new data from the last day).
During the validation process it is also possible to parameterize expectation variables. Suppose you have a streaming data source containing a time column with unixtime values. The data is persisted and once a day a batch of the previous day is to be validated. The fact that the time stamps should lie within the day bounds can be easily validated by setting the expectations
batch.expect_column_values_to_be_between('timestamp', {'$PARAMETER': 'start_time'}, {'$PARAMETER': 'end_time'})
and during the validation process
batch.validate(evaluation_parameters={'start_time': unixtime_start, 'end_time': unixtime_end})
A nice feature of this tool is the DataDocs which are created and then maintained in the course of pipeline tests. There, a catalogue of data sources, expectations and validations tracked by the GE project is stored in a transparent website format (also in the GE folder, under great_expectations/uncommitted/data_docs) and versioned. This way, we can always find out which data sources are being monitored, which tests (expectations) are defined, and whether the tests had positive or negative results.
Configuration keys which we don't want to commit to a GitHub repository can be easily saved in a yml-file in the uncommitted folder, so that passwords and access keys remain securely stored.
Lifecycle and version management of a DataContext
Once we have obtained a set of expectations with which data sources are to be validated in future, we save our expectation suite. It is stored as a JSON file in the project folder. Now is a good time to trigger a git commit to save the expectation suite under version management.
Great Expectations is not intended to be a pipeline automation tool (pipeline execution), but can be integrated into one (Airflow, Oozie, ...) in order to perform validations in a timed manner, e.g. daily.
Once the pattern according to which GE operates has been understood, it can be extended very easily. Additional expectations can be defined, or new data sources can be integrated. It also makes sense to monitor the same data source repeatedly on its way through the processing pipeline, in order to identify any intermediate steps affecting data quality.
An end-to-end example of loading a database with a data source can be studied in this following GitHub repository. There, a CSV generated newly every day is first validated as raw data, and then validated again after loading into a MySQL database.
Conclusion
Great Expectations is a fast growing tool allowing comprehensive use to ensure data quality for the operation of a machine-learning model. The tool has been developed with special importance attached to providing the most generic possible framework and offering users many interfaces which allow them to adapt Great Expectations to their own project and extend it according to their own needs. Available, for example, is an action operator which can automatically generate slack notification after validation. This and other templates can be used to design interfaces to other systems.
The development rate and the major players like Databricks promoting the tool at conferences denote the importance and quality of this tool. At the same time, there is a great need to take data quality testing into the world of modern data sources and development approaches with the help of a modern tool. It is therefore worth getting to know Great Expectations and integrating it into your own data pipelines; the tool will certainly accompany us for longer.