In business intelligence, "quality" is always a key topic but dealt with in a variety of ways. This is due to a lack of a consistent understanding in this area, and consequent absence of a standardized approach to BI quality.
That this is becoming increasingly problematic is also signified by renaming of the current Gartner Quadrant. Data quality "tools" are now called data quality "solutions". The reason: Whereas many manufacturers now proclaim to be bearers of "quality", the common denominator for this is usually infinitesimally small: To find, understand and solve problems. To their credit, this is still the source of all technical progress according to the philosopher Karl Popper. However, it does not yet necessarily have anything to do with data quality per se. In this article, I will therefore sort out and structure the topic of BI quality somewhat, and accordingly call the whole thing "BI quality management". This leaves plenty of scope for "total BI quality management", as applied and taught in the engineering disciplines for some time now.
Firstly: Differentiation from software development
When it comes to BI quality, it is almost reflexive to take a look at software development. It has advanced a great deal in the field of testing. IT-books on this already fill metres of shelf space. For some time now, BI has also incorporated the (favourable) trend of increasingly using principles and tools from software development, and designing BI-landscapes as CI/CD-environments. Numerous test variants along the deployment pipeline, such as automatic builds and code check-in tests, modularization - the test pyramid is now part of the standard BI repertoire in many places. .
Nevertheless, this approach still poses an unsolvable problem in BI: We are already one level deeper and talking about "testing" instead of "quality". For many operational software applications, this approach may work wonderfully. If tests cover almost everything, the quality is also right in most cases. Unfortunately, however, this just doesn't work for BI. But why not?
In BI, we are dealing with data streams which often allow a tremendous variation in possible input and output vectors, with numerous transformations for the business logic in between. Testing with predefined data during development here is not feasible in terms of quantity – pure execution of the classic test pyramid pushes BI to its limits. That this is already an issue software development and being discussed there as "pipeline debt" is shown, for example, in this blog post.
Balancing quality management
For me, this firstly means that quality control in the production system is highly important, which brings us to the subject of validation, monitoring and balancing.
Secondly, there is unfortunately a permanent dilemma with regard to test data: Production data may not be used (due to data protection) and/or cannot be used (due to their quantity). Automatic derivation of an anonymized subset of the data is therefore extremely elaborate – albeit still possible depending on the situation. Basically, the problem remaining with tests using production-related/based data remains is that the developer or product owner cannot be sure if these data cover all requirements. For test-driven development (TDD) and meaningful user-acceptance tests (UAT), detailed test data are therefore fundamentally important.
Figure 1: Product quality vs. testing
Secondly: Arrangement
The next important point for creating order when it comes to quality is differentiation in terms of production and development.
Production: Synchronization with loading cycles
Production is essentially about observational validation and monitoring of previously implemented processes with their business logic over the entire production data quantity. That is to say: Time series synchronized with loading cycles. Validation of source data at the field level (e.g. address and e-mail address) is often listed and offered by many tools, which in my opinion, however, is a normal cleansing task and sometimes already implemented directly during loading (ELT/ETL). Separate tools such as the Great Expectations package in Python also enable this.
Balancing is about checking completeness at defined loading or processing points, and ensuring completeness of the database along the entire data lineage.
Basically, I recommend solving these higher-level quality tests in the production environment using a separate tool instead of, for example, the ELT tool itself (of course, the ETL tool should perform its part of data validation). The advantage of this is that in the event of an extension or migration, existent quality tests remain in place and can continue to run independently of the tools used.
A few simple SQLs are usually enough here for the monitoring logic. Dashboard tools such as Grafana are available for reporting the results. However, it is more important for me personally to be able to keep track of the test metrics, and define tolerance ranges. Because no BI-specific solution is available here yet, we at b.telligent use our specially developed DQM tool to perform these tasks. More on this can be found in our white paper titled "Increasing data quality" and available (in German) for download free of charge from our library.
Development: The problem is always the test data
Development, on the other hand, involves continuous, repeated testing of small individual parts (unit test) prior to the comprehensive treatment (integration and system tests). Here, it helps if the BI system is modular enough to be operable as a CI/CD pipeline, and allow automated testing wherever possible. A very good introduction for BI is provided by the webinar of my colleagues Beyer and Schuster, titled "Quality offensive - DWH development" (also available free of charge but in German from the b.telligent webinar repository)
Categorization of the quality steps is indicated in the extended BI test pyramid:
Figure 2: BI test pyramid
Thirdly: Testing with data
There is no way around this: Data lineage
The challenge in BI is that we usually need a great deal of test data to be able to reproduce a continuous process. However, because a lot of data are related in terms of business logic, large quantities of test data would often need to be adapted to a small change. In most cases, this is extremely elaborate and makes full testing with constructed data disproportionately costly. If the data lineage is not available here in automated form, but has to be traced laboriously and manually each time, the costs for setup and maintenance of test data rise immeasurably. The result is test data with inadequate coverage, and numerous sections of untested code.
Pipeline debt
In modern development environments, the individual developer usually deploys a specific release version in a virtual environment, performs development and testing there, and subsequently checks the code back into the repository for the next release. This is usually the crux of the matter: Which data can be tested in a freshly established environment? How are the data introduced there?
The dilemma with test data is that we don't know which cases are actually covered by data derived from production. By contrast, synthetic test data are notorious for covering too few cases, and possess a poor elasticity of change.
In most cases, this leads to a painful compromise: On the development environments, often only very small data sets are used for unit or component tests, which cover only an isolated part of the existing process steps. Subsequently, further testing is performed in a pre-production environment with production data (anonymized or production-like data). The missing continuity of this test track for the developer is the so-called pipeline debt.
Test-data management
Lacking so far is a powerful test-data management system tailored to the special needs of BI. I consider the following functions to have priority here:
- Definition of test data along the data lineage
- Technical description and commentary options for the individual cases
- Clustering of test data into sub-groups to simulate specific use cases
- Historicization of test data for their
- assignment to release versions
Last but not least: Test-case engineering
What must not be missing under any circumstances is a common definition of test cases with the specialist department. This leads to a better understanding for both sides – BI development and specialist department – and consequently to a much higher quality from the very first line of code onward. Recommendable from practice is a procedure like the Gherkin scenario, involving "given – when – then" and widely used in software development, but unfortunately not compatible with the data and structures in BI.
Conclusion
When it comes to data quality in BI, it is essential to understand the production system and the development process both as a single unit. Either one does not help if the other is neglected. An independent monitoring and balancing tool should be used in production. For development, the BI system is best designed so as to allow use of the methods from modern software development.
Unfortunately, there is no universal, automated solution for constructed test data, and the limit at which maintenance and extension effort exceeds existing resources can be quickly reached. Regrettably, the market currently offers no decent solution for BI test-data management which minimizes a use of resources. The only remaining option is therefore to establish an own solution which achieves a good balance between test requirements, test automation and costs, while ascending the test pyramid to the highest possible point.
Are you seeking a competent partner for support here? Then by all means contact me!