In this article, I propose a way for physical organization of historical tables, which makes it possible to effectively use partition pruning to optimize query performance. The way is specifically designed for data warehouses, therefore it presumes relatively complicated data loads yet productive selections.
Table of Contents
Physical Organisation of Historical Tables
In this article, I propose a way for physical organization of historical tables, which makes it possible to effectively use partition pruning to optimize query performance. The way is specifically designed for data warehouses, therefore it presumes relatively complicated data loads yet productive selections.
I will use two metrics to evaluate the effectiveness of each described solution:
Total Row Count (TRC)
Rows per Single Read (RSR)
The Total Row Count determines the overall size of a table. The "Rows per Single Read" metric gives us an idea on how much data should be scanned to satisfy a typical query. The typical query against a historical table is, in my view, the one that returns values of historized attributes as of given date; it affects the significant part of key values which makes index access ineffective.
Does the Change in the Demography of Data Impact on the Indicators?
It's obvious that effectiveness of physical design depends on data demographics. I will use a one case for an initial comparison, and later on we'll see how changes in data demographics can influence the metrics. The case is: a table that keeps the history of account balances in a bank. That data has following dimensions:
Count of Accounts (CA): 1 000 000
History Depth (HD): 5 years
Frequency of Changes (FC): 5 changes per account per month
First Possibility: The Interval Table
Firstly, let's recall traditional ways to keep historucal data. The most frequently used solution is an interval table. each row in the table has two dates which represent the vailidity onterval of a record.
The balace for a given date could be selceted using query like this:
Based on this query, we can identify the main drawback of interval tables: there is no way to use date-partitioning, because VALID_FROM/VALID_TO of necessary rows can unconstrainedly differ from the given date.
The performance metrics could be calculated using following formulas:
TRC = CA * FC * HD * 12
RSR = TRC (because the whole table should be read)
In summary, we have:
Total row count = 300 000 000
Rows per single read = 300 000 000
Second Possibility: The Snapshot Table
The second option is a snapshot table, which means saving balances for each account for each day.
Doing this way means storing a lot of redundant informations, but it permits using with equity condition to select balances on a given date:
For this kind of queries, Oracle can easily use partition pruning if the base table is partitioned by BALANCE_DT. Therefore, we can calculate the performance metrics like so:
TRC = CA * HD * 365
RSR = CA (there is one record per account per partition)
That determines our data given below:
Each single read costs much less for a snapshot table than for nterval one, but this advantage is ruined by increasing overall table's bigness. The table of this size has less chances to be cached, requires more time for backup, restore and maintenance operations.
You can find out which solution can help you overcome these factors in our next blog post!
Who is b.telligent?
Do you want to replace the IoT core with a multi-cloud solution and utilise the benefits of other IoT services from Azure or Amazon Web Services? Then get in touch with us and we will support you in the implementation with our expertise and the b.telligent partner network.
Exasol is a leading manufacturer of analytical database systems. Its core product is a high-performance, in-memory, parallel processing software specifically designed for the rapid analysis of data. It normally processes SQL statements sequentially in an SQL script. But how can you execute several statements simultaneously? Using the simple script contained in this blog post, we show you how.
Many companies with SAP source systems are familiar with this challenge: They want to integrate their data into an Azure data lake in order to process them there with data from other source systems and applications for reporting and advanced analytics. The new SAP notice on use of the SAP ODP framework has also raised questions among b.telligent's customers. This blog post presents three good approaches to data integration (into Microsoft's Azure cloud) which we recommend at b.telligent and which are supported by SAP.
First of all, let us summarize the customers' requirements. In most cases, enterprises want to integrate their SAP data into a data lake in order to process them further in big-data scenarios and for advanced analytics (usually also in combination with data from other source systems).
As part of their current modernization and digitization initiatives, many companies are deciding to move their data warehouse (DWH) or data platform to the cloud. This article discusses from a technical/organizational perspective which aspects areof particularly important for this and which strategies help to minimize anyrisks. Migration should not be seen as a purely technical exercise. "Soft" factors and business use-cases have a much higher impact.