Sizing a Redshift Database for a Cloud DWH

As an AWS advanced partner, our intention is to regularly document our findings on solutions for data & analytics solutions in this blog and share them with parties interested in AWS. As a cloud provider, AWS offers us and our clients an ecosystem and services allowing very useful solutions for modern data architecture. Our goal here is not to repeat generally accessible AWS documentation. We instead want to condense our technical and professional experience "on paper".

This blog series starts with the topic: "Sizing a Redshift database for a cloud DWH".

Table of Contents

Motivation

Having now become indispensable for enterprises over more than 35 years, data-warehouse solutions are a central component at "data-driven" companies, whether for classic business intelligence for strategic and tactical management or as a basis for data products allowing operational control.

For a very long time, the topic of "big data" was a real obstacle for on-premises databases, due to the resultant higher costs and/or slower data processing. This has changed dramatically in recent years with the introduction of cloud DWH solutions which now make storage and computation results available within minutes or seconds. Workloads of any size can be processed.

On the other hand, however, it is precisely this flexibility which poses risks pertaining to infrastructure costs and sustainability, and is discussed in the context of this article, i.e. how does one sensibly adjust the size of a Redshift without being shocked by a cost explosion?

Preliminary Considerations or “Does Software Beat Hardware”

The dominant criterion in the performance of a Redshift and cloud DWH is generally the speed of data communications:

  • How rapidly do Redshift nodes receive data from Redshift's managed S3 storage?
  • How rapidly are data from individual tables exchanged between cluster nodes?

So, regardless of whether you work with a cluster or Redshift Serverless, reduction in I/O load is always key! I/O requirements play a major role in database sizing.

The size of a DWH thus depends on the following factors:

  • Volume of compressed data (requiring processing), usually in addition to query magnitude and response on the largest fact tables, i.e., if only 5% of such facts are usually queried, then sorting by the time attribute, for example, helps significantly reduce I/O.
  • Requirements concerning the duration of daily or hourly processing. The following standard requirements are meaningful here:
    • A daily process should take no longer than 2h - 4h to allow any processes aborted in the morning to be restarted after troubleshooting, thus preventing loss of an entire day.
    • For hourly processing to make sense at all, the involved processes should run for just a few minutes each. The requirement here is to process the freshest possible data and keep reporting up-to-date.
    • Processes needed at intervals of minutes/seconds, i.e. streaming processes, should also be able to run correspondingly fast to prevent data backlogs.

This necessitates an understanding of data use cases (requirements), query patterns and changes in data volume over time in order to enable proactive adjustments where necessary.

  • Factors influencing query performance for end-users / BI tools
    • Complexity of SQLs and user expectations in the areas of reporting and ad-hoc analytics: >95% of all queries should be within the range of seconds, while the remaining 5% (if they do not run constantly) may also take minutes.
    • Number of queries & concurrent users
    • Long-running queries (e.g. month-end closing statement) which are executed very rarely and should not affect sizing.

Ultimately, the sizing of a DWH is concluded precisely once all these requirements have been fulfilled. Storage of data in general is a secondary aspect, because separation (at least Serverless and new RA3 nodes) makes memory capacity more or less "unlimited" and only has a minor impact on the total costs of a Redshift. 20h of ra3.xlplus or 8h 8 RPU of Redshift Serverless cost about as much as 1TB month. Usage costs for servers, and not data volumes, are generally the cost drivers.

The ability to scale a database seems to solve these issues easily. However, we do not recommend using this factor exclusively as a solution, experience having shown that the development patterns described next usually have a much greater influence on query performance and are much more cost-effective to implement:

These development patterns are quite easy to consider and should always be included in data engineering. Delta processes and sort keys are usually the biggest "levers", as they easily allow performance gains of 100 -1000 by significantly lowering data processing requirements. Instead of 10 years of history, only the chronology of the last month or days is processed, according to needs.

The other best practices have less of an impact:

  • Import from / export to S3 using compressed or column-based data formats (e.g. Parquet).
  • Reduction in data communications between cluster nodes through the right choice of distribution keys.
  • Reduction in data through a wise choice of data model (e.g. aggregate tables) or materialized views.
  • Outsourcing to the lake as well as use of Redshift Spectrum and caching in BI tools or proprietary data storage in these BI tools to reduce query volumes.
  • Processing of mass data outside Redshift, e.g. via AWS Glue, in order to reduce peak usage accordingly.
  • Materialization of intermediate results in temporary tables to allow accurate performance analysis and "facilitate" Redshift query optimization.

In our experience, the influence of data vault vs. dimensional vs. 3NF modelling plays a secondary role, so that the measures described above can be applied independently of this influence.

"Help me choose" is carried out once, while steps 2-4 are repeated as needed or monthly in order to regularly monitor performance parameters.

The problem with step (1) is that estimation focuses heavily on individual query patterns (time-based) and the size of the Redshift is overrated. We therefore recommend starting relatively small (50%) and scaling only when necessary.

In our opinion, an analysis of long processes (3) (SQL from data engineers, BI tools, ad-hoc by power users) should be carried out on a monthly basis. The decision whether to either modify software or scale hardware should be reserved exclusively for the DWH architect.

Of course, mechanisms such as concurrency scaling, limits in Redshift Serverless and Redshift advisor (for sort keys, distribution keys, compression and workload management) are also available; however, it should be noted that whereas these can improve performance, they ultimately do not optimize "bad" SQL code or a data model.

Conclusion

When using cloud services, it is tempting to simply augment potentially unlimited resources, but experience has shown that this leads to permanent cost growth and neglect of the topic of sustainability (in terms of conserving resources and quality of ETL processes).

We therefore always recommend first examining and optimizing the structure of SQL queries / processes. Obvious here is a need for a proper trade-off between development expenditure and infrastructure costs. Certain development patterns can help to keep development expenditure low and counteract creeping "ballooning". Keep in mind that sizing a Redshift database for a cloud data warehouse is not a one-time process. When data, query patterns and even workloads change, sizing needs to be reviewed and adjustments made.

At b.telligent , we are happy to support you in the introduction of Redshift for your data warehouse use cases and also look forward to helping fine-tune your Redshift cluster and the associated data processes!

Want To Learn More? Contact Us!

Helene Fuchs

Your contact person

Helene Fuchs

Domain Lead Data Platform & Data Management

Pia Ehrnlechner

Your contact person

Pia Ehrnlechner

Domain Lead Data Platform & Data Management

Related Posts

chevron left icon
Previous post
Next post
chevron right icon

No previous post

No next post