AWS Big Data Blog

Architecture patterns to optimize Amazon Redshift performance at scale

Tens of thousands of customers use Amazon Redshift as a fully managed, petabyte-scale data warehouse service in the cloud. As an organization’s business data grows in volume, the data analytics need also grows. Amazon Redshift performance needs to be optimized at scale to achieve faster, near real-time business intelligence (BI). You might also consider optimizing Amazon Redshift performance when your data analytics workloads or user base increases, or to meet a data analytics performance service level agreement (SLA). You can also look for ways to optimize Amazon Redshift data warehouse performance after you complete an online analytical processing (OLAP) migration from another system to Amazon Redshift.

In this post, we will show you five Amazon Redshift architecture patterns that you can consider to optimize your Amazon Redshift data warehouse performance at scale using features such as Amazon Redshift Serverless, Amazon Redshift data sharing, Amazon Redshift Spectrum, zero-ETL integrations, and Amazon Redshift streaming ingestion.

Use Amazon Redshift Serverless to automatically provision and scale your data warehouse capacity

To start, let’s review using Amazon Redshift Serverless to automatically provision and scale your data warehouse capacity. The architecture is shown in the following diagram and includes different components within Amazon Redshift Serverless like ML-based workload monitoring and automatic workload management.

Amazon Redshift Serverless architecture diagram

Amazon Redshift Serverless architecture diagram

Amazon Redshift Serverless is a deployment model that you can use to run and scale your Redshift data warehouse without managing infrastructure. Amazon Redshift Serverless will automatically provision and scale your data warehouse capacity to deliver fast performance for even the most demanding, unpredictable, or massive workloads.

Amazon Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs). You pay for the workloads you run in RPU-hours on a per-second basis. You can optionally configure your Base, Max RPU-Hours, and MaxRPU parameters to modify your warehouse performance costs. This post dives deep into understanding cost mechanisms to consider when managing Amazon Redshift Serverless.

Amazon Redshift Serverless scaling is automatic and based on your RPU capacity. To further optimize scaling operations for large scale datasets, Amazon Redshift Serverless has AI-driven scaling and optimization. It uses AI to scale automatically with workload changes across key metrics such as data volume changes, concurrent users, and query complexity, accurately meeting your price performance targets.

There is no maintenance window in Amazon Redshift Serverless, because software version updates are applied automatically. This maintenance occurs with no interruptions for any existing connections or query executions. Make sure to consult the considerations guide to better understand the operation of Amazon Redshift Serverless.

You can migrate from an existing provisioned Amazon Redshift data warehouse to Amazon Redshift Serverless by creating a snapshot of your current provisioned data warehouse and then restoring that snapshot in Amazon Redshift Serverless. Amazon Redshift will automatically convert interleaved keys to compound keys when you restore a provisioned data warehouse snapshot to a Serverless namespace. You can also get started with a new Amazon Redshift Serverless data warehouse.

Amazon Redshift Serverless use cases

You can use Amazon Redshift Serverless for:

  • Self-service analytics
  • Auto scaling for unpredictable or variable workloads
  • New applications
  • Multi-tenant applications

With Amazon Redshift, you can access and query data stored in Amazon S3 Tables – fully managed Apache Iceberg tables optimized for analytics workloads. Amazon Redshift also supports querying data stored using Apache Iceberg tables, and other open table formats like Apache Hudi and Linux Foundation Delta Lake, for more information see External tables for Redshift Spectrum and Expand data access through Apache Iceberg using Delta Lake UniForm on AWS.

You can also use Amazon Redshift Serverless with Amazon Redshift data sharing, which can automatically scale your large dataset in independent datashares and maintain workload isolation controls.

Amazon Redshift data sharing to share live data between separate Amazon Redshift data warehouses

Next, we will look at an Amazon Redshift data sharing architecture pattern, shown in below diagram, to share data between a hub Amazon Redshift data warehouse and spoke Amazon Redshift data warehouses , and to share data across multiple Amazon Redshift data warehouses with each other.

Amazon Redshift data sharing architecture patterns diagram

Amazon Redshift data sharing architecture patterns diagram

With Amazon Redshift data sharing, you can securely share access to live data between separate Amazon Redshift data warehouses without manually moving or copying the data. Because the data is live, all users can see the most up-to-date and consistent information in Amazon Redshift as soon as it’s updated using separate dedicated resources. Because the compute accessing the data is isolated, you can size the data warehouse configurations to individual workload price performance requirements rather than the aggregate of all workloads. This also provides additional flexibility to scale with new workloads without affecting the workloads already being run on Amazon Redshift.

A datashare is the unit of sharing data in Amazon Redshift. A producer data warehouse administrator can create datashares and add datashare objects to share data with other data warehouses, referred to as outbound shares. A consumer data warehouse administrator can receive datashares from other data warehouses, referred to as inbound shares.

To get started, a producer data warehouse needs to add all objects (and potential permissions) that need to be accessed by another data warehouse to a datashare, and share that datashare with a consumer. After that consumer creates a database from the datashare, the shared objects can be accessed using three-part notation consumer_database_name.schema_name.table_name on the consumer, using the consumer’s compute.

Amazon Redshift data sharing use cases

Amazon Redshift data sharing, including multi-warehouse writes in Amazon Redshift, can be used to:

  • Support different kinds of business-critical workloads, including workload isolation and chargeback for individual workloads.
  • Enable cross-group collaboration across teams for broader analytics, data science, and cross-product impact analysis.
  • Deliver data as a service.
  • Share data between environments to improve team agility by sharing data at different granularity levels such as development, test, and production.
  • License access to data in Amazon Redshift by listing Amazon Redshift data sets in the AWS Data Exchange catalog so that customers can find, subscribe to, and query the data in minutes.
  • Update business source data on the producer. You can share data as a service across your organization, but then consumers can also perform actions on the source data.
  • Insert additional records on the producer. Consumers can add records to the original source data.

The following articles provide examples of how you can use Amazon Redshift data sharing to scale performance:

Amazon Redshift Spectrum to query data in Amazon S3

You can use Amazon Redshift Spectrum to query data in , as shown in below diagram using AWS Glue Data Catalog.

Amazon Redshift Spectrum architecture diagram

Amazon Redshift Spectrum architecture diagram

You can use Amazon Redshift Spectrum to efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to directly load data into Amazon Redshift tables. Using the large, parallel scale of the Amazon Redshift Spectrum layer, you can run massive, fast, parallel queries against large datasets while most of the data remains in Amazon S3. This can significantly improve the performance and cost-effectiveness of massive analytics workloads, because you can use the scalable storage of Amazon S3 to handle large volumes of data while still benefiting from the powerful query processing capabilities of Amazon Redshift.

Amazon Redshift Spectrum uses separate infrastructure independent of your Amazon Redshift data warehouse, offloading many compute-intensive tasks, such as predicate filtering and aggregation. This means that you can use significantly less data warehouse processing capacity than other queries. Amazon Redshift Spectrum can also automatically scale to potentially thousands of instances, based on the demands of your queries.

When implementing Amazon Redshift Spectrum, make sure to consult the considerations guide which details how to configure your networking, external table creation, and permissions requirements.

Review this best practices guide and this blog post, which outlines recommendations on how to optimize performance including the impact of different file types, how to design around the scaling behavior, and how you can efficiently partition files. You can check out an example architecture in Accelerate self-service analytics with Amazon Redshift Query Editor V2.

To get started with Amazon Redshift Spectrum, you define the structure for your files and register them as an external table in an external data catalog (AWS Glue, Amazon Athena, and Apache Hive metastore are supported). After creating your external table, you can query your data in Amazon S3 directly from Amazon Redshift.

Amazon Redshift Spectrum use cases

You can use Amazon Redshift Spectrum in the following use cases:

  • Huge volume but less frequently accessed data, build lake house architecture to query exabytes of data in an S3 data lake
  • Heavy scan- and aggregation-intensive queries
  • Selective queries that can use partition pruning and predicate pushdown, so the output is fairly small

Zero-ETL to unify all data and achieve near real-time analytics

You can use Zero-ETL integration with Amazon Redshift to integrate with your transactional databases like Amazon Aurora MySQL-Compatible Edition, so you can run near real-time analytics in Amazon Redshift, or BI in Amazon QuickSight, or machine learning workload in Amazon SageMaker AI, shown in below diagram.

Zero-ETL integration with Amazon Redshift architecture diagram

Zero-ETL integration with Amazon Redshift architecture diagram

Zero-ETL integration with Amazon Redshift removes the undifferentiated heavy lifting to build and manage complex extract, transform, and load (ETL) data pipelines; unifies data across databases, data lakes, and data warehouses; and makes data available in Amazon Redshift in near real time for analytics, artificial intelligence (AI) and machine learning (ML) workloads.

Currently Amazon Redshift supports the following zero-ETL integrations:

To create a zero-ETL integration, you specify an integration source, such as an Amazon Aurora DB cluster, and an Amazon Redshift data warehouse, such as Amazon Redshift Serverless workgroup or a provisioned data warehouse (including Multi-AZ deployment on RA3 clusters to automatically recover from any infrastructure or Availability Zone failures and help ensure that your workloads remain uninterrupted), as the target. The integration replicates data from the source to the target and makes data available in the target data warehouse within seconds. The integration also monitors the health of the integration pipeline and recovers from issues when possible.

Make sure to review considerations, limitations, and quotas on both the data source and target when using zero-ETL integrations with Amazon Redshift.

Zero-ETL integration use cases

You can use zero-ETL integration with Amazon Redshift as an architecture pattern to boost analytical query performance at scale, enable a straightforward and secure way to create near real-time analytics on petabytes of transactional data, with continuous change-data-capture (CDC). Plus, you can use other Amazon Redshift capabilities such as built-in machine learning, materialized views, data sharing, and federated access to multiple data stores and data lakes. You can see more other zero-ETL integrations use cases at What is ETL.

Ingest streaming data into Amazon Redshift data warehouse for near real-time analytics

You can ingest streaming data with Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK) to Amazon Redshift and run near real-time analytics in Amazon Redshift, as shown in the following diagram.

Amazon Redshift data streaming architecture diagram

Amazon Redshift data streaming architecture diagram

Amazon Redshift streaming ingestion provides low-latency, high-speed data ingestion directly from Amazon Kinesis Data Streams or Amazon MSK to an Amazon Redshift provisioned or Amazon Redshift Serverless data warehouse, without staging data in Amazon S3. You can connect to and access the data from the stream using standard SQL and simplify data pipelines by creating materialized views in Amazon Redshift on top of the data stream. For best practices, you can review these blog posts:

To get started on Amazon Redshift streaming ingestion, you create an external schema that maps to the streaming data source and create a materialized view that references the external schema. For details on how to set up Amazon Redshift streaming ingestion for Amazon KDS, see Getting started with streaming ingestion from Amazon Kinesis Data Streams. For details on how to set up Amazon Redshift streaming ingestion for Amazon MSK, see Getting started with streaming ingestion from Apache Kafka sources.

Amazon Redshift streaming ingestion use cases

You can use Amazon Redshift streaming ingestion to:

  • Improve gaming experience by analyzing real-time data from gamers
  • Analyze real-time IoT data and use machine learning (ML) within Amazon Redshift to improve operations, predict customer churn, and grow your business
  • Analyze clickstream user data
  • Conduct real-time troubleshooting by analyzing streaming data from log files
  • Perform near real-time retail analytics on streaming point of sale (POS) data

Other Amazon Redshift features to optimize performance

There are other Amazon Redshift features that you can use to optimize performance.

  • You can resize Amazon Redshift provisioned clusters to optimize data warehouse compute and storage use.
  • You can use concurrency scaling, where Amazon Redshift provisioning automatically adds additional capacity to process increases in read, such as dashboard queries; and write operations, such as data ingestion and processing.
  • You can also consider materialized views in Amazon Redshift, applicable to both provisioned and serverless data warehouses, which contains a precomputed result set, based on an SQL query over one or more base tables. They are especially useful for speeding up queries that are predictable and repeated.
  • You can use auto-copy for Amazon Redshift to set up continuous file ingestion from your Amazon S3 prefix and automatically load new files to tables in your Amazon Redshift data warehouse without the need for additional tools or custom solutions.

Cloud security at AWS is the highest priority. Amazon Redshift offers broad security-related configurations and controls to help ensure information is appropriately protected. See Amazon Redshift Security Best Practices for a comprehensive guide to Amazon Redshift security best practices.

Conclusion

In this post, we reviewed Amazon Redshift architecture patterns and features that you can use to help scale your data warehouse to dynamically accommodate different workload combinations, volumes, and data sources to achieve optimal price performance. You can use them alone or together—choosing the best infrastructural set up for your use case requirements—and scale to accommodate for any future growth.

Get started with these Amazon Redshift architecture patterns and features today by following the instructions provided in each section. If you have questions or suggestions, leave a comment below.


About the authors

Eddie Yao is a Principal Technical Account Manager (TAM) at AWS. He helps enterprise customers build scalable, high-performance cloud applications and optimize cloud operations. With over a decade of experience in web application engineering, digital solutions, and cloud architecture, Eddie currently focuses on Media & Entertainment (M&E) and Sports industries and AI/ML and generative AI.

Julia Beck is an Analytics Specialist Solutions Architect at AWS. She supports customers in validating analytics solutions by architecting proof of concept workloads designed to meet their specific needs.

Scott St. Martin is a Solutions Architect at AWS who is passionate about helping customers build modern applications. Scott uses his decade of experience in the cloud to guide organizations in adopting best practices around operational excellence and reliability, with a focus the manufacturing and financial services spaces. Outside of work, Scott enjoys traveling, spending time with family, and playing piano.