AWS Big Data Blog

Category: Analytics

Unlock self-serve streaming SQL with Amazon Managed Service for Apache Flink

In this post, we present Riskified’s journey toward enabling self-service streaming SQL pipelines. We walk through the motivations behind the shift from Confluent ksqlDB to Apache Flink, the architecture Riskified built using Amazon Managed Service for Apache Flink, the technical challenges they faced, and the solutions that helped them make streaming accessible, scalable, and production-ready.

Unify streaming and analytical data with Amazon Data Firehose and Amazon SageMaker Lakehouse

In this post, we show you how to create Iceberg tables in Amazon SageMaker Unified Studio and stream data to these tables using Firehose. With this integration, data engineers, analysts, and data scientists can seamlessly collaborate and build end-to-end analytics and ML workflows using SageMaker Unified Studio, removing traditional silos and accelerating the journey from data ingestion to production ML models.

OpenSearch UI: Six months in review

OpenSearch UI has been adopted by thousands of customers for various use cases since its launch in November 2024. Exciting customer stories and feedback have helped shape our feature improvements. As we complete 6 months since its general availability, we are sharing major enhancements that have improved OpenSearch UI’s capability, especially in observability and security analytics, in this post.

Scalable analytics and centralized governance for Apache Iceberg tables using Amazon S3 Tables and Amazon Redshift

In this post, we’ll build on the first post in this series to show you how to set up an Apache Iceberg data lake catalog using Amazon S3 Tables and provide different levels of access control to your data. Through this example, you’ll set up fine-grained access controls for multiple users and see how this works using Amazon Redshift. We’ll also review an example with simultaneously using data that resides both in Amazon Redshift and Amazon S3 Tables, enabling a unified analytics experience.

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

In this post, we showcase how financial planners, advisors, or bankers can now ask questions in natural language. These prompts will receive precise data from the customer databases for accounts, investments, loans, and transactions. Amazon Bedrock Knowledge Bases automatically translates these natural language queries into optimized SQL statements, thereby accelerating time to insight, enabling faster discoveries and efficient decision-making.

Simplify enterprise data access using the Amazon Redshift integration with Amazon S3 Access Grants

In this post, we show how to grant Amazon S3 permissions to IAM Identity Center users and groups using S3 Access Grants. We also test the integration using an IAM Identity Center federated user to unload data from Amazon Redshift to Amazon S3 and load data from Amazon S3 to Amazon Redshift.

Access Amazon Redshift Managed Storage tables through Apache Spark on AWS Glue and Amazon EMR using Amazon SageMaker Lakehouse

With SageMaker Lakehouse, you can access tables stored in Amazon Redshift managed storage (RMS) through Iceberg APIs, using the Iceberg REST catalog backed by AWS Glue Data Catalog. This post describes how to integrate data on RMS tables through Apache Spark using SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0.

Zero-copy, Coordination-free approach to OpenSearch Snapshots

In this blog post, we tell you how we enhanced the snapshot efficiency in Amazon OpenSearch Service while carefully maintaining these critical operational aspects. These snapshot optimizations are enabled for all OpenSearch optimized instance family (OR1, OR2, OM2) domains from version 2.17 onwards.

Enhance governance with asset type usage policies in Amazon SageMaker

In this post, we introduce authorization policies for custom asset types—a new governance capability in Amazon SageMaker that gives organizations fine-grained control over who can create and manage assets using specific templates. This feature enhances data governance by allowing teams to enforce usage policies that align with business and security requirements across the organization.

Petabyte-scale data migration made simple: AppsFlyer’s best practice journey with Amazon EMR Serverless

In this post, we share how AppsFlyer successfully migrated their massive data infrastructure from self-managed Hadoop clusters to Amazon EMR Serverless, detailing their best practices, challenges to overcome, and lessons learned that can help guide other organizations in similar transformations.