Unify streaming and analytical data with Amazon Data Firehose and Amazon SageMaker Lakehouse

Organizations are increasingly required to derive real-time insights from their data while maintaining the ability to perform analytics. This dual requirement presents a significant challenge: how to effectively bridge the gap between streaming data and analytical workloads without creating complex, hard-to-maintain data pipelines. In this post, we demonstrate how to simplify this process using Amazon Data Firehose (Firehose) to deliver streaming data directly to Apache Iceberg tables in Amazon SageMaker Lakehouse, creating a streamlined pipeline that reduces complexity and maintenance overhead.

Streaming data empowers AI and machine learning (ML) models to learn and adapt in real time, which is crucial for applications that require immediate insights or dynamic responses to changing conditions. This creates new opportunities for business agility and innovation. Key use cases include predicting equipment failures based on sensor data, monitoring supply chain processes in real time, and enabling AI applications to respond dynamically to changing conditions. Real-time streaming data helps customers make quick decisions, fundamentally changing how businesses compete in real-time markets.

Amazon Data Firehose seamlessly acquires, transforms, and delivers data streams to lakehouses, data lakes, data warehouses, and analytics services, with automatic scaling and delivery within seconds. For analytical workloads, a lakehouse architecture has emerged as an effective solution, combining the best elements of data lakes and data warehouses. Apache Iceberg, an open table format, enables this transformation by providing transactional guarantees, schema evolution, and efficient metadata handling that were previously only available in traditional data warehouses. SageMaker Lakehouse unifies your data across Amazon Simple Storage Service (Amazon S3) data lakes, Amazon Redshift data warehouses, and other sources, and gives you the flexibility to access your data in-place with Iceberg-compatible tools and engines. By using SageMaker Lakehouse, organizations can harness the power of Iceberg while benefiting from the scalability and flexibility of a cloud-based solution. This integration removes the traditional barriers between data storage and ML processes, so data workers can work directly with Iceberg tables in their preferred tools and notebooks.

In this post, we show you how to create Iceberg tables in Amazon SageMaker Unified Studio and stream data to these tables using Firehose. With this integration, data engineers, analysts, and data scientists can seamlessly collaborate and build end-to-end analytics and ML workflows using SageMaker Unified Studio, removing traditional silos and accelerating the journey from data ingestion to production ML models.

Solution overview

The following diagram illustrates the architecture of how Firehose can deliver real-time data to SageMaker Lakehouse.

This post includes an AWS CloudFormation template to set up supporting resources so Firehose can deliver streaming data to Iceberg tables. You can review and customize it to suit your needs. The template performs the following operations:

Creates an AWS Identity and Access Management (IAM) role with permissions needed for Firehose to write to an S3 bucket
Creates resources for the Amazon Kinesis Data Generator to send sample streaming data to Firehose
Grants AWS Lake Formation permissions to the Firehose IAM role for Iceberg tables created in SageMaker Unified Studio
Creates an S3 bucket to backup records failed to deliver

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account – If you don’t have an account, you can create one.
A SageMaker Unified Studio domain – For instructions, refer to Create an Amazon SageMaker Unified Studio domain – quick setup.
A demo project – Create a demo project in your SageMaker Unified Studio domain. For instructions, see Create a project. For this example, we choose All capabilities in the project profile section and use streaming_datalake as the AWS Glue database name.

After you create the prerequisites, verify you can log in to SageMaker Unified Studio and the project is created successfully. Every project created in SageMaker Unified Studio gets a project location and project IAM role, as highlighted in the following screenshot.

Create an Iceberg table

For this solution, we use Amazon Athena as the engine for our query editor. Complete the following steps to create your Iceberg table:

In SageMaker Unified Studio, on the Build menu, choose Query Editor.

Choose Athena as the engine for query editor and choose the AWS Glue database created for the project.

Use the following SQL statement to create the Iceberg table. Make sure to provide your project AWS Glue database and project Amazon S3 location (can be found on the project overview page):

CREATE TABLE firehose_events (
type struct<device: string, event: string, action: string>,
customer_id string,
event_timestamp timestamp,
region string)
LOCATION '<PROJECT_S3_LOCATION>/iceberg/events'
TBLPROPERTIES (
'table_type'='iceberg',
'write_compression'='zstd'
);

Deploy the supporting resources

The next step is to deploy the required resources into your AWS environment by using a CloudFormation template. Complete the following steps:

Choose Launch Stack.
Choose Next.
Leave the stack name as firehose-lakehouse.
Provide the user name and password that you want to use for accessing the Amazon Kinesis Data Generator application.
For DatabaseName, enter the AWS Glue database name.
For ProjectBucketName, enter the project bucket name (located on the SageMaker Unified Studio project details page).
For TableName, enter the table name created in SageMaker Unified Studio.
Choose Next.

Select I acknowledge that AWS CloudFormation might create IAM resources and choose Next.

Complete the stack.

Create a Firehose stream

Complete the following steps to create a Firehose stream to deliver data to Amazon S3:

On the Firehose console, choose Create Firehose stream.

For Source, choose Direct PUT.
For Destination, choose Apache Iceberg Tables.

This example chooses Direct PUT as the source, but you can apply the same steps for other Firehose sources, such as Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

For Firehose stream name, enter firehose-iceberg-events.

Collect the database name and table name from the SageMaker Unified Studio project to use in the next step.

In the Destination settings section, enable Inline parsing for routing information and provide the database name and table name from the previous step.

Make sure you enclose the database and table names in double quotes if you want to deliver data to a single database and table. Amazon Data Firehose can also route records to different tables based on the content of the record. For more information, refer to Route incoming records to different Iceberg tables.

Under Buffer hints, reduce the buffer size to 1 MiB and the buffer interval to 60 seconds. You can fine-tune these settings based on your use case latency needs.

In the Backup settings section, enter the S3 bucket created by the CloudFormation template (s3://firehose-demo-iceberg-<account_id>-<region>) and the error output prefix (error/events-1/).

In the Advanced settings section, enable Amazon CloudWatch error logging to troubleshoot any failures, and in for Existing IAM roles, choose the role that starts with Firehose-Iceberg-Stack-FirehoseIamRole-*, created by the CloudFormation template.
Choose Create Firehose stream.

Generate streaming data

Use the Amazon Kinesis Data Generator to publish data records into your Firehose stream:

On the AWS CloudFormation console, choose Stacks in the navigation pane and open your stack.
Select the nested stack for the generator, and go to the Outputs tab.
Choose the Amazon Kinesis Data Generator URL.

Enter the credentials that you defined when deploying the CloudFormation stack.

Choose the AWS Region where you deployed the CloudFormation stack and choose your Firehose stream.
For the template, replace the default values with the following code:

{
"type": {
"device": "{{random.arrayElement(["mobile", "desktop", "tablet"])}}",
"event": "{{random.arrayElement(["firehose_events_1", "firehose_events_2"])}}",
"action": "update"
},
"customer_id": "{{random.number({ "min": 1, "max": 1500})}}",
"event_timestamp": "{{date.now("YYYY-MM-DDTHH:mm:ss.SSS")}}",
"region": "{{random.arrayElement(["pdx", "nyc"])}}"
}

Before sending data, choose Test template to see an example payload.
Choose Send data.

You can monitor the progress of the data stream.

Query the table in SageMaker Unified Studio

Now that Firehose is delivering data to SageMaker Lakehouse, you can perform analytics on that data in SageMaker Unified Studio using different AWS analytics services.

Clean up

It’s generally a good practice to clean up the resources created as part of this post to avoid additional cost. Complete the following steps:

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Select the stack firehose-lakehouse* and on the Actions menu, choose Delete Stack.
In SageMaker Unified Studio, delete the domain created for this post.

Conclusion

Streaming data allows models to make predictions or decisions based on the latest information, which is crucial for time-sensitive applications. By incorporating real-time data, models can make more accurate predictions and decisions. Streaming data can help organizations avoid the costs associated with storing and processing large datasets, because it focuses on the most relevant information. Amazon Data Firehose makes it straightforward to bring real-time streaming data to data lakes in Iceberg format and unifying it with other data assets in SageMaker Lakehouse, making streaming data accessible by various analytics and AI services in SageMaker Unified Studio to deliver real-time insights. Try out the solution for your own use case, and share your feedback and questions in the comments.

About the Authors

Kalyan Janaki is Senior Big Data & Analytics Specialist with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

Phaneendra Vuliyaragoli is a Product Management Lead for Amazon Data Firehose at AWS. In this role, Phaneendra leads the product and go-to-market strategy for Amazon Data Firehose.

Maria Ho is a Product Marketing Manager for Streaming and Messaging services at AWS. She works with services including Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Managed Service for Apache Flink, Amazon Data Firehose, Amazon Kinesis Data Streams, Amazon MQ, Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Notification Services (Amazon SNS).

AWS Big Data Blog