AWS Big Data Blog
Build a secure serverless streaming pipeline with Amazon MSK Serverless, Amazon EMR Serverless and IAM
The exponential growth and vast volume of streaming data have made it a vital resource for organizations worldwide. To unlock its full potential, real-time analytics are essential for extracting actionable insights. Derived from a wide range of sources, including social media, Internet of Things (IoT) sensors, and user interactions, streaming data empowers businesses to respond promptly to emerging trends and events, make informed decisions, and stay ahead of the competition.
Commonly streaming applications use Apache Kafka for data ingestion and Apache Spark Structured Streaming for processing. However, integrating and securing these components poses considerable challenges for users. The complexity of managing certificates, keystores, and TLS configurations to connect Spark Streaming to Kafka brokers demands specialized expertise. A managed, serverless framework would greatly simplify this process, alleviating the need for manual configuration and streamlining the integration of these critical components.
To simplify the management and security of traditional streaming architectures, you can use Amazon Managed Streaming for Apache Kafka (Amazon MSK). This fully managed service simplifies data ingestion and processing. Amazon MSK Serverless alleviates the need for cluster management and scaling, and further enhances security by integrating AWS Identity and Access Management (IAM) for authentication and authorization. This consolidated approach replaces complex certificate and key management require by TLS client authentication through AWS Certificate Manager, streamlining operations and bolstering data protection. For instance, when a client attempts to write data to the cluster, MSK Serverless verifies both the client’s identity and its permissions using IAM.
For efficient data processing, you can use Amazon EMR Serverless with a Spark application built on the Spark Structured Streaming framework, enabling near real-time data processing. This setup seamlessly handles large volumes of data from MSK Serverless, using IAM authentication for secure and swift data processing.
The post demonstrates a comprehensive, end-to-end solution for processing data from MSK Serverless using an EMR Serverless Spark Streaming job, secured with IAM authentication. Additionally, it demonstrates how to query the processed data using Amazon Athena, providing a seamless and integrated workflow for data processing and analysis. This solution enables near real-time querying of the latest data processed from MSK Serverless and EMR Serverless using Athena, providing instant insights and analytics.
Solution overview
The following diagram illustrates the architecture that you implement through this post.
The workflow consists of the following steps:
- The architecture begins with an MSK Serverless cluster set up with IAM authentication. An Amazon Elastic Compute Cloud (Amazon EC2) instance runs a Python script
producer.py
that acts as a data producer, sending sample data to a Kafka topic within the cluster. - The Spark Streaming job retrieves data from the Kafka topic, stores it in Amazon Simple Storage Service (Amazon S3), and creates a corresponding table in the AWS Glue Data Catalog. As it continuously consumes data from the Kafka topic, the job stays up-to-date with the latest streaming data. With checkpointing enabled, the job tracks processed records, allowing it to resume from where it left off in case of a failure, providing seamless data processing.
- To analyze this data, users can use Athena, a serverless query service. Athena enables interactive SQL-based exploration of data directly in Amazon S3 without the need for complex infrastructure management.
Prerequisites
Before getting started, make sure you have the following:
- An active AWS account with billing enabled
- An IAM user with administrator access (
AdministratorAccess
policy) or specific permissions to create and manage resources such as a virtual private cloud (VPC), subnet, security group, IAM roles, NAT gateway, internet gateway, EC2 client, MSK Serverless, EMR Serverless, Amazon EMR Studio, and S3 buckets - Sufficient VPC capacity in your chosen AWS Region
Although using an IAM user with administrator access will work, it’s recommended to follow the principle of least privilege in production environments by creating custom IAM policies with only the necessary permissions. The IAM user we create has the AdministrativeAccess
policy attached to it. However, you might not need such elevated access.
For this post, we create the solution resources in the us-east-2
Region using AWS CloudFormation templates. In the following sections, we show you how to configure your resources and implement the solution.
Create MSK Serverless and EMR Serverless resources
The vpc-msk-emr-serverless-studio.yaml
stack creates a VPC, subnet, security group, IAM roles, NAT gateway, internet gateway, EC2 client, MSK Serverless, EMR Serverless, EMR Studio, and S3 buckets. To create the solution resources, complete the following steps:
- Launch the stack
vpc-msk-emr-serverless-studio
using the CloudFormation template:
- Provide the parameter values as listed in the following table.
Parameters | Description | Sample value |
EnvironmentName |
An environment name that is prefixed to resource names. | msk-emr-serverless-pipeline |
InstanceType |
Amazon MSK client EC2 instance type. | t2.micro |
LatestAmiId |
Latest AMI ID of Amazon Linux 2023 for ec2 instance. You can use the default value. | /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-x86_64 |
VpcCIDR |
IP range (CIDR notation) for this VPC. | 10.192.0.0/16 |
PublicSubnet1CIDR |
IP range (CIDR notation) for the public subnet in the first Availability Zone. | 10.192.10.0/24 |
PublicSubnet2CIDR |
IP range (CIDR notation) for the public subnet in the second Availability Zone. | 10.192.11.0/24 |
PrivateSubnet1CIDR |
IP range (CIDR notation) for the private subnet in the first Availability Zone. | 10.192.20.0/24 |
PrivateSubnet2CIDR |
IP range (CIDR notation) for the private subnet in the second Availability Zone. | 10.192.21.0/24 |
The stack creation process can take approximately 10 minutes to complete. You can check the Outputs tab for the stack after the stack is created.
Next, you set up the data ingestion to the Kafka topic from the Kafka EC2 instance.
Produce records to Kafka topic
Complete the following steps to set up data ingestion:
- On the Amazon EC2 console, go to the EC2 instance that you created using the CloudFormation template.
- Log in to the EC2 instance using Session Manager, a capability of AWS Systems Manager.
- Choose the instance
msk-emr-serverless-blog
and then choose Connect.
- Create a Kafka topic in MSK Serverless from the EC2 instance.
- In the following
export
command, replacemy-endpoint
with theMSKBootstrapServers
value from the CloudFormation stack output: - Run the following command on the EC2 instance to create a topic called
sales_data_topic
:
- In the following
Kafka client already installed at ec2-user home directory (/home/ec2-user
) with MSK IAM Authentication jar and client configuration also created (/home/ec2-user/kafka_2.12-2.8.1/bin/client.properties
) with IAM authentication properties.
The following code shows the contents of client.properties
:
- Run the following command to produce records to the Kafka topic using the
syntheticSalesDataProducer.py
Python script present in EC2 instance. Update the Region accordingly.
Understanding Amazon MSK IAM authentication with EMR Serverless
Amazon MSK IAM authentication enables secure authentication and authorization for Kafka clusters (MSK Serverless) using IAM roles. When integrating with EMR Serverless Spark Streaming, Amazon MSK IAM authentication allows Spark jobs to access Kafka topics securely, using IAM roles for fine-grained access control. This provides secure data processing and streaming.
IAM policy configuration
To enable EMR Serverless jobs to authenticate with an MSK Serverless cluster using IAM, you need to attach specific Kafka-related IAM permissions to the EMR Serverless job execution role. These permissions allow the job to perform essential operations on the Kafka cluster, topics, and consumer groups.The following IAM policy must be attached to the EMR Serverless job execution role to enable necessary permissions:
This code refers to the following actions:
- Connect, DescribeCluster – Required to initiate a secure connection and obtain metadata
- DescribeTopic, ReadData, WriteData – Enables data consumption and production
- CreateTopic (optional) – Allows dynamic topic creation
- AlterGroup, DescribeGroup – Needed for consumer group management in streaming jobs
These permissions make sure that the Spark Streaming job can securely authenticate and interact with MSK Serverless resources using its IAM role.
Required dependencies
To enable Amazon MSK IAM authentication in Spark (especially on EMR Serverless), specific JAR dependencies must be included in your Spark Streaming job using sparkSubmitParameters
:
- spark-sql-kafka-0-10_2.12 – This is the Kafka connector for Spark Structured Streaming. It provides the DataFrame API to read from and write to Kafka.
- aws-msk-iam-auth – This JAR provides the IAM authentication mechanism required to connect to MSK Serverless using the
AWS_MSK_IAM SASL
mechanism.
You can include these dependencies directly by specifying them in the --packages
argument when submitting the EMR Serverless job. For example:
When the job is submitted, EMR Serverless will automatically download these JARs from Maven Central (or another configured repository) at runtime. You don’t need to bundle them manually unless offline usage or specific versions are required.
Spark Streaming job configuration for Amazon MSK IAM authentication
In your Spark Streaming application, configure the Kafka source with SASL properties to enable IAM based authentication. The following code shows the relevant configuration:
Key properties include:
- kafka.security.protocol = SASL_SSL – Enables encrypted communication over SSL with SASL authentication
- kafka.sasl.mechanism = AWS_MSK_IAM – Tells Kafka to use the IAM based SASL mechanism
- kafka.sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required; – Specifies the login module provided by AWS for IAM integration
- kafka.sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler – Handles the actual signing and authentication using the IAM role
With these settings, Spark uses the IAM credentials attached to the EMR Serverless job execution role to authenticate to MSK Serverless without needing additional credentials, certificates, or secrets.
Data processing using an EMR Serverless streaming job with Amazon MSK IAM authentication
Complete the following steps to run a Spark Streaming job to process the data from MSK Serverless:
- Submit the Spark Streaming job to EMR Serverless using the AWS Command Line Interface (AWS CLI), which is already installed on the EC2 instance.
- Log in to the EC2 instance using Session Manager. Choose the instance
msk-emr-serverless-blog
and then choose Connect. - Run the following command to submit the streaming job. Provide the parameters from the CloudFormation stack output.
- After you submit the job, log in to EMR Studio using the URL in the
EmrServerlessStudioURL
value from the CloudFormation stack output. - In the navigation pane, choose Applications under Serverless.
- Choose the application ID in the
EmrServerlessSparkApplicationID
value from the CloudFormation stack output. - On the Streaming job runs tab, verify that the job has been submitted and wait for it to begin running.
Validate the data in Athena
After the EMR Serverless Spark Streaming job ran and created the table for the processed data in the Data Catalog, follow these steps to validate the data using Athena:
- On the Athena console, open the query editor.
- Choose the Data Catalog as the data source.
- Choose the database
emrblog
that the streaming job created. - To validate the data, run the following query:
Clean up
To clean up your resources, complete the following steps:
- Log in to EMR Studio using the URL from the
EmrServerlessStudioURL
value in the CloudFormation stack output. - In the navigation pane, choose Applications under Serverless.
- Choose the application ID from the
EmrServerlessSparkApplicationID
value in the CloudFormation stack output. - On the Streaming job runs tab, select the job that has been running and cancel the job run.
- On the AWS CloudFormation console, delete the CloudFormation stack
vpc-msk-emr-serverless-studio
.
Conclusion
In this post, we showcased a serverless pipeline for streaming data with IAM authentication, empowering you to focus on deriving insights from your analytics. You can customize the EMR Serverless Spark Streaming code to apply transformations and filters, so only valid data is loaded into Amazon S3. This solution combines the power of Amazon EMR Spark Serverless streaming with MSK Serverless, securely integrated through IAM authentication. Now you can streamline your streaming processes without the complexity of managing Amazon MSK and Amazon EMR Spark Streaming integrations.
About the Authors
Shubham Purwar is an AWS Analytics Specialist Solution Architect. He helps organizations unlock the full potential of their data by designing and implementing scalable, secure, and high-performance analytics solutions on the AWS platform. With deep expertise in AWS analytics services, he collaborates with customers to uncover their distinct business requirements and create customized solutions that deliver actionable insights and drive business growth. In his free time, Shubham loves to spend time with his family and travel around the world.
Nitin Kumar is a Cloud Engineer (ETL) at AWS, specialized in AWS Glue. With a decade of experience, he excels in aiding customers with their big data workloads, focusing on data processing and analytics. He is committed to helping customers overcome ETL challenges and develop scalable data processing and analytics pipelines on AWS. In his free time, he likes to watch movies and spend time with his family.
Prashanthi Chinthala is a Cloud Engineer (DIST) at AWS. She helps customers overcome EMR challenges and develop scalable data processing and analytics pipelines on AWS.