Using AWS Glue Data Catalog views with Apache Spark in EMR Serverless and Glue 5.0

The AWS Glue Data Catalog has expanded its Data Catalog views feature, and now supports Apache Spark environments in addition to Amazon Athena and Amazon Redshift. This enhancement, launched in March 2025, now makes it possible to create, share, and query multi-engine SQL views across Amazon EMR Serverless, Amazon EMR on Amazon EKS, and AWS Glue 5.0 Spark, as well as Athena and Amazon Redshift Spectrum. The multi-dialect views empower data teams to create SQL views one time and query them through supported engines—whether it’s Athena for ad-hoc analytics, Amazon Redshift for data warehousing, or Spark for large-scale data processing. This cross-engine compatibility means data engineers can focus on building data products rather than managing multiple view definitions or complex permission schemes. Using AWS Lake Formation permissions, organizations can share these views within the same AWS account, across different AWS accounts, and with AWS IAM Identity Center users and groups, without granting direct access to the underlying tables. Features of Lake Formation such as fine-grained access control (FGAC) using Lake Formation-tag based access control (LF-TBAC) can be applied to Data Catalog views, enabling scalable sharing and access control across organizations.

In an earlier blog post, we demonstrated the creation of Data Catalog views using Athena, adding a SQL dialect for Amazon Redshift, and querying the view using Athena and Amazon Redshift. In this post, we guide you through the process of creating a Data Catalog view using EMR Serverless, adding the SQL dialect to the view for Athena, sharing it with another account using LF-Tags, and then querying the view in the recipient account using a separate EMR Serverless workspace and AWS Glue 5.0 Spark job and Athena. This demonstration showcases the versatility and cross-account capabilities of Data Catalog views and access through various AWS analytics services.

Benefits of Data Catalog views

The following are key benefits of Data Catalog views for business solutions:

Targeted data sharing and access control – Data Catalog views, combined with the sharing capabilities of Lake Formation, enable organizations to provide specific data subsets to different teams or departments without duplicating data. For example, a retail company can create views that show sales data to regional managers while restricting access to sensitive customer information. By applying LF-TBAC to these views, companies can efficiently manage data access across large, complex organizational structures, maintaining compliance with data governance policies while promoting data-driven decision-making.
Multi-service analytics integration – The ability to create a view in one analytics service and query it across Athena, Amazon Redshift, EMR Serverless, and AWS Glue 5.0 Spark breaks down data silos and promotes a unified analytics approach. This feature allows businesses to use the strengths of different services for various analytics needs. For instance, a financial institution could create a view of transaction data and use Athena for ad-hoc queries, Amazon Redshift for complex aggregations, and EMR Serverless for large-scale data processing—all without moving or duplicating the data. This flexibility accelerates insights and improves resource utilization across the analytics stack.
Centralized auditing and compliance – With views stored in the central Data Catalog, businesses can maintain a comprehensive audit trail of data access across connected accounts using AWS CloudTrail logs. This centralization is crucial for industries with strict regulatory requirements, such as healthcare or finance. Compliance officers can seamlessly monitor and report on data access patterns, detect unusual activities, and demonstrate adherence to data protection regulations like GDPR or HIPAA. This centralized approach simplifies compliance processes and reduces the risk of regulatory violations.

These capabilities of Data Catalog views provide powerful solutions for businesses to enhance data governance, improve analytics efficiency, and maintain robust compliance measures across their data ecosystem.

Solution overview

An example company has multiple datasets containing details of their customers’ purchase details mixed with personally identifiable information (PII) data. They categorize their datasets based on sensitivity of the information. The data steward wants to share a subset of their preferred customers data for further analysis downstream by their data engineering team.

To demonstrate this use case, we use sample Apache Iceberg tables customer and customer_address. We create a Data Catalog view from these two tables to filter by preferred customers. We then use LF-Tags to share restricted columns of this view to the downstream engineering team. The solution is represented in the following diagram.

Prerequisites

To implement this solution, you need two AWS accounts with an AWS Identity and Access Management (IAM) admin role. We use the role to run the provided AWS CloudFormation templates and also use the same IAM roles added as Lake Formation administrator.

Set up infrastructure in the producer account

We provide a CloudFormation template that deploys the following resources and completes the data lake setup:

Two Amazon Simple Storage Service (Amazon S3) buckets: one for scripts, logs, and query results, and one for the data lake storage.
Lake Formation administrator and catalog settings. The IAM admin role that you provide is registered as Lake Formation administrator. Cross-account sharing version is set to 4. Default permissions for newly created databases and tables is set to use Lake Formation permissions only.
An IAM role with read, write, and delete permissions on the data lake bucket objects. The data lake bucket is registered with Lake Formation using this IAM role.
An AWS Glue database for the data lake.
Lake Formation tags. These tags are attached to the database.
CSV and Iceberg format tables in the AWS Glue database. The CSV tables are pointing to s3://redshift-downloads/TPC-DS/2.13/10GB/ and the Iceberg tables are stored in the user account’s data lake bucket.
An Athena workgroup.
An IAM role and an AWS Lambda function to run Athena queries. Athena queries are run in the Athena workgroup to insert data from CSV tables to Iceberg tables. Relevant Lake Formation permissions are granted to the Lambda role.
An EMR Studio and related virtual private cloud (VPC), subnet, routing table, security groups, and EMR Studio service IAM role.
An IAM role with policies for the EMR Studio runtime. Relevant Lake Formation permissions are granted to this role on the Iceberg tables. This role will be used as the definer role to create the Data Catalog view. A definer role is the IAM role with necessary permissions to access the referenced tables, and runs the SQL statement that defines the view.

Complete the following steps in your producer AWS account:

Sign in to the AWS Management Console as an IAM administrator role.
Launch the CloudFormation stack.

Allow approximately 5 minutes for the CloudFormation stack to complete creation. After the CloudFormation has completed launching, proceed with the following instructions.

If you’re using the producer account in Lake Formation for the first time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime role GlueViewBlog-EMRStudio-RuntimeRole.

Create an EMR Serverless application

Complete the following steps to create an EMR Serverless application in your EMR Studio:

On the Amazon EMR console, under EMR Studio in the navigation pane, choose Studios.
Choose GlueViewBlog-emrstudio and choose the URL link of the Studio to open it.
On the EMR Studio dashboard, choose Create application.

You will be directed to the Create application page on EMR Studio. Let’s create a Lake Formation enabled EMR Serverless application.

Under Application settings, provide the following information:
1. For Name, enter a name (for example, emr-glueview-application).
2. For Type, choose Spark.
3. For Release version, choose emr-7.8.0.
4. For Architecture, choose x86_64.
Under Application setup options, select Use custom settings.
Under Interactive endpoint, select Enable endpoint for EMR studio.
Under Additional configurations, for Metastore configuration, select Use AWS Glue Data Catalog as metastore, then select Use Lake Formation for fine-grained access control.
Under Network connections, choose emrs-vpc for VPC, enter any two private subnets, and enter emr-serverless-sg for Security groups.
Choose Create and start the application.

Create an EMR Workspace

Complete the following steps to create an EMR Workspace:

On the EMR Studio console, choose Workspaces in the navigation pane and choose Create Workspace.
Enter a Workspace name (for example, emrs-glueviewblog-workspace).
Leave all other settings as default and choose Create Workspace.
Choose Launch Workspace. Your browser might request to allow pop-up permissions for the first time launching the Workspace.
After the Workspace is launched, in the navigation pane, choose Compute.
For Compute type, select EMR Serverless application and enter emr-glueview-application for the application and GlueViewBlog-EMRStudio-RuntimeRole for Interactive runtime role.
Make sure the kernel attached to the Workspace is PySpark.

Create a Data Catalog view and verify

Complete the following steps:

Download the notebook glueviewblog_producer.ipynb. The code creates a Data Catalog view customer_nonpii_view from the two Iceberg tables, customer_iceberg and customer_address_iceberg, in the database glueviewblog_<account-id>_db.
On your EMR Workspace emrs-glueviewblog-workspace, go to the File browser section and choose Upload files.
Upload glueviewblog_producer.ipynb.
Update the data lake bucket name, AWS account ID, and AWS Region to match your resources.
Update the database_name, table1_name, and table2_name to match your resources.
Save the notebook.
Choose the double arrow icon to restart the kernel and rerun the notebook.

The Data Catalog view customer_nonpii_view is created and verified.

In the navigation pane on the Lake Formation console, under Data Catalog, choose Views.
Choose the new view customer_nonpii_view.
On the SQL definitions tab, verify EMR with Apache Spark shows up for Engine name.
Choose the tab LF-Tags. The view should show the LF-Tag sensitivity=pii-confidential inherited from the database.
Choose Edit LF-Tags.
On the Values dropdown menu, choose confidential to overwrite the Data Catalog view’s key value of sensitivity from pii-confidential.
Choose Save.

With this, we have created a non-PII view to share with the data engineering team from the datasets that has PII information of customers.

Add Athena SQL dialect to the view

With the view customer_nonpii_view having been created by the EMR runtime role GlueViewBlog-EMRStudio-RuntimeRole, the Admin will have only describe permissions on it as a database creator and Lake Formation administrator. In this step, the Admin will grant itself alter permissions on the view, in order to add the Athena SQL dialect to the view.

On the Lake Formation console, in the navigation pane, choose Data permissions.
Choose Grant and provide the following information:
1. For Principals, enter Admin.
2. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
3. For Key, choose sensitivity.
4. For Values, choose confidential and pii-confidential.
5. Under Database permissions, select Super for Database permissions and Grantable permissions.
6. Under Table permissions, select Super for Table permissions and Grantable permissions.
7. Choose Grant.
Verify the LF-Tags based permissions the Admin.
Open the Athena query editor, choose the Workgroup GlueViewBlogWorkgroup and choose the AWS Glue database glueviewblog_<accountID>_db.

Run the following query. Replace <accountID> with your account ID.

ALTER VIEW glueviewblog_<accountID>_db.customer_nonpii_view ADD DIALECT
AS
select c_customer_id, c_customer_sk, c_last_review_date, ca_country, ca_location_type
from glueviewblog__<accountID>_db.customer_iceberg, glueviewblog__<accountID>_db.customer_address_iceberg
where c_current_addr_sk = ca_address_sk and c_preferred_cust_flag='Y';

Verify the Athena dialect by running a preview on the view.
On the Lake Formation console, verify the SQL dialects on the view customer_nonpii_view.

Share the view to the consumer account

Complete the following steps to share the Data Catalog view to the consumer account:

On the Lake Formation console, in the navigation pane, choose Data permissions.
Choose Grant and provide the following information:
1. For Principals, select External accounts and enter the consumer account ID.
2. For LF-Tags or catalog resources, select Resources matched by LF-Tags.
3. For Key, choose sensitivity.
4. For Values, choose confidential.
5. Under Database permissions, select Describe for Database permissions and Grantable permissions.
6. Under Table permissions, select Describe and Select for Table permissions and Grantable permissions.
7. Choose Grant.
Verify granted permissions on the Data permissions page.

With this, the producer account data steward has created a Data Catalog view of a subset of data from two tables in their Data Catalog, using the EMR runtime role as the definer role. They have shared it to their analytics account using LF-Tags to run further processing of the data downstream.

Set up infrastructure in the consumer account

We provide a CloudFormation template to deploy the following resources and set up the data lake as follows:

An S3 bucket for Amazon EMR and AWS Glue logs
Lake Formation administrator and catalog settings similar to the producer account setup
An AWS Glue database for the data lake
An EMR Studio and related VPC, subnet, routing table, security groups, and EMR Studio service IAM role
An IAM role with policies for the EMR Studio runtime

Complete the following steps in your consumer AWS account:

Sign in to the console as an IAM administrator role.
Launch the CloudFormation stack.

Allow approximately 5 minutes for the CloudFormation stack to complete creation. After the CloudFormation has completed launching, proceed with the following instructions.

If you’re using the consumer account Lake Formation for the first time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime role GlueViewBlog-EMRStudio-Consumer-RuntimeRole.

Accept AWS RAM shares in the consumer account

You can now log in to the AWS consumer account and accept the AWS RAM invitation:

Open the AWS RAM console with the IAM role that has AWS RAM access.
In the navigation pane, choose Resource shares under Shared with me.

You should see two pending resource shares from the producer account.

Accept both invitations.

Create a resource link for the shared view

To access the view that was shared by the producer AWS account, you need to create a resource link in the consumer AWS account. A resource link is a Data Catalog object that is a link to a local or shared database, table, or view. After you create a resource link to a view, you can use the resource link name wherever you would use the view name. Furthermore, you can grant permission on the resource link to the job runtime role GlueViewBlog-EMRStudio-Consumer-RuntimeRole to access the view through EMR Serverless Spark.

To create a resource link, complete the following steps:

Open the Lake Formation console as the Lake Formation data lake administrator in the consumer account.
In the navigation pane, choose Tables.
Choose Create and Resource link.
For Resource link name, enter the name of the resource link (for example, customer_nonpii_view_rl).
For Database, choose the glueviewblog_customer_<accountID>_db database.
For Shared table region, choose the Region of the shared table.
For Shared table, choose customer_nonpii_view.
Choose Create.

Grant permissions on the database to the EMR job runtime role

Complete the following steps to grant permissions on the database glueviewblog_customer_<accountID>_db to the EMR job runtime role:

On the Lake Formation console, in the navigation pane, choose Databases.
Select the database glueviewblog_customer_<accountID>_db and on the Actions menu, choose Grant.
In the Principles section, select IAM users and roles, and choose GlueViewBlog-EMRStudio-Consumer-RuntimeRole.
In the Database permissions section, select Describe.
Choose Grant.

Grant permissions on the resource link to the EMR job runtime role

Complete the following steps to grant permissions on the resource link customer_nonpii_view_rl to the EMR job runtime role:

On the Lake Formation console, in the navigation pane, choose Tables.
Select the resource link customer_nonpii_view_rl and on the Actions menu, choose Grant.
In the Principles section, select IAM users and roles, and choose GlueViewBlog-EMRStudio-Consumer-RuntimeRole.
In the Resource link permissions section, select Describe for Resource link permissions.
Choose Grant.

This allows the EMR Serverless job runtime roles to describe the resource link. We don’t make any selections for grantable permissions because runtime roles shouldn’t be able to grant permissions to other principles.

Grant permissions on the target for the resource link to the EMR job runtime role

Complete the following steps to grant permissions on the target for the resource link customer_nonpii_view_rl to the EMR job runtime role:

On the Lake Formation console, in the navigation pane, choose Tables.
Select the resource link customer_nonpii_view_rl and on the Actions menu, choose Grant on target.
In the Principles section, select IAM users and roles, and choose GlueViewBlog-EMRStudio-Consumer-RuntimeRole.
In the View permissions section, select Select and Describe.
Choose Grant.

Set up an EMR Serverless application and Workspace in the consumer account

Repeat the steps to create an EMR Serverless application in the consumer account.

Repeat the steps to create a Workspace in the consumer account. For Compute type, select EMR Serverless application and enter emr-glueview-application for the application and GlueViewBlog-EMRStudio-Consumer-RuntimeRole as the runtime role.

Verify access using interactive notebooks from EMR Studio

Complete the following steps to verify access in EMR Studio:

Download the notebook glueviewblog_emr_consumer.ipynb. The code runs a select statement on the view shared from the producer.
In your EMR Workspace emrs-glueviewblog-workspace, navigate to the File browser section and choose Upload files.
Upload glueviewblog_emr_consumer.ipynb.
Update the data lake bucket name, AWS account ID, and Region to match your resources.
Update the database to match your resources.
Save the notebook.
Choose the double arrow icon to restart the kernel with PySpark kernel and rerun the notebook.

Verify access using interactive notebooks from AWS Glue Studio

Complete the following steps to verify access using AWS Glue Studio:

Download the notebook glueviewblog_glue_consumer.ipynb
Open the AWS Glue Studio console.
Choose Notebook and then choose Upload notebook.
Upload the notebook glueviewblog_glue_consumer.ipynb.
For IAM role, choose GlueViewBlog-EMRStudio-Consumer-RuntimeRole.
Choose Create notebook.
Update the data lake bucket name, AWS account ID, and Region to match your resources.
Update the database to match your resources.
Save the notebook.
Run all the cells to verify fine-grained access.

Verify access using the Athena query editor

Because the view from the producer account was shared to the consumer account, the Lake Formation administrator will have access to the view in the producer account. Also, because the lake admin role created the resource link pointing to the view, it will also have access to the resource link. Go to the Athena query editor and run a simple select query on the resource link.

The analytics team in the consumer account was able to access a subset of the data from a business data producer team, using their analytics tools of choice.

Clean up

To avoid incurring ongoing costs, clean up your resources:

In your consumer account, delete AWS Glue notebook, stop and delete the EMR application, and then delete EMR Workspace.
In your consumer account, delete the CloudFormation stack. This should remove the resources launched by the stack.
In your producer account, log in to the Lake Formation console and revoke the LF-Tags based permissions you had granted to the consumer account.
In your producer account, stop and delete the EMR application and then delete the EMR Workspace.
In your producer account, delete the CloudFormation stack. This should delete the resources launched by the stack.
Review and clean up any additional AWS Glue and Lake Formation resources and permissions.

Conclusion

In this post, we demonstrated a powerful, enterprise-grade solution for cross-account data sharing and analysis using AWS services. We walked you through how to do the following key steps:

Create a Data Catalog view using Spark in EMR Serverless within one AWS account
Securely share this view with another account using LF-TBAC
Access the shared view in the recipient account using Spark in both EMR Serverless and AWS Glue ETL
Implement this solution with Iceberg tables (it’s also compatible other open table formats like Apache Hudi and Delta Lake)

The solution approach with multi-dialect data catalog views provided in this post is particularly valuable for enterprises looking to modernize their data infrastructure while optimizing costs, improve cross-functional collaboration while enhancing data governance, and accelerate business insights while maintaining control over sensitive information.

Refer to the following information about Data Catalog views with individual analytics services, and try out the solution. Let us know your feedback and questions in the comments section.

EMR Serverless: Working with Glue Data Catalog views
Athena: Use Data Catalog views in Athena
Amazon Redshift: AWS Glue Data Catalog views
Lake Formation: Building AWS Glue Data Catalog views

About the Authors

Aarthi Srinivasan is a Senior Big Data Architect with Amazon SageMaker Lakehouse. As part of the SageMaker Lakehouse team, she works with AWS customers and partners to architect lake house solutions, enhance product features, and establish best practices for data governance.

Praveen Kumar is an Analytics Solutions Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-based services. His areas of interest are serverless technology, data governance, and data-driven AI applications.

Dhananjay Badaya is a Software Developer at AWS, specializing in distributed data processing engines including Apache Spark and Apache Hadoop. As a member of the Amazon EMR team, he focuses on designing and implementing enterprise governance features for EMR Spark.

AWS Big Data Blog