AWS Storage Blog
How Pendulum achieves 6x faster processing and 40% cost reduction with Amazon S3 Tables
Pendulum is an AI-powered analytics platform that aggregates and analyzes real-time data from social media, news, and podcasts. Designed to help organizations stay ahead, it enables reputation monitoring, early crisis detection, and influencer activity tracking. Using machine learning (ML) enables Pendulum to surface key insights from multiple channels, providing a comprehensive view of the digital landscape.
A key component of Pendulum’s solution is ingesting billions of records from more than 20 third-party platforms while preprocessing this data into a 10+ TB data lake stored in Amazon S3. Pendulum uses Apache Iceberg as their preferred table format due to its query optimization capabilities, seamless schema and indexing evolution, time travel support, and usage of merge-on-read to maintain data consistency and fast access. Managing Iceberg tables in their S3 environment manually necessitated a team of multiple engineers dedicated to maintenance and optimization. The introduction of Amazon S3 Tables empowered Pendulum to completely transform what was previously an operationally expensive process. S3 Tables delivered the first cloud object store with built-in Iceberg support, and the easiest way to store tabular data at scale. Adopting S3 Tables allowed Pendulum to achieve remarkable improvements: they cut out more than 4 hours dedicated to table maintenance and optimizations each week, improved query performance by about 70%, reduced costs by about 40%, and achieved system processing speeds up to 6 times faster than their legacy setup.
In this post, we explore how Pendulum’s migration to S3 Tables eliminated hours of weekly engineering maintenance and streamlined their management of multiple mission-critical tables, replacing their previous manual optimization process. We start by examining the bottlenecks associated with Pendulum’s management of their Iceberg tables. Then, we look at Pendulum’s updated architecture and their move to S3 Tables. Finally, we discuss the benefits associated with storing the data in S3 Tables, focusing on improvements in speed and performance, productivity, scalability, and cost.
Legacy system
Before using S3 Tables, Pendulum manually supervised their Iceberg tables in general purpose S3 buckets. This manual work involved custom AWS Glue automation scripts for compaction, basic Iceberg implementation, manual toggling of Glue Optimization (On/Off) for specific tables, and maintaining compaction and snapshot cleanup routines. The automation scripts consumed substantial resources, requiring an hour to complete and utilizing significant data processing units (DPUs).
S3 Tables implementation
The following figure shows the S3 Tables implementation for Pendulum.
Ingestion layer
S3 Tables provided Pendulum with the ability to transition to a fully automated and scalable data pipeline. The content ingestion platform is built entirely on AWS serverless infrastructure, designed for large-scale, resilient, and efficient web data collection. AWS Step Functions serves as the scheduled workflow coordinator calling AWS Lambda functions through Amazon EventBridge events. For on-demand web crawling, tasks are submitted directly to Amazon Simple Queue Service (Amazon SQS) queues, which immediately trigger the appropriate Lambda functions, enabling real-time ingestion. The system ingests public data from 26 different social and media platforms, processing approximately 90 million posts per day across 4 million channels. Collected content is stored in S3, partitioned by ingestion date, platform, and content creator. The serverless approach provides flexibility and scalability for handling large volumes of incoming data. The processing time of Pendulum’s ETL jobs has also significantly decreased due to the automated management of Iceberg maintenance tasks offered by S3 Tables.
Preprocessing layer
When the data is ingested it flows into AWS Glue Streaming ETL jobs, where it is cleaned, transformed, and categorized based on its type: Channels or Posts. A Channel represents a distinct entity that serves as the source of content on digital platforms, while a Post is an individual piece of content published by a channel. When the data is classified it is ingested into the associated table bucket.
Data store layer
After preprocessing, the refined data is stored in Pendulum’s S3 Tables environment. This consists of three buckets:
- channel_latest_s3: A cumulative view that tracks all social channels ever seen in the system. This table is being updated by a daily Glue ETL job.
- post_latest_s3: A cumulative view that tracks all social media posts ever seen in the system. This table is being updated by a daily Glue ETL job.
- enriched_snippets: A near real-time streaming Iceberg table containing split text generated from posts and enriched by NLP models through Amazon SageMaker AI real-time inference endpoints.
Analytics layer
The cumulative views are maintained for downstream analytics. These cumulative datasets aggregate social media posts over time, offering both historical and real-time insights within a defined time window. Downstream analytics use the enriched dataset for tasks such as queries in Amazon Athena for real-time decision-making, embedding jobs in Amazon EMR Serverless to power Pendulum’s AI platform, indexing enriched snippets in Amazon OpenSearch, and visualizing insights in Pendulum Analytic Platform and Amazon QuickSight dashboards. This structured approach maintains efficient querying while optimizing storage and performance. Integrating S3 Tables with Glue ETL allows the pipeline to continuously update datasets, keeping analytics workloads optimized, up to date, and queryable.
System impacts
The introduction of S3 Tables has eliminated more than 4 hours per engineer weekly that were previously dedicated to table maintenance and optimization. Engineering teams are now freed from script maintenance and performance monitoring, allowing them to focus on higher-value data initiatives. Pendulum was able to replace their manual optimization workflow that consisted of the following:
- Custom automation scripts for compaction
- Self-managed Iceberg implementation
- Manual toggling of optimizations for specific tables
- Maintaining compaction and snapshot routines
In terms of performance, Pendulum was able to achieve a 6 times speed improvement for one of their largest cumulative tables (post_latest_s3) after moving to S3 Tables. The daily Glue ETL job execution time dropped from approximately 1 hour to just 10 minutes. This improvement was enabled by the optimizations features built into S3 Tables, such as automatic file compaction, metadata and file-level indexing, and adaptive query execution (AQE)—all of which reduced the need for manual tuning. Offloading table format management to S3 Tables enabled Pendulum to focus their efforts solely on metadata indexing strategies tailored to their query patterns. These foundational enhancements contributed to unlocking significant runtime gains and highlighted the power of using a fully managed Iceberg table experience with S3 Tables.
Moving to S3 Tables offered cost savings for Pendulum as well with about a 40% reduction in Glue DPU consumption and other associated costs for their self-managed jobs. This is also directly correlated to the 6 times processing speed improvement through tailored field level indexing. Other hidden costs that S3 Tables has eliminated include the following:
- Engineering time previously spent on maintenance tasks
- Potential business impact from delayed insights due to slower processing
- Infrastructure overhead of managing optimization scripts and monitoring
Other improvements include the following:
- Enhanced data reliability through consistent automated optimization rather than manual, potentially error-prone processes
- Better scalability for growing data volumes without proportional increases in management overhead
- Improved query performance for business users across all analytical tools
- Enhanced data accessibility
Conclusion
Before using Amazon S3 Tables, Pendulum spent valuable engineering time managing Iceberg tables. With more data coming in and new engineering initiatives, it became demanding to manage the costs and manual labor overhead related to these tables. S3 Tables is a perfect fit for Pendulum’s workloads and has already solved many of Pendulum’s existing challenges while freeing their engineering teams to focus on impactful business priorities.
S3 Tables made it possible for Pendulum to reduce hours of table management time. They also realized a 6 times speed improvement of their ETL jobs, as well as a 40% reduction in their Glue DPU consumption. These improvements helped Pendulum realize major performance benefits, cost reductions, and scalability in their growing datasets.
If you’re observing unnecessary hours spent manually managing Iceberg tables, then we recommend trying out S3 Tables as a fully managed solution for storing your tabular data at scale.