AWS Storage Blog
Using Amazon S3 Express One Zone as a caching layer for S3 Standard
Data caching is a critical strategy for optimizing application performance in today’s data-intensive environments. By storing frequently accessed information in high-speed storage locations, organizations can dramatically reduce access times, optimize the use of compute resources, and improve overall system responsiveness. Effective caching strategies become particularly essential for workloads that require consistent low latency, such as financial modelling, AI/ML training, genomics research, and media processing. Without proper caching mechanisms, organizations face increased costs, reduced application performance, and potential bottlenecks that can impact critical business operations.
Amazon S3 offers powerful capabilities that align perfectly with modern caching needs. Its storage classes, particularly S3 Express One Zone, provide the infrastructure needed to implement sophisticated caching strategies at scale. With S3 Express One Zone delivering consistent single-digit millisecond data access, organizations can create a high-performance cache layer for their most critical data. While S3 Express provides the performance capabilities needed for effective caching, application owners still face challenges when migrating existing data between different storage classes to optimize their caching architecture. This data movement process, when done manually, can be time-consuming and introduce unnecessary complexity into an organization’s data management workflows.
In this post, we discuss a serverless solution that automates the movement of data between S3 general purpose buckets and S3 directory buckets, enabling users to quickly and cost-effectively use the low latency and high throughput benefits of S3 Express One Zone. This solution uses AWS services such as AWS Step Functions, AWS Lambda, Amazon DynamoDB, and S3 Batch Operations to manage the end-to-end data movement process, such as object caching and expiration. This solution allows users to accelerate their data-intensive workloads by providing fast data access to latency sensitive workloads.
Solution overview
The solution consists of four components: A Step Functions workflow for managing the end-to-end process and triggering the caching logic; a DynamoDB table that acts a state store for the caching logic, such as which objects are to be cached, which are currently cached, and their respective time-to-live (TTL) attributes; a set of Lambda functions that implement the object processing, caching logic, and S3 Batch Operations jobs, which perform the data movement between the buckets.
The following figure shows these components running a data movement operation.
- During the run, the Step Function builds a map state for each prefix to be processed. This map state, labelled as Distributed MAP list, acts as the input to build a manifest of objects to be synchronized between a general purpose bucket and a directory bucket.
- The Step Function then checks DynamoDB for existing objects in the target directory bucket and any associated TTL expiration values. Any object already existing in the directory bucket gets its TTL updated in DynamoDB if the new TTL is longer than the previous one. New objects to be synchronized to the target directory bucket have items created in DynamoDB along with TTL expiration values.
- The Step Function creates a manifest of missing objects along with an associated list of copy jobs needed and passes it to S3 Batch Operations. Then, S3 Batch Operations performs the object copying from the general purpose bucket to the directory bucket.
- Finally, when the objects have reached their expiration time, DynamoDB invokes the Lambda function to delete objects and removes any expired references to those objects inside of DynamoDB.
Figure 1: Serverless solution automating data movement between Amazon S3 general purpose buckets and S3 directory buckets
This solution is serverless and doesn’t necessitate the use of compute resources that otherwise might be used for data processing. Sending an event to the Step Function state machine created during AWS CloudFormation template deployment triggers the workflow.
The solution is designed to be triggered either as part of an on-demand job that needs data to be present in an S3 directory bucket for processing, or as part of a scheduled task to move data to the cache periodically for known access patterns. When the data copy operations have completed, the Step Function reports back the final result. The task isn’t considered complete until the Step Function reports its successful completion.
To experience the low latency performance benefits of directory buckets firsthand and implement the serverless caching solution in your environment, let’s walk through the step-by-step deployment process using AWS CloudFormation.
Solution prerequisites
On a supported operating system, such as Linux, Windows, or MacOS, you need to use Python (3.9 or above) and you’ll also need:
- AWS Command Line Interface (AWS CLI, latest version)
- AWS Cloud Development Kit (AWS CDK 2.147 or above)
1. Setting up the AWS CLI and configuring it with your security credentials
Install and configure the AWS CLI with your security credentials, and specify the AWS Region where you will deploy the stack as your default region. Your security credentials should provide AWS Identity and Access Management (IAM) policy permissions, such as AdministratorAccess, which allow you to deploy infrastructure. You add this policy permission to your user in IAM.
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install
Configure the CLI with your Access Key, Secret Access Key, and the Region to which you are deploying the caching solution.
aws configure
If you choose not to add your Access Key, Secret Access Key, and AWS Region during the configure command, then you should instead add them to your session as environmental variables. They are used several times during the deployment process.
2. Installing and bootstrapping the AWS CDK
AWS CDK is installed through the node package manager (npm), which needs to be installed on our host. Then, the npm is used to install the AWS CDK packages.
sudo yum install -y nodejs sudo npm install -g aws-cdk
AWS CDK is a framework for defining cloud infrastructure in code. AWS CDK uses the security credentials that you configured for AWS CLI to perform some tasks. Before you can deploy any code as infrastructure you must prepare your local system to run deployments.
cdk bootstrap aws://<ACCOUNT-ID>/<REGION>
The ACCOUNT-ID
is the 12-digit number representing your AWS account ID, and it makes sure that resources are set up in the correct account. The REGION
specifies the Region where you should bootstrap the AWS CDK environment. As part of the bootstrapping process, temporary Amazon S3 general purpose buckets and IAM roles are created for the deployment process, and access controls are applied. With the prerequisites and bootstrap processes complete, you can now look at your infrastructure as code (IaC). The code in this S3 caching solution is Python code that specifies the configuration of the data movement infrastructure. If you have git installed, then you can clone it from the repository using the git CLI. Otherwise, you can use curl.
Solution deployment
In this section, you deploy the S3 Express One Zone caching solution from an Amazon Linux 2023 Amazon Elastic Compute Cloud (Amazon EC2) instance and show it in action. You can find the code for this reference architecture and its documentation on GitHub. This code is downloaded and used later in this example, but before that we have to meet the prerequisites for deployment. Below are the steps for deploying and cleaning up this solution.
- Downloading and preparing the caching solution code
- Creating and activating a Python virtual environment
- Building and deploying the solution
- Deploying the stack to production
- Running the solution
- Configuring the execution input
- Defining prefix patterns
- Starting and monitoring the solution
1. Downloading and preparing the caching solution code
In this example using an EC2 instance running Amazon Linux 2, use curl to download the main.zip file containing the code. The -O
flag saves the file locally using the same name it is stored remotely, while the -L
flag handles any URL redirects the website may be using.
curl -O -L https://github.com/aws-samples/s3xz-caching-solution/archive/refs/heads/main.zip unzip main.zip
After unzipping the file, the directory structure in the following figure shows the stack and serverless functions that are turned into a CloudFormation template.
Figure 2: Caching solution directory structure
2. Creating and activating a Python virtual environment
Using Python’s virtual environment (venv
) functionality allows you to create a disposable working environment directory named .venv
and activate it. This virtual environment isolates the caching solution modules from any other Python modules installed on the system.
python3 -m venv .venv source .venv/bin/activate
Windows users can find the Windows-specific activation command in the README.md file included with the code.
With your virtual environment activated and running, install the Python requirements for the template build process inside of the disposable working environment. You may need to specify the full directory path or change to the s3xz-caching-solution-main
directory where you unzipped the files shown in Figure 2.
cd s3xz-caching-solution-main pip install -r requirements.txt
With the Python requirements installed, you can build the CloudFormation template.
3. Building and deploying the solution
cdk synth
This synthesis runs a processor and memory intensive build process for a minute or two as the Python interpreter works with the AWS CDK to build a CloudFormation solution stack. When the stack is built the final step is to deploy the stack into production.
4. Deploying the stack to production
cdk deploy
You receive a Y/N prompt asking you to accept the changes that the template applies. When accepted, the caching solution is deployed as shown in Figure 3.
Figure 3: From Python code to deployed infrastructure in production
Running the deactivate
command exits the Python virtual environment and returns you to the shell prompt.
5. Running the solution
Users can perform all actions involved in running the solution through CLI or API commands. However, for illustration purposes, we’ll switch to the AWS Management Console during the next steps. In the Step Functions service, you should see a new state machine deployed, as shown in Figure 4, and waiting for execution.
Figure 4: Solution state machine (active but not executing a caching operation)
6. Configuring the execution input
Choosing the State machine makes available a Start execution button. When chosen, it gives the option to add optional input. This is where you can add event data to define the movement process and set the TTL expiry value of the data. The following is the format of the execution input used in this example.
{
"bucket": "s3bucketname",
"directory_bucket": "s3expressdirectorybucketname",
"prefixes": [
""
],
"ttl": 1,
"force_copy": false
}
This input contains the following information:
bucket
: the name of the source general purpose bucket containing the objects to be cached in the target S3 directory bucket.directory_bucket
: the target S3 directory bucket to which the objects are copied.prefixes
: a list of prefixes within the source bucket that are to be copied to the target S3 directory bucket.ttl
: an integer value specifying how many hours to cache the objects in the S3 directory bucket. The TTL begins when the job starts, not when the job completes.force_copy
: A boolean value indicating whether to ignore the presence of an object in the S3 directory bucket. Setting this to true forces a copy of an object already present to the S3 directory bucket. When set tofalse
only the data newly added is copied from the source to the destination.
7. Defining prefix patterns
You can define the prefixes
parameter in various ways:
- Copy the whole bucket to the cache:
[ "" ]
- Using top level prefixes:
[ "year-2023/", "year-2024/" ]
- Using deeper prefixes:
[ "year-2023/month-01/day-04/", "year-2024/month-05/" ]
As shown in the following figure, we’ll move all the data from an S3 general purpose bucket named pluto to an S3 directory bucket named mercury and set the TTL cleanup time to one hour. If the data already exists in the bucket, it won’t be copied.
Figure 5: Copying all data from the pluto general purpose bucket to the mercury directory bucket
8. Starting and monitoring the execution
Choosing the Start execution button on this screen runs the copy job between buckets. One hour after the job has started Lambda starts deleting the copied objects in the S3 directory bucket. Compute resources accessing these objects now benefit from the S3 Express One Zone high throughput and low latency performance.
The solution in action
Having shown the installation and configuration of the solution, the following figure shows it in action. Using some demonstration data for an ML workload, the solution moves ~2.9 TiB of data in 280,000 objects between an Amazon S3 general purpose bucket and an S3 directory bucket. The solution spawns multiple processes to move the data.
Figure 6: 280,000 objects copied to an S3 directory bucket in 4 minutes 25 seconds.
When the primary Step Function reports a successful execution, the task is complete. In this case it took 4 minutes and 25 seconds to move the dataset to S3 Express One Zone and it cost ~$20. You can find further examples of pricing in the README file included with the solution. Compute resources accessing these objects now benefit from the S3 Express One Zone high throughput and low latency performance.
Cleaning up
With serverless technology, you only pay for what you use. When this solution isn’t running there are no charges for the work it does. If you choose to remove the solution, then you can do so by returning to the venv, in case you exited it:
source .venv/bin/activate cd s3xz-caching-solution-main/ cdk destroy
Furthermore, running the AWS CDK destroy command gathers the current states of the resources created during the cdk deploy
command earlier, and it makes the necessary API calls to delete them. This is a destructive command, so make sure that you examine the output of the command before you choose y
to delete the stack.
Are you sure you want to delete: S3CachingSolutionStack (y/n)?
The deactivate
command exits the venv
and returns you to the shell prompt.
Conclusion
In this post, we introduced a serverless caching solution for S3 Express One Zone that addresses the challenge of data movement between storage tiers. We began with an overview of how the solution integrates key AWS services such as AWS Step Functions, AWS Lambda, Amazon DynamoDB, and Amazon S3 Batch Operations to automate the end-to-end data movement process. We then walked through the detailed deployment process using AWS CDK on an Amazon EC2 instance, configured the necessary execution parameters in the AWS Management Console, and demonstrated the solution’s performance by moving 2.9 TiB of data (280,000 objects) in just 4 minutes and 25 seconds.
The solution delivers several key benefits for workloads where latency is a critical factor. By automating bulk data movement to S3 Express One Zone’s low-latency storage, organizations can dramatically improve workload efficiency while maintaining cost control through automatic expiration of cached objects. The serverless architecture ensures you only pay for the resources you use during actual data movement operations, with no ongoing costs when the solution isn’t actively running.
You can find out more about this S3 caching solution by downloading the code and reading more about its design in the GitHub repository. To learn more about S3 Express One Zone, visit the S3 User Guide.