AWS for Industries

Scaling Backtesting for Algorithmic Trading with AWS and Coiled

Backtesting plays a critical role in algorithmic trading. Firms leverage XGBoost for backtesting to build predictive models that forecast market movements based on historical data. Training XGBoost models is computationally expensive though, and can quickly become a bottleneck, especially when working with large historical datasets.

This blog post shows how quantitative trading firms distribute XGBoost model training with Dask and scale computations on AWS across hundreds of Amazon Elastic Compute Cloud (EC2) instances with Coiled. Firms use Coiled and AWS to increase backtesting throughput, so researchers focus on building and testing trading strategies instead of managing cloud infrastructure.

Background

Quantitative trading firms develop algorithmic models to generate financial insights and execute trades efficiently. Backtesting involves running trading strategies on historical data to evaluate performance before deploying them in live markets. This requires extensive model training and simulation, making efficient computation essential.

Firms rely on XGBoost, an open-source gradient boosting library widely used for stock price prediction, portfolio optimization, and financial risk assessment. XGBoost is highly accurate, scalable, and able to handle complex feature interactions.

Many of these firms incorporate alternative data sources beyond traditional market data, such as macroeconomic indicators, demographic trends, and social impact metrics, to enhance predictive accuracy. Challenges arise when working with these diverse and large datasets, often reaching terabyte-scale, requiring robust computational infrastructure for rapid model training.

However, training models on terabyte-sized datasets present a challenge. These computations require high memory and processing power, and firms find their existing infrastructure inadequate for the speed and scale required for competitive trading strategies.

Distributed Model Training for Backtesting on AWS with XGBoost, Dask, and Coiled

To overcome these challenges, quantitative trading firms have adopted a distributed workflow combining XGBoost, Dask, and Coiled, providing a Python-native, scalable solution for backtesting and model training. The following sections describe how these tools come together for financial simulations on large datasets.

Predictive Modeling with XGBoost

Python is the dominant language for quantitative finance because of its powerful ecosystem of data science libraries, including XGBoost. XGBoost’s efficiency comes from iteratively building decision trees, with each new tree refining the predictions of the previous one. When integrated with Dask, computations can be distributed across multiple nodes, improving speed and scalability.

Distributed Model Training with Dask

Dask divides the data into small chunks, trains independent models in parallel, and then aggregates results across all machines before the next training iteration. This lets firms run large-scale backtests in a fraction of the time it will take using single-node computations.

Here’s an example of how model training and inference run locally on a single machine with XGBoost on a Dask cluster:

from xgboost import dask as dxgb
from dask.distributed import LocalCluster

cluster = LocalCluster()                   # Run Dask locally
client = cluster.get_client()

dtrain = dxgb.DaskDMatrix(client, X, y)
output = dxgb.train(                   # Distributed model training
   {"tree_method": "hist"},
   dtrain,
   num_boost_round=4,
   evals=[(dtrain, "train")],
)
predictions = dxgb.predict(             # Parallel model inference
    client, output, dtrain
)

Backtesting at Scale on AWS with Coiled

As dataset sizes grow, even single-machine setups with Dask reach their limits. Financial models require more memory and compute power, making scalable cloud infrastructure essential. With Coiled, a compute platform for Python developers, quantitative trading firms manage scaling out EC2 instances on AWS with only a few lines of code. Coiled manages the AWS infrastructure, so quants focus on evaluating trading strategies rather than monitoring EC2 instances. The following architectural diagram (Figure 1) shows how Coiled sits between the user’s local environment and AWS cloud environment.

Figure 1 - Coiled allows users to access the computing scale of AWS from within their local environment.Figure 1 – Coiled allows users to access the computing scale of AWS from within their local environment.

Replacing the LocalCluster in the snippet above, firms can distribute workloads across hundreds of AWS instances:

import coiled

cluster = coiled.Cluster(
   n_workers=300,
   region="us-east-1",
   spot_policy="spot_with_fallback",
)

Diving into the Coiled infrastructure, there are features to facilitate starting a cluster on AWS:

  • Start the EC2 instances you need with n_workers.
  • Deploy in any AWS region to optimize data access with region.
  • Leverage discounted Amazon EC2 Spot Instances for cost savings with spot_with_fallback.
  • Automatically replicate local Python packages to the remote cluster with package sync. You don’t need to worry about creating Docker images.
  • Securely forward local AWS credentials to the remote cluster, so you can easily access AWS resources like S3.

Quantitative analysts typically train models on datasets that range from hundreds of gigabytes up to tens of terabytes in memory.

Cluster Hardware Metrics

In this section we look at hardware metrics available in the Coiled dashboard for a typical model-training workflow.

For this example, the customer was running a backtesting workflow that took approximately 6 minutes using 300 EC2 m6i.xlarge instances. The m6i.xlarge instances are a good balance of memory and compute, with 16 GiB of RAM and 4 vCPUs. Their team will run many of these model training workloads in parallel, up to 20 at a time, drastically increasing backtesting throughput.

Model training with XGBoost relies on heavy communication between Dask workers, since they share their intermediate model training outputs across the cluster. During the lifetime of the computation, workers sent and received 21.4 TB of data (Figure 2). Data transfer rates reach 188 GB/s during the computation (Figure 3).

Graph of cumulative data sent and received over the duration of the computation. Time is on the horizontal axis and terabytes of data is on the vertical axis.Figure 2 – Cumulative data sent and received over the duration of the computation. XGBoost model-training requires inter-worker communication for workers to share their intermediate results. Dask on AWS EC2 can coordinate 21 TB of data during the computation.

Graph of data transfer rates for data being sent and received to and from the compute cluster. Time is on the horizontal access and gigabytes per second is on the vertical access.Figure 3 – Data transfer rates reach 188 GB/s across the cluster.

Workers write intermediate results to disks using Amazon Elastic Block Store (EBS), with the total data stored on disk during the computation reaching 1.09 TB (Figure 4).

Graph of the cumulative data written to storage during the computation. Time is on the horizontal axis and data volume is on the vertical axis. Figure 4 – Cumulative cluster data stored to disk over the course of a typical model-training workload. Total data stored on disk on the Coiled cluster reaches 1.09 TB.

To cut down on cloud costs, it’s common for users to take advantage of Spot instances. For this workflow, approximately one third of the instances were Spot instances (Figure 5).

Bar graph of the number of On-demand and Spot instances used for the computation.

Figure 5 – On-demand vs. Spot pricing for m6i.xlarge AWS EC2 instances. Spot instances are often 2-3x cheaper compared to on-demand.

If these instances become unavailable, Coiled will automatically replace them with on-demand instances so that workloads run without interruptions.

Once the computation is complete, the cluster automatically shuts down to avoid unnecessary compute costs.

Conclusion

By integrating XGBoost, Dask, and Coiled, quantitative trading firms efficiently scale their backtesting workflows to process terabyte-scale datasets. The combination of these tools enhances predictive modeling, accelerates training times from days to minutes, and reduces cloud costs.

Coiled’s automated cloud scaling allows firms to focus on refining trading strategies instead of infrastructure management. With simple Python environment synchronization, Spot instance support, and flexible AWS deployment, firms accelerate their teams and easily scale their computations.

This scalable approach empowers firms to enhance their algorithmic trading strategies, incorporating increasingly complex datasets and running extensive backtests without sacrificing performance. To learn more about how this solution can be replicated for your use case, check out Coiled on the AWS Marketplace.

Alket Memushaj

Alket Memushaj

Alket Memushaj works as a Principal Architect in the Financial Services Market Development team at AWS. Alket is responsible for technical strategy for capital markets, working with partners and customers to deploy applications across the trade lifecycle to the AWS Cloud, including market connectivity, trading systems, and pre- and post-trade analytics and research platforms.

Sarah Johnson

Sarah Johnson

Sarah Johnson is a Product Marketing Manager at Coiled, where she helps data teams scale Python workflows in the cloud. With a background in population health research and hands-on experience as a data practitioner, she understands the real-world challenges of working with data at scale.

Simon Panek

Simon Panek

Simon Panek is a Partner Solutions Architect at Amazon Web Services (AWS). He focuses on solutions involving the Internet of Things (IoT) by supporting AWS Partners that build new and scalable solutions to better serve customers across many industries. With manufacturing, engineering, and business administration experience, he is always looking for novel solutions.