AWS for Industries
Scaling Backtesting for Algorithmic Trading with AWS and Coiled
Backtesting plays a critical role in algorithmic trading. Firms leverage XGBoost for backtesting to build predictive models that forecast market movements based on historical data. Training XGBoost models is computationally expensive though, and can quickly become a bottleneck, especially when working with large historical datasets.
This blog post shows how quantitative trading firms distribute XGBoost model training with Dask and scale computations on AWS across hundreds of Amazon Elastic Compute Cloud (EC2) instances with Coiled. Firms use Coiled and AWS to increase backtesting throughput, so researchers focus on building and testing trading strategies instead of managing cloud infrastructure.
Background
Quantitative trading firms develop algorithmic models to generate financial insights and execute trades efficiently. Backtesting involves running trading strategies on historical data to evaluate performance before deploying them in live markets. This requires extensive model training and simulation, making efficient computation essential.
Firms rely on XGBoost, an open-source gradient boosting library widely used for stock price prediction, portfolio optimization, and financial risk assessment. XGBoost is highly accurate, scalable, and able to handle complex feature interactions.
Many of these firms incorporate alternative data sources beyond traditional market data, such as macroeconomic indicators, demographic trends, and social impact metrics, to enhance predictive accuracy. Challenges arise when working with these diverse and large datasets, often reaching terabyte-scale, requiring robust computational infrastructure for rapid model training.
However, training models on terabyte-sized datasets present a challenge. These computations require high memory and processing power, and firms find their existing infrastructure inadequate for the speed and scale required for competitive trading strategies.
Distributed Model Training for Backtesting on AWS with XGBoost, Dask, and Coiled
To overcome these challenges, quantitative trading firms have adopted a distributed workflow combining XGBoost, Dask, and Coiled, providing a Python-native, scalable solution for backtesting and model training. The following sections describe how these tools come together for financial simulations on large datasets.
Predictive Modeling with XGBoost
Python is the dominant language for quantitative finance because of its powerful ecosystem of data science libraries, including XGBoost. XGBoost’s efficiency comes from iteratively building decision trees, with each new tree refining the predictions of the previous one. When integrated with Dask, computations can be distributed across multiple nodes, improving speed and scalability.
Distributed Model Training with Dask
Dask divides the data into small chunks, trains independent models in parallel, and then aggregates results across all machines before the next training iteration. This lets firms run large-scale backtests in a fraction of the time it will take using single-node computations.
Here’s an example of how model training and inference run locally on a single machine with XGBoost on a Dask cluster:
Backtesting at Scale on AWS with Coiled
As dataset sizes grow, even single-machine setups with Dask reach their limits. Financial models require more memory and compute power, making scalable cloud infrastructure essential. With Coiled, a compute platform for Python developers, quantitative trading firms manage scaling out EC2 instances on AWS with only a few lines of code. Coiled manages the AWS infrastructure, so quants focus on evaluating trading strategies rather than monitoring EC2 instances. The following architectural diagram (Figure 1) shows how Coiled sits between the user’s local environment and AWS cloud environment.
Figure 1 – Coiled allows users to access the computing scale of AWS from within their local environment.
Replacing the LocalCluster in the snippet above, firms can distribute workloads across hundreds of AWS instances:
Diving into the Coiled infrastructure, there are features to facilitate starting a cluster on AWS:
- Start the EC2 instances you need with n_workers.
- Deploy in any AWS region to optimize data access with region.
- Leverage discounted Amazon EC2 Spot Instances for cost savings with spot_with_fallback.
- Automatically replicate local Python packages to the remote cluster with package sync. You don’t need to worry about creating Docker images.
- Securely forward local AWS credentials to the remote cluster, so you can easily access AWS resources like S3.
Quantitative analysts typically train models on datasets that range from hundreds of gigabytes up to tens of terabytes in memory.
Cluster Hardware Metrics
In this section we look at hardware metrics available in the Coiled dashboard for a typical model-training workflow.
For this example, the customer was running a backtesting workflow that took approximately 6 minutes using 300 EC2 m6i.xlarge instances. The m6i.xlarge instances are a good balance of memory and compute, with 16 GiB of RAM and 4 vCPUs. Their team will run many of these model training workloads in parallel, up to 20 at a time, drastically increasing backtesting throughput.
Model training with XGBoost relies on heavy communication between Dask workers, since they share their intermediate model training outputs across the cluster. During the lifetime of the computation, workers sent and received 21.4 TB of data (Figure 2). Data transfer rates reach 188 GB/s during the computation (Figure 3).
Figure 2 – Cumulative data sent and received over the duration of the computation. XGBoost model-training requires inter-worker communication for workers to share their intermediate results. Dask on AWS EC2 can coordinate 21 TB of data during the computation.
Figure 3 – Data transfer rates reach 188 GB/s across the cluster.
Workers write intermediate results to disks using Amazon Elastic Block Store (EBS), with the total data stored on disk during the computation reaching 1.09 TB (Figure 4).
Figure 4 – Cumulative cluster data stored to disk over the course of a typical model-training workload. Total data stored on disk on the Coiled cluster reaches 1.09 TB.
To cut down on cloud costs, it’s common for users to take advantage of Spot instances. For this workflow, approximately one third of the instances were Spot instances (Figure 5).
Figure 5 – On-demand vs. Spot pricing for m6i.xlarge AWS EC2 instances. Spot instances are often 2-3x cheaper compared to on-demand.
If these instances become unavailable, Coiled will automatically replace them with on-demand instances so that workloads run without interruptions.
Once the computation is complete, the cluster automatically shuts down to avoid unnecessary compute costs.
Conclusion
By integrating XGBoost, Dask, and Coiled, quantitative trading firms efficiently scale their backtesting workflows to process terabyte-scale datasets. The combination of these tools enhances predictive modeling, accelerates training times from days to minutes, and reduces cloud costs.
Coiled’s automated cloud scaling allows firms to focus on refining trading strategies instead of infrastructure management. With simple Python environment synchronization, Spot instance support, and flexible AWS deployment, firms accelerate their teams and easily scale their computations.
This scalable approach empowers firms to enhance their algorithmic trading strategies, incorporating increasingly complex datasets and running extensive backtests without sacrificing performance. To learn more about how this solution can be replicated for your use case, check out Coiled on the AWS Marketplace.