How Novo Nordisk, Columbia University and AWS collaborated to create OpenFold3

This post was contributed by Daniele Granata, Principal Modelling Scientist, Søren Moos, Architect Advisor Director, Rômulo Jales, Senior Software Engineer, Stig Bøgelund Nielsen, Senior HPC Cloud Engineer at Novo Nordisk, and Jake Mevorach, Senior Specialist, Application Modernization, Arthur Grabman, Principal Technical Account Manager, and Anamaria Todor, Principal Solutions Architect at AWS.

In this blog post, we’re excited to share how Novo Nordisk, Columbia University and AWS collaborated to create OpenFold3, a state-of-the-art protein structure prediction model, using cost-effective and scalable bioinformatics solutions on AWS.

The OpenFold Consortium, a non-profit AI research and development group, has been at the forefront of creating open-source tools for biology and drug discovery. Their latest project, OpenFold3, represents a significant advancement in protein structure prediction. This post explores how we leveraged AWS services and innovative approaches to overcome the challenges of training this complex model while maintaining cost-effectiveness and scalability.

We will first introduce protein structure prediction and discuss its importance in advancing medical research and drug discovery.

Then we’ll delve into the three main challenges we faced and how we addressed them:

Creating a lean research environment: We will discuss how we developed the Research Collaboration Platform (RCP) using AWS services to provide a secure, flexible, and efficient workspace for our collaborative efforts.
Generating Multiple Sequence Alignments (MSAs): we will explore how we optimized the MSA generation process using AWS Graviton processors, achieving significant improvements in both speed and cost-efficiency.
AI/ML Training: We’ll detail our approach to the computationally intensive task of training the OpenFold3 model, including our use of Amazon EC2 Capacity Blocks for ML and Spot Instances to manage resources effectively.

This post offers valuable insights for researchers, data scientists, and organizations looking to leverage cloud computing for complex bioinformatics tasks. Whether you’re working on protein structure prediction or other computationally intensive projects in life sciences, the strategies and solutions we’ll discuss can help you optimize your workflows for both performance and cost-effectiveness.

Protein Structure Prediction in a Nutshell

If you’re interested in curing disease, advancing medicine and potentially even saving lives, a great place to start is understanding the structure of proteins. There are many different types of proteins that can be found in the human body. Some are helpful proteins like the enzyme lactase which helps us digest dairy. Other proteins can be harmful and cause disease or even death. Even slight alterations in the structure of these microscopic compounds can mean the difference between life and death.

Protein-based therapeutics are a new and exciting paradigm for treating disease. But given how powerfully proteins can act on the human body we need to make sure we’re creating proteins with the right structure for treating disease.

The process of developing a protein based therapeutic traditionally involved:

Finding receptors or other structures in the body (in industry this is referred to as a “target”) that you believe a protein could bind to have the desired therapeutic outcome.
Try to develop a process that produces a protein that will act on this target.
Verify using a method like X-ray crystallography or Nuclear Magnetic Resonance Spectroscopy that your process produces a protein with the desired structure.
Conduct extensive testing and research and eventually clinical trials to make sure that the newly developed has the desired effect.

Historically the biggest problem was that while we could genetically engineer proteins with specific sequences, we lacked good tools that would tell us in advance what a protein looked like based off its genetic data. This meant that therapeutic development took a long time because even after you figured out how to make a protein with a specific genetic sequence. You’d have to have a lot of highly trained people spend a lot of time operating expensive equipment—even expensive by global pharmaceutical company standards—to verify the structure. And if the structure wasn’t right you had to start over and then repeat that until you got the right structure. You would have to restart the process when you had a new molecule to characterize.

All of this contributed to making developing protein-based therapeutics incredibly expensive and time consuming. A breakthrough came when researchers realized that AI could be applied to significantly speed up this process. Researchers already had a ton of historical data for research that had already been done, which paired protein structures with their associated genetic data. The question posed was “If we have this data that pairs protein structures with their associated genetic data can we use this to train a model that if given a new genetic sequence will predict the structure of the resultant protein?” The research community found that the answer to that question is “yes” and from that point on we saw the proliferation of many AI models that take as an input a genetic sequence and output a predicted protein structure.

It takes pennies worth of electricity to run these models, and they provide valuable feedback on protein behaviors prior to starting expensive and time-consuming trials.

Enter OpenFold3

The OpenFold Consortium is a non-profit AI research and development consortium that develops free, open-source tools that help with biology and drug discovery. One of the tools they develop is OpenFold, an AI model that takes a genetic sequence as an input and outputs a predicted structure.

OpenFold3 represents the latest version of this software and to accelerate its development, Novo Nordisk and AWS partnered with the OpenFold Consortium and Columbia University.

The First Challenge: a lean research environment

We started this project with both a limited time and budget and the first challenge the project faced was that researchers needed a lean research environment as soon as possible.

As the preferred scientific partner, Novo Nordisk needed to create an environment that would fundamentally transform how research collaborations operate. The environment had to enable seamless external collaboration between Novo Nordisk, Columbia University, and the OpenFold Consortium while maintaining strict security and compliance standards. It was crucial that researchers could use their preferred tools and devices, maintaining their established workflows without disruption. Given the tight timeline, we needed quick setup capabilities and short turnaround times for building infrastructure, coupled with a streamlined onboarding process for external collaborators. The system also needed to be flexible enough to accommodate various consortia contract terms, particularly around geographic data location and access controls. Most importantly, the environment had to accelerate early target discovery by providing robust computational resources while maintaining cost efficiency. Finally, needed an environment that could dynamically scale up during intensive training periods but scale down when not in use to maintain cost effectiveness.

All these requirements needed to be balanced while offering researchers the freedom to innovate – a particularly crucial factor given the intensive computational demands of training large language models for protein structure prediction.

Figure 1 – This diagram illustrates Novo Nordisk’s Research Collaboration Platform (RCP), showing how it creates a secure and performant research environment on AWS Cloud. The key elements to notice are: (1) the secure access layer using Okta authentication, which lets researchers safely connect through web portals or SSH, (2) the flexible computing resources managed by AWS ParallelCluster, which automatically scale to match researchers’ needs, and (3) the AWS EFS and AWS FSx file systems that handle research data. The main takeaway is that this architecture enables scientists to conduct complex research securely and efficiently, particularly for demanding tasks like AI model training, while maintaining regulatory compliance. The platform is notable for combining enterprise-grade security with the flexibility researchers need and can be deployed in under 2 hours through automation.

To address these requirements, Novo Nordisk developed the Research Collaboration Platform (RCP), a comprehensive solution built on AWS that provides a secure and flexible environment for computational research. The architecture (shown in Figure 1) leverages AWS ParallelCluster as its foundation, operating within a Virtual Private Cloud (VPC) that ensures secure isolation of research workloads.

At the core of the RCP’s security model is an Okta integration, providing robust user directory services and multi-factor authentication. Researchers can access the platform through either web portals or SSH using Okta Advanced Server Access, ensuring secure and controlled access to resources. The platform’s entry point is an EC2 head node located in a public subnet, which hosts essential research tools including RStudio, Docker containers, and various scientific frameworks like Singularity.

The compute infrastructure is thoughtfully designed with a SLURM job scheduler that manages workloads across different instance types optimized for various computational needs – from general-purpose compute to memory-intensive tasks and GPU acceleration. Amazon EC2 Auto Scaling ensures that resources are dynamically allocated based on demand, helping to maintain cost efficiency while meeting computational requirements. The platform uses Amazon Aurora Serverless for database management and Amazon DynamoDB for cluster state information, providing reliable and scalable data storage solutions.

Data management is handled through a combination of Amazon EFS and FSx for Lustre, offering both general-purpose and high-performance file storage options. The entire infrastructure is monitored using Amazon CloudWatch, with AWS CloudFormation enabling infrastructure as code for consistent and repeatable deployments. Deploying an instance of RCP takes less than 2 hours.

This architecture provides researchers with the flexibility to use their preferred tools while maintaining security and compliance requirements. The combination of containerization, automated scaling, and high-performance computing capabilities enables rapid experimentation and accelerates the research workflow, particularly for computationally intensive tasks like training the OpenFold3 model.

The Second Challenge: MSA generation

With the research environment up and running, we had to figure out how to generate millions of multiple sequence alignments (MSAs). A multiple sequence alignment or MSA is basically a unit of information that describes how similar three or more genetic sequences are. These are important for us because it’s these MSAs that will be used (along with other data) to train OpenFold3.

The software we were tasked with scaling that was going to be used to produce these MSAs is called HHBlits (specifically version 3 which can be found and downloaded for free on GitHub). When we calculated what the cost and runtime would be for this step, we found that it was going to take roughly twice as long and cost a little over twice as much as it needed to fit into our timeline and budget using our initial target for compute instances, r5.16xlarge.

Amazon EC2 R8g instances, powered by the latest-generation AWS Graviton4 processors, had the potential to give us better price performance for this memory-intensive workload. Benchmarking on r8g.16xlarge instances revealed a 50% lower runtime and a 55% lower cost for HHBlits when compared to r5.16xlarge. This plus some other optimizations was able to make the whole process run much faster for a much lower cost and allowed us to use our setup to generate over one million MSAs per day.

We encountered a challenge in migrating from R5 to R8g machines, and that was the difference in CPU architecture. R8g are ARM-based CPUs, while R5 are x86-64. Code-wise, that wasn’t a problem, since all the code is C, C++ , and python-based, and there was support for both AArch64 and x86-84. But we had to create a completely new cluster since ParallelCluster requires that the head node and the compute nodes must have the same kind of CPU.

We solved it by defining a new AWS ParallelCluster configuration, which aligned CPU architecture between head and compute nodes. To further improve performance, we also customized an EC2 image with EBS as local storage. All generated MSAs have been saved directly in an Amazon S3 bucket on the fly. This setup proved to be efficient.

For the end user perspective, there were minimal changes. The scientists only had to use a different host name to login to get access to the graviton Slurm queues. We also managed to reuse the very same EFS and FSx for Lustre filesystems configured initially without performance penalties.

The best part about this architecture is that it’s highly scalable. If we want to tweak the amount of MSAs we produce at a time we can just add or subtract nodes. In no way is the one million MSAs mentioned above an upper limit; in fact, we don’t actually know what the upper limit is for throughput using this method.

The Third Challenge: AI/ML training

It’s well known that the enabler of AI/ML training workloads is GPUs. But for protein prediction, just having access to GPUs is not enough. It must also satisfy a set of pre-requisites that includes petabyte scale storage, highspeed networking and a couple of hundreds of GPUs.

The OpenFold3 project required up to 256 GPUs for training. This can be translated into 32 p5en.48xlarge EC2 instances operating in parallel, crunching and generating data running for a couple of weeks, 24 hours per day. This setup requires an infrastructure that must be scalable, resilient, observable and most importantly flexible.

At the beginning of this project, the scientists needed to create the training logic almost from the ground up. The uncertainty was high; hence the 32 GPU machines were not all needed from the very beginning. What they needed was an environment to develop, do some small experiments, and gradually scale up the number of nodes used for the training. That was exactly the infrastructure characteristic that AWS can provide with ParallelCluster.

In training OpenFold3 on our generated data, we encountered three primary challenges:

Budget constraints necessitated cost-effective solutions and left no room for costly errors.
The increased complexity of the new model required distributed AI/ML training infrastructure, as it could no longer fit on a single GPU or instance.
Securing substantial GPU resources for both training and preliminary experiments demanded swift, coordinated action among all involved parties.

To address these challenges, we implemented the following solutions:

GPU Resource Allocation: We utilized Amazon EC2 Capacity Blocks for ML, a consumption model that allows reservation of high-performance GPU compute capacity for short-duration machine learning workloads. This service enables users to reserve hundreds of NVIDIA GPUs co-located in Amazon EC2 UltraClusters by specifying cluster size, future start date, and duration. Capacity Blocks provide reliable, predictable access to GPU resources without long-term commitments, making them ideal for training and fine-tuning ML models, running experiments, and preparing for anticipated demand increases. Pricing is dynamic based on supply and demand, typically ranging around P5 On-Demand rates.

Cost Optimization: We leveraged spot instances to significantly reduce costs. As AWS’s GPU infrastructure expanded, more GPUs became available as spot instances. This system allows customers to bid on spare EC2 compute capacity at substantially reduced prices through a spot market. Instances are allocated when a user’s bid exceeds the current spot price, running until either the user terminates it or the spot price surpasses the bid price, with a two-minute warning before termination. Spot Instances can offer up to 90% savings compared to On-Demand pricing, potentially exceeding Capacity Blocks’ cost-effectiveness. However, they are subject to interruptions, making them more suitable for fault-tolerant workloads.

Notably, Novo Nordisk’s RCP team successfully incorporated spot instances midway through the project, achieving an impressive 85% cost reduction compared to on-demand pricing with zero interruptions.

Distributed Training Infrastructure: We drew inspiration from the AWSome Distributed Training examples, particularly those for DeepSpeed, to construct our scalable training infrastructure. This setup efficiently utilized GPUs from multiple instances simultaneously, resulting in a highly effective and scalable system. The final infrastructure allowed for Slurm job submissions to an autoscaling, efficient, distributed GPU cluster, capable of handling the complex task of training the new OpenFold3 model.

By implementing these solutions, we successfully overcame our initial challenges, creating a cost-effective, scalable, and efficient training environment for OpenFold3.

Conclusion

The successful training of OpenFold3 represents a significant collaboration between Novo Nordisk, Columbia University, and AWS that overcame three major challenges through innovative solutions:

Novo Nordisk developed the Research Collaboration Platform (RCP), a secure and flexible research environment built on AWS that enabled seamless collaboration while maintaining strict security standards. This platform can be deployed in under two hours and provides dynamic scaling of computational resources.
Together, we optimized the Multiple Sequence Alignment (MSA) generation process by leveraging AWS Graviton processors, achieving a 50% reduction in runtime and 55% reduction in costs. This enabled researchers to generate over one million MSAs per day with a highly scalable architecture.
Together, we tackled the AI/ML training challenges by implementing a combination of Amazon EC2 Capacity Blocks for ML and Spot instances. Our distributed training infrastructure efficiently utilized 256 GPUs across 32 p5en.48xlarge EC2 machines.

These solutions not only made the project more cost-effective and efficient but also established a blueprint for future large-scale bioinformatics collaborations between Novo Nordisk and AWS.

The successful development of OpenFold3 advances the field of protein structure prediction, contributing to faster and more efficient therapeutic development processes.

AWS HPC Blog