AWS HPC Blog

How Novo Nordisk, Columbia University and AWS collaborated to create OpenFold3 

This post was contributed by Daniele Granata, Principal Modelling Scientist, Søren Moos, Architect Advisor Director, Rômulo Jales, Senior Software Engineer, Stig Bøgelund Nielsen, Senior HPC Cloud Engineer at Novo Nordisk, and Jake Mevorach, Senior Specialist, Application Modernization, Arthur Grabman, Principal Technical Account Manager, and Anamaria Todor, Principal Solutions Architect at AWS.

In this blog post, we’re excited to share how Novo Nordisk, Columbia University and AWS collaborated to create OpenFold3, a state-of-the-art protein structure prediction model, using cost-effective and scalable bioinformatics solutions on AWS.

The OpenFold Consortium, a non-profit AI research and development group, has been at the forefront of creating open-source tools for biology and drug discovery. Their latest project, OpenFold3, represents a significant advancement in protein structure prediction. This post explores how we leveraged AWS services and innovative approaches to overcome the challenges of training this complex model while maintaining cost-effectiveness and scalability.

We will first introduce protein structure prediction and discuss its importance in advancing medical research and drug discovery.

Then we’ll delve into the three main challenges we faced and how we addressed them:

  1. Creating a lean research environment: We will discuss how we developed the Research Collaboration Platform (RCP) using AWS services to provide a secure, flexible, and efficient workspace for our collaborative efforts.
  2. Generating Multiple Sequence Alignments (MSAs): we will explore how we optimized the MSA generation process using AWS Graviton processors, achieving significant improvements in both speed and cost-efficiency.
  3. AI/ML Training: We’ll detail our approach to the computationally intensive task of training the OpenFold3 model, including our use of Amazon EC2 Capacity Blocks for ML and Spot Instances to manage resources effectively.

This post offers valuable insights for researchers, data scientists, and organizations looking to leverage cloud computing for complex bioinformatics tasks. Whether you’re working on protein structure prediction or other computationally intensive projects in life sciences, the strategies and solutions we’ll discuss can help you optimize your workflows for both performance and cost-effectiveness.

Protein Structure Prediction in a Nutshell

If you’re interested in curing disease, advancing medicine and potentially even saving lives, a great place to start is understanding the structure of proteins. There are many different types of proteins that can be found in the human body. Some are helpful proteins like the enzyme lactase which helps us digest dairy. Other proteins can be harmful and cause disease or even death. Even slight alterations in the structure of these microscopic compounds can mean the difference between life and death.

Protein-based therapeutics are a new and exciting paradigm for treating disease. But given how powerfully proteins can act on the human body we need to make sure we’re creating proteins with the right structure for treating disease.

The process of developing a protein based therapeutic traditionally involved:

  1. Finding receptors or other structures in the body (in industry this is referred to as a “target”) that you believe a protein could bind to have the desired therapeutic outcome.
  2. Try to develop a process that produces a protein that will act on this target.
  3. Verify using a method like X-ray crystallography or Nuclear Magnetic Resonance Spectroscopy that your process produces a protein with the desired structure.
  4. Conduct extensive testing and research and eventually clinical trials to make sure that the newly developed has the desired effect.

Historically the biggest problem was that while we could genetically engineer proteins with specific sequences, we lacked good tools that would tell us in advance what a protein looked like based off its genetic data.  This meant that therapeutic development took a long time because even after you figured out how to make a protein with a specific genetic sequence. You’d have to have a lot of highly trained people spend a lot of time operating expensive equipment—even expensive by global pharmaceutical company standards—to verify the structure. And if the structure wasn’t right you had to start over and then repeat that until you got the right structure. You would have to restart the process when you had a new molecule to characterize.

All of this contributed to making developing protein-based therapeutics incredibly expensive and time consuming. A breakthrough came when researchers realized that AI could be applied to significantly speed up this process. Researchers already had a ton of historical data for research that had already been done, which paired protein structures with their associated genetic data. The question posed was “If we have this data that pairs protein structures with their associated genetic data can we use this to train a model that if given a new genetic sequence will predict the structure of the resultant protein?” The research community found that the answer to that question is “yes” and from that point on we saw the proliferation of many AI models that take as an input a genetic sequence and output a predicted protein structure.

It takes pennies worth of electricity to run these models, and they provide valuable feedback on protein behaviors prior to starting expensive and time-consuming trials.

Enter OpenFold3

The OpenFold Consortium is a non-profit AI research and development consortium that develops free, open-source tools that help with biology and drug discovery. One of the tools they develop is OpenFold, an AI model that takes a genetic sequence as an input and outputs a predicted structure.

OpenFold3 represents the latest version of this software and to accelerate its development, Novo Nordisk and AWS partnered with the OpenFold Consortium and Columbia University.

The First Challenge: a lean research environment

We started this project with both a limited time and budget and the first challenge the project faced was that researchers needed a lean research environment as soon as possible.

As the preferred scientific partner, Novo Nordisk needed to create an environment that would fundamentally transform how research collaborations operate. The environment had to enable seamless external collaboration between Novo Nordisk, Columbia University, and the OpenFold Consortium while maintaining strict security and compliance standards. It was crucial that researchers could use their preferred tools and devices, maintaining their established workflows without disruption. Given the tight timeline, we needed quick setup capabilities and short turnaround times for building infrastructure, coupled with a streamlined onboarding process for external collaborators. The system also needed to be flexible enough to accommodate various consortia contract terms, particularly around geographic data location and access controls. Most importantly, the environment had to accelerate early target discovery by providing robust computational resources while maintaining cost efficiency. Finally, needed an environment that could dynamically scale up during intensive training periods but scale down when not in use to maintain cost effectiveness.

All these requirements needed to be balanced while offering researchers the freedom to innovate – a particularly crucial factor given the intensive computational demands of training large language models for protein structure prediction.

Figure 1 - This diagram illustrates Novo Nordisk's Research Collaboration Platform (RCP), showing how it creates a secure and performant research environment on AWS Cloud. The key elements to notice are: (1) the secure access layer using Okta authentication, which lets researchers safely connect through web portals or SSH, (2) the flexible computing resources managed by AWS ParallelCluster, which automatically scale to match researchers' needs, and (3) the AWS EFS and AWS FSx file systems that handle research data. The main takeaway is that this architecture enables scientists to conduct complex research securely and efficiently, particularly for demanding tasks like AI model training, while maintaining regulatory compliance. The platform is notable for combining enterprise-grade security with the flexibility researchers need and can be deployed in under 2 hours through automation.

Figure 1 – This diagram illustrates Novo Nordisk’s Research Collaboration Platform (RCP), showing how it creates a secure and performant research environment on AWS Cloud. The key elements to notice are: (1) the secure access layer using Okta authentication, which lets researchers safely connect through web portals or SSH, (2) the flexible computing resources managed by AWS ParallelCluster, which automatically scale to match researchers’ needs, and (3) the AWS EFS and AWS FSx file systems that handle research data. The main takeaway is that this architecture enables scientists to conduct complex research securely and efficiently, particularly for demanding tasks like AI model training, while maintaining regulatory compliance. The platform is notable for combining enterprise-grade security with the flexibility researchers need and can be deployed in under 2 hours through automation.

To address these requirements, Novo Nordisk developed the Research Collaboration Platform (RCP), a comprehensive solution built on AWS that provides a secure and flexible environment for computational research. The architecture (shown in Figure 1) leverages AWS ParallelCluster as its foundation, operating within a Virtual Private Cloud (VPC) that ensures secure isolation of research workloads.

At the core of the RCP’s security model is an Okta integration, providing robust user directory services and multi-factor authentication. Researchers can access the platform through either web portals or SSH using Okta Advanced Server Access, ensuring secure and controlled access to resources. The platform’s entry point is an EC2 head node located in a public subnet, which hosts essential research tools including RStudio, Docker containers, and various scientific frameworks like Singularity.

The compute infrastructure is thoughtfully designed with a SLURM job scheduler that manages workloads across different instance types optimized for various computational needs – from general-purpose compute to memory-intensive tasks and GPU acceleration. Amazon EC2 Auto Scaling ensures that resources are dynamically allocated based on demand, helping to maintain cost efficiency while meeting computational requirements. The platform uses Amazon Aurora Serverless for database management and Amazon DynamoDB for cluster state information, providing reliable and scalable data storage solutions.

Data management is handled through a combination of Amazon EFS and FSx for Lustre, offering both general-purpose and high-performance file storage options. The entire infrastructure is monitored using Amazon CloudWatch, with AWS CloudFormation enabling infrastructure as code for consistent and repeatable deployments. Deploying an instance of RCP takes less than 2 hours.

This architecture provides researchers with the flexibility to use their preferred tools while maintaining security and compliance requirements. The combination of containerization, automated scaling, and high-performance computing capabilities enables rapid experimentation and accelerates the research workflow, particularly for computationally intensive tasks like training the OpenFold3 model.

The Second Challenge: MSA generation

With the research environment up and running, we had to figure out how to generate millions of multiple sequence alignments (MSAs). A multiple sequence alignment or MSA is basically a unit of information that describes how similar three or more genetic sequences are. These are important for us because it’s these MSAs that will be used (along with other data) to train OpenFold3.

The software we were tasked with scaling that was going to be used to produce these MSAs is called HHBlits (specifically version 3 which can be found and downloaded for free on GitHub). When we calculated what the cost and runtime would be for this step, we found that it was going to take roughly twice as long and cost a little over twice as much as it needed to fit into our timeline and budget using our initial target for compute instances, r5.16xlarge.

Amazon EC2 R8g instances, powered by the latest-generation AWS Graviton4 processors, had the potential to give us better price performance for this memory-intensive workload. Benchmarking  on r8g.16xlarge instances revealed a 50% lower runtime and a 55% lower cost for HHBlits when compared to r5.16xlarge. This plus some other optimizations was able to make the whole process run much faster for a much lower cost and allowed us to use our setup to generate over one million MSAs per day.

We encountered a challenge in migrating from R5 to R8g machines, and that was the difference in CPU architecture. R8g are ARM-based CPUs, while R5 are x86-64. Code-wise, that wasn’t a problem, since all the code is C, C++ , and python-based, and there was support for both AArch64 and x86-84. But we had to create a completely new cluster since ParallelCluster requires that the head node and the compute nodes must have the same kind of CPU.

We solved it by defining a new AWS ParallelCluster configuration, which aligned CPU architecture between head and compute nodes. To further improve performance, we also customized an EC2 image with EBS as local storage. All generated MSAs have been saved directly in an Amazon S3 bucket on the fly. This setup proved to be efficient.

For the end user perspective, there were minimal changes. The scientists only had to use a different host name to login to get access to the graviton Slurm queues. We also managed to reuse the very same EFS and FSx for Lustre filesystems configured initially without performance penalties.

The best part about this architecture is that it’s highly scalable. If we want to tweak the amount of MSAs we produce at a time we can just add or subtract nodes. In no way is the one million MSAs mentioned above an upper limit; in fact, we don’t actually know what the upper limit is for throughput using this method.

The Third Challenge: AI/ML training

It’s well known that the enabler of AI/ML training workloads is GPUs. But for protein prediction, just having access to GPUs is not enough. It must also satisfy a set of pre-requisites that includes petabyte scale storage, highspeed networking and a couple of hundreds of GPUs.

The OpenFold3 project required up to 256 GPUs for training. This can be translated into 32 p5en.48xlarge EC2 instances operating in parallel, crunching and generating data running for a couple of weeks, 24 hours per day. This setup requires an infrastructure that must be scalable, resilient, observable and most importantly flexible.

At the beginning of this project, the scientists needed to create the training logic almost from the ground up. The uncertainty was high; hence the 32 GPU machines were not all needed from the very beginning. What they needed was an environment to develop, do some small experiments, and gradually scale up the number of nodes used for the training. That was exactly the infrastructure characteristic that AWS can provide with ParallelCluster.

In training OpenFold3 on our generated data, we encountered three primary challenges:

  1. Budget constraints necessitated cost-effective solutions and left no room for costly errors.
  2. The increased complexity of the new model required distributed AI/ML training infrastructure, as it could no longer fit on a single GPU or instance.
  3. Securing substantial GPU resources for both training and preliminary experiments demanded swift, coordinated action among all involved parties.

To address these challenges, we implemented the following solutions:

GPU Resource Allocation: We utilized Amazon EC2 Capacity Blocks for ML, a consumption model that allows reservation of high-performance GPU compute capacity for short-duration machine learning workloads. This service enables users to reserve hundreds of NVIDIA GPUs co-located in Amazon EC2 UltraClusters by specifying cluster size, future start date, and duration. Capacity Blocks provide reliable, predictable access to GPU resources without long-term commitments, making them ideal for training and fine-tuning ML models, running experiments, and preparing for anticipated demand increases. Pricing is dynamic based on supply and demand, typically ranging around P5 On-Demand rates.

Cost Optimization: We leveraged spot instances to significantly reduce costs. As AWS’s GPU infrastructure expanded, more GPUs became available as spot instances. This system allows customers to bid on spare EC2 compute capacity at substantially reduced prices through a spot market. Instances are allocated when a user’s bid exceeds the current spot price, running until either the user terminates it or the spot price surpasses the bid price, with a two-minute warning before termination. Spot Instances can offer up to 90% savings compared to On-Demand pricing, potentially exceeding Capacity Blocks’ cost-effectiveness. However, they are subject to interruptions, making them more suitable for fault-tolerant workloads.

Notably, Novo Nordisk’s RCP team successfully incorporated spot instances midway through the project, achieving an impressive 85% cost reduction compared to on-demand pricing with zero interruptions.

Distributed Training Infrastructure: We drew inspiration from the AWSome Distributed Training examples, particularly those for DeepSpeed, to construct our scalable training infrastructure. This setup efficiently utilized GPUs from multiple instances simultaneously, resulting in a highly effective and scalable system. The final infrastructure allowed for Slurm job submissions to an autoscaling, efficient, distributed GPU cluster, capable of handling the complex task of training the new OpenFold3 model.

By implementing these solutions, we successfully overcame our initial challenges, creating a cost-effective, scalable, and efficient training environment for OpenFold3.

Conclusion

The successful training of OpenFold3 represents a significant collaboration between Novo Nordisk, Columbia University, and AWS that overcame three major challenges through innovative solutions:

  1. Novo Nordisk developed the Research Collaboration Platform (RCP), a secure and flexible research environment built on AWS that enabled seamless collaboration while maintaining strict security standards. This platform can be deployed in under two hours and provides dynamic scaling of computational resources.
  2. Together, we optimized the Multiple Sequence Alignment (MSA) generation process by leveraging AWS Graviton processors, achieving a 50% reduction in runtime and 55% reduction in costs. This enabled researchers to generate over one million MSAs per day with a highly scalable architecture.
  3. Together, we tackled the AI/ML training challenges by implementing a combination of Amazon EC2 Capacity Blocks for ML and Spot instances. Our distributed training infrastructure efficiently utilized 256 GPUs across 32 p5en.48xlarge EC2 machines.

These solutions not only made the project more cost-effective and efficient but also established a blueprint for future large-scale bioinformatics collaborations between Novo Nordisk and AWS.

The successful development of OpenFold3 advances the field of protein structure prediction, contributing to faster and more efficient therapeutic development processes.

Daniele Granata

Daniele Granata

Daniele Granata is a Principal Modelling Scientist in InSilico Protein Discovery at Novo Nordisk. He loves to build people and technological connections and to work at the frontiers of protein science, AI and HPC. He has a bachelor's and master's degree in Physics, a PhD in "Physics and Chemistry of Biological Systems" and 5 years of postdoctoral experience between US and Denmark. In the last 6 years at Novo, Daniele thrived pushing data science and generative AI to design new drugs for patients, still with a strong dedication towards open sourcing to the scientific community.

Anamaria Todor

Anamaria Todor

Anamaria Todor is a Principal Solutions Architect based in Copenhagen, Denmark. She saw her first computer when she was 4 years old and never let go of computer science, video games, and engineering since. She has worked in various technical roles, from freelancer, full-stack developer, to data engineer, technical lead, and CTO, at various companies in Denmark, focusing on the gaming and advertising industries. She has been at AWS for over 5 years, working as a Principal Solutions Architect, focusing mainly on life sciences and AI/ML. Anamaria has a bachelor’s in Applied Engineering and Computer Science, a master’s degree in Computer Science, and over 10 years of AWS experience. When she’s not working or playing video games, she’s coaching girls and female professionals in understanding and finding their path through technology.

Arthur Grabman

Arthur Grabman

Arthur Grabman is a Principal Technical Account Manager with over 15 years of experience, specialising in HPC for enterprise clients. He loves translating complex technical concepts into tangible business value. Outside of his professional life, Arthur enjoys traveling with his family.

Rômulo Jales

Rômulo Jales

Rômulo Jales is a Senior Software Engineer at Novo Nordisk. With a strong background of more than 15 years in software development, Rômulo loves to get things done by creating simple and elegant solutions, even in critical mission applications. Rômulo has a bachelor's in Computer Engineering, and it is still moved by the original curiosity of understanding how and why things work in the way they do. For him there is always something new to learn and have fun.

Jacob Mevorach

Jacob Mevorach

Jacob Mevorach is a senior specialist for containers for healthcare and the life sciences at AWS. Jacob has a background in bioinformatics and machine learning. Prior to joining AWS, Jacob focused on enabling and conducting large scale analysis for genomics and other scientific areas.

Stig Bøgelund Nielsen

Stig Bøgelund Nielsen

is a Senior Cloud & HPC Engineer at Novo Nordisk. A lifelong engagement with IT, beginning with early work in supporting local businesses, has shaped his career. Almost 8 years at IBM provided a strong foundation, in roles such as IT Service and Infrastructure Architect, Team Lead, and Transition & Transformation Architect/Project Manager, all focused on delivering business value through effective IT strategies and execution. He joined Novo Nordisk almost two years ago as a Cloud & HPC Engineer and is now leading the Research Collaboration Platform (RCP) as Tech Lead, directly supporting groundbreaking research, including the OpenFold project. He is passionate about leveraging technology to enable research and drive innovation.