Running GenAI Inference with AWS Graviton and Arcee AI Models

By Nolan Chen, Partner Solutions Architect – AWS
By Julien Simon, Chief Evangelist – Arcee AI
By Jeff Underhill, Principal Specialist Graviton – AWS
By Kinnar Sen, Principal Specialist Solutions Architect Compute – AWS

The growing demand for Generative AI (GenAI) applications has led to a corresponding demand for compute resources that can run these workloads efficiently. In this post we share how GenAI inference workloads running AI models from Arcee AI can be optimized using AWS Graviton-based instances.

Large Language Models (LLMs) are pre-trained on vast amounts of data. They extract meaning from text sequences and analyze relationships between words and phrases to perform tasks such as answering questions, summarizing documents and translating languages. While LLMs are capable of a wide variety of tasks, they require compute resources to support hundreds of billions and sometimes trillions of parameters. Small Language models (SLMs) in contrast typically have a range of 3 to 15 billion parameters, while providing responses more efficiently.

Arcee AI, an AWS partner, was founded to make SLMs tailored to businesses across industries. Starting from open-source SLMs such as Llama-3 and Qwen2, Arcee relies on its model adaptation stack to improve the cost and efficiency of SLM text generation. At AWS re:Invent 2024, Arcee AI premiered Virtuoso, a family of models optimized for lightweight tasks and inference.

Innovative models go hand in hand with optimal infrastructure to host them. For over a decade, Amazon has been innovating across the entire stack from software to silicon with the goal of delivering higher performing infrastructure at lower cost. AWS has developed a diverse silicon portfolio to meet various computing needs. At its core are three specialized processor types: AWS Trainium for AI training, AWS Inferentia for inference acceleration, and AWS Graviton for general-purpose processing. This range of options allows customers to select the most efficient solution for their specific AI and machine learning workloads, providing both choice and flexibility in how they run their applications on AWS infrastructure.

Why AWS Graviton for AI/ML?

AWS Graviton, as shown in Figure 1, is a family of CPUs designed to deliver the best price performance for your cloud workloads. Over the years, new use cases have influenced CPU design. Examples include cryptography instructions to secure the exchange of data, and parallel processing for video and scientific simulations.

Figure 1: AWS Graviton Processors

The rise of AI/ML workloads has also influenced CPU design. Graviton3 includes more capacity for parallel processing and more memory bandwidth. These hardware improvements combined with accompanying software support deliver up to 25% better compute performance over Graviton 2 processors. AWS Graviton4 processors further extend AI/ML price-performance with 50% more cores per processor (96 vs. 64 in Graviton3) and 75% more memory bandwidth. On top of these hardware advancements, Graviton also includes an instruction set that can run software using the latest AI acceleration techniques.

The design techniques shown in Figure 2 work together to achieve three key benefits of superior price-performance, lower overall costs, and improved energy efficiency.

Figure 2: Why AWS Graviton?

Running SLMs on Graviton

We now walk through at a high level how Graviton based instances can be deployed to run SLMs from Arcee AI. Figure 3 shows steps for downloading Arcee AI SLMs, applying Quantization, and then deploying the models for Inference on a single Graviton4-based Amazon EC2 instance.

Figure 3: Running Arcee SLM on Graviton Steps

Below are the steps to setup Arcee SLM on Graviton instances.

Download Arcee SLM: First, we download an Arcee AI Virtuoso Lite model, which is available on Hugging Face.

Install llama.cpp: After we download a model, we install llama.cpp onto a Graviton instance. In our case we used the Amazon EC2 r8g.8xlarge instance. Llama.cpp is an open-source library that allows us to optimize models and run inference on a variety of hardware systems including Graviton processors.

Quantize the Model: After installing it, we can use llama.cpp to quantize the model. Quantization is a technique for reducing computational requirements by converting high precision floating point numbers (i.e. 32-bit) into lower precision numbers (i.e. 8-bit). In exchange for a negligible reduction in accuracy, applying this one-time operation allows us to perform inference using less resources and therefore increase the price-performance. The quality of text generation can be measured using the perplexity metric, which evaluates a model’s ability to predict the next token. A lower perplexity score indicates higher confidence in the next token prediction. A perplexity score of < 10 is considered to be a good score for LLMs. In general, quantizing a 16-bit model to 8-bit and then to 4-bit introduces minimal additional perplexity. As shown in Figure 4, when moving to 4-bit, you should see that inference runs noticeably faster than an 8-bit model. SLMs experience less degradation of accuracy from quantization than LLMs do. Any time you make a change it is best practice to re-evaluate the model to make sure they behave as expected from a functional and performance perspective.

Run Inference: Once quantization is complete, we run inference on our model using the llama.cpp CLI tool. Here you can explore models quantized to different bit-depths to see how they perform relative to each other. Here in Figure 4 are Arcee AI’s results running on an EC2 r8g.8xlarge Graviton instance:

	Virtuoso Small FP16	Virtuoso Lite Q8	Virtuoso Small Q4 (Automatically repacked for Graviton 4)
Tokens per second	17.2	28.5	44.5
Perplexity	6.6953	6.7001	6.7627

Figure 4: Arcee AI’s Results Running SLM on Graviton r8g.8xlarge Instance

The 4-bit model is 1.6 times faster, producing 44.5 tokens per second compared to 28.5 tokens per second on the 8-bit model, and with only 1% increase in perplexity. Similarly, the 4-bit model is 2.5 times faster, compared to 17.2 tokens per second on the 16-bit model.

Beyond inference with quantized models, you can also explore running on different sized Graviton instances for further comparison of price-performance. Other optimizations to explore include changing the batch and prompt sizes to see how much an instance type can handle. Working in the AWS cloud allows you to dynamically provision different hardware on-demand as needed for experimentation and evaluation. This allows you to efficiently explore the problem space specific to your use case and determine the optimal combination for your needs.

Conclusion

In this blog we shared how Arcee AI SLMs with AWS Graviton instances can deliver compelling price performance for AI/ML text generation workloads. By following the steps in this post, we invite you to evaluate and decide for yourself, the power of Graviton CPUs for GenAI inference. Book a demo to further learn how to potentially gain better price performance by building and deploying SLMs from Arcee AI on Graviton.

To get hands-on experience with SLM on Graviton4, you can follow our streamlined process: launch a Graviton4 instance using our provided instructions, then install Arcee following the GitHub guide, and remember to clean up your instances afterward to avoid ongoing charges. We encourage you to explore these powerful capabilities while following AWS best practices for resource management.

Arcee AI – AWS Partner Spotlight

Arcee AI is an AWS Advanced Technology Partner and AWS Competency Partner whose goal is to make world-class small language models (SLMs) available to companies across all industries. With their flagship product, Arcee Orchestra, Arcee AI takes SLMs to their full potential: leveraging them to work together in an easy-to-use platform for implementing custom agentic AI workflows.

Contact Arcee AI | Partner Overview | AWS Marketplace

AWS Partner Network (APN) Blog

Running GenAI Inference with AWS Graviton and Arcee AI Models

Why AWS Graviton for AI/ML?

Running SLMs on Graviton

Conclusion

Arcee AI – AWS Partner Spotlight

Resources

Follow