Seamless AI Acceleration for Developers Everywhere
To scale the AI opportunity, developers need access to the fastest methods of AI deployment, together with optimal performance that best suits their specific workload. Arm is dedicated to maximizing AI performance across the entirety of the Arm platform, helping to ensure seamless acceleration for every developer, every model, and every workload.
Unprecedented AI on CPU Performance with Arm Kleidi
At the heart of all Arm platforms is the Arm CPU. Its ubiquity offers a flexible and energy-efficient target for many AI inference workloads, including deep learning and generative AI. Arm Kleidi, inspired by the Greek word for 'key' focuses on ensuring these workloads get the best out of the underlying Arm Cortex-A or Arm Neoverse CPU.
Collaborating with Key Partners Unlocks AI Acceleration Everywhere
The mission of Arm Kleidi is to collaborate with leading AI frameworks, cloud service providers and the ML ISV community to provide full ML stack, out-of-the-box inference performance improvements for billions of workloads without the need for extra developer work or expertise.
PyTorch
Arm works closely with the PyTorch community, helping to ensure models running on PyTorch just work on Arm—driving seamless acceleration for even the most demanding AI workloads.
BERT-Large
Arm has been working to improve PyTorch inference performance on Arm CPUs, including optimizing the primary execution modes, Eager Mode and Graph Mode.
Integrating Kleidi improves Llama model inference by up to 18 times, Gemma 2 2B by 15 times, and performance for natural language processing (NLP) models, including 2.2 times uplift on Bert-Large.
Llama 3.1 8B
Using Arm Neoverse V2-based Graviton4 processors, we can achieve an estimated 12 times uplift in token generation rate for a chatbot demo with KleidiAI optimizations applied to PyTorch.
This demo shows how easy it is to build AI applications using LLMs, making use of existing Arm-based compute capacity.
RoBERTa
AWS collaborated with Arm to optimize the PyTorch torch.compile feature for Neoverse V1-based Graviton3 processors with Arm Compute Library (ACL) kernels using oneDNN.
This optimization results in up to 2 times inference performance improvement for the most popular NLP models on Hugging Face.
FunASR Paraformer-Large
FunASR is an advanced open-source automatic speech recognition (ASR) toolkit developed by Alibaba DAMO Academy.
By integrating ACL with PyTorch via oneDNN, we have seen a 2.3 times performance improvement when running the Paraformer model on Neoverse N2-based AliCloud Yitian710 processors.
ExecuTorch
Together, Arm and ExecuTorch, a lightweight ML framework, enable efficient on-device inference capabilities at the edge.
Stable Audio Open
Stability AI and Arm have partnered to accelerate on-device generative AI, unlocking real-time audio generation capabilities without the need for an internet connection.
Through model distillation and leveraging Arm KleidiAI, Stable Audio Open now delivers 30x faster text-to-audio generation on Arm-based smartphones than previously – letting users create high-quality sounds at the edge in seconds.
Llama 3.2 1B
Thanks to the collaborative efforts of Arm and Meta, AI developers can now run quantized Llama 3.2 models up to 20% faster than ever on Arm CPUs.
By integrating KleidiAI with ExecuTorch and developing optimized quantization schemes, we have achieved speeds of over 350 tokens per second on the prefill stage for generative AI workloads on mobile.
Llama.cpp
To demonstrate the capability of Arm-based CPUs for LLM inferencing, Arm and our partners are optimizing the int4 and int8 kernels implemented in llama.cpp to leverage these newer instructions.
Custom SLM
AWS and Arm have fine-tuned the TinyLlama 1.1B SLM to create a chatbot for the car manual, enabling drivers to interact directly with their vehicle. Using KleidiAI, SLM inference is 10 times faster than previously on Arm Cortex-A76 CPUs, achieving response times of 3 seconds.
TinyLlama 1.1B
Using llama.cpp with KleidiAI, VicOne accelerated performance, doubling prefill and uplifting encode by 60%. Our partnership enables fast in-vehicle cybersecurity threat detection by reducing cloud dependency, lowering costs, and keeping data secure onboard.
TinyStories
TinyStories is a dataset containing words a typical 3-year-old might understand. It can be used to train and evaluate small models below 10M parameters. When running TinyStories on the Arm Cortex-A320 CPU, a performance uplift of over 70% has been achieved.
Llama 3.3 70B
In partnership with Meta and leveraging KleidiAI with 4-bit quantization, the SLM achieved similar performance to the larger Llama 3.1 405B model. Performance was consistent at 50 tokens/second when deployed on Arm Neoverse-powered Google Axion processors.
Phi 3 3.8B
Due to our optimizations, the time-to-first token (TTFT) for Microsoft’s Phi 3 LLM is accelerated by around 190% when running a chatbot demo on the Arm Cortex-X925 CPU, which is used in premium smartphones.
Llama 3 8B
Running a text generation demo on Graviton3 processors with our optimizations achieves a 2.5 times performance uplift for TTFT and over 35 tokens / second in the text generation phase, which is more than sufficient for real-time use cases.
Other Leading Frameworks
To maximize AI performance across the entire Arm compute platform, we are dedicated to optimizing inference workloads across all major AI and ML frameworks.
MNN
MNN is an open source deep learning framework developed by Alibaba. Our partnership helps improve performance and efficiency for on-device multimodal use cases.
As demonstrated with the multilingual instruction-tuned Qwen2-VL 2B model, integrating Kleidi with MNN accelerates prefill performance by 57% and decode by 28%.
OpenCV
With increasing demand for advanced, energy-efficient computer vision (CV) at the edge, KleidiCV helps ensure optimized performance for CV applications on Arm CPUs.
Now integrated with OpenCV 4.11, developers benefit from 4 times faster processing than ever for key image processing tasks such as blur, filter and rotation. This acceleration boosts performance for image segmentation, object detection and recognition use cases.
MediaPipe
Arm’s partnership with Google AI Edge on MediaPipe and XNNPACK is accelerating AI workloads on current and future Arm CPUs. This enables developers to deliver outstanding AI performance for mobile, web, edge and IoT, using numerous LLMs, like Gemma and Falcon.
Thanks to Kleidi integration with MediaPipe via XNNPACK, a 30% acceleration in TTFT has been achieved when running a chatbot demo on the Gemma 1 2B LLM on Arm-based premium smartphones.
Angel
Tencent’s Angel ML framework supports Hunyuan LLM, available in sizes from 1B to over 300B parameters. It enables AI capabilities across a wide range of devices, including smartphones and Windows on Arm PCs.
Our partnership was announced at the 2024 Tencent Global Digital Ecosystem Summit and is having a positive impact on real-world workloads by providing users with even more powerful and efficient on-device AI services across Tencent’s many applications.
Key Developer Technologies for Accelerating CPU Performance
Arm Kleidi includes the latest developer enablement technologies designed to advance AI model capability, accuracy, and speed.
KleidiAI and KleidiCV libraries are lightweight kernels designed to make it easy for machine learning (ML) and computer vision (CV) frameworks to target optimum performance and leverage the latest features for enhancing AI and CV in Arm CPU-based designs.
A fully comprehensive and flexible library that enables independent software vendors to source ML functions optimized for Cortex-A and Neoverse CPUs. The library is OS agnostic and is portable to Android, Linux, and bare metal systems.

Simplifying AI Deployment
Arm is committed to maximizing the ease and speed of AI deployment for developers. Kleidi is just one of the ways we are making AI optimizations accessible to millions.

Unleashing CPU Performance at Scale
Kleidi enables easy optimization across the full range of Arm Neoverse and Arm Cortex-A CPUs. These technologies leverage advanced features in the Arm architecture, such as Arm Scalable Vector Extensions (SVE), and Arm Scalable Matrix Extensions (SME), which target accelerated AI performance.