Seamless AI Acceleration for Developers Everywhere

To scale the AI opportunity, developers need access to the fastest methods of AI deployment, together with optimal performance that best suits their specific workload. Arm is dedicated to maximizing AI performance across the entirety of the Arm platform, helping to ensure seamless acceleration for every developer, every model, and every workload.

Unprecedented AI on CPU Performance with Arm Kleidi

At the heart of all Arm platforms is the Arm CPU. Its ubiquity offers a flexible and energy-efficient target for many AI inference workloads, including deep learning and generative AI. Arm Kleidi, inspired by the Greek word for 'key' focuses on ensuring these workloads get the best out of the underlying Arm Cortex-A or Arm Neoverse CPU.

 

 

Arm Kleidi Pillars

Arm Kleidi Libraries Pillar infographic_vertical

Collaborating with Key Partners Unlocks AI Acceleration Everywhere

The mission of Arm Kleidi is to collaborate with leading AI frameworks, cloud service providers and the ML ISV community to provide full ML stack, out-of-the-box inference performance improvements for billions of workloads without the need for extra developer work or expertise.

PyTorch

Arm works closely with the PyTorch community, helping to ensure models running on PyTorch just work on Arm—driving seamless acceleration for even the most demanding AI workloads.

language processing icon

BERT-Large

Arm has been working to improve PyTorch inference performance on Arm CPUs, including optimizing the primary execution modes, Eager Mode and Graph Mode.

Integrating Kleidi improves Llama model inference by up to 18 times, Gemma 2 2B by 15 times, and performance for natural language processing (NLP) models, including 2.2 times uplift on Bert-Large.

customer service icon

Llama 3.1 8B

Using Arm Neoverse V2-based Graviton4 processors, we can achieve an estimated 12 times uplift in token generation rate for a chatbot demo with KleidiAI optimizations applied to PyTorch.

This demo shows how easy it is to build AI applications using LLMs, making use of existing Arm-based compute capacity.

text summary icon

RoBERTa

AWS collaborated with Arm to optimize the PyTorch torch.compile feature for Neoverse V1-based Graviton3 processors with Arm Compute Library (ACL) kernels using oneDNN.

This optimization results in up to 2 times inference performance improvement for the most popular NLP models on Hugging Face.

text transcription icon

FunASR Paraformer-Large

FunASR is an advanced open-source automatic speech recognition (ASR) toolkit developed by Alibaba DAMO Academy.

By integrating ACL with PyTorch via oneDNN, we have seen a 2.3 times performance improvement when running the Paraformer model on Neoverse N2-based AliCloud Yitian710 processors.

ExecuTorch

Together, Arm and ExecuTorch, a lightweight ML framework, enable efficient on-device inference capabilities at the edge.

Chat summary icon

Llama 3.2 1B

Thanks to the collaborative efforts of Arm and Meta, AI developers can now run quantized Llama 3.2 models up to 20% faster than ever on Arm CPUs.

By integrating KleidiAI with ExecuTorch and developing optimized quantization schemes, we have achieved speeds of over 350 tokens per second on the prefill stage for generative AI workloads on mobile.

Llama.cpp

To demonstrate the capability of Arm-based CPUs for LLM inferencing, Arm and our partners are optimizing the int4 and int8 kernels implemented in llama.cpp to leverage these newer instructions.

virtual assistant icon

Phi 3 3.8B

Due to our optimizations, the time-to-first token (TTFT) for Microsoft’s Phi 3 LLM is accelerated by around 190% when running a chatbot demo on the Arm Cortex-X925 CPU, which is used in premium smartphones.

text generation icon

Llama 3 8B

Running a text generation demo on Graviton3 processors with our optimizations achieves a 2.5 times performance uplift for TTFT and over 35 tokens / second in the text generation phase, which is more than sufficient for real-time use cases.

MediaPipe

Arm’s partnership with Google AI Edge on MediaPipe and XNNPACK is accelerating AI workloads on current and future Arm CPUs. This enables developers to deliver outstanding AI performance for mobile, web, edge and IoT.

education icon

Gemma 1 2B

Arm collaborated with Google AI Edge to integrate KleidiAI with their MediaPipe framework, which supports numerous LLMs, such as Gemma and Falcon.

Thanks to KleidiAI integration via XNNPACK, we have seen a 30% acceleration in TTFT when running a chatbot demo on the Gemma 2B LLM on Arm-based premium smartphones.

Hunyuan

Tencent’s Hunyuan AI framework supports Hunyuan LLM, a universal model that enables AI capabilities across a wide range of devices, including smartphones.

chat box icon

Hunyuan

Arm has been working with Tencent to integrate Kleidi technologies into Hunyuan, its LLM with over 100B parameters.

The partnership was announced at the 2024 Tencent Global Digital Ecosystem Summit and can have a positive impact on real-world workloads.

Key Developer Technologies for Accelerating CPU Performance

Arm Kleidi includes the latest developer enablement technologies designed to advance AI model capability, accuracy, and speed.

KleidiAI and KleidiCV libraries are lightweight kernels designed to make it easy for machine learning (ML) and computer vision (CV) frameworks to target optimum performance and leverage the latest features for enhancing AI and CV in Arm CPU-based designs.

A fully comprehensive and flexible library that enables independent software vendors to source ML functions optimized for Cortex-A and Neoverse CPUs. The library is OS agnostic and is portable to Android, Linux, and bare metal systems.

Simplifying AI Deployment

Simplifying AI Deployment

Arm is committed to maximizing the ease and speed of AI deployment for developers. Kleidi is just one of the ways we are making AI optimizations accessible to millions.

Explore AI Software
Armv9 AI on CPU

Unleashing CPU Performance at Scale

Kleidi enables easy optimization across the full range of Arm Neoverse and Arm Cortex-A CPUs. These technologies leverage advanced features in the Arm architecture, such as Arm Scalable Vector Extensions (SVE), and Arm Scalable Matrix Extensions (SME), which target accelerated AI performance.

CPU Inference

Subscribe to the Latest AI News from Arm

Newsletter Signup