ExecuTorch is PyTorch's unified solution for deploying AI models on-deviceโfrom smartphones to microcontrollersโbuilt for privacy, performance, and portability. It powers Meta's on-device AI across Instagram, WhatsApp, Quest 3, Ray-Ban Meta Smart Glasses, and more.
Deploy LLMs, vision, speech, and multimodal models with the same PyTorch APIs you already knowโaccelerating research to production with seamless model export, optimization, and deployment. No manual C++ rewrites. No format conversions. No vendor lock-in.
๐ Table of Contents
- ๐ Native PyTorch Export โ Direct export from PyTorch. No .onnx, .tflite, or intermediate format conversions. Preserve model semantics.
- โก Production-Proven โ Powers billions of users at Meta with real-time on-device inference.
- ๐พ Tiny Runtime โ 50KB base footprint. Runs on microcontrollers to high-end smartphones.
- ๐ 12+ Hardware Backends โ Open-source acceleration for Apple, Qualcomm, ARM, MediaTek, Vulkan, and more.
- ๐ฏ One Export, Multiple Backends โ Switch hardware targets with a single line change. Deploy the same model everywhere.
ExecuTorch uses ahead-of-time (AOT) compilation to prepare PyTorch models for edge deployment:
- ๐งฉ Export โ Capture your PyTorch model graph with
torch.export()
- โ๏ธ Compile โ Quantize, optimize, and partition to hardware backends โ
.pte
- ๐ Execute โ Load
.pte
on-device via lightweight C++ runtime
Models use a standardized Core ATen operator set. Partitioners delegate subgraphs to specialized hardware (NPU/GPU) with CPU fallback.
Learn more: How ExecuTorch Works โข Architecture Guide
pip install executorch
For platform-specific setup (Android, iOS, embedded systems), see the Quick Start documentation for additional info.
import torch
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
# 1. Export your PyTorch model
model = MyModel().eval()
example_inputs = (torch.randn(1, 3, 224, 224),)
exported_program = torch.export.export(model, example_inputs)
# 2. Optimize for target hardware (switch backends with one line)
program = to_edge_transform_and_lower(
exported_program,
partitioner=[XnnpackPartitioner()] # CPU | CoreMLPartitioner() for iOS | QnnPartitioner() for Qualcomm
).to_executorch()
# 3. Save for deployment
with open("model.pte", "wb") as f:
f.write(program.buffer)
# Test locally via ExecuTorch runtime's pybind API (optional)
from executorch.runtime import Runtime
runtime = Runtime.get()
method = runtime.load_program("model.pte").load_method("forward")
outputs = method.execute([torch.randn(1, 3, 224, 224)])
#include <executorch/extension/module/module.h>
#include <executorch/extension/tensor/tensor.h>
Module module("model.pte");
auto tensor = make_tensor_ptr({2, 2}, {1.0f, 2.0f, 3.0f, 4.0f});
auto outputs = module.forward({tensor});
let module = Module(filePath: "model.pte")
let input = Tensor<Float>([1.0, 2.0, 3.0, 4.0])
let outputs: [Value] = try module.forward([input])
val module = Module.load("model.pte")
val inputTensor = Tensor.fromBlob(floatArrayOf(1.0f, 2.0f, 3.0f, 4.0f), longArrayOf(2, 2))
val outputs = module.forward(EValue.from(inputTensor))
Export Llama models using the export_llm
script or Optimum-ExecuTorch:
# Using export_llm
python -m executorch.extension.llm.export.export_llm --model llama3_2 --output llama.pte
# Using Optimum-ExecuTorch
optimum-cli export executorch \
--model meta-llama/Llama-3.2-1B \
--task text-generation \
--recipe xnnpack \
--output_dir llama_model
Run on-device with the LLM runner API:
#include <executorch/extension/llm/runner/text_llm_runner.h>
auto runner = create_llama_runner("llama.pte", "tiktoken.bin");
executorch::extension::llm::GenerationConfig config{
.seq_len = 128, .temperature = 0.8f};
runner->generate("Hello, how are you?", config);
let runner = TextRunner(modelPath: "llama.pte", tokenizerPath: "tiktoken.bin")
try runner.generate("Hello, how are you?", Config {
$0.sequenceLength = 128
}) { token in
print(token, terminator: "")
}
Kotlin (Android) โ API Docs โข Demo App
val llmModule = LlmModule("llama.pte", "tiktoken.bin", 0.8f)
llmModule.load()
llmModule.generate("Hello, how are you?", 128, object : LlmCallback {
override fun onResult(result: String) { print(result) }
override fun onStats(stats: String) { }
})
For multimodal models (vision, audio), use the MultiModal runner API which extends the LLM runner to handle image and audio inputs alongside text. See Llava and Voxtral examples.
See examples/models/llama for complete workflow including quantization, mobile deployment, and advanced options.
Next Steps:
- ๐ Step-by-step tutorial โ Complete walkthrough for your first model
- โก Colab notebook โ Try ExecuTorch instantly in your browser
- ๐ค Deploy Llama models โ LLM workflow with quantization and mobile demos
Platform | Supported Backends |
---|---|
Android | XNNPACK, Vulkan, Qualcomm, MediaTek, Samsung Exynos |
iOS | XNNPACK, MPS, CoreML (Neural Engine) |
Linux / Windows | XNNPACK, OpenVINO, CUDA (experimental) |
macOS | XNNPACK, MPS, Metal (experimental) |
Embedded / MCU | XNNPACK, ARM Ethos-U, NXP, Cadence DSP |
See Backend Documentation for detailed hardware requirements and optimization guides.
ExecuTorch powers on-device AI at scale across Meta's family of apps, VR/AR devices, and partner deployments. View success stories โ
LLMs: Llama 3.2/3.1/3, Qwen 3, Phi-4-mini, LiquidAI LFM2
Multimodal: Llava (vision-language), Voxtral (audio-language)
Vision/Speech: MobileNetV2, DeepLabV3, Whisper
Resources: examples/
directory โข executorch-examples out-of-tree demos โข Optimum-ExecuTorch for HuggingFace models
ExecuTorch provides advanced capabilities for production deployment:
- Quantization โ Built-in support via torchao for 8-bit, 4-bit, and dynamic quantization
- Memory Planning โ Optimize memory usage with ahead-of-time allocation strategies
- Developer Tools โ ETDump profiler, ETRecord inspector, and model debugger
- Selective Build โ Strip unused operators to minimize binary size
- Custom Operators โ Extend with domain-specific kernels
- Dynamic Shapes โ Support variable input sizes with bounded ranges
See Advanced Topics for quantization techniques, custom backends, and compiler passes.
- Documentation Home โ Complete guides and tutorials
- API Reference โ Python, C++, Java/Kotlin APIs
- Backend Integration โ Build custom hardware backends
- Troubleshooting โ Common issues and solutions
We welcome contributions from the community!
- ๐ฌ GitHub Discussions โ Ask questions and share ideas
- ๐ฎ Discord โ Chat with the team and community
- ๐ Issues โ Report bugs or request features
- ๐ค Contributing Guide โ Guidelines and codebase structure
ExecuTorch is BSD licensed, as found in the LICENSE file.
Part of the PyTorch ecosystem
GitHub โข Documentation