<RETURN_TO_BASE

Google LiteRT NeuroPilot Turns MediaTek NPUs Into LLMs Hub

Google's LiteRT NeuroPilot enhances MediaTek NPUs for on-device AI models.

What is LiteRT NeuroPilot Accelerator?

LiteRT is the successor of TensorFlow Lite. It is a high-performance runtime that resides on device, executing models in .tflite FlatBuffer format, targeting CPU, GPU, and now NPU backends through a unified hardware acceleration layer.

LiteRT NeuroPilot Accelerator is the new NPU path for MediaTek hardware. It replaces the older TFLite NeuroPilot delegate with a direct integration to the NeuroPilot compiler and runtime. This means developers can now create and deploy LLMs using a Compiled Model API which supports Ahead of Time (AOT) and on-device compilation, available through C++ and Kotlin APIs.

Unified Workflow for Fragmented NPUs

Historically, on-device ML stacks focused on CPU and GPU. NPU SDKs were vendor-specific, requiring separate compilation flows per SoC and leading to a combinatorial explosion of binaries. LiteRT NeuroPilot Accelerator simplifies this with a three-step workflow:

  1. Convert or load a .tflite model.
  2. Optionally run AOT compilation with LiteRT Python tools to produce an AI Pack for target SoCs.
  3. Deliver the AI Pack via Play for On-device AI (PODAI), selecting Accelerator.NPU at runtime, with fallback options to GPU or CPU if needed.

Model Support and Performance

The stack targets specific open weight models, including:

  • Qwen3 0.6B for text generation in specific markets.
  • Gemma-3-270M for sentiment analysis and entity extraction.
  • Gemma-3-1B for summarization and reasoning.
  • Gemma-3n E2B for multimodal applications like translation.

On Dimensity 9500, the Gemma 3n E2B variant offers more than 1600 tokens per second in prefill and 28 tokens per second in decoding, demonstrating the NPU's efficiency.

Developer Experience with Zero Copy Buffers

The new C++ API streamlines development by integrating closely with Android’s AHardwareBuffer. This allows direct construction of input TensorBuffer from OpenGL or OpenCL buffers, crucial for real-time processing without unnecessary intermediate copies.

Example code:

// Load model compiled for NPU
auto model = Model::CreateFromFile("model.tflite");
auto options = Options::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);
 
// Create compiled model
auto compiled = CompiledModel::Create(*env, *model, *options);
 
// Allocate buffers and run
auto input_buffers = compiled->CreateInputBuffers();
auto output_buffers = compiled->CreateOutputBuffers();
input_buffers[0].Write<float>(input_span);
compiled->Run(input_buffers, output_buffers);
output_buffers[0].Read(output_span);

Key Takeaways

  1. LiteRT NeuroPilot marks a major step in NPU integration, replacing the older TFLite delegate.
  2. It focuses on open weight models and provides a unified API for MediaTek NPUs.
  3. AOT compilation is essential for production LLM deployments, as on-device compilation can be time-consuming.
  4. On Dimensity NPUs, targeted models show significant performance improvements over CPU and GPU counterparts.
  5. The streamlined C++ and Kotlin APIs offer a cohesive development path for various hardware accelerators.
🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский