Back to Models
litert-community logo

litert-community/gemma-4-E2B-it-litert-lm

litert-communitygeneral

litert-community/gemma-4-E2B-it-litert-lm

Main Model Card: google/gemma-4-E2B-it

This model card provides the Gemma 4 E2B model in a way that is ready for deployment on Android, iOS, Desktop, IoT and Web.

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. This particular Gemma 4 model is small so it is ideal for on-device use cases. By running this model on device, users can have private access to Generative AI technology without even requiring an internet connection.

These models are provided in the .litertlm format for use with the LiteRT-LM framework. LiteRT-LM is a specialized orchestration layer built directly on top of LiteRT, Google’s high-performance multi-platform runtime trusted by millions of Android and edge developers. LiteRT provides the foundational hardware acceleration via XNNPack for CPU and ML Drift for GPU. LiteRT-LM adds the specialized GenAI libraries and APIs, such as KV-cache management, prompt templating, and function calling. This integrated stack is the same technology powering the Google AI Edge Gallery showcase app.

The model file size is 2.58 GB, which includes a text decoder with 0.79GB of weights and 1.12GB of embedding parameters. LiteRT-LM framework always keeps main weights in memory, while the embedding parameters are memory mapped which enables significant working memory savings on some platforms as seen in the detailed data below. The vision and audio models are loaded as needed to further reduce memory consumption.

Try Gemma 4 E2B

Build with Gemma 4 E2B and LiteRT-LM

Ready to integrate this into your product? Get started here.

Gemma 4 E2B Performance on LiteRT-LM

All benchmarks were taken using 1024 prefill tokens and 256 decode tokens with a context length of 2048 tokens via LiteRT-LM. The model can support up to 32k context length. The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads. Time-to-first-token does not include load time. Benchmarks were run with caches enabled and initialized. During the first run, the latency and memory usage may differ. Model size is the size of the file on disk.

CPU memory was measured using, rusage::ru_maxrss on Android, Linux and Raspberry Pi, task_vm_info::phys_footprint on iOS and MacBook and process_memory_counters::PrivateUsage on Windows.

Android

Note: On supported Android devices, Gemma 4 is available through Android AI Core as Gemini Nano, which is the recommended path for production applications.

Device                                     BackendPrefill (tokens/sec)Decode (tokens/sec)Time-to-first-token (sec)Model size (MB)CPU Memory (MB)
S26 UltraCPU55746.91.825831733
S26 UltraGPU3,80852.10.32583676

iOS

Device                                     BackendPrefill (tokens/sec)Decode (tokens/sec)Time-to-first-token (sec)Model size (MB)CPU/GPU Memory (MB)
iPhone 17 ProCPU53225.01.92583607
iPhone 17 ProGPU2,87856.50.325831450

Linux

Device                                     BackendPrefill (tokens/sec)Decode (tokens/sec)Time-to-first-token (sec)Model size (MB)CPU Memory (MB)
Arm 2.3 & 2.8GHzCPU26035.04.025831628
NVIDIA GeForce RTX 4090GPU11,234143.40.12583913

macOS

Device                                     BackendPrefill (tokens/sec)Decode (tokens/sec)Time-to-first-token (sec)Model size (MB)CPU/GPU Memory (MB)
MacBook Pro M4 MaxCPU90141.61.12583736
MacBook Pro M4 MaxGPU7,835160.20.125831623

Windows

Device                                     BackendPrefill (tokens/sec)Decode (tokens/sec)Time-to-first-token (sec)Model size (MB)CPU Memory (MB)
Intel LunarLakeCPU43529.82.3925833505
Intel LunarLakeGPU3,75148.40.2925833540

IoT

Device                                     BackendPrefill (tokens/sec)Decode (tokens/sec)Time-to-first-token (sec)Model size (MB)CPU Memory (MB)
Raspberry Pi 5 16GBCPU1337.67.825831546
Jetson Orin NanoCPU10912.29.425833681
Jetson Orin NanoGPU1,14224.20.925832739
Qualcomm Dragonwing IQ8 (IQ-8275)NPU3,74731.70.329671869
  • NPU model is benchmarked with 4096 context length

Gemma 4 E2B on Web

Running Gemma inference on the web is currently supported through LLM Inference Engine and uses the gemma-4-E2B-it-web.task model file. Try it out live in your browser (Chrome with WebGPU recommended). To start developing with it, download the web model and run with our sample web page, or follow the guide to add it to your own app.

Benchmarked in Chrome on a MacBook Pro 2024 (Apple M4 Max) with 1024 prefill tokens and 256 decode tokens, but the model can support context lengths up to 128K.

DeviceBackendPrefill (tokens/sec)Decode (tokens/sec)Initialization time (sec)Model size (MB)CPU Memory (GB)GPU Memory (GB)
WebGPU4,67673.91.120041.51.8
  • GPU memory measured by "GPU Process" memory for all of Chrome while running. Was 130MB when inactive, before any model loading took place.
  • CPU memory measured for the entire tab while running. Was 55MB when inactive, before any model loading took place.
Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes159
Downloads
📝

No reviews yet

Be the first to review litert-community/gemma-4-E2B-it-litert-lm!

Model Info

Providerlitert-community
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes159
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor