Gemma 4 E2B Powers Real-Time, On-Device AI Chat in Parlor Project

On-Device AI Gets a Real-Time Voice and Vision

A new open-source project called Parlor demonstrates a significant leap forward for local AI. It enables natural, real-time conversations with an AI using voice and a webcam, all running entirely on a developer's laptop without sending data to the cloud. The key enabling technology is Google's recently announced Gemma 4 E2B model.

Shared on Hacker News, Parlor showcases what Google's new generation of small, efficient models makes possible. The project uses the 2-billion-parameter E2B variant, designed specifically for edge devices, to process both audio speech and video frames. It then uses the Kokoro text-to-speech model to generate a vocal response.

The entire pipeline, from listening and seeing to understanding and speaking, runs locally on an Apple M3 Pro MacBook Pro. End-to-end latency is reported at 2.5 to 3 seconds, a feat that would have required high-end server GPUs just months ago. This opens doors for private, low-cost AI assistants.

Why Local, Real-Time AI Matters

The project's creator, Fikri Karim, states his motivation stems from running a free AI service for English language learners. Server costs for cloud-based AI models can be prohibitive. Running everything on-device eliminates that cost entirely, making such services sustainable.

Beyond cost, local execution offers profound benefits in privacy, latency, and accessibility. Users' conversations, camera feeds, and personal context never leave their device. There's no network lag for real-time interaction. As Karim speculates, if this runs on a Mac today, it could run on phones in a few years, enabling powerful AI companions anywhere.

This aligns perfectly with Google's vision for its Gemma 4 edge models. According to sources, the E2B and E4B models are built for "lightweight, on-device deployments" on smartphones, IoT devices, and Raspberry Pis. They feature a 128K token context window and are optimized for low latency and battery efficiency.

Inside the Parlor Tech Stack

Parlor is architected as a local web application. The browser handles audio capture via microphone and images from the webcam. A key component is Silero VAD (Voice Activity Detection) running in the browser, which allows for hands-free, "barge-in" conversations where the user can interrupt the AI.

The audio and video data is streamed via WebSocket to a local Python server powered by FastAPI. This server hosts the core AI models:

Gemma 4 E2B via LiteRT-LM: The model runs on the Mac's GPU (using Apple's Metal Performance Shaders) and handles the multimodal understanding, generating text responses.
Kokoro TTS: The text response is converted to speech. On macOS, it uses Apple's MLX framework; on Linux, it uses ONNX.

The audio response is streamed back to the browser in chunks, allowing playback to start before the full sentence is generated. This creates a more natural, responsive feel.

continue reading below...

Gemma 4: The Engine of the Revolution

Parlor is a practical demonstration of the capabilities Google packed into Gemma 4, announced in early April 2026. The release includes four models: the edge-focused E2B and E4B, and two larger models (26B MoE and 31B Dense) for servers and high-end GPUs.

What makes the edge models special for projects like Parlor is their native multimodality. As reported by SiliconANGLE and Geeky Gadgets, all Gemma 4 models process images and video, but the E2B and E4B variants uniquely add native audio input support. This allows for direct speech understanding without a separate transcription model.

Google also highlights major improvements in multi-step reasoning and native support for function calling and structured JSON output. This makes them far more capable of powering "agentic" workflows—AIs that can autonomously use tools and execute plans. For developers, this is a game-changer for building sophisticated local apps.

Performance and the Road to Phones

On an Apple M3 Pro, Parlor's benchmarks are telling. The speech and vision understanding phase takes 1.8-2.2 seconds, response generation adds ~0.3s, and TTS takes 0.3-0.7s. The model decodes at about 83 tokens per second on the GPU.

Google's own performance claims for the edge models are aggressive. Ars Technica reports they offer "near-zero latency" and use up to 60% less battery than Gemma 3 while being up to four times faster. The Android Developers Blog notes the E2B runs three times faster than the E4B.

Most significantly, sources confirm that Gemma 4 E2B and E4B are the foundation for Gemini Nano 4, Google's next-generation on-device model for Android. The Next Web and Ars Technica report that this model will arrive on consumer devices, like Pixel phones, later in 2026. This means prototypes built with Gemma 4 today will be forward-compatible with the AI running on billions of phones tomorrow.

Implications and the Future of Local AI

The combination of Parlor's demo and Gemma 4's specs paints a clear picture of the near future. Powerful, multimodal AI that can see, hear, and converse intelligently will move from the cloud to our pockets. This enables a new class of applications: real-time translation tutors, vision-based navigation aids, privacy-first personal assistants, and interactive educational tools.

Google's shift to the permissive Apache 2.0 license for Gemma 4, as noted by Ars Technica, further accelerates this trend. It removes commercial restrictions, encouraging broad adoption and integration. With over 400 million downloads of the Gemma family to date, the ecosystem is poised for explosive growth.

Parlor, while a "research preview" with rough edges, is a tangible prototype of this future. It proves that real-time, conversational AI with vision is no longer science fiction or confined to tech demos from large corporations. It's a downloadable project that runs on a developer's laptop today, hinting at the intelligent, always-available, and private computing experiences coming to mainstream devices very soon.