Google's Gemini 'Omni' Video Model Emerges as Distilled Tool-Calling Model Hits GitHub

Gemini Expands Its Arsenal While a Tiny Offshoot Emerges

Google's Gemini ecosystem is making significant strides on multiple fronts. A new feature rolling out globally now allows users to generate a variety of file formats—including PDFs, Word documents, Excel spreadsheets, Google Docs, and plain text—directly from within the Gemini chat interface. This move eliminates the need for manual copying and reformatting, positioning Gemini as a more direct competitor to ChatGPT's document generation capabilities.

Simultaneously, early demonstrations of a new "Gemini Omni" video model have surfaced online. While details remain scarce, these demos show the model generating video scenes based on textual prompts, such as a scene of two men eating spaghetti, with results described as "fairly realistic" and "quite good." This suggests Google is actively pushing the boundaries of multimodal AI beyond static images and text.

Needle: Distilling Gemini's Power for the Edge

In a parallel development, the AI research group Cactus Compute has released "Needle," an open-source project that distills the tool-calling capabilities of large models like Gemini into a remarkably small 26-million parameter model. Built on a "Simple Attention Network" architecture, Needle is designed to run on "incredibly small devices" such as phones, watches, and glasses, enabling efficient on-device AI.

The model was pretrained on 200 billion tokens using 16 TPU v6e chips over 27 hours, followed by post-training on a specialized 2-billion token dataset for single-shot function calls. According to the developers, Needle outperforms larger models like FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m on single-shot function call tasks, a key capability for personal AI assistants.

Why It Matters: The creation of Needle represents a significant shift towards making powerful AI features, like reliable tool and API calling, accessible on resource-constrained devices without constant cloud dependency. This opens the door for more private, responsive, and cost-effective AI applications.

Technical Architecture and Performance

Needle's architecture is a streamlined encoder-decoder design. It features a 12-layer encoder (using grouped query attention and rotary positional embeddings but no feed-forward networks) and an 8-layer decoder. The model uses a small vocabulary size (BPE=8192) and shares embeddings between the encoder and decoder, contributing to its compact size.

The project provides a full suite of tools for developers, including:

A web UI playground for testing and fine-tuning on custom tools.
A Python API for easy integration.
CLI commands for training, fine-tuning, evaluation, and data generation.
Support for fine-tuning locally on standard Mac/PC hardware.

In production environments using Cactus Compute's infrastructure, Needle reportedly achieves speeds of 6000 tokens/second for prefill and 1200 tokens/second for decoding.

continue reading below...

A Glimpse into Google's Internal Testing

Adding to the week's Gemini news, a hidden model selector was discovered within the Google App (v17.18.22), revealing seven previously unreported AI model options for Gemini Live voice conversations. This appears to be an internal testing tool activated ahead of Google I/O 2026.

Testing showed these models produce measurably different responses. Key findings include:

Four models could access the user's location for live weather data; three could not.
One model, codenamed "Capybara," identified itself as "Gemini 3.1 Pro" instead of the standard "Gemini 3.1 Flash Live."
Two models caught a deliberately false claim made during testing, while others accepted it, indicating varying levels of fact-checking.
Three models promised to remember personal information, while others refused.

This reveals Google is actively experimenting with a suite of specialized models for interactive voice, likely refining them for future public release.

The Underrated Power of Gemini Canvas

Beyond these core developments, Google's Gemini Canvas is gaining recognition as a powerful, underutilized tool. It acts as a persistent workspace where users can develop ideas, plan projects, and even build simple tools without constantly switching between apps.

Users report employing Canvas for trip planning, research organization, task breakdowns, and creating lightweight tools like budget trackers. Its strength lies in maintaining context and allowing natural refinement of ideas and structures over time, positioning it as a flexible AI-augmented workspace rather than just a chat interface.

The Road Ahead for Compact and Capable AI

The emergence of Needle highlights a growing trend in AI: the distillation of large, cloud-based model capabilities into efficient, specialized models that can run on-device. This addresses critical concerns around latency, cost, privacy, and offline functionality.

Meanwhile, Google's continued expansion of Gemini's features—from file creation and video generation to a potential suite of voice models—shows a company aggressively iterating to capture market share and define the future of AI-assisted productivity. The confluence of these developments points to a future where powerful AI is both broadly accessible in the cloud and efficiently specialized at the edge.