LLM Evolution: Coding Agents, Open Models, & Why Pelicans Matter

The Last Six Months: A Pivotal Era for Large Language Models

At PyCon US 2026, Simon Willison delivered a rapid-fire summary of a transformative period for large language models. His five-minute lightning talk, now available as an annotated presentation, argues that the last half-year represents a fundamental inflection point, particularly marked by November 2025.

Willison identifies this as a moment where the balance of power in AI shifted perceptibly. The so-called "best" model, often judged by community 'vibes,' changed hands five times between Anthropic, OpenAI, and Google in a single month, underscoring frenetic competition.

The November Inflection Point and the Rise of Reliable Coding Agents

November 2025 was critical. It began with Claude Sonnet 4.5 as the perceived leader, only to be rapidly overtaken by GPT-5.1, Gemini 3, GPT-5.1 Codex Max, and finally Claude Opus 4.5. However, the real breakthrough was more profound than a simple leaderboard shuffle.

Coding agents crossed a crucial reliability threshold. Thanks to extensive Reinforcement Learning from Verifiable Rewards (RLVR) training by OpenAI and Anthropic, agents like Codex and Claude Code evolved from novelties into practical daily-driver tools. Developers could now use them to accomplish real work without constantly fixing fundamental errors.

The Emergence of "Claws" and a New AI Ecosystem

Simultaneously, a new category of software began its ascent. A project initially named Warelay, which underwent several rebrandings (CLAWDIS, CLAWDBOT, Clawdbot, Moltbot), exploded onto the scene in February 2026 as OpenClaw.

OpenClaw is a "personal AI assistant," and its success spawned a generic term: "Claws." These locally-run assistants became so popular they reportedly caused Mac Minis to sell out in Silicon Valley, humorously described as "the perfect aquarium for your Claw." The phenomenon highlights a move toward personalized, private AI.

continue reading below...

The Surprising Capability of Local and Open-Weight Models

A parallel trend saw open-weight models dramatically exceed expectations. Google's Gemma 4 series emerged as the most capable open-weight models from a US company. More strikingly, Chinese labs released formidable contenders.

GLM-5.1 is a 1.5TB, 754-billion-parameter behemoth licensed under the MIT License, offering state-of-the-art performance for those with the hardware. Meanwhile, Qwen's Qwen3.6-35B-A3B, a 20.9GB model, demonstrated it could run on a laptop and, in Willison's unique "pelican riding a bicycle" benchmark, even outperform Claude Opus 4.7 in some aspects.

The Pelican Benchmark: A Quirky Measure of Progress

Willison's ongoing use of the "Generate an SVG of a pelican riding a bicycle" prompt provides a whimsical yet telling visual timeline. The test is designed to be absurd—pelicans can't ride bikes, and labs wouldn't train for it—making it a useful probe for general drawing and instruction-following capability.

The progression from Claude Sonnet 4.5's simple pelican in September 2025 to Gemini 3.1 Pro's detailed illustration (complete with a fish in the basket) and GLM-5.1's animated attempts shows rapid improvement in visual reasoning and SVG generation, even for open-weight models.

Growing Pains: Security, Oversight, and Societal Influence

This period of breakneck advancement hasn't been without concerns. Separate research highlights critical challenges. A Nature article argues LLMs require a new form of capability-based monitoring, as performance degradation is complex and context-dependent.

Models can "overfit" to intrinsic factors (like outdated knowledge) or extrinsic ones (like specific human interactions), requiring nuanced fixes rather than blunt model retraining. Meanwhile, cybersecurity reports warn that AI-powered cyberattackers are improving rapidly, and AI coding assistants are exacerbating secrets-sprawl crises.

Perhaps most alarmingly, another Nature study provides evidence that government control of media globally influences LLM outputs. The models exhibit a stronger pro-government bias in responses when their training data includes content from state-controlled media, raising profound questions about neutrality and global digital discourse.

Conclusion: A New Phase of Practical and Accessible AI

The last six months crystallize two major themes, as summarized by Willison. First, coding agents got genuinely good, moving from research demos to professional tools. Second, locally-run models wildly outperform expectations, democratizing access to powerful AI.

This shift towards practical utility and accessible power defines the current era. However, it unfolds alongside serious questions about security, oversight, and inherent bias that the industry must address as these models become further embedded in our digital and professional lives. The race is no longer just about who has the best pelican; it's about who can build the most useful, trustworthy, and responsible AI systems.