Google Antigravity 2.0 Tops Practical LLM Benchmark for CAD Code
AI News

Google Antigravity 2.0 Tops Practical LLM Benchmark for CAD Code

6 min
5/23/2026
Artificial IntelligenceCADGenerative AIGoogle

Antigravity 2.0 Emerges as a Leader in AI-Generated CAD

In a notable benchmark for AI-assisted coding, Google's newly launched Antigravity 2.0, leveraging the Gemini 3.5 Flash model, has delivered the highest-quality result for generating complex architectural 3D models. The test, conducted by ModelRift, pitted several leading AI coding tools against the task of creating an OpenSCAD model of the Pantheon from reference images.

The results highlight a significant leap in AI's ability to handle spatial reasoning and constructive geometry. While speed did not correlate with quality, Antigravity’s methodical, detail-oriented approach produced the most architecturally faithful model autonomously, including intricate features like the interior coffered ceiling.

This performance arrives alongside Antigravity 2.0's public debut at Google I/O 2026, where Google unveiled major updates to its agentic coding platform. The new version includes a redesigned desktop application, a command-line interface (CLI) tool, and an SDK for building custom workflows, positioning it as a direct competitor to tools like Cursor.

The Pantheon Benchmark: A Test of Spatial Intelligence

ModelRift's benchmark was designed to move beyond simple syntax checks. The goal was to see how well AI systems could translate visual architectural references into parametric CAD code using OpenSCAD, a text-based 3D modeling language. The Pantheon was chosen specifically because its geometry—a radial rotunda, dome, portico, and columns—plays to OpenSCAD's strengths in Boolean operations and symmetry.

All tested agents had access to the OpenSCAD CLI to render previews and iterate. The core prompt instructed them to "see two ref images and build .scad file with openscad implementation of pantheon." This setup tested not just code generation, but also an AI's capacity for visual analysis and iterative refinement.

Benchmark Results: Antigravity Takes the Crown

The benchmark compared six different AI systems. The results, scored for both implementation speed and output quality, revealed clear tiers of performance.

  • Google Antigravity 2.0 / Gemini 3.5 Flash High: Achieved the top autonomous quality score of 4.5/5. It was the only agent to implement the Pantheon's signature interior coffered ceiling pattern and used real architectural dimensions from research. However, it was among the slowest, taking around 12 minutes.
  • ModelRift / Gemini Flash 3.0 (Human-in-the-loop): Scored 3.8/5, representing the best non-autonomous result. Using ModelRift's visual annotation workflow, a human provided feedback on renders, guiding the AI to a more coherent model in about 10 minutes.
  • Codex 5.5 High: Scored 3.0/5. It produced a model with impressive detail density, including the engraved "M AGRIPPA" inscription on the entablature. Its score was hampered by a mismatch between its preview render and the final exported STL mesh, which had geometry issues.
  • Claude Sonnet 4.6: Scored 3.4/5, producing the cleanest and most proportionally balanced model among the original autonomous batch, but was the slowest of that group.
  • Claude Opus 4.7: Scored 3.0/5, creating a structured but overly uniform and monochrome model.
  • Cursor Composer 2.5: Was the fastest (5/5 for speed) but produced the weakest output (1.4/5 for quality), resulting in a simplistic, placeholder-like model.

The ranking demonstrates that for complex spatial tasks, raw speed is a poor predictor of quality. The most deliberate and planned approaches yielded the most architecturally sound results.

Inside Antigravity 2.0's Winning Approach

Google's Antigravity 2.0, announced at I/O 2026, represents a significant shift from its initial VS Code-based IDE. The new version is an agent-first desktop app that allows users to orchestrate multiple AI agents and execute tasks in parallel. Google stated that the new Gemini 3.5 Flash model was co-developed using Antigravity itself.

In the benchmark, this new foundation showed. Unlike other agents that visually estimated proportions, Antigravity's plan explicitly stated it would search for and use real Pantheon parameters. Its implementation included a parametric model with a cutaway toggle to showcase interior details.

The agent's output went beyond basic shapes. It accurately modeled the 5 rings of 28 coffers inside the dome—a level of detail no other autonomous agent attempted. It also correctly mixed materials (grey and red for columns) and included a readable inscription. This suggests Gemini 3.5 Flash High possesses enhanced spatial reasoning and a stronger capacity for planning and executing multi-step, detail-oriented tasks.

continue reading below...

The Human-in-the-Loop Advantage

While Antigravity won the autonomous category, the benchmark underscored the continued value of human guidance. The ModelRift/Gemini Flash 3.0 run, which employed a visual annotation workflow, achieved a high-quality score of 3.8/5.

In this workflow, a user could draw arrows and notes directly on a 3D render to point out issues—like missing column capitals or incorrect roof proportions—and feed that visual feedback back to the AI. For spatial tasks, this proved to be a faster and more precise correction method than textual descriptions alone.

This highlights a key insight: fully autonomous generation is not yet the optimal workflow for precision CAD tasks. Even the best AI benefits from targeted human steering, especially when nuanced aesthetic or proportional judgments are required.

OpenSCAD: The Ideal Language for AI-Generated Geometry

The benchmark validated OpenSCAD as a highly effective target language for AI-generated 3D geometry. Its text-based, procedural nature aligns well with how large language models reason about structure. Agents could directly describe operations like "make 28 repeated columns" or "subtract an oculus from a dome" in code.

This contrasts with AI-driven workflows for traditional 3D applications like Blender, where the AI must translate intent into a sequence of UI actions and maintain a mental model of a mutable scene state. OpenSCAD's deterministic, code-as-source approach provides a more transparent and reproducible foundation for AI collaboration.

Implications and Market Context

The benchmark results arrive amidst a rapidly evolving landscape for AI coding assistants. Google's launch of Antigravity 2.0, with its focus on multi-agent orchestration and custom workflows, signals a move beyond simple code completion toward more complex, project-level automation.

The performance of Gemini 3.5 Flash is particularly noteworthy given its context. While it delivered top-tier results, Google's published API pricing shows it is significantly more expensive than its predecessor, Gemini 3 Flash. This creates a cost/performance trade-off that developers and platforms like ModelRift must navigate.

Furthermore, the benchmark reveals that tool access is no longer the primary bottleneck. All agents successfully used the OpenSCAD CLI. The differentiating factors are now geometric judgment, architectural understanding, and the ability to plan and iterate effectively.

Conclusion: A New Benchmark for AI in CAD

Google Antigravity 2.0's victory in the OpenSCAD Pantheon benchmark is more than a single test win. It demonstrates a maturing capability in AI to handle non-trivial, spatially complex coding tasks. The integration of research, parametric design, and attention to authentic detail points to a future where AI can act as a competent junior engineer for constructive geometry.

However, the benchmark also clearly shows that the highest-quality outcomes for professional work will, for the foreseeable future, involve a human-in-the-loop. AI excels at generating a strong first draft and executing detailed plans, but human oversight remains crucial for final refinement and validation, especially when the output is destined for manufacturing or simulation.

As AI coding agents become more powerful and specialized, benchmarks like this will be essential for measuring true progress beyond simple code completion, assessing their ability to contribute to real-world engineering and design workflows.