Google Antigravity User Manual 2026

Understanding the Agent Centric Paradigm of Antigravity

I view Google Antigravity as a fundamental departure from the traditional integrated development environment. It functions as a mission control center where autonomous agents handle the heavy lifting of planning and implementation. When I interact with this platform, I am not just writing code; I am directing a team of digital entities that can navigate the editor, command the terminal, and browse the web to verify their own work. The core of this experience lies in the use of artifacts such as the **Implementation Plan** and the **Task List**. These are not merely documents; they are the logical checkpoints that allow you to maintain oversight without getting lost in the syntax. The shift from manual input to agentic supervision requires a change in mindset. You become an architect who approves blueprints rather than a bricklayer who places every stone. The platform uses a specialized **Agent Manager** surface to consolidate these activities, making the entire lifecycle of a feature visible and manageable in a single view. Switching between the Editor and Agent Manager is as fast as hitting `Cmd+E`, and opening the terminal panel within the Agent Manager takes a quick `Cmd/Ctrl+J`. I find that this approach reduces the cognitive load associated with context switching because the agent maintains the state across different tools on your behalf. One subtlety that took me time to internalize: the agent is not a glorified autocomplete. It reads your project, understands your architecture, proposes a plan, and then executes that plan across files, terminals, and even a browser instance. Your role is supervisory. You review the Implementation Plan, approve or reject, and the agent proceeds. This feedback loop is what separates Antigravity from every other code assistant I have used.

Quick Reference

Agent Manager feels disconnected from the editor

Not sure when to intervene vs let the agent run

Context feels lost between sessions

Use Cmd+E to toggle between Editor and Agent Manager instantly

Review the Implementation Plan artifact before approving any execution

Start fresh sessions for new features to keep context clean

Adapting to Multi-Agent Workflows

One of the most profound shifts in my daily development cycle involves embracing a multi-agent workflow. Antigravity fundamentally supports multitasking, allowing you to run parallel operations without blocking your primary editor. When I assign a comprehensive, long-running task to an agent, I do not sit idly waiting for the output. I immediately spin up another agent or focus on a different part of the codebase. This capability shines when integrating third-party intelligence. Because Antigravity is built on a VSCode foundation, I frequently install the official "Codex - OpenAI's coding agent" extension. This lets me bring their most powerful flagship models, such as GPT-5.4, directly into the same environment alongside Gemini. OpenAI's flagship models are incredibly capable for heavy implementation tasks but can sometimes be slower in generating output compared to Gemini. While Codex is busy implementing a complex feature, I use the native Gemini agent to draft documentation or refactor a separate service. I essentially unite two major competitors within a single workspace. A practical pattern I often use involves delegating distinct roles for the same project. I assign the initial architectural planning and code review to Gemini, taking advantage of its speed and deep repository tracking. Once the plan is approved, I hand the heavy implementation phase over to the Codex agent. This dual-agent strategy ensures rapid planning and rigorous execution, significantly increasing my overall throughput.

Quick Reference

Waiting for long-running agent tasks interrupts development momentum

Slower models block the main workflow

Need different types of reasoning for planning versus coding

Open a new editor tab and start a second agent while the first one processes

Install the official Codex extension to run heavy implementation tasks in parallel with Gemini

Delegate fast architecture reviews to Gemini and assign deep implementation writing to Codex

Strategic Model Selection for Optimized Workflows

I have observed that the effectiveness of Antigravity depends heavily on the model you select for the specific task at hand. By April 2026, the platform provides access to a curated selection of intelligence through the **Google Vertex Model Garden**. This is not a bring-your-own-key setup; Antigravity relies on the integrated models provided within the environment. I often choose **Gemini 3.1 Pro** when I need deep reasoning for complex refactoring or architectural decisions. This model possesses the cognitive depth required to understand intricate dependencies within a large codebase. On the other hand, I prefer **Gemini 3.1 Flash** for rapid iterations and unit testing because its speed allows for a much faster feedback loop. I also see value in utilizing third-party models available through Vertex. **Claude Opus 4.6** excels at deep, sustained reasoning for enterprise agent workflows. **GPT-OSS** (both the 120B and the lighter 20B variant) offers open-weight flexibility for projects where you need full control over the inference stack. Antigravity exposes **High** and **Low** variants for its models in the settings. The exact parameters these variants control (thinking depth, token limits, tool invocation priority) are not published as a single formal specification, so I recommend testing both on your specific workload to find the right balance between output quality and response speed. I also recommend adjusting the **thinking level** parameters when using the Gemini API directly. Setting `thinking_level` to `MEDIUM` instead of the default can cut latency significantly on tasks that do not require exhaustive chain-of-thought reasoning, while still producing high quality output.

Quick Reference

Model responds slowly on simple tasks

Output quality drops with the faster model

Not sure which model to pick for a given task

Switch to Gemini Flash or set thinking level to MEDIUM for routine work

Use Gemini 3.1 Pro (High) for architectural decisions and complex refactoring

Reserve Claude Opus 4.6 for deep analytical reviews where cost is secondary

Performance Tuning and Latency Management

I know that a slow editor can disrupt your creative flow, so I prioritize performance optimization in my Antigravity setup. Over time I have distilled this into a five-step methodology that I apply to every new project. ### Step 1: Choose the Right Mode **Fast Mode** is designed to bypass the overhead of detailed planning for localized changes. I find it essential for quick fixes, small refactors, and single-file edits. If you are experiencing lag on minor tasks, switching from Planning to Fast mode often eliminates the delay entirely. ### Step 2: Select a Speed-Appropriate Model Gemini Flash is purpose-built for throughput. Antigravity's own documentation highlights that Flash produces faster plan/execution cycles in agentic workflows. When speed matters more than deep reasoning, Flash is the answer. ### Step 3: Manage Conversation Length As a dialogue grows, the amount of data the agent must process increases, which leads to a noticeable drop in responsiveness. I make it a habit to start fresh sessions for new features. When a session has accumulated too much history, I ask the agent to dump the current progress into a bullet-point summary, then start a new conversation with that summary as the only context. ### Step 4: Isolate MCP and Integration Overhead Sorunlu or heavy MCP servers can create instability through large handshake/command sequences. When diagnosing lag, I first disable all MCP servers and run a baseline test (a simple "Hello" message). If the lag disappears, I re-enable servers one by one to find the culprit. ### Step 5: Reduce Local Resource Pressure Antigravity is a desktop application that runs locally. Indexing large repositories, running background telemetry, and maintaining unused integrations all consume CPU. I limit the agent's scope to specific directories, disable telemetry in settings, and close extensions I am not actively using. When working with cloud models, I use **global endpoints** provided by Vertex AI to ensure the highest availability and the lowest possible latency. Vertex explicitly states that global endpoints offer higher reliability than single-region endpoints.

Quick Reference

Antigravity feels sluggish on a powerful machine

CPU usage spikes to 100% during indexing

Response times degrade as conversation grows

Disable MCP servers and test baseline, then re-enable one by one

Limit agent scope to specific directories to prevent full-repo indexing

Start fresh sessions and carry forward only a bullet-point summary

Extending Agent Capabilities with Model Context Protocol

I believe the true power of Antigravity is realized when you connect it to your external tools and data sources through the **Model Context Protocol (MCP)**. This protocol allows the agent to act as a bridge between your code and your infrastructure. I use the `mcp_config.json` file located at `~/.gemini/antigravity/mcp_config.json` to define these connections. The UI path is: **Settings > Manage MCP Servers > View raw config**. By adding specialized servers, I enable my agents to query databases, interact with internal APIs, or even manage cloud resources directly from the IDE. Here is a practical example of a global MCP configuration that connects two servers, one for Google Developer Knowledge and one for an n8n automation workflow: ```json { "mcpServers": { "google-developer-knowledge": { "serverUrl": "https://developerknowledge.googleapis.com/mcp" }, "n8n-mcp": { "command": "node", "args": ["path/to/n8n-mcp/dist/mcp/index.js"], "env": { "MCP_MODE": "stdio", "N8N_API_URL": "http://localhost:5678", "N8N_API_KEY": "YOUR_SECRET" } } } } ``` I suggest being precise with the environment variables and command arguments in your configuration. A minor typo in a path or a missing env variable will cause silent connection failures that are difficult to diagnose. While Antigravity does not allow you to replace the core reasoning model with local alternatives like Ollama, you can still use MCP to call local services as tools. This hybrid approach gives you cloud model reasoning with local tool access. For example, I connect a local documentation MCP server so the agent fetches the latest API specs without me providing them manually.

Quick Reference

MCP server connects but never responds

Agent cannot discover tools from the MCP server

Local service integration fails silently

Verify environment variables and command paths in mcp_config.json are exact

Check MCP server logs independently before connecting to Antigravity

Use localhost URLs for local services and confirm they are running before launch

Governance and Security in Autonomous Workflows

I take security seriously when allowing autonomous agents to execute commands on my machine. Antigravity provides several layers of protection that I always configure before starting a new project. ### Terminal Security I recommend enabling **Strict Mode** for the terminal. When Strict Mode is active, sandboxing engages automatically and network access can be restricted for any command the agent runs. This ensures that a misguided script cannot compromise your local environment or exfiltrate data to an unexpected endpoint. You can also configure the agent to ask for **human approval** before performing any action that involves the terminal or the file system. I keep this enabled for production projects where the cost of an unexpected `rm -rf` is catastrophic. ### Browser Security The browser tool uses an **allowlist** approach. By default, the allowlist starts with only `localhost`. Any URL outside the allowlist triggers a user approval prompt. I limit my allowlist to localhost and my internal documentation servers. This is especially important when integrating local services through MCP, because it ensures the agent cannot browse to untrusted external sites during an agentic workflow. ### Project-Level Rules I use the `AGENTS.md` file (or `GEMINI.md` in the global config directory `~/.gemini/`) to define rules that the agent must follow. These can include mandates like "never use eval()", "always include security headers in HTTP responses", or "do not install packages from unverified registries". The agent reads these files at the start of every session and treats them as hard constraints. These proactive measures allow me to enjoy the benefits of automation without sacrificing the integrity of my development environment.

Quick Reference

Agent runs a destructive command without asking

Browser tool visits unexpected external URLs

Security rules in AGENTS.md are being ignored

Enable Strict Mode in terminal settings to activate sandboxing automatically

Keep browser allowlist limited to localhost and trusted internal servers only

Verify AGENTS.md is in the project root or GEMINI.md is at ~/.gemini/GEMINI.md

Cost Optimization and Resource Management

I understand that managing the costs of generative AI is a priority for any professional developer. In Antigravity, I monitor my credit consumption through the **Models** section in settings. I use the **AI Credit Overages** feature to prevent my account from exceeding a specific budget. This is particularly useful when working on large-scale projects that involve extensive reasoning. ### Batch Processing For tasks that are not time-sensitive, I take advantage of the **batch processing** capabilities offered by Vertex AI. The Gemini Developer API explicitly advertises a **50% cost reduction** for batch jobs compared to real-time inference. By queuing non-urgent analysis, documentation generation, or test creation into batch jobs, I cut my token costs in half without impacting my development velocity. ### Context Caching Vertex AI offers both **implicit** and **explicit** context caching. If I am working on a single repository for several days, caching the core codebase allows the agent to reference it without reprocessing the entire context every time. Cache hits reduce both latency and cost. I have found this especially effective when iterating on prompts that share a large common prefix. ### Inference Tiers As of April 2026, the Gemini API supports **Standard**, **Flex**, and **Priority** inference tiers. Standard is the default. Flex offers lower pricing for workloads that can tolerate slightly higher latency. Priority guarantees the lowest latency at a premium. I use Flex for background processing and Priority only when I need real-time responsiveness. ### Prompt Discipline I suggest being specific with your prompts to avoid unnecessary token usage. A clear, concise instruction will always be more cost-effective than a vague request that requires multiple attempts. Combining batch processing, caching, tier selection, and prompt discipline, I maintain high productivity while keeping operational expenses within a reasonable range. *Note: Pricing figures referenced in this guide reflect April 2026 values. Always check the official Vertex AI and Gemini API pricing pages for the most current rates.*

Quick Reference

Credit consumption feels unexpectedly high

Not sure how to enable batch mode for non-urgent tasks

Context caching does not seem to reduce costs

Check Models section in settings and enable AI Credit Overages to set a hard budget cap

Use the Vertex AI Batch API for documentation, analysis, and test generation jobs

Ensure your prompts share a consistent prefix so the cache can match and produce hits

Persistent Rules, Workflows, and Agent Customization

One of the most valuable features I have found in Antigravity is the ability to define persistent behavior through file-based configuration. Instead of repeating the same instructions at the start of every conversation, I encode my preferences into Markdown files that the agent reads automatically. ### Global Rules The file at `~/.gemini/GEMINI.md` acts as a global instruction set. Anything I write here applies to every project I open in Antigravity. I use it for cross-cutting concerns: my preferred coding style, security mandates, commit message format, and testing philosophy. Since version 1.20.3, Antigravity also reads from `AGENTS.md`, giving you an additional file for organizing your rules. ### Workspace Rules For project-specific behavior, I create Markdown files inside `.agents/rules/` at the root of my project. These rules apply only to that workspace. For example, in a Python project I might have a rule that enforces type hints on every function signature. In a React project, a different rule might mandate the use of functional components over class components. ### Workflows Workflows live in `.agents/workflows/` and describe step-by-step procedures the agent should follow for specific tasks. I use them for deployment checklists, code review protocols, and release pipelines. The agent can reference these workflows when you invoke them, executing each step in sequence. ### The Customization Bug I should mention that in early 2026, the "Agent Customizations" UI panel had reported issues detecting workspaces and listing rules/workflows correctly. If you experience this, bypass the UI entirely and manage your `.agents/` directory structure through the file system directly. This file-based approach is reliable regardless of UI state.

Quick Reference

Agent ignores rules I defined in GEMINI.md

Customizations panel shows empty workspace

Workflow steps execute out of order

Confirm GEMINI.md is placed at ~/.gemini/GEMINI.md (not inside a project folder)

Bypass the UI and manage .agents/rules/ and .agents/workflows/ directly via file system

Number your workflow steps explicitly and include clear completion criteria for each

Planning Mode, Fast Mode, and the Artifact Lifecycle

The distinction between **Planning Mode** and **Fast Mode** is, in my experience, the single most impactful setting in Antigravity. Getting this right determines the quality, speed, and predictability of every interaction. ### Planning Mode Planning Mode is the default for complex, multi-file operations. When activated, the agent produces structured artifacts before writing any code: - **Implementation Plan**: A detailed proposal of what will change, organized by component. I review this before the agent touches a single file. - **Task List**: A checklist derived from the plan. The agent marks items as in-progress and completed as it works. - **Walkthrough**: A post-execution summary of what changed and why. I use this for code review and documentation. These artifacts form a **Task Group** that makes the entire lifecycle of a feature visible in a single view. I find that Planning Mode is indispensable for refactoring, new feature development, and any change that spans more than two or three files. ### Fast Mode Fast Mode strips away the planning overhead. It skips Implementation Plan generation and moves directly to execution. I use it for single-file fixes, formatting changes, quick questions, and exploratory coding where I want immediate feedback. The key insight: do not use Planning Mode for trivial tasks. The overhead of generating and reviewing artifacts is wasted on a one-line fix. Conversely, do not use Fast Mode for complex refactors. The lack of a plan leads to fragmented changes and missed edge cases. ### My Workflow I start every significant feature in Planning Mode, review the Implementation Plan, approve it, and let the agent execute. Once the heavy structural work is done, I switch to Fast Mode for polish, formatting, and minor adjustments. This two-phase approach gives me the best of both capabilities.

Quick Reference

Planning overhead feels excessive for small changes

Agent makes fragmented changes without a clear structure

Task list does not reflect actual progress

Switch to Fast Mode for single-file edits, formatting, and quick questions

Use Planning Mode for any change spanning more than two files

Review the Implementation Plan before approving; reject incomplete proposals

Model Comparison, Pricing, and When to Use Each

Selecting the right model is a strategic decision that balances reasoning depth, speed, and cost. I have compiled a practical comparison based on my experience with each model inside and outside of Antigravity. ### Gemini 3.1 Pro Preview **Best for:** Complex agentic workflows, large repository refactoring, deep architectural reasoning. **Risk:** Higher cost and latency. The thinking level parameter needs tuning to avoid unnecessary over-reasoning on simple tasks. **Pricing (Gemini Developer API, April 2026):** Input $2.00 / Output $12.00 per million tokens (up to 200k context). ### Gemini 3.1 Flash Preview **Best for:** Speed-focused code assistance, high-throughput iteration, rapid prototyping. **Risk:** May not match Pro's stability on deeply complex multi-step reasoning chains. **Pricing:** Input $0.50 / Output $3.00 per million tokens. ### Claude Opus 4.6 (via Vertex) **Best for:** Deep reasoning, enterprise agent workflows, sustained analytical coding sessions. **Risk:** "Overthinking" on simple tasks inflates cost and latency. Anthropic recommends lowering the **effort** parameter for routine work. **Pricing (Vertex, region-dependent):** Approximately Input $5.50 / Output $27.50 per million tokens. **Context:** 1,000,000 input tokens, 128,000 output tokens. ### GPT-OSS 120B (Vertex MaaS) **Best for:** Cost-effective reasoning with open-weight flexibility. **Risk:** Tool calling schema is sensitive; the Harmony format requires careful prompt construction. **Pricing (Vertex):** Input $0.09 / Output $0.36 per million tokens. ### GPT-OSS 20B (Edge or Vertex MaaS) **Best for:** Edge inference, local deployment, minimal infrastructure cost. **Risk:** Hardware constraints. Requires approximately 16GB VRAM with MXFP4 quantization. **Pricing:** Self-hosted (hardware cost only) or Vertex MaaS rates. *All pricing figures reflect April 2026 published rates. Check the official pricing pages for Vertex AI, Gemini API, and your specific provider for the most current numbers.*

Quick Reference

Spending too much on AI inference

Not sure which model to default to

Claude is slower than expected on simple tasks

Default to Gemini Flash for daily coding; reserve Pro and Claude for architecture

Use GPT-OSS 120B on Vertex MaaS when cost is the primary constraint at $0.09 input

Lower Claude's effort parameter for routine tasks to reduce overthinking latency

Claude Opus 4.6 on Vertex AI, Effort Tuning and Privacy Considerations

Claude Opus 4.6 deserves its own section because its behavior within Antigravity and Vertex AI has specific characteristics that affect both performance and security. ### Effort Tuning Anthropic's official announcement notes that Opus 4.6 thinks more deeply by default, which means it can inflate cost and latency on tasks that do not require exhaustive reasoning. The recommended solution is to adjust the **effort** level. In agentic environments like Antigravity, I lower the effort for routine code generation and increase it only for complex architectural reviews. Here is a curl example for calling Claude through Vertex AI with streaming enabled: ```bash MODEL_ID="claude-opus-4-6" LOCATION="us-central1" PROJECT_ID="YOUR_PROJECT_ID" curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ "https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/publishers/anthropic/models/${MODEL_ID}:streamRawPredict" \ -d '{ "anthropic_version": "vertex-2023-10-16", "messages": [{"role": "user", "content": "Review this architecture for security gaps."}], "max_tokens": 4096, "stream": true }' ``` ### Web Search Privacy Warning If you enable Claude's **web search** capability through Vertex, be aware of a critical data flow: Claude sends search queries derived from your prompt to a third-party search provider selected by Anthropic. Google explicitly states that it is not responsible for how that third party handles the data. I strongly recommend disabling web search for any requests containing proprietary source code, client data, or internal architecture details. Treat this as a hard security policy, not a suggestion.

Quick Reference

Claude Opus 4.6 is expensive for routine coding

Concerned about data leaking through web search

Stream responses timeout or fail on Vertex

Reduce the effort parameter to save cost on tasks that do not need deep reasoning

Disable web search in Claude settings when working with proprietary or sensitive data

Verify your Vertex project quota and region; use us-central1 for the broadest model availability

Running GPT-OSS Locally with Open-Weight Models at the Edge

The GPT-OSS family of models from OpenAI is released as **open-weight**, meaning you can run them on your own infrastructure. This is critical context for anyone who has searched for "antigravity cant use gpu" or "how to add Ollama to Antigravity." The answer is detailed; Antigravity does not support local models as core reasoning engines, but GPT-OSS gives you a self-hosted alternative outside the IDE. ### Quick Local Setup OpenAI's Cookbook demonstrates a fast path using the `transformers` library: ```bash pip install -U transformers # Start a local server transformers serve # Connect and chat transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-20b ``` ### Hardware Requirements The 20B parameter model runs within **approximately 16GB of VRAM** using MXFP4 quantization, making it accessible on a single consumer GPU like an RTX 4080 or 4090. The 120B model targets **80GB class VRAM**, which requires enterprise hardware (A100, H100) or multi-GPU setups. ### Serving at Scale with vLLM For production workloads, I recommend **vLLM** as the serving layer. It provides an OpenAI-compatible server endpoint and advanced memory optimizations like **KV cache quantization** (FP8), which reduces the memory footprint and allows longer context windows or higher throughput on the same hardware. ### Vertex MaaS Alternative If self-hosting is too much operational overhead, Vertex AI offers GPT-OSS through its **Model-as-a-Service (MaaS)** platform with an OpenAI Chat Completions compatible endpoint: ```bash LOCATION="global" PROJECT_ID="YOUR_PROJECT_ID" curl -X POST \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ -d '{"model":"gpt-oss-120b-maas","messages":[{"role":"user","content":"Explain KV cache quantization."}],"max_tokens":500,"stream":true}' \ "https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/endpoints/openapi/chat/completions" ``` Streaming in this context reduces perceived latency, as tokens arrive incrementally rather than in a single bulk response.

Quick Reference

Cannot add Ollama or local models to Antigravity's model selector

GPT-OSS 20B runs out of memory on my GPU

Self-hosting feels like too much operational work

Use MCP to call local model endpoints as tools instead of trying to replace the core model

Apply MXFP4 quantization and verify you have at least 16GB VRAM for the 20B model

Use Vertex MaaS to access GPT-OSS without managing your own serving infrastructure

API-Side Throughput, Caching, and Inference Tier Selection

When I build applications on top of the Gemini or Vertex APIs (outside of Antigravity), tuning the API configuration is just as important as tuning the model prompt. ### Global Endpoints Vertex AI states that **global endpoints** provide higher availability and reliability compared to single-region endpoints. I default to global endpoints for all production workloads and fall back to a specific region only when data residency requirements mandate it. Global routing also helps during high-demand periods when individual regions may return 429 rate-limit errors. ### Inference Tiers The Gemini API now supports three tiers: - **Standard**: The default. Balanced latency and availability. - **Flex**: Lower pricing in exchange for slightly higher and more variable latency. I use this for batch-like workloads that execute through the real-time API. - **Priority**: Guarantees the lowest latency. I reserve this for user-facing features where response time directly impacts the end-user experience. Selecting the right tier per use-case can reduce costs by 30 to 50 percent without any change to the prompt or the model. ### Context Caching Vertex AI distinguishes between **implicit** and **explicit** context caching. Implicit caching happens automatically when repeated prompt prefixes are detected. Explicit caching requires you to define a cache entry that the API stores and references across calls. For workflows where I send the same system prompt plus a large codebase context across dozens of requests, explicit caching saves both money and time. The cached tokens are charged at a significantly lower rate than fresh input tokens. ### Batch API The Gemini Developer API offers a **Batch API with 50% cost reduction** compared to real-time inference. I submit documentation generation, large-scale code review, and test suite generation through batch. The trade-off is that batch jobs do not return results instantly; they are processed asynchronously and results are retrieved later. For anything that is not blocking a developer's workflow, batch is the fiscally responsible choice. *Pricing specifics evolve. Always consult the official Gemini API and Vertex AI pricing documentation for current rates and tier availability.*

Quick Reference

Getting 429 rate limit errors from Vertex API

Caching does not seem to reduce token costs

Batch results take too long to return

Switch to global endpoints to distribute load across regions automatically

Use explicit context caching for repeated system prompts; ensure the prefix is identical each call

Reserve batch for genuinely non-urgent tasks; do not batch anything that blocks active development

Edge, Cloud, and Hybrid Deployment Patterns for Production

When building AI-powered applications, I consider three deployment patterns. The right choice depends on data residency, latency requirements, and operational capacity. ### Edge (Self-Hosted GPT-OSS) **When to use it:** Strict data sovereignty requirements, offline environments, or scenarios where you need sub-10ms inference latency. **Advantages:** Full control over the model, the data, and the serving infrastructure. OpenAI explicitly states that GPT-OSS models are not served through the OpenAI API, so self-hosting is the intended deployment model. **Trade-offs:** You own the hardware, the serving stack (vLLM, TGI, or similar), and the model update lifecycle. This is operational overhead that scales with traffic. ### Cloud (Vertex AI) **When to use it:** Enterprise scale, global user bases, and projects that benefit from managed infrastructure. **Advantages:** Global endpoints for high availability, built-in caching and batch processing, integrated observability dashboards, and audit logging. **Trade-offs:** Subject to quota limits and 429 errors during peak demand. Third-party model integrations (like Claude web search) introduce data flow considerations that require security review. ### Hybrid (Cloud Models + Local Tools via MCP) **When to use it:** When you need cloud-grade reasoning but your tools, databases, or APIs must remain on local infrastructure. **Advantages:** Antigravity's MCP support enables this pattern natively. Cloud models handle the reasoning; local MCP servers provide the tools and data access. **Trade-offs:** Tool security becomes your responsibility. Secrets must be managed carefully in `mcp_config.json`. Integration stability depends on both the cloud model's availability and your local server's uptime. I find that most professional setups evolve toward the hybrid pattern. The cloud handles what it does best (reasoning at scale), and local infrastructure handles what it must (proprietary data, internal APIs, low-latency tooling).

Quick Reference

Not sure which pattern fits my project

Cloud model cannot access internal databases

Edge deployment outgrows a single GPU

Start with Cloud (Vertex) for speed to production; migrate to hybrid as needs evolve

Use MCP to bridge cloud models to local databases without exposing data to the internet

Scale edge with vLLM's multi-GPU serving or move the heavy model to Vertex MaaS

Monitoring, Logging, and Debugging with Vertex Observability

When something goes wrong in an AI workflow, the debugging experience is fundamentally different from traditional software. The model is a black box; you cannot step through its reasoning with a debugger. Instead, I rely on observability tools to understand what happened and why. ### Model Observability Dashboard Vertex AI provides a prebuilt **model observability dashboard** that surfaces usage metrics, latency percentiles, and error rates. This is my first stop when diagnosing "antigravity lag" or unexpected slowness in API-backed workflows. The dashboard reveals whether the issue is in my code, the network, or the model serving infrastructure. ### Request and Response Logging Vertex supports **request/response logging** that writes full request and response payloads to a BigQuery table. I enable this for production applications and use it for three purposes: 1. **Debugging**: When the model produces an unexpected output, I can replay the exact request that triggered it. 2. **Cost Attribution**: By attaching **custom labels** to requests, I can break down costs by feature, team, or customer. Note that custom labels are currently supported only for Google models; partner models may produce errors if labels are included. 3. **Compliance**: For regulated industries, the complete request/response log provides an audit trail of every AI interaction. ### Audit Logs Vertex offers **Data Access audit logs** for tracking model endpoint usage. These are disabled by default for some services because they can generate high volumes of log data. I enable them selectively for production endpoints and keep them disabled for development to avoid noise and storage costs. ### Local Debugging Antigravity's own desktop application does not publish a single standard for its local log locations and formats, so local debugging involves checking the output panel (accessible from the bottom of the editor) and the developer console. For Vertex API debugging, the cloud-side tools are far more capable.

Quick Reference

Cannot determine why the model produced an unexpected response

Cost attribution is not granular enough

Audit log volume is overwhelming

Enable request/response logging to BigQuery and replay the exact failing request

Attach custom labels to every API call to segment costs by feature or team

Enable audit logs only on production endpoints; disable for development and staging

Breaking Changes, Migration Notes, and the 2024 to 2026 Timeline

The AI tooling ecosystem moves fast. In the span of two years, several breaking changes have landed that can silently break your integrations if you are not tracking them. Here is what I have cataloged. ### Gemini API (2025 to 2026) - `gemini-3-pro-preview` was **shut down** in March 2026. The alias now redirects to `gemini-3.1-pro-preview`. If your code hardcodes the old model ID, it may still work through the redirect, but I recommend updating explicitly to avoid silent behavior changes. - The Interactions API v1beta renamed `total_reasoning_tokens` to `total_thought_tokens`. This is listed as a **breaking change** in the release notes. Any telemetry or billing pipeline that parses this field will need a schema update. - **Flex and Priority** inference tiers were introduced in April 2026. If you are on the Standard tier by default, you may be overpaying for latency-insensitive workloads. ### Antigravity (Version 1.20.3, early 2026) - **AGENTS.md support** was added, giving you an alternative to GEMINI.md for defining agent rules. - The **Auto-continue** setting was deprecated and made default-on. If you previously relied on the agent pausing between steps, this behavior change can cause unexpected multi-step executions. - Long conversation **loading time improvements** were reported, but in practice, very long threads can still cause UI slowness. - Some community reports noted that **"Command support removed"** appeared in patch notes, causing confusion about inline command availability. If your workflow relies heavily on inline terminal commands, run a regression test after each update. ### Claude Opus 4.6 (February 2026) - Anthropic introduced **effort tuning** as a first-class parameter. Default high effort in agentic contexts can produce cost and latency surprises. Set it explicitly. - Vertex's Claude integration added a **web search** capability that routes queries through a third-party search provider. Review your data classification policies before enabling. ### GPT-OSS (2025 to 2026) - OpenAI confirmed that GPT-OSS models are **not served through the OpenAI API or ChatGPT**. All production deployments must use self-hosting or a third-party provider like Vertex MaaS. This affects your provider abstraction layer if you were planning to use a unified OpenAI SDK. ### Key Dates - **November 2024**: Vertex AI batch predictions for Llama (Preview), Gemini batch (GA) - **May 2025**: Vertex AI global endpoint GA, expanded observability - **November 2025**: Antigravity public preview launched (free tier) - **December 2025**: Gemini 3 Flash Preview, Interactions API breaking change - **February 2026**: Gemini 3.1 Pro Preview, Claude Opus 4.6 - **March 2026**: Gemini 3 Pro shutdown, Antigravity credits/overages discussion - **April 2026**: Flex and Priority inference tiers launched

Quick Reference

Integration broke after a model update

Billing pipeline shows incorrect token counts

Agent suddenly executes multiple steps without pausing

Replace hardcoded model IDs with the latest versions; test after every Gemini API release

Update telemetry schema for total_reasoning_tokens to total_thought_tokens rename

Auto-continue is now default-on; if you need step-by-step control, check the latest settings