back to top
More
    HomeNewsGemini 2.5 Computer Use model: A practical guide for building browser agents

    Gemini 2.5 Computer Use model: A practical guide for building browser agents

    Published on

    Cursor Now Lets You Extend Its AI Agent With One-Click Marketplace Plugins

    Cursor just collapsed the entire product development lifecycle into a single editor and the mechanism is a new plugin architecture that ships with Cursor 2.5. Released February 17, 2026, the update adds an

    Gemini 2.5 Computer Use is a browser-focused model that “sees” the screen via screenshots and then clicks, types, scrolls, and drags through predefined actions. It’s in public preview via AI Studio and Vertex AI. You’ll get speed and flexibility, but you must design for safety, pop-ups, and flaky UIs

    What is the Gemini 2.5 Computer Use model?

    It’s a specialized Gemini 2.5 model that controls web interfaces like a human opening pages, filling forms, clicking buttons, and working behind logins. It runs in a loop: observe a screenshot, decide an action, execute, repeat until the task is done or blocked for safety.

    Google is shipping this in public preview through the Gemini API on Google AI Studio and Vertex AI. It’s tuned for browser environments first; desktop OS-level control isn’t the goal yet. There’s also a public demo (Browserbase) to watch it navigate in real time.

    How it works: the agent loop in plain English

    Think of a careful intern at a computer:

    1. You describe the goal.
    2. The model looks at a screenshot of the current page.
    3. It proposes a UI action (e.g., click_at, type_text_at, navigate).
    4. A safety service may require user confirmation for risky steps (e.g., checkout).
    5. Your code executes the action, grabs a new screenshot, and sends it back.
    6. Repeat until done or you stop.

    The supported actions are predefined. Expect basics like opening a browser, going to a URL, clicking/typing at coordinates, scrolling, going back/forward, waiting, and drag-and-drop. You can also exclude certain actions (e.g., disallow drag-and-drop) to narrow behavior.

    Two important notes:

    • Confirmations: If the safety layer flags a step as risky, your app must ask the user before execution. That’s by design.
    • Scope: Today, Computer Use is browser-centric. Some outlets cite 13 actions; either way, treat it as a capable browser operator, not a full desktop automation tool.

    Setup paths: AI Studio, Vertex AI, or Browserbase

    You can start three ways:

    • AI Studio (Developer API). Fast prototyping; clear docs and quickstarts. You’ll use the model id gemini-2.5-computer-use-preview-10-2025, pass screenshots, and parse function calls for actions.
    • Vertex AI (enterprise). Same model, plus enterprise controls, IAM, logging, and private networking. Preview terms apply; follow safety guidance and don’t bypass confirmations.
    • Browserbase (cloud browsers). Handy if you don’t want to maintain Playwright infra. There’s a free demo and an OSS reference showing the agent loop.

    Getting a key is straightforward in AI Studio; their quickstart covers the SDK install and first call.

    Pricing explained (with example scenarios)

    AI Studio (Gemini API) pricing for Computer Use preview (per 1M tokens): Input $1.25 (≤200k tokens) / $2.50 (>200k), Output $10 (≤200k) / $15 (>200k). Vertex AI’s page mirrors these Computer Use preview rates under Gemini 2.5 Pro Computer Use (region and platform differences may apply).

    What this means in practice:

    • A small task (5K input tokens + 2K output tokens) is a tiny fraction of $1.25/$10 per million usually cents.
    • Costs rise with lots of screenshots or long sequences. Keep screenshots succinct (right size, crop clutter), and avoid verbose logging in the model output.

    Tip: Track tokens per action and actions per successful task. That KPI often predicts monthly cost better than “requests per day.”

    Benchmarks and what they actually mean

    Google cites wins on Online-Mind2Web and WebVoyager, plus strong mobile control (AndroidWorld). Browserbase who partnered with DeepMind—also reports state-of-the-art accuracy/latency with extensive runs. Treat vendor and partner results as promising but not gospel. Recreate a mini-eval on your own top 20 tasks before committing.

    A useful mental model:

    • Online-Mind2Web: multi-step web tasks on real sites (booking, shopping).
    • WebVoyager: open-web navigation benchmarks.
    • Latency vs quality: Gemini aims to sit high-accuracy with lower latency. Your mileage varies by site complexity, auth, pop-ups, and geo.

    When to use Computer Use vs classic APIs

    Use Computer Use when:

    • No API exists or access is limited (partner portals, legacy admin UIs).
    • You need UI-level verification (e.g., visual checks, feature flags).
    • You’re testing end-to-end flows (QA smoke tests, checkout, sign-ups).

    Stick to classic APIs when:

    • There’s a reliable, documented API.
    • You need high throughput or strict SLAs.
    • You must avoid UI drift and pop-ups.

    Pros (Computer Use): Flexible, works behind logins, human-like, fast to prototype.
    Cons: UI drift, pop-ups, CAPTCHAs, token cost creep, requires safety confirmations for certain actions.

    Real-world pitfalls and fixes

    • Cookie banners & pop-ups: Add a pre-step instruction (“dismiss cookie consent if present”), allow a wait_5_seconds, and retry once.
    • CAPTCHAs: Don’t automate. If the model detects a captcha, stop and route to a human. This is aligned with Google’s safety guidance.
    • Auth flows: Prefer app passwords or short-lived test accounts; keep secrets out of prompts.
    • DOM drift: Use goal-oriented prompts (“find the email field in the signup form”) and allow navigate+search fallback if a path breaks.
    • Safety confirmations: For purchases, deletions, or data changes, make the agent ask the user—and log the consent reason.

    Mini case studies (composite, practical)

    • QA smoke on staging: Run 12 key journeys (login, search, add-to-cart, checkout) at 7 a.m. and noon. Alert on deviation (missing button, error page). Early SLO catches saved a team 4–6 engineer hours weekly.
    • Ops data pulls: No export API? Have an agent log in every Friday, filter to “last 7 days,” export CSV, drop to a GCS bucket. Token cost is tiny; value is high.
    • Support triage: Route a tricky admin flow (reset SSO + reassign license). The agent attempts the path and stops for confirmation on sensitive steps.

    Step by step: How to start your first agent

    1. Get access. Create a Gemini API key in AI Studio.
    2. Pick the model. Use gemini-2.5-computer-use-preview-10-2025. Set environment to browser and enable the computer_use tool.
    3. Start the loop. Send a goal + initial screenshot. Parse the function_call (e.g., type_text_at) and execute with Playwright. Send back a new screenshot + URL.
    4. Add safety. If safety_response asks for confirmation, prompt the user and only proceed on explicit consent.
    5. Ship a demo. Use the reference repo and compare behavior on the Browserbase demo environment.

    Comparison: Computer Use vs Classic APIs

    ScenarioComputer Use (Gemini)Classic API
    No API available✅ Handles UI directly❌ Not possible
    Behind logins✅ Works with auth flows⚠️ Depends on API
    Throughput⚠️ Slower; UI steps✅ Usually faster
    Flakiness (UI drift)⚠️ Needs guards✅ Stable schemas
    E2E validation✅ Visual/UX checks❌ Limited

    Frequently Asked Questions (FAQs)

    What is Gemini Computer Use?
    A Gemini 2.5 model that controls web UIs by “seeing” screenshots and issuing actions (click/type/scroll) in a loop. It’s in public preview via API.

    Is it browser-only?
    Yes, it’s optimized for browsers. It’s not aimed at full desktop OS control right now.

    Where do I start?
    AI Studio for quick tests; Vertex AI for enterprise rollout; Browserbase for hosted cloud browsers and a public demo.

    How much does it cost?
    Computer Use preview pricing (AI Studio): $1.25/$2.50 per 1M input tokens (≤/>200k) and $10/$15 per 1M output tokens (≤/>200k). Vertex has a matching entry for Computer Use preview.

    What about benchmarks?
    Google reports strong results on Online-Mind2Web, WebVoyager, and AndroidWorld; Browserbase’s independent runs also show SOTA but verify on your own tasks.

    Does it bypass CAPTCHAs?
    No. Follow the safety rules: don’t automate captchas; require user confirmation for risky steps.

    Featured Snippet Boxes

    What is Gemini 2.5 Computer Use?

    A Gemini model that operates web UIs via screenshots and predefined actions (click, type, scroll) in a loop, now in public preview via AI Studio and Vertex AI.

    How do you start using it?

    Get a Gemini API key, select gemini-2.5-computer-use-preview-10-2025, enable the computer use tool, and implement the loop with Playwright or a hosted browser.

    Is it browser only?

    Yes. Google says it’s optimized for browsers and not desktop OS-level control yet.

    How much does it cost?

    Preview pricing is per million tokens: input from $1.25 and output from $10 (tiered by prompt size).

    How much does it cost?
    Preview pricing is per million tokens: input from $1.25 and output from $10 (tiered by prompt size).

    Mohammad Kashif
    Mohammad Kashif
    Senior Technology Analyst and Writer at AdwaitX, specializing in the convergence of Mobile Silicon, Generative AI, and Consumer Hardware. Moving beyond spec sheets, his reviews rigorously test "real-world" metrics analyzing sustained battery efficiency, camera sensor behavior, and long-term software support lifecycles. Kashif’s data-driven approach helps enthusiasts and professionals distinguish between genuine innovation and marketing hype, ensuring they invest in devices that offer lasting value.

    Latest articles

    Cursor Now Lets You Extend Its AI Agent With One-Click Marketplace Plugins

    Cursor just collapsed the entire product development lifecycle into a single editor and the mechanism is a new plugin architecture that ships with Cursor 2.5. Released February 17, 2026, the update adds an

    Claude’s New Web Search Filter Writes Code to Cut Noise, The Numbers Prove It Works

    Web search is one of the most token-intensive operations an AI agent performs and most of those tokens are wasted. Anthropic’s updated web search and web fetch tools for Claude now address this at the

    NVIDIA Nemotron Nano 9B v2 Japanese: The Small Model Reshaping Japan’s AI Sovereignty

    NVIDIA shipped a Japanese-optimized language model that filled a critical gap in enterprise AI: a sub-10B model combining strong Japanese language understanding with genuine agentic capabilities.

    Grok 4.20 Is Not One AI It’s Four Specialized Agents Working in Real Time

    xAI didn’t release a bigger model on February 17, 2026. It released a team. Grok 4.20 is the first consumer-facing AI system from a major lab where four specialized agents each with a distinct role reason in

    More like this

    Cursor Now Lets You Extend Its AI Agent With One-Click Marketplace Plugins

    Cursor just collapsed the entire product development lifecycle into a single editor and the mechanism is a new plugin architecture that ships with Cursor 2.5. Released February 17, 2026, the update adds an

    Claude’s New Web Search Filter Writes Code to Cut Noise, The Numbers Prove It Works

    Web search is one of the most token-intensive operations an AI agent performs and most of those tokens are wasted. Anthropic’s updated web search and web fetch tools for Claude now address this at the

    NVIDIA Nemotron Nano 9B v2 Japanese: The Small Model Reshaping Japan’s AI Sovereignty

    NVIDIA shipped a Japanese-optimized language model that filled a critical gap in enterprise AI: a sub-10B model combining strong Japanese language understanding with genuine agentic capabilities.
    Skip to main content