back to top
More
    HomeNewsGemini 2.5 Computer Use model: A practical guide for building browser agents

    Gemini 2.5 Computer Use model: A practical guide for building browser agents

    Published on

    How Cisco Is Powering the $1.3 Billion AI Infrastructure Revolution

    Summary: Cisco reported $1.3 billion in AI infrastructure orders from hyperscalers in Q1 FY2026, driven by Nexus Hyperfabric architecture, NVIDIA partnerships, and 800 Gbps...

    Gemini 2.5 Computer Use is a browser-focused model that “sees” the screen via screenshots and then clicks, types, scrolls, and drags through predefined actions. It’s in public preview via AI Studio and Vertex AI. You’ll get speed and flexibility, but you must design for safety, pop-ups, and flaky UIs

    What is the Gemini 2.5 Computer Use model?

    It’s a specialized Gemini 2.5 model that controls web interfaces like a human opening pages, filling forms, clicking buttons, and working behind logins. It runs in a loop: observe a screenshot, decide an action, execute, repeat until the task is done or blocked for safety.

    Google is shipping this in public preview through the Gemini API on Google AI Studio and Vertex AI. It’s tuned for browser environments first; desktop OS-level control isn’t the goal yet. There’s also a public demo (Browserbase) to watch it navigate in real time.

    How it works: the agent loop in plain English

    Think of a careful intern at a computer:

    1. You describe the goal.
    2. The model looks at a screenshot of the current page.
    3. It proposes a UI action (e.g., click_at, type_text_at, navigate).
    4. A safety service may require user confirmation for risky steps (e.g., checkout).
    5. Your code executes the action, grabs a new screenshot, and sends it back.
    6. Repeat until done or you stop.

    The supported actions are predefined. Expect basics like opening a browser, going to a URL, clicking/typing at coordinates, scrolling, going back/forward, waiting, and drag-and-drop. You can also exclude certain actions (e.g., disallow drag-and-drop) to narrow behavior.

    Two important notes:

    • Confirmations: If the safety layer flags a step as risky, your app must ask the user before execution. That’s by design.
    • Scope: Today, Computer Use is browser-centric. Some outlets cite 13 actions; either way, treat it as a capable browser operator, not a full desktop automation tool.

    Setup paths: AI Studio, Vertex AI, or Browserbase

    You can start three ways:

    • AI Studio (Developer API). Fast prototyping; clear docs and quickstarts. You’ll use the model id gemini-2.5-computer-use-preview-10-2025, pass screenshots, and parse function calls for actions.
    • Vertex AI (enterprise). Same model, plus enterprise controls, IAM, logging, and private networking. Preview terms apply; follow safety guidance and don’t bypass confirmations.
    • Browserbase (cloud browsers). Handy if you don’t want to maintain Playwright infra. There’s a free demo and an OSS reference showing the agent loop.

    Getting a key is straightforward in AI Studio; their quickstart covers the SDK install and first call.

    Pricing explained (with example scenarios)

    AI Studio (Gemini API) pricing for Computer Use preview (per 1M tokens): Input $1.25 (≤200k tokens) / $2.50 (>200k), Output $10 (≤200k) / $15 (>200k). Vertex AI’s page mirrors these Computer Use preview rates under Gemini 2.5 Pro Computer Use (region and platform differences may apply).

    What this means in practice:

    • A small task (5K input tokens + 2K output tokens) is a tiny fraction of $1.25/$10 per million usually cents.
    • Costs rise with lots of screenshots or long sequences. Keep screenshots succinct (right size, crop clutter), and avoid verbose logging in the model output.

    Tip: Track tokens per action and actions per successful task. That KPI often predicts monthly cost better than “requests per day.”

    Benchmarks and what they actually mean

    Google cites wins on Online-Mind2Web and WebVoyager, plus strong mobile control (AndroidWorld). Browserbase who partnered with DeepMind—also reports state-of-the-art accuracy/latency with extensive runs. Treat vendor and partner results as promising but not gospel. Recreate a mini-eval on your own top 20 tasks before committing.

    A useful mental model:

    • Online-Mind2Web: multi-step web tasks on real sites (booking, shopping).
    • WebVoyager: open-web navigation benchmarks.
    • Latency vs quality: Gemini aims to sit high-accuracy with lower latency. Your mileage varies by site complexity, auth, pop-ups, and geo.

    When to use Computer Use vs classic APIs

    Use Computer Use when:

    • No API exists or access is limited (partner portals, legacy admin UIs).
    • You need UI-level verification (e.g., visual checks, feature flags).
    • You’re testing end-to-end flows (QA smoke tests, checkout, sign-ups).

    Stick to classic APIs when:

    • There’s a reliable, documented API.
    • You need high throughput or strict SLAs.
    • You must avoid UI drift and pop-ups.

    Pros (Computer Use): Flexible, works behind logins, human-like, fast to prototype.
    Cons: UI drift, pop-ups, CAPTCHAs, token cost creep, requires safety confirmations for certain actions.

    Real-world pitfalls and fixes

    • Cookie banners & pop-ups: Add a pre-step instruction (“dismiss cookie consent if present”), allow a wait_5_seconds, and retry once.
    • CAPTCHAs: Don’t automate. If the model detects a captcha, stop and route to a human. This is aligned with Google’s safety guidance.
    • Auth flows: Prefer app passwords or short-lived test accounts; keep secrets out of prompts.
    • DOM drift: Use goal-oriented prompts (“find the email field in the signup form”) and allow navigate+search fallback if a path breaks.
    • Safety confirmations: For purchases, deletions, or data changes, make the agent ask the user—and log the consent reason.

    Mini case studies (composite, practical)

    • QA smoke on staging: Run 12 key journeys (login, search, add-to-cart, checkout) at 7 a.m. and noon. Alert on deviation (missing button, error page). Early SLO catches saved a team 4–6 engineer hours weekly.
    • Ops data pulls: No export API? Have an agent log in every Friday, filter to “last 7 days,” export CSV, drop to a GCS bucket. Token cost is tiny; value is high.
    • Support triage: Route a tricky admin flow (reset SSO + reassign license). The agent attempts the path and stops for confirmation on sensitive steps.

    Step by step: How to start your first agent

    1. Get access. Create a Gemini API key in AI Studio.
    2. Pick the model. Use gemini-2.5-computer-use-preview-10-2025. Set environment to browser and enable the computer_use tool.
    3. Start the loop. Send a goal + initial screenshot. Parse the function_call (e.g., type_text_at) and execute with Playwright. Send back a new screenshot + URL.
    4. Add safety. If safety_response asks for confirmation, prompt the user and only proceed on explicit consent.
    5. Ship a demo. Use the reference repo and compare behavior on the Browserbase demo environment.

    Comparison: Computer Use vs Classic APIs

    ScenarioComputer Use (Gemini)Classic API
    No API available✅ Handles UI directly❌ Not possible
    Behind logins✅ Works with auth flows⚠️ Depends on API
    Throughput⚠️ Slower; UI steps✅ Usually faster
    Flakiness (UI drift)⚠️ Needs guards✅ Stable schemas
    E2E validation✅ Visual/UX checks❌ Limited

    Frequently Asked Questions (FAQs)

    What is Gemini Computer Use?
    A Gemini 2.5 model that controls web UIs by “seeing” screenshots and issuing actions (click/type/scroll) in a loop. It’s in public preview via API.

    Is it browser-only?
    Yes, it’s optimized for browsers. It’s not aimed at full desktop OS control right now.

    Where do I start?
    AI Studio for quick tests; Vertex AI for enterprise rollout; Browserbase for hosted cloud browsers and a public demo.

    How much does it cost?
    Computer Use preview pricing (AI Studio): $1.25/$2.50 per 1M input tokens (≤/>200k) and $10/$15 per 1M output tokens (≤/>200k). Vertex has a matching entry for Computer Use preview.

    What about benchmarks?
    Google reports strong results on Online-Mind2Web, WebVoyager, and AndroidWorld; Browserbase’s independent runs also show SOTA but verify on your own tasks.

    Does it bypass CAPTCHAs?
    No. Follow the safety rules: don’t automate captchas; require user confirmation for risky steps.

    Featured Snippet Boxes

    What is Gemini 2.5 Computer Use?

    A Gemini model that operates web UIs via screenshots and predefined actions (click, type, scroll) in a loop, now in public preview via AI Studio and Vertex AI.

    How do you start using it?

    Get a Gemini API key, select gemini-2.5-computer-use-preview-10-2025, enable the computer use tool, and implement the loop with Playwright or a hosted browser.

    Is it browser only?

    Yes. Google says it’s optimized for browsers and not desktop OS-level control yet.

    How much does it cost?

    Preview pricing is per million tokens: input from $1.25 and output from $10 (tiered by prompt size).

    How much does it cost?
    Preview pricing is per million tokens: input from $1.25 and output from $10 (tiered by prompt size).

    Mohammad Kashif
    Mohammad Kashif
    Topics covers smartphones, AI, and emerging tech, explaining how new features affect daily life. Reviews focus on battery life, camera behavior, update policies, and long-term value to help readers choose the right gadgets and software.

    Latest articles

    How Cisco Is Powering the $1.3 Billion AI Infrastructure Revolution

    Summary: Cisco reported $1.3 billion in AI infrastructure orders from hyperscalers in Q1 FY2026,...

    Qualcomm Insight Platform: How Edge AI Is Transforming Video Analytics

    Summary: Qualcomm Insight Platform transforms traditional surveillance into intelligent video analytics by processing AI...

    Meta Launches AI-Powered Support Hub for Facebook and Instagram Account Recovery

    Summary: Meta rolled out a centralized support hub on Facebook and Instagram globally, featuring...

    Snowflake and Anthropic’s $200 Million Partnership Brings Claude AI to Enterprise Data

    Snowflake and Anthropic expanded their partnership with a $200 million, multi-year agreement that integrates...

    More like this

    How Cisco Is Powering the $1.3 Billion AI Infrastructure Revolution

    Summary: Cisco reported $1.3 billion in AI infrastructure orders from hyperscalers in Q1 FY2026,...

    Qualcomm Insight Platform: How Edge AI Is Transforming Video Analytics

    Summary: Qualcomm Insight Platform transforms traditional surveillance into intelligent video analytics by processing AI...

    Meta Launches AI-Powered Support Hub for Facebook and Instagram Account Recovery

    Summary: Meta rolled out a centralized support hub on Facebook and Instagram globally, featuring...