Gemini 2.5 Computer Use is a browser-focused model that “sees” the screen via screenshots and then clicks, types, scrolls, and drags through predefined actions. It’s in public preview via AI Studio and Vertex AI. You’ll get speed and flexibility, but you must design for safety, pop-ups, and flaky UIs
What is the Gemini 2.5 Computer Use model?
It’s a specialized Gemini 2.5 model that controls web interfaces like a human opening pages, filling forms, clicking buttons, and working behind logins. It runs in a loop: observe a screenshot, decide an action, execute, repeat until the task is done or blocked for safety.
Google is shipping this in public preview through the Gemini API on Google AI Studio and Vertex AI. It’s tuned for browser environments first; desktop OS-level control isn’t the goal yet. There’s also a public demo (Browserbase) to watch it navigate in real time.
How it works: the agent loop in plain English
Think of a careful intern at a computer:
- You describe the goal.
- The model looks at a screenshot of the current page.
- It proposes a UI action (e.g.,
click_at,type_text_at,navigate). - A safety service may require user confirmation for risky steps (e.g., checkout).
- Your code executes the action, grabs a new screenshot, and sends it back.
- Repeat until done or you stop.
The supported actions are predefined. Expect basics like opening a browser, going to a URL, clicking/typing at coordinates, scrolling, going back/forward, waiting, and drag-and-drop. You can also exclude certain actions (e.g., disallow drag-and-drop) to narrow behavior.
Two important notes:
- Confirmations: If the safety layer flags a step as risky, your app must ask the user before execution. That’s by design.
- Scope: Today, Computer Use is browser-centric. Some outlets cite 13 actions; either way, treat it as a capable browser operator, not a full desktop automation tool.
Setup paths: AI Studio, Vertex AI, or Browserbase
You can start three ways:
- AI Studio (Developer API). Fast prototyping; clear docs and quickstarts. You’ll use the model id
gemini-2.5-computer-use-preview-10-2025, pass screenshots, and parse function calls for actions. - Vertex AI (enterprise). Same model, plus enterprise controls, IAM, logging, and private networking. Preview terms apply; follow safety guidance and don’t bypass confirmations.
- Browserbase (cloud browsers). Handy if you don’t want to maintain Playwright infra. There’s a free demo and an OSS reference showing the agent loop.
Getting a key is straightforward in AI Studio; their quickstart covers the SDK install and first call.
Pricing explained (with example scenarios)
AI Studio (Gemini API) pricing for Computer Use preview (per 1M tokens): Input $1.25 (≤200k tokens) / $2.50 (>200k), Output $10 (≤200k) / $15 (>200k). Vertex AI’s page mirrors these Computer Use preview rates under Gemini 2.5 Pro Computer Use (region and platform differences may apply).
What this means in practice:
- A small task (5K input tokens + 2K output tokens) is a tiny fraction of $1.25/$10 per million usually cents.
- Costs rise with lots of screenshots or long sequences. Keep screenshots succinct (right size, crop clutter), and avoid verbose logging in the model output.
Tip: Track tokens per action and actions per successful task. That KPI often predicts monthly cost better than “requests per day.”
Benchmarks and what they actually mean
Google cites wins on Online-Mind2Web and WebVoyager, plus strong mobile control (AndroidWorld). Browserbase who partnered with DeepMind—also reports state-of-the-art accuracy/latency with extensive runs. Treat vendor and partner results as promising but not gospel. Recreate a mini-eval on your own top 20 tasks before committing.
A useful mental model:
- Online-Mind2Web: multi-step web tasks on real sites (booking, shopping).
- WebVoyager: open-web navigation benchmarks.
- Latency vs quality: Gemini aims to sit high-accuracy with lower latency. Your mileage varies by site complexity, auth, pop-ups, and geo.
When to use Computer Use vs classic APIs
Use Computer Use when:
- No API exists or access is limited (partner portals, legacy admin UIs).
- You need UI-level verification (e.g., visual checks, feature flags).
- You’re testing end-to-end flows (QA smoke tests, checkout, sign-ups).
Stick to classic APIs when:
- There’s a reliable, documented API.
- You need high throughput or strict SLAs.
- You must avoid UI drift and pop-ups.
Pros (Computer Use): Flexible, works behind logins, human-like, fast to prototype.
Cons: UI drift, pop-ups, CAPTCHAs, token cost creep, requires safety confirmations for certain actions.
Real-world pitfalls and fixes
- Cookie banners & pop-ups: Add a pre-step instruction (“dismiss cookie consent if present”), allow a
wait_5_seconds, and retry once. - CAPTCHAs: Don’t automate. If the model detects a captcha, stop and route to a human. This is aligned with Google’s safety guidance.
- Auth flows: Prefer app passwords or short-lived test accounts; keep secrets out of prompts.
- DOM drift: Use goal-oriented prompts (“find the email field in the signup form”) and allow navigate+search fallback if a path breaks.
- Safety confirmations: For purchases, deletions, or data changes, make the agent ask the user—and log the consent reason.
Mini case studies (composite, practical)
- QA smoke on staging: Run 12 key journeys (login, search, add-to-cart, checkout) at 7 a.m. and noon. Alert on deviation (missing button, error page). Early SLO catches saved a team 4–6 engineer hours weekly.
- Ops data pulls: No export API? Have an agent log in every Friday, filter to “last 7 days,” export CSV, drop to a GCS bucket. Token cost is tiny; value is high.
- Support triage: Route a tricky admin flow (reset SSO + reassign license). The agent attempts the path and stops for confirmation on sensitive steps.
Step by step: How to start your first agent
- Get access. Create a Gemini API key in AI Studio.
- Pick the model. Use
gemini-2.5-computer-use-preview-10-2025. Set environment to browser and enable the computer_use tool. - Start the loop. Send a goal + initial screenshot. Parse the function_call (e.g.,
type_text_at) and execute with Playwright. Send back a new screenshot + URL. - Add safety. If
safety_responseasks for confirmation, prompt the user and only proceed on explicit consent. - Ship a demo. Use the reference repo and compare behavior on the Browserbase demo environment.
Comparison: Computer Use vs Classic APIs
| Scenario | Computer Use (Gemini) | Classic API |
|---|---|---|
| No API available | ✅ Handles UI directly | ❌ Not possible |
| Behind logins | ✅ Works with auth flows | ⚠️ Depends on API |
| Throughput | ⚠️ Slower; UI steps | ✅ Usually faster |
| Flakiness (UI drift) | ⚠️ Needs guards | ✅ Stable schemas |
| E2E validation | ✅ Visual/UX checks | ❌ Limited |
Frequently Asked Questions (FAQs)
What is Gemini Computer Use?
A Gemini 2.5 model that controls web UIs by “seeing” screenshots and issuing actions (click/type/scroll) in a loop. It’s in public preview via API.
Is it browser-only?
Yes, it’s optimized for browsers. It’s not aimed at full desktop OS control right now.
Where do I start?
AI Studio for quick tests; Vertex AI for enterprise rollout; Browserbase for hosted cloud browsers and a public demo.
How much does it cost?
Computer Use preview pricing (AI Studio): $1.25/$2.50 per 1M input tokens (≤/>200k) and $10/$15 per 1M output tokens (≤/>200k). Vertex has a matching entry for Computer Use preview.
What about benchmarks?
Google reports strong results on Online-Mind2Web, WebVoyager, and AndroidWorld; Browserbase’s independent runs also show SOTA but verify on your own tasks.
Does it bypass CAPTCHAs?
No. Follow the safety rules: don’t automate captchas; require user confirmation for risky steps.
Featured Snippet Boxes
What is Gemini 2.5 Computer Use?
A Gemini model that operates web UIs via screenshots and predefined actions (click, type, scroll) in a loop, now in public preview via AI Studio and Vertex AI.
How do you start using it?
Get a Gemini API key, select gemini-2.5-computer-use-preview-10-2025, enable the computer use tool, and implement the loop with Playwright or a hosted browser.
Is it browser only?
Yes. Google says it’s optimized for browsers and not desktop OS-level control yet.
How much does it cost?
Preview pricing is per million tokens: input from $1.25 and output from $10 (tiered by prompt size).
How much does it cost?
Preview pricing is per million tokens: input from $1.25 and output from $10 (tiered by prompt size).
