Google SREs Deploy Gemini CLI to Cut Incident Response Time

Quick Brief

The Technology: Google Site Reliability Engineering teams now use Gemini CLI powered by Gemini 3 to automate incident response workflows, reducing Mean Time to Mitigation (MTTM) during production outages.
The Impact: SREs maintain a 5-minute Service Level Objective (SLO) for incident acknowledgment while AI agents handle diagnosis, mitigation selection, and postmortem generation.
The Context: This deployment represents the first production-scale implementation of agentic AI in Google’s core infrastructure operations, where outages affect multiple services simultaneously.

Google has deployed Gemini CLI, an AI-powered terminal tool backed by the Gemini 3 foundation model across its Site Reliability Engineering (SRE) teams to automate critical incident response workflows. The system integrates with Google’s internal ProdAgent framework to reduce Mean Time to Mitigation (MTTM) during production outages affecting core infrastructure.

Gemini CLI Architecture for Production Incidents

The Gemini CLI implementation connects four operational intelligence tools through the Model Context Protocol (MCP) to build incident context. The fetch_playbook function chains get_incident_details for alert metadata, causal_analysis for time series behavior correlation with mitigation labels, timeseries_correlation for anomaly detection, and log_analysis for volumetric pattern recognition.

Google SREs operate under a 5-minute SLO just to acknowledge incident pages, with extreme pressure to mitigate shortly after. The system measures impact in “Bad Customer Minutes” every minute degraded service burns through Error Budget allocations. Unlike Mean Time to Repair (MTTR), which focuses on complete fixes, MTTM prioritizes speed in stopping user-facing degradation.

The workflow follows four stages: paging (alert triggers), mitigation (stopping impact before diagnosing root cause), root cause analysis (identifying underlying bugs), and postmortem documentation with action items. Gemini CLI automates decision-making at each stage while maintaining human approval gates.

AdwaitX Analysis: Multi-Layer Safety in Agentic Operations

Google implements a five-tier safety architecture positioning Gemini CLI as a “copilot, not an autopilot”. The system restricts agents to deterministic tools via strictly typed Model Context Protocol definitions like borg_task_restart (equivalent to Kubernetes pod restarts), eliminating arbitrary bash script execution.

Every tool includes risk assessment metadata flagging actions as safe, reversible, or destructive. A policy enforcement layer blocks commands violating contextual rules such as preventing global restarts during peak traffic or requiring two-person approval for high-risk operations. Human-in-the-loop confirmation steps maintain accountability while enabling AI-speed execution.

Audit trail logging captures both AI proposals and human approvals, satisfying compliance requirements for production mutations. This architecture addresses the core operational challenge: commands safe under specific system states may cause cascading failures in different conditions, such as binary rollbacks during active configuration pushes.

Incident Response Performance Metrics

Stage	Traditional Manual Process	Gemini CLI-Assisted Process
Initial Diagnosis	Minutes to identify mitigation class	Seconds via automated symptom classification
Failed Mitigation Recovery	Context switching across dashboards, manual log analysis	Immediate error analysis with pattern recognition
Root Cause Analysis	Manual code review and log correlation	Under 2 minutes with monorepo analysis
Postmortem Generation	Tedious timestamp gathering and document creation	Automated CSV timeline and Markdown template population

During a simulated Core SRE incident (s_e1vnco7W2), Gemini CLI recommended the borg_task_restart playbook based on symptom analysis. When the restart failed, the system immediately identified a pattern only the specific job failed while others in the cluster remained healthy suggesting an application-level defect rather than infrastructure issues. This insight redirected investigation within seconds, avoiding time loss on infrastructure troubleshooting.

Code Generation and Automated Remediation

After identifying a logic error in a recent configuration push, Gemini CLI generated a Changelist (Google’s equivalent to GitHub Pull Requests) reverting the problematic configuration and applying safeguards. The system cross-referenced recent changes in Google’s monorepo with production logs to isolate the defect.

The postmortem automation developed by Google Cloud Developer Advocate Riccardo Carlesso scrapes conversation history, metrics, and logs to populate incident timelines. Custom commands generate Markdown documents based on standard SRE postmortem templates and suggest action items to prevent recurrence. MCP integration with issue trackers automatically files bugs, assigns engineering owners, and exports documentation to Google Docs.

Market Implications for DevOps Automation

Gemini CLI is publicly available at geminicli.com with MCP server support for Grafana, Prometheus, PagerDuty, and Kubernetes. Google’s deployment demonstrates production viability of agentic AI in high-stakes infrastructure operations where Core SRE incidents affect foundational services including safety, security, account management, and data backends visible across multiple products.

The “virtuous cycle” architecture feeds generated postmortems back into Gemini as training data, creating self-improving incident response capabilities where “the output of today’s investigation becomes the input for tomorrow’s solution”. This represents a fundamental shift from reactive scripting to proactive automation that triggers at optimal moments addressing the SRE principle that solving a problem once requires not just writing scripts, but building systems that execute them at the exact right time.

Roadmap for Enterprise SRE Adoption

Organizations can replicate the workflow pattern using publicly available Gemini CLI with custom MCP servers connecting to proprietary monitoring and incident management platforms. The Model Context Protocol enables integration with existing DevOps toolchains without requiring infrastructure migration.

Google’s approach aligns with the SRE mission of “Eliminate Toil” replacing repetitive manual work with engineered systems. The deployment validates that AI can safely assist operators during high-pressure outages without removing human control through confirmation gates and deterministic tool restrictions.

Frequently Asked Questions (FAQs)

What is Gemini CLI for incident response?

Gemini CLI is Google’s AI-powered terminal tool using Gemini 3 to automate SRE workflows including incident diagnosis, mitigation selection, code generation, and postmortem documentation.

How does Google ensure AI safety in production operations?

Google uses five safety layers: deterministic tool restrictions, risk metadata, policy enforcement, human-in-the-loop approvals, and audit logging to maintain accountability.

What is Mean Time to Mitigation (MTTM)?

MTTM measures speed to stop user impact during outages, unlike MTTR which tracks complete repair time. Google SREs target 5-minute acknowledgment SLOs.

Can external organizations use Gemini CLI?

Yes, Gemini CLI is publicly available with MCP server support for tools like Grafana, Prometheus, PagerDuty, and Kubernetes at geminicli.com.

Search for an article

Google SREs Deploy Gemini CLI to Automate Critical Incident Response