Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale

Are you tired of sifting through countless integration test logs to find the bug you’re looking for? You’re not alone, and Google has the data to prove it.

Google researchers have developed Auto-Diagnose, a tool powered by LLM that automatically analyzes failure logs from broken integration tests, identifies the root cause, and provides a concise diagnosis directly in the code review where the failure occurred. In a manual evaluation of 71 real-world failures across 39 teams, the tool accurately pinpointed the root cause 90.14% of the time. It has been used on over 52,000 failing tests, 224,000 executions, and 91,000 code changes authored by nearly 23,000 developers, with only a 5.8% ‘Not helpful’ rate in feedback received.

Integration tests can be a debugging nightmare, especially when dealing with hermetic functional integration tests that involve multiple components of a distributed system communicating with each other. A survey at Google revealed that 78% of integration tests are functional, which is why Auto-Diagnose focuses on this type of test.

Diagnosing integration test failures has been a common pain point for developers at Google, with a significant portion of failures taking hours or even days to diagnose. The issue lies in the structure of test driver logs, which often only provide generic symptoms without pinpointing the actual error buried within the system under test component logs.

Auto-Diagnose works by collecting all relevant logs from test drivers and system under test components, organizing them by timestamp, and feeding them into the Gemini 2.5 Flash model with specific parameters. The model follows a step-by-step protocol to scan logs, identify failures, and summarize errors before providing a conclusion in a markdown format posted in Google’s internal code review system.

The tool has been successful in providing quick diagnoses, with a median latency of 56 seconds and a 90th percentile latency of 346 seconds. Feedback from developers has been overwhelmingly positive, with 84.3% requesting action on the diagnosis. Auto-Diagnose ranks highly in helpfulness among other tools in Google’s code review system.

Auto-Diagnose has proven to be a valuable tool for developers at Google, addressing a common pain point in integration test debugging. Its success lies in its approach to prompt engineering and refusal to guess, ensuring accurate diagnoses and even uncovering infrastructure issues within Google’s logging pipeline.

Since its implementation in May 2025, Auto-Diagnose has been used extensively on failing tests and code changes, providing fast and accurate diagnoses to engineers before they switch contexts. The tool’s effectiveness is a testament to the power of AI-driven solutions in software development.

For more information, you can check out the pre-print paper on Auto-Diagnose. Don’t forget to follow us on Twitter and join our ML SubReddit and Newsletter for more updates. If you’re interested in partnering with us for promoting your GitHub Repo, Hugging Face Page, or any other product release, feel free to connect with us.