Back to Blog Home

Contents

Share

Share on Twitter
Share on Bluesky
Share on HackerNews
Share on LinkedIn

Seer fixes Seer: How Seer pointed us toward a bug and helped fix an outage

Kush Dubey image

Kush Dubey -

Seer fixes Seer: How Seer pointed us toward a bug and helped fix an outage

Seer fixes Seer: How Seer pointed us toward a bug and helped fix an outage

Seer is our AI agent that takes bugs and uses all of the context Sentry has to find the root cause and suggest a fix. We use it all the time to help us improve Sentry. Seer fixes Sentry.

More recently, Seer has been helping us fix itself — Seer fixing Seer. 

An upstream outage triggered a bit of an avalanche, revealing a bug that had been hiding away for months. When it came time to fix it, Seer pointed us exactly where we needed to look. 

The alarm bells

On February 21, 2026, Seer's AI-powered issue summarization went down in our EU region.

Around 80-90% of requests to Seer’s Issue Summary API endpoint were failing, which meant broken "AI Summary" cards on every new Sentry issue. No actionability scores. No automated autofix runs. 40,000+ errors streamed in.

We hadn’t recently changed anything that should’ve caused something like this — so why now?

Turns out there was an upstream issue. Seer AI Summaries run on gemini-2.5-flash-lite through Google Cloud Platform (GCP) Vertex API. GCP later declared an incident for gemini-2.5-flash-lite availability in several EU regions. 

But that should have been a minor degradation. We had provisioned throughput in Vertex AI and were only using about 12% of it. 

What turned a manageable upstream availability issue into a total outage was our own code: a latency optimization we'd built to skip failing regions blocklisted every Gemini region in the EU, including the one where we had guaranteed capacity.

How Seer routes LLM calls in the EU

As noted, Seer runs gemini-2.5-flash-lite through GCP Vertex AI. In our EU deployment, we have provisioned throughput (PT) in europe-west1, which gives us reserved capacity even when broader Vertex AI demand spikes. 

We use Standard pay-as-you-go (Standard PayGo) for several other EU regions. Standard PayGo is best-effort capacity where Google sets a quota based on our total Vertex AI spend over a rolling 30-day period, with no guarantees during bursts in demand.

Seer's LLM client implements a region fallback with a temporary blocklist: if a region accumulates 6 eligible failures within a short window, it's temporarily removed from the rotation. This feature is important for latency-sensitive services, as a 429 or 504 response typically takes 2–4 seconds to return. During an interactive Autofix session, which makes 50–100 LLM calls, these delays add up.

This system has a critical invariant: never blocklist a PT region. That's your guaranteed capacity, and blocklisting it means giving up the one region you’re paying to handle your load while dumping all of it onto regions that aren’t able to.

The bug that had been hiding away was revealed by this outage. We enforced this invariant in our US deployment but forgot to add it for the EU one.

The cascade

Our PT region, europe-west1, started returning 504 Deadline Exceeded errors because the model was intermittently unavailable on Google's side.

Six failures in a short window was enough to cross the blocklist threshold, and europe-west1 was frequently removed from the rotation.

With europe-west1 blocklisted, all traffic shifted to our Standard PayGo regions, which weren't provisioned to handle the full load. europe-west4 started returning 429 RESOURCE_EXHAUSTED and got blocklisted, then europe-central2 followed, etc. Within minutes, the client had cycled through and blocklisted every EU region, and every subsequent call returned LlmNoRegionsToRunError — there were no allowed regions left for it to utilize.

Even during the GCP incident, most calls to europe-west1 were succeeding because our provisioned throughput was absorbing the load. But the blocklist triggered on 6 failures regardless of the success rate, so a region could be handling the vast majority of requests just fine and still get banned because at least 6 failures happened to cluster.

The fix was adding europe-west1 to an allowlist that prevents PT regions from being blocklisted. Failure rates returned to baseline within minutes of deploying it.

The code problem

Simplified, the blocklist logic looked like this:

Click to Copy
def should_blocklist(region: str, model: str, error_count: int) -> bool:
    return error_count >= BLOCKLIST_THRESHOLD

What it needed:

Click to Copy
def should_blocklist(region: str, model: str, error_count: int) -> bool:
    if is_provisioned_throughput_region(region, model):
        return False  # Never blocklist PT regions

    return error_count >= BLOCKLIST_THRESHOLD

The US deployment had a hardcoded exception for its PT region, but when we provisioned throughput for the EU deployment—after a previous incident where a load increase triggered a wave of 429s—we didn't add the corresponding exception in the blocklist code. The configuration relied on a developer remembering to update a separate, manually maintained list—a classic gap between infrastructure provisioning and the application’s awareness of it.

The secondary issue is the heuristic itself. A threshold of 6 errors regardless of total volume and success rate was hardcoded based on load from months ago. We're replacing it with an error-rate-based approach.

Seer debugging Seer

There's an obvious irony in Sentry's AI debugging tool being used to debug an outage of Sentry's AI debugging tool. But it was genuinely the fastest path to understanding the blast radius.

The initial detection came from a standard Sentry alert. But figuring out what was actually broken, which user-facing features were impacted, which regions, which model, and whether this was EU-only or global, came from Seer running an analysis on that LlmNoRegionsToRunError issue.

Within seconds, Seer identified that failed issue summaries accounted for the bulk of ~42K errors, with spam detection (~1,600) and autofix (~850) also impacted. It confirmed >99% of events were in the EU deployment and it traced the blocklisting cascade through the breadcrumb trail.

The analysis got most of the way to the root cause autonomously.

The final step, realizing that the PT region shouldn't have been blocklistable came from human knowledge of the provisioned throughput setup. But Seer had pointed directly at the region blocklisting mechanism as the culprit, and then confirmed that calls to the PT region were mostly succeeding even during the GCP incident, which is exactly the combination of facts our engineers needed to make the fix click.

The lesson

Latency optimizations have a potential failure mode that's worse than having no optimization at all. A circuit breaker that opens too aggressively, a blocklist that doesn't respect reserved capacity, or a fallback chain that amplifies failures can each turn an upstream provider incident into a total outage of your own making.

The gap this bug exploited is mundane and common: the distance between "we provisioned capacity in GCP" and "our application code knows we provisioned capacity in GCP." If you're routing LLM requests across multiple regions (which if you're running AI features at scale, you probably are), audit your circuit breakers and make sure they know which regions are sacred. This fix was six lines of code.


To see how Seer analyzes production issues, check out the Seer documentation or try it on your own Sentry project.

Syntax.fm logo

Listen to the Syntax Podcast

Of course we sponsor a developer podcast. Check it out on your favorite listening platform.

Listen To Syntax
© 2026 • Sentry is a registered Trademark of Functional Software, Inc.