What You Actually Need to Monitor AI Systems in Production

Rahul Chhabria - July 9, 2025

What You Actually Need to Monitor AI Systems in Production

ON THIS PAGE

Stage One: Pre-Production
Stage Two: Production
Stage Three: Product Market Fit
What Good Looks Like
One More Thing

You did it. You added the latest AI agent into your product. Shipped it. Went to sleep. Woke up to find it returning a blank string, taking five seconds longer than yesterday, or confidently outputting lies in perfect JSON.

Naturally, you check your logs. You see a prompt. You see a response. And you see nothing helpful.

Surprise. Prompt in and response out is not observability. It is vibes.

There is a lot of buzz around “LLM observability.” Most of it involves charts you do not need and dashboards you will forget to check. If you are actually building and shipping LLM-powered products—chatbots, internal agents, retrieval apps—you need observability. But not the kind that stops where things get interesting.

Let’s walk through what to track, when to track it, and why you should care before you accidentally spend five thousand dollars on empty completions.

Stage One: Pre-Production

(AKA “Prompt Graveyard”)

You are building a prototype. You have a notebook, a vector store, an OpenAI key, and a dream. This is not the time for dashboards. This is the time for panic-saving broken prompts.

What to Log

At this stage, you are debugging yourself more than your users. So log:

The full prompt and response
Model name, temperature, and function schema version
Token usage and latency
Something—anything—that identifies the version of your prompt

Click to Copy

log_data = {
    "prompt": prompt,
    "response": response,
    "model": "gpt-4-turbo",
    "temperature": 0.7,
    "latency_ms": duration_ms,
    "tokens": {
        "prompt_tokens": usage.prompt_tokens,
        "completion_tokens": usage.completion_tokens
    },
    "prompt_version": os.getenv("GIT_COMMIT_HASH")
}
logger.info("llm_trace", extra=log_data)

Prompt versioning does not need a fancy system. A commit hash will do. Or a sticky note. Just write it down before you forget what changed.

Tooling You Can Get Away With

A JSON file
A table in Postgres
Sentry with structured logs
Traceloop, if you like that sort of thing

Your only goal here is to make weird behavior reproducible. If something explodes and you can explain it in less than five minutes, you win.

Stage Two: Production

(Now It Is Someone Else’s Problem)

Your app is live. Someone, somewhere is typing something terrible into your input box. This is where your optimistic prototype becomes a haunted house of retry storms, stale embeddings, and LLM behavior changes you did not ask for.

What to Monitor

At this point, you are not debugging the model. You are debugging everything around it.

Layer	What Will Break
Frontend	Laggy input fields, users pasting PDFs
Backend	Prompt assembly bugs, retry loops
LLM	Latency, token burn, mysterious hallucinations
Retrieval	Missing documents, low relevance scores
External APIs	Schema changes, rate limits, surprise outages
Infra	Cold starts, memory spikes, silent container exits

You Need Tracing. Actual Tracing.

This is not “what did we send to the model.” This is “what happened from the user click to the flaming output.”

Click to Copy

import uuid

trace_id = str(uuid.uuid4())

logger.info("start_trace", extra={"trace_id": trace_id, "input": user_input})

response = call_agent(user_input, trace_id=trace_id)

logger.info("end_trace", extra={
    "trace_id": trace_id,
    "response": response,
    "latency_ms": elapsed,
    "tools_used": tool_calls
})

Free Stuff That Actually Helps

Add trace identifiers across frontend and backend
Use OpenTelemetry or Sentry to follow request flow
Track retries and token usage
Set alerts for latency spikes and error rates

If you cannot tell what the user saw, what the model saw, and what changed, you are not doing observability. You are doing archaeology.

What Breaks Without It

A retry loop burns tokens quietly during a vector store outage
The UI breaks because the model adds an unexpected prefix
Retrieval stops working after a model change, and you find out on Twitter
Latency jumps 600 milliseconds and no one knows why

You cannot fix what you cannot find. Especially when it is wrapped in base64 inside a JSON blob.

Event

Debugging with Sentry AI using Seer, MCP, and Agent Monitoring

Join Sentry engineers for a hands-on workshop on debugging AI applications and agents with full-stack visibility. Join live on July 23rd at 10am PT to code along.

RSVP

Stage Three: Product Market Fit

(Also Known As “Now We Are On Call”)

Your product is being used. People depend on it. You have new problems now. Like cost. And scale. And trying to explain to your manager why your summarizer now refuses to summarize.

Observability shifts from catching bugs to catching regressions and making tradeoffs.

What You Actually Need

Output drift detection for every model upgrade
Evaluation metrics like semantic similarity and formatting checks
Cost and latency breakdowns by user, endpoint, and model
RAG (Retrieval-Augmented Generation) quality tracking: is your index fresh, relevant, and not broken
Full stack tracing, all the way down to the weird tool call that silently failed

Example: Quick and Dirty Eval

Click to Copy

score = cosine_similarity(embed(output), embed(reference))

if score < 0.8:
    alert("Drift detected in summarizer")

You Can Still Do Some of This Yourself

Nightly evals using cron and SQL
Token usage reports grouped by endpoint
Manual diffs of output before and after model changes
Embedding freshness checks with thresholds

When to Stop Building and Start Paying

You should pay for tooling when:

You spend more time reading logs than shipping features
It takes longer than five minutes to answer “what changed”
Your system breaks silently and no one knows until customers complain

Useful tools:

LangSmith for tracing and evals
Sentry for full-stack visibility
WhyLabs or Arize for drift detection and output monitoring

What Good Looks Like

Monitoring LLM systems is messy. These are probabilistic tools. You are not logging errors. You are logging behavior.

Stage	What You Should See
Pre-Production	Prompt logs, token usage, version tracking
Production	Tracing, retries, feedback, structured logs
Post-PMF	Drift detection, evals, cost insights, RAG health

A good observability stack should answer four questions. Fast.

What did we send the model
Why did it respond that way
What changed recently
How much is this costing us

If you can answer those, you are probably fine. If you cannot, start small. Fix the parts that hurt. Add more when things get weird. They will.

One More Thing

If your monitoring stops at the model call, you are not monitoring. You are hoping.

Sentry gives you full request traces across your app. You can follow a user click through your toolchain, vector store, model call, and back to the output. You can even see why a tool invocation failed inside an agent plan, without spending your entire afternoon rebuilding context from logs.

So stop guessing. Start knowing. And please log the prompt before you forget what broke it.

Check out the docs to learn more, join the discussion in Discord, or if you’re new to Sentry, you can get started for free.

Products

Integrations

SDKs

Learn

Support

Hang out with us

Unsnag Your Mobile Monitoring: Why Teams Are Switching to Sentry

What You Actually Need to Monitor AI Systems in Production

Stage One: Pre-Production

What to Log

Tooling You Can Get Away With

Stage Two: Production

What to Monitor

You Need Tracing. Actual Tracing.

Free Stuff That Actually Helps

What Breaks Without It

Debugging with Sentry AI using Seer, MCP, and Agent Monitoring

Stage Three: Product Market Fit

What You Actually Need

Example: Quick and Dirty Eval

You Can Still Do Some of This Yourself

When to Stop Building and Start Paying

What Good Looks Like

One More Thing

Code breaks, fix it faster

How Anthropic solved scaling log volume with Sentry

Debugging with Sentry AI using Seer, MCP, and Agent Monitoring

Unsnag Your Mobile Monitoring: Why Teams Are Switching to Sentry

What You Actually Need to Monitor AI Systems in Production

What to LogWhat to Log

Tooling You Can Get Away WithTooling You Can Get Away With

Stage Two: ProductionStage Two: Production

What to MonitorWhat to Monitor

You Need Tracing. Actual Tracing.You Need Tracing. Actual Tracing.

Free Stuff That Actually HelpsFree Stuff That Actually Helps

What Breaks Without ItWhat Breaks Without It

Debugging with Sentry AI using Seer, MCP, and Agent Monitoring

Stage Three: Product Market FitStage Three: Product Market Fit

What You Actually NeedWhat You Actually Need

Example: Quick and Dirty EvalExample: Quick and Dirty Eval

You Can Still Do Some of This YourselfYou Can Still Do Some of This Yourself

When to Stop Building and Start PayingWhen to Stop Building and Start Paying

What Good Looks LikeWhat Good Looks Like

One More ThingOne More Thing

Code breaks, fix it faster

How Anthropic solved scaling log volume with Sentry

Debugging with Sentry AI using Seer, MCP, and Agent Monitoring

What to Log

Tooling You Can Get Away With

Stage Two: Production

What to Monitor

You Need Tracing. Actual Tracing.

Free Stuff That Actually Helps

What Breaks Without It

Stage Three: Product Market Fit

What You Actually Need

Example: Quick and Dirty Eval

You Can Still Do Some of This Yourself

When to Stop Building and Start Paying

What Good Looks Like

One More Thing