Back to Blog Home

What You Actually Need to Monitor AI Systems in Production

Rahul Chhabria image

Rahul Chhabria -

You did it. You added the latest AI agent into your product. Shipped it. Went to sleep. Woke up to find it returning a blank string, taking five seconds longer than yesterday, or confidently outputting lies in perfect JSON.

Naturally, you check your logs. You see a prompt. You see a response. And you see nothing helpful.

Surprise. Prompt in and response out is not observability. It is vibes.

There is a lot of buzz around “LLM observability.” Most of it involves charts you do not need and dashboards you will forget to check. If you are actually building and shipping LLM-powered products—chatbots, internal agents, retrieval apps—you need observability. But not the kind that stops where things get interesting.

Let’s walk through what to track, when to track it, and why you should care before you accidentally spend five thousand dollars on empty completions.

Stage One: Pre-Production

(AKA “Prompt Graveyard”)

You are building a prototype. You have a notebook, a vector store, an OpenAI key, and a dream. This is not the time for dashboards. This is the time for panic-saving broken prompts.

What to Log

At this stage, you are debugging yourself more than your users. So log:

  • The full prompt and response

  • Model name, temperature, and function schema version

  • Token usage and latency

  • Something—anything—that identifies the version of your prompt

Click to Copy
log_data = {     "prompt": prompt,     "response": response,     "model": "gpt-4-turbo",     "temperature": 0.7,     "latency_ms": duration_ms,     "tokens": {         "prompt_tokens": usage.prompt_tokens,         "completion_tokens": usage.completion_tokens     },     "prompt_version": os.getenv("GIT_COMMIT_HASH") } logger.info("llm_trace", extra=log_data)

Prompt versioning does not need a fancy system. A commit hash will do. Or a sticky note. Just write it down before you forget what changed.

Tooling You Can Get Away With

  • A JSON file

  • A table in Postgres

  • Sentry with structured logs

  • Traceloop, if you like that sort of thing

Your only goal here is to make weird behavior reproducible. If something explodes and you can explain it in less than five minutes, you win.

Stage Two: Production

(Now It Is Someone Else’s Problem)

Your app is live. Someone, somewhere is typing something terrible into your input box. This is where your optimistic prototype becomes a haunted house of retry storms, stale embeddings, and LLM behavior changes you did not ask for.

What to Monitor

At this point, you are not debugging the model. You are debugging everything around it.

Layer

What Will Break

Frontend

Laggy input fields, users pasting PDFs

Backend

Prompt assembly bugs, retry loops

LLM

Latency, token burn, mysterious hallucinations

Retrieval

Missing documents, low relevance scores

External APIs

Schema changes, rate limits, surprise outages

Infra

Cold starts, memory spikes, silent container exits

You Need Tracing. Actual Tracing.

This is not “what did we send to the model.” This is “what happened from the user click to the flaming output.”

Click to Copy
import uuid trace_id = str(uuid.uuid4()) logger.info("start_trace", extra={"trace_id": trace_id, "input": user_input}) response = call_agent(user_input, trace_id=trace_id) logger.info("end_trace", extra={     "trace_id": trace_id,     "response": response,     "latency_ms": elapsed,     "tools_used": tool_calls })

Free Stuff That Actually Helps

  • Add trace identifiers across frontend and backend

  • Use OpenTelemetry or Sentry to follow request flow

  • Track retries and token usage

  • Set alerts for latency spikes and error rates

If you cannot tell what the user saw, what the model saw, and what changed, you are not doing observability. You are doing archaeology.

What Breaks Without It

  • A retry loop burns tokens quietly during a vector store outage

  • The UI breaks because the model adds an unexpected prefix

  • Retrieval stops working after a model change, and you find out on Twitter

  • Latency jumps 600 milliseconds and no one knows why

You cannot fix what you cannot find. Especially when it is wrapped in base64 inside a JSON blob.

Stage Three: Product Market Fit

(Also Known As “Now We Are On Call”)

Your product is being used. People depend on it. You have new problems now. Like cost. And scale. And trying to explain to your manager why your summarizer now refuses to summarize.

Observability shifts from catching bugs to catching regressions and making tradeoffs.

What You Actually Need

  • Output drift detection for every model upgrade

  • Evaluation metrics like semantic similarity and formatting checks

  • Cost and latency breakdowns by user, endpoint, and model

  • RAG quality tracking: is your index fresh, relevant, and not broken

  • Full stack tracing, all the way down to the weird tool call that silently failed

Example: Quick and Dirty Eval

score = cosine_similarity(embed(output), embed(reference))

if score < 0.8:
    alert("Drift detected in summarizer")

You Can Still Do Some of This Yourself

  • Nightly evals using cron and SQL

  • Token usage reports grouped by endpoint

  • Manual diffs of output before and after model changes

  • Embedding freshness checks with thresholds

When to Stop Building and Start Paying

You should pay for tooling when:

  • You spend more time reading logs than shipping features

  • It takes longer than five minutes to answer “what changed”

  • Your system breaks silently and no one knows until customers complain

Useful tools:

  • LangSmith for tracing and evals

  • Sentry for full-stack visibility

  • WhyLabs or Arize for drift detection and output monitoring

What Good Looks Like

Monitoring LLM systems is messy. These are probabilistic tools. You are not logging errors. You are logging behavior.

Stage

What You Should See

Pre-Production

Prompt logs, token usage, version tracking

Production

Tracing, retries, feedback, structured logs

Post-PMF

Drift detection, evals, cost insights, RAG health

A good observability stack should answer four questions. Fast.

  1. What did we send the model

  2. Why did it respond that way

  3. What changed recently

  4. How much is this costing us

If you can answer those, you are probably fine. If you cannot, start small. Fix the parts that hurt. Add more when things get weird. They will.

One More Thing

If your monitoring stops at the model call, you are not monitoring. You are hoping.

Sentry gives you full request traces across your app. You can follow a user click through your toolchain, vector store, model call, and back to the output. You can even see why a tool invocation failed inside an agent plan, without spending your entire afternoon rebuilding context from logs.

So stop guessing. Start knowing. And please log the prompt before you forget what broke it.

Check out the docs to learn more, join the discussion in Discord, or if you’re new to Sentry, you can get started for free.

Share

Share on Twitter
Share on Bluesky
Share on HackerNews
Share on LinkedIn

Published

Sentry Sign Up CTA

Code breaks, fix it faster

Sign up for Sentry and monitor your application in minutes.

Try Sentry Free

Topics

Agent Monitoring
600+ Engineers, 1 Tool: Anthropic's Sentry Story

600+ Engineers, 1 Tool: Anthropic's Sentry Story

Listen to the Syntax Podcast

Of course we sponsor a developer podcast. Check it out on your favorite listening platform.

Listen To Syntax
© 2025 • Sentry is a registered Trademark of Functional Software, Inc.