What You Actually Need to Monitor AI Systems in Production
You did it. You added the latest AI agent into your product. Shipped it. Went to sleep. Woke up to find it returning a blank string, taking five seconds longer than yesterday, or confidently outputting lies in perfect JSON.
Naturally, you check your logs. You see a prompt. You see a response. And you see nothing helpful.
Surprise. Prompt in and response out is not observability. It is vibes.
There is a lot of buzz around “LLM observability.” Most of it involves charts you do not need and dashboards you will forget to check. If you are actually building and shipping LLM-powered products—chatbots, internal agents, retrieval apps—you need observability. But not the kind that stops where things get interesting.
Let’s walk through what to track, when to track it, and why you should care before you accidentally spend five thousand dollars on empty completions.
(AKA “Prompt Graveyard”)
You are building a prototype. You have a notebook, a vector store, an OpenAI key, and a dream. This is not the time for dashboards. This is the time for panic-saving broken prompts.
At this stage, you are debugging yourself more than your users. So log:
The full prompt and response
Model name, temperature, and function schema version
Token usage and latency
Something—anything—that identifies the version of your prompt
log_data = { "prompt": prompt, "response": response, "model": "gpt-4-turbo", "temperature": 0.7, "latency_ms": duration_ms, "tokens": { "prompt_tokens": usage.prompt_tokens, "completion_tokens": usage.completion_tokens }, "prompt_version": os.getenv("GIT_COMMIT_HASH") } logger.info("llm_trace", extra=log_data)
Prompt versioning does not need a fancy system. A commit hash will do. Or a sticky note. Just write it down before you forget what changed.
A JSON file
A table in Postgres
Sentry with structured logs
Traceloop, if you like that sort of thing
Your only goal here is to make weird behavior reproducible. If something explodes and you can explain it in less than five minutes, you win.
(Now It Is Someone Else’s Problem)
Your app is live. Someone, somewhere is typing something terrible into your input box. This is where your optimistic prototype becomes a haunted house of retry storms, stale embeddings, and LLM behavior changes you did not ask for.
At this point, you are not debugging the model. You are debugging everything around it.
Layer | What Will Break |
Frontend | Laggy input fields, users pasting PDFs |
Backend | Prompt assembly bugs, retry loops |
LLM | Latency, token burn, mysterious hallucinations |
Retrieval | Missing documents, low relevance scores |
External APIs | Schema changes, rate limits, surprise outages |
Infra | Cold starts, memory spikes, silent container exits |
This is not “what did we send to the model.” This is “what happened from the user click to the flaming output.”
import uuid trace_id = str(uuid.uuid4()) logger.info("start_trace", extra={"trace_id": trace_id, "input": user_input}) response = call_agent(user_input, trace_id=trace_id) logger.info("end_trace", extra={ "trace_id": trace_id, "response": response, "latency_ms": elapsed, "tools_used": tool_calls })
Add trace identifiers across frontend and backend
Use OpenTelemetry or Sentry to follow request flow
Track retries and token usage
Set alerts for latency spikes and error rates
If you cannot tell what the user saw, what the model saw, and what changed, you are not doing observability. You are doing archaeology.
A retry loop burns tokens quietly during a vector store outage
The UI breaks because the model adds an unexpected prefix
Retrieval stops working after a model change, and you find out on Twitter
Latency jumps 600 milliseconds and no one knows why
You cannot fix what you cannot find. Especially when it is wrapped in base64 inside a JSON blob.
(Also Known As “Now We Are On Call”)
Your product is being used. People depend on it. You have new problems now. Like cost. And scale. And trying to explain to your manager why your summarizer now refuses to summarize.
Observability shifts from catching bugs to catching regressions and making tradeoffs.
Output drift detection for every model upgrade
Evaluation metrics like semantic similarity and formatting checks
Cost and latency breakdowns by user, endpoint, and model
RAG quality tracking: is your index fresh, relevant, and not broken
Full stack tracing, all the way down to the weird tool call that silently failed
score = cosine_similarity(embed(output), embed(reference))
if score < 0.8:
alert("Drift detected in summarizer")
Nightly evals using cron and SQL
Token usage reports grouped by endpoint
Manual diffs of output before and after model changes
Embedding freshness checks with thresholds
You should pay for tooling when:
You spend more time reading logs than shipping features
It takes longer than five minutes to answer “what changed”
Your system breaks silently and no one knows until customers complain
Useful tools:
LangSmith for tracing and evals
Sentry for full-stack visibility
WhyLabs or Arize for drift detection and output monitoring
Monitoring LLM systems is messy. These are probabilistic tools. You are not logging errors. You are logging behavior.
Stage | What You Should See |
Pre-Production | Prompt logs, token usage, version tracking |
Production | Tracing, retries, feedback, structured logs |
Post-PMF | Drift detection, evals, cost insights, RAG health |
A good observability stack should answer four questions. Fast.
What did we send the model
Why did it respond that way
What changed recently
How much is this costing us
If you can answer those, you are probably fine. If you cannot, start small. Fix the parts that hurt. Add more when things get weird. They will.
If your monitoring stops at the model call, you are not monitoring. You are hoping.
Sentry gives you full request traces across your app. You can follow a user click through your toolchain, vector store, model call, and back to the output. You can even see why a tool invocation failed inside an agent plan, without spending your entire afternoon rebuilding context from logs.
So stop guessing. Start knowing. And please log the prompt before you forget what broke it.
Check out the docs to learn more, join the discussion in Discord, or if you’re new to Sentry, you can get started for free.