← Back to Blog Home

Sample AI traces at 100% without sampling everything

Sample AI traces at 100% without sampling everything

A little while ago, when agents were telling me “You’re absolutely right!”, I was building webvitals.com. You put in a URL, it kicks off an API request to a Next.js API route that invokes an agent with a few tools to scan it and provide AI generated suggestions to improve your… you guessed it… Web Vitals. Do we even care about these anymore?

I had the traceSampleRate set to 100% in development, but in production, I sampled it down to 10% because… well that’s what our instrumentation recommends. Kyle wrote a great blog post explaining that “Watching everything is watching nothing”. But AI is non-deterministic. And when I was debugging an error from a tool call, I realized I was missing very important spans emitted from the Vercel AI SDK because of that sampling strategy.

An agent run with 7 tool calls doesn’t get partially sampled. You either capture the whole span tree or you lose it entirely. This is how head-based sampling works.

I was chasing ghosts.

Agent Runs Are Span Trees, and Sampling Is All-or-Nothing

A typical agent execution looks like this in Sentry’s trace view:

POST /api/chat (http.server)
└── gen_ai.invoke_agent "Research Agent"
    ├── gen_ai.request "chat claude-sonnet-4-6"        ← initial reasoning
    ├── gen_ai.execute_tool "search_docs"              ← tool call
    ├── gen_ai.request "chat claude-sonnet-4-6"        ← process results
    ├── gen_ai.execute_tool "summarize"                ← second tool call
    ├── gen_ai.request "chat claude-sonnet-4-6"        ← decides to hand off
    └── gen_ai.execute_tool "transfer_to_writer"       ← handoff via tool
        └── gen_ai.invoke_agent "Writer Agent"
            ├── gen_ai.request "chat gemini-2.5-flash"
            └── gen_ai.execute_tool "format_output"

That’s 11 spans in a single run. The sampling decision happens once, at the root: the POST /api/chat HTTP transaction. Every child span inherits that decision. If the root is dropped, all 9 spans disappear.

This is fundamentally different from sampling HTTP requests, where dropping one GET /api/users is no big deal because the next one is basically identical.

Agent runs are not identical. Each one makes different decisions, calls different tools, processes different data. An agent that hallucinated on run 67 might work perfectly on run 420. If your sample rate dropped 67, you’ll never know what went wrong.

How Head-Based Sampling Actually Works (and Why It Matters Here)

Both the Sentry JavaScript and Python SDKs use head-based sampling: the decision is made at the start of the trace, before any child spans exist.

In the JavaScript SDK, SentrySampler.shouldSample() is explicit about this:

// We only sample based on parameters (like tracesSampleRate or tracesSampler)
// for root spans. Non-root spans simply inherit the sampling decision
// from their parent.

Non-root spans don’t get a vote. If the root span was dropped, tracesSampler is never called for any child, including your gen_ai.request and gen_ai.execute_tool spans. They inherit the parent’s fate.

In Python, the same logic lives in Transaction._set_initial_sampling_decision(). The traces_sampler callback receives a sampling_context dict with transaction_context (containing op and name) and parent_sampled. It only fires for root transactions.

This means head-based sampling doesn’t support independently sampling gen_ai child spans at a different rate than their parent transaction. There’s no “sample 100% of LLM calls but 10% of HTTP requests.” If the HTTP request is dropped, the LLM calls inside it are dropped too.

I’d love to walk through a few different scenarios to show the difference in filtering approaches based on wether or not the root span is from an agent or the application.

Scenario 1: The gen_ai Span IS the Root

Sometimes your agent run is the root span. Maybe it’s a cron job thats running an agent, a queue consumer processing an AI task, or a CLI script. In these cases, tracesSampler sees the gen_ai.* operation directly and you can match on it:

JavaScript:

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  tracesSampler: ({ name, attributes, inheritOrSampleWith }) => {
    // Standalone gen_ai root spans - always sample
    if (attributes?.['sentry.op']?.startsWith('gen_ai.') || attributes?.['gen_ai.system']) {
      return 1.0;
    }

    return inheritOrSampleWith(0.2);
  },
});

Python:

def traces_sampler(sampling_context):
    op = sampling_context.get("transaction_context", {}).get("op", "")

    # Standalone gen_ai root spans - always sample
    if op.startswith("gen_ai."):
        return 1.0

    parent = sampling_context.get("parent_sampled")
    if parent is not None:
        return float(parent)

    return 0.2

sentry_sdk.init(dsn="...", traces_sampler=traces_sampler)

This is the easy case. The hard case is next.

Scenario 2: The gen_ai Spans Are Children of an HTTP Transaction

This is the common case in web applications. A user hits POST /api/chat, your framework creates an http.server root span, and somewhere inside that request handler your agent runs. By the time the first gen_ai.request span is created, the sampling decision was already made for the HTTP transaction.

The fix: identify which routes trigger AI calls and sample those routes at 100%.

JavaScript:

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  tracesSampler: ({ name, attributes, inheritOrSampleWith }) => {
    // Standalone gen_ai root spans
    if (attributes?.['sentry.op']?.startsWith('gen_ai.') || attributes?.['gen_ai.system']) {
      return 1.0;
    }

    // HTTP routes that serve AI features - always sample
    if (name?.includes('/api/chat') ||
        name?.includes('/api/agent') ||
        name?.includes('/api/generate')) {
      return 1.0;
    }

    return inheritOrSampleWith(0.2);
  },
});

Python:

def traces_sampler(sampling_context):
    tx_context = sampling_context.get("transaction_context", {})
    op = tx_context.get("op", "")
    name = tx_context.get("name", "")

    # Standalone gen_ai root spans
    if op.startswith("gen_ai."):
        return 1.0

    # HTTP routes that serve AI features - always sample
    if op == "http.server" and any(
        p in name for p in ["/api/chat", "/api/agent", "/api/generate"]
    ):
        return 1.0

    # Honour parent decision in distributed traces
    parent = sampling_context.get("parent_sampled")
    if parent is not None:
        return float(parent)

    return 0.2

sentry_sdk.init(dsn="...", traces_sampler=traces_sampler)

Replace the route strings with whatever paths your AI features live on. If your entire app is AI-powered, skip the tracesSampler and just set tracesSampleRate: 1.0.

The Cost Math: AI API Bills Dwarf Observability Costs

The instinct to sample AI traces at a lower rate usually comes from cost concerns. Let’s look at the actual numbers.

WhatCost per event
Claude Sonnet 4 input (1K tokens)~$0.003
Claude Sonnet 4 output (1K tokens)~$0.015
Gemini 2.5 Flash input (1K tokens)~$0.00015
Gemini 2.5 Flash output (1K tokens)~$0.0006
A typical agent run (3 LLM calls, 2 tool calls)$0.02-$0.15
Sentry span events for that agent run (~9 spans)Fraction of a cent

The LLM calls themselves are 10-100x more expensive than the monitoring. You’re already paying for the AI call; dropping the observability span to save a fraction of a cent per call is like skipping the dashcam to save on gas.

When 100% Tracing Isn’t Feasible: Metrics and Logs as a Safety Net

If you genuinely can’t sample AI routes at 100%, because of, say, massive scale or strict budget restraints, you can still capture the important signals from every AI call using Sentry Metrics and Logs. Both are independent of trace sampling.

JavaScript - emit metrics on every LLM call:

import * as Sentry from "@sentry/node";

// After every LLM call, regardless of trace sampling:
Sentry.metrics.distribution("gen_ai.token_usage", result.usage.totalTokens, {
  unit: "none",
  attributes: {
    model: "claude-sonnet-4-6",
    user_id: user.id,
    endpoint: "/api/chat",
  },
});

Sentry.metrics.distribution("gen_ai.latency", responseTimeMs, {
  unit: "millisecond",
  attributes: { model: "claude-sonnet-4-6" },
});

Sentry.metrics.count("gen_ai.calls", 1, {
  attributes: {
    model: "claude-sonnet-4-6",
    status: result.error ? "error" : "success",
  },
});

Python - emit metrics on every LLM call:

import sentry_sdk

sentry_sdk.metrics.distribution(
    "gen_ai.token_usage",
    result.usage.total_tokens,
    attributes={
        "model": "claude-sonnet-4-6",
        "user_id": str(user.id),
        "endpoint": "/api/chat",
    },
)

sentry_sdk.metrics.distribution(
    "gen_ai.latency",
    response_time_ms,
    unit="millisecond",
    attributes={"model": "claude-sonnet-4-6"},
)

sentry_sdk.metrics.count(
    "gen_ai.calls",
    1,
    attributes={
        "model": "claude-sonnet-4-6",
        "status": "error" if error else "success",
    },
)

You can also log every call with structured attributes for searchability:

JavaScript:

Sentry.logger.info("LLM call completed", {
  model: "claude-sonnet-4-6",
  user_id: user.id,
  input_tokens: result.usage.promptTokens,
  output_tokens: result.usage.completionTokens,
  latency_ms: responseTimeMs,
  status: "success",
});

Python:

sentry_sdk.logger.info(
    "LLM call completed",
    model="claude-sonnet-4-6",
    user_id=str(user.id),
    input_tokens=result.usage.prompt_tokens,
    output_tokens=result.usage.completion_tokens,
    latency_ms=response_time_ms,
    status="success",
)

Here’s what each telemetry layer gives you:

SignalTraces (sampled)Metrics (100%)Logs (100%)
Full span tree with prompts/responsesYesNoNo
Token usage distributions (p50, p99)PartialYesNo
Cost attribution by model/userPartialYesYes
Error rates by model/endpointPartialYesYes
Latency distributionsPartialYesNo
Searchable per-call recordsYesNoYes

The recommended approach: Use tracesSampler to capture 100% of AI-related routes. If that’s not possible, combine a lower trace rate with metrics and logs emitted on every call. Traces give you the debugging depth; metrics and logs give you the aggregate picture.

Once you’re emitting these metrics, you can build custom dashboards that go beyond what the pre-built AI Agents dashboard shows. The Sentry CLI makes this scriptable:

# Find your most expensive users - the pre-built dashboard doesn't group by user
sentry dashboard create 'AI Cost Attribution'
sentry dashboard widget add 'AI Cost Attribution' "Most Expensive Users" \
  --display table --dataset spans \
  --query "sum:gen_ai.usage.total_tokens" \
  --where "span.op:gen_ai.request" \
  --group-by "user.id" \
  --sort "-sum:gen_ai.usage.total_tokens" \
  --limit 20

# Cost per conversation - find runaway multi-turn sessions
sentry dashboard widget add 'AI Cost Attribution' "Cost per Conversation" \
  --display table --dataset spans \
  --query "sum:gen_ai.usage.total_tokens" "count" \
  --where "span.op:gen_ai.request" \
  --group-by "gen_ai.conversation.id" \
  --sort "-sum:gen_ai.usage.total_tokens" \
  --limit 20

The pre-built dashboard gives you per-model and per-tool aggregates. Custom dashboards answer the business questions: who’s driving cost, which features justify their AI spend, and which conversations are spiraling.

The Full Production Config

Here’s a complete setup that samples AI routes at 100%, everything else at your baseline, and emits metrics as a safety net:

JavaScript:

import * as Sentry from "@sentry/node";

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  tracesSampler: ({ name, attributes, inheritOrSampleWith }) => {
    if (attributes?.['sentry.op']?.startsWith('gen_ai.') || attributes?.['gen_ai.system']) {
      return 1.0;
    }
    if (name?.includes('/api/chat') || name?.includes('/api/agent')) {
      return 1.0;
    }
    return inheritOrSampleWith(0.2);
  },
});

// Wrapper for any LLM call - emit metrics regardless of sampling
function trackLLMCall(model, usage, latencyMs, userId) {
  Sentry.metrics.distribution("gen_ai.token_usage", usage.totalTokens, {
    attributes: { model, user_id: userId },
  });
  Sentry.metrics.distribution("gen_ai.latency", latencyMs, {
    unit: "millisecond",
    attributes: { model },
  });
  Sentry.metrics.count("gen_ai.calls", 1, {
    attributes: { model, status: "success" },
  });
}

Python:

import sentry_sdk

def traces_sampler(sampling_context):
    tx = sampling_context.get("transaction_context", {})
    op, name = tx.get("op", ""), tx.get("name", "")

    if op.startswith("gen_ai."):
        return 1.0
    if op == "http.server" and any(
        p in name for p in ["/api/chat", "/api/agent"]
    ):
        return 1.0

    parent = sampling_context.get("parent_sampled")
    if parent is not None:
        return float(parent)
    return 0.2

sentry_sdk.init(
    dsn="...",
    traces_sampler=traces_sampler,
)

# Wrapper for any LLM call - emit metrics regardless of sampling
def track_llm_call(model, usage, latency_ms, user_id):
    sentry_sdk.metrics.distribution(
        "gen_ai.token_usage", usage.total_tokens,
        attributes={"model": model, "user_id": str(user_id)},
    )
    sentry_sdk.metrics.distribution(
        "gen_ai.latency", latency_ms,
        unit="millisecond",
        attributes={"model": model},
    )
    sentry_sdk.metrics.count(
        "gen_ai.calls", 1,
        attributes={"model": model, "status": "success"},
    )

Quick Reference

SituationWhat to do
AI is the core producttracesSampleRate: 1.0 - sample everything
AI is one feature in a larger apptracesSampler with AI routes at 1.0, baseline for the rest
Can’t afford 100% on AI routesLower trace rate + metrics/logs on every call
Already using tracesSamplerAdd AI route matching to your existing logic
Sample rate is already 1.0No change needed

The underlying principle: agent runs are high-value, low-volume (relative to HTTP traffic), and expensive to reproduce. Sample them accordingly.

If you’re just getting started with AI monitoring, check out our companion post on the developer’s guide to AI agent monitoring, which covers the full setup across 10+ frameworks, the pre-built dashboards, and a real debugging walkthrough.

For framework-specific setup, see our AI monitoring docs. If you’re using an AI coding assistant, install the Sentry CLI skill (npx skills add <https://cli.sentry.dev>) to configure your sampling, build custom dashboards, and investigate issues directly from your editor.

Syntax.fm logo

Listen to the Syntax Podcast

Of course we sponsor a developer podcast. Check it out on your favorite listening platform.

Listen To Syntax