FAQs

Frequently asked questions about sampling in Odigos.

Concepts and terminology

Does 'sampling' mean dropping traces or keeping them?

Both, depending on who you ask. Some teams describe sampling as what gets dropped (e.g. “drop 90%”), others as what gets kept (e.g. “keep 10%”). Odigos uses the keep framing throughout the UI and docs because it aligns with how rules are written (each rule has a Keep Percentage). The two phrasings describe the same outcome; we standardize on “keep” to avoid confusion. See Introduction to sampling.

Why is dropping health checks called 'sampling' if I'm dropping 100% of them?

“Sampling” is the name of the pipeline mechanism that makes a per-trace keep/drop decision—not a description of “always keep some, always drop some.” Dropping all health probes is just sampling at 0%. The same engine that drops 100% of probe traffic also retains 25% of /api/products traffic; both are sampling decisions, just at different keep percentages.

Why does Odigos always keep or drop the entire trace?

Odigos applies a single keep/drop decision per trace, not per span. With head sampling, the decision is made at the root span and propagated to every child span (parent-based sampling). With tail sampling, the Odigos Gateway buffers all spans for a trace ID and makes one decision after the aggregation window. In both cases, splitting a trace—keeping some spans and dropping others—would break parent-child links, end-to-end timing, and any analysis that depends on the full request path, so a kept trace is always complete from root through children. See Sampling level.

How rules are evaluated

If a trace matches rules in more than one category, which category wins?

Categories are evaluated in a fixed sequence: Noisy → Highly Relevant → Cost Reduction. A Noisy match short-circuits the rest—the trace is dropped (or sampled down) and the other categories are skipped. A Highly Relevant match does not short-circuit—Cost Reduction rules are still evaluated, but the Highly Relevant Keep Percentage wins. Cost Reduction caps only take effect on traces with no Highly Relevant match. See Order of operations between categories.

Within a category, do rules have a priority order?

No. Within a single category, rules have no hierarchy—they’re all equal. When a trace matches multiple rules in the same category, the engine satisfies all matches at once: the highest Keep Percentage wins for Highly Relevant (floors, “keep at least”), and the lowest Keep Percentage wins for Noisy Operation and Cost Reduction (caps, “keep at most”). This is a deliberate design choice: ordered rule lists (like firewall rules) become unmaintainable when you have dozens of rules and have to mentally simulate the order to predict behavior.

Why does 'Keep at most 25%' not always mean exactly 25%?

Cost Reduction percentages are caps, not exact rates, because the engine considers all matching rules together. If two Cost Reduction rules match the same trace—one capped at 25% and another at 1%—the lower cap (1%) wins. This makes it safe to add new rules without worrying that they’ll silently override existing ones. The same logic runs in reverse for Highly Relevant: percentages there are floors (“keep at least”), and the highest floor wins.

If I only create Highly Relevant rules and no Cost Reduction rules, what gets dropped?

Almost nothing—and that’s usually not what you want. Highly Relevant rules mark important traces, but they don’t reduce volume on their own; non-matching traces still flow through the pipeline. Worse, the gateway still buffers traces for tail evaluation, which adds load without cutting volume. To control volume, pair Highly Relevant rules with at least one Cost Reduction rule (commonly the built-in Drop Traces Cluster-Wide rule). See Highly Relevant.

Can I combine multiple scope types (e.g. namespace + language) in a single rule?

Yes. Within a single scope type, multiple selections are combined with OR (e.g. namespaces prod OR staging). Across different scope types, selections are combined with AND (e.g. namespace prod AND language Java matches only Java workloads in prod—not every workload in prod and not every Java workload across the cluster). See Source scope.

Operational impact

Does adding any sampling rule increase load on the gateway?

It depends on the rule. Noisy rules use head sampling and run in the instrumentation agent where supported—they reduce gateway load. Highly Relevant, Cost Reduction and Noisy rules in agents with no head-sampling support, require tail evaluation, which means the gateway buffers each affected trace for the aggregation window. Even a single tail-mode rule activates buffering for all traffic, raising memory and CPU on the gateway.

Why do gateways sometimes crash instead of just dropping data?

Once the gateway accepts a trace for tail evaluation, it allocates memory to hold spans until the aggregation window closes. During traffic bursts, many traces can be accepted concurrently before backpressure kicks in—each one keeps allocating memory until the pod hits its limit and OOMs. This is why bursty workloads can produce dropped spans or collector restarts even when your sampling rules look correct. Sizing the gateway for peak burst capacity (not steady-state) is the most reliable fix. See Gateway load, OOM, and scaling.

Should I scale up (bigger pods) or scale out (more replicas) for tail sampling?

Both arithmetically yield similar capacity, but scaling up is usually preferable for tail sampling. All spans of a single trace must land on the same gateway pod for aggregation, so each scale event re-shuffles the trace-ID hash routing. Fewer, larger pods scale less often and produce less routing churn. There’s no one-size-fits-all answer—validate against your own load.

What happens to in-flight traces if a gateway crashes?

They are lost. Odigos does not currently use persistent queues or full backpressure between the data collector and the gateway, so a crash drops everything the gateway was buffering. Graceful shutdowns (rolling updates, scale-downs) flush their buffers before exit; only ungraceful crashes lose data.

My sampler looks empty or inactive—why is it still affecting my pipeline?

The presence of any sampling rule that requires tail evaluation activates tail mode for matching traffic—even rules that look minimal, such as a placeholder service name or a broad / route. Once tail mode is on, the gateway buffers traces for the aggregation window, which raises memory and latency to export. If you don’t need tail behavior, either remove unused or test sampling rules from the cluster, or disable tail sampling explicitly in Odigos configuration (Helm values, or the UI Settings page). See When any sampler enables tail mode.

Tail sampling specifics

Why is the aggregation window 30 seconds, and can I change it?

There is no reliable signal for “this trace is finished,” so the engine uses a fixed timeout instead. 30 seconds is the Odigos default, matching the OpenTelemetry community default. You can change it through the Helm value sampling.tailSampling.traceAggregationWaitDuration (for example '45s' or '15s'); the same value applies to every trace in the cluster. Larger windows hold more memory and delay traces appearing in your backend; smaller windows risk evaluating partial traces and fragmenting a single trace across multiple sampling decisions. See Aggregation window.

Does the 30-second timer start at the first span or the last span of a trace?

The first span the gateway sees for a trace ID, following the OpenTelemetry tail-sampling standard. Practical consequence: requests longer than 30 seconds may be evaluated on a partial trace.

What is trace-based load balancing, and why does tail sampling need it?

Tail evaluation requires every span of a single trace to land on the same gateway pod—otherwise the gateway can’t aggregate the trace before deciding. Odigos’s data collectors use a trace-based load balancing exporter that hashes the trace ID and routes all spans for that trace to the same gateway. Plain round-robin (the default without tail sampling) would scatter spans across pods and break aggregation.

Built-in rules

What exactly does the built-in Kubernetes Health Probes rule cover?

It drops traces for the liveness, readiness, and startup probe paths discovered on each workload’s Kubernetes manifest. The rule’s Keep Percentage defaults to 0% (drop everything) and is configurable. It only matches HTTPGet probes—exec and tcpSocket probes, or probes routed through ingress rewrites or service-mesh wrappers, are not auto-handled and need an explicit Noisy rule. See Built-in rule: Kubernetes Health Probes.

Can I disable or change the built-in 'Keep All Error Traces' rule?

The built-in rule itself is fixed at 100% retention and is not configurable. If you need a different keep percentage for error traces—for example because your error volume is too high to retain all of it—author your own Highly Relevant rule on errors with the Keep Percentage you want. Custom rules are how you express any intent other than the built-in 100%. See Built-in rule: Keep All Error Traces.

Why isn't there a built-in rule for high-latency traces?

Latency thresholds depend on the application: 200 ms is fast for one service and unacceptable for another, so there’s no sensible default the product can ship. Errors are different—an error span is universally a signal worth keeping—which is why the error rule is built in but a duration rule is not. Define your own Highly Relevant duration rule with the threshold and Keep Percentage that fit your service.

Are there built-in operations beyond HTTP and Kafka (e.g. SQS, gRPC, database calls)?

Today’s Operation options are HTTP Server, HTTP Client (Noisy only), and Kafka Consumer / Kafka Producer (Highly Relevant and Cost Reduction). Odigos will add support for additional protocols as users needs demand it.

Getting Started

Odigos AI

Instrumentations

Odigos Pipeline

Destinations

Concepts

Authentication

Contribution Guidelines

Concepts and terminology

How rules are evaluated

Operational impact

Tail sampling specifics

Built-in rules

​Concepts and terminology

​How rules are evaluated

​Operational impact

​Tail sampling specifics

​Built-in rules

Concepts and terminology

How rules are evaluated

Operational impact

Tail sampling specifics

Built-in rules