Log Sampling Strategies for High-Throughput Systems

When your distributed system generates terabytes of logs daily, you face a brutal choice: either spend a fortune on storage and processing infrastructure, or lose visibility into critical system behavior. Most engineering teams discover this problem too late—after their observability costs have spiraled out of control or they've missed a critical incident because they couldn't afford to retain enough data.

Traditional approaches to logging—capturing everything at all times—break down catastrophically at scale. A microservices architecture handling millions of requests per second can generate logs faster than most teams can ingest them. The result? Dropped logs, delayed indexing, astronomical cloud storage bills, and query timeouts that make your observability platform unusable during the incidents when you need it most. I've seen teams spend six figures monthly on log infrastructure while still missing the signals that matter.

The fundamental problem isn't just volume—it's signal-to-noise ratio. Most logs contain redundant information: successful health checks, routine database queries, and repetitive error messages that don't require individual inspection. Yet these mundane entries consume the same storage and processing resources as the critical anomalies you actually need to investigate. Without intelligent sampling, you're paying premium prices to store and index noise.

Why Naive Sampling Approaches Fail

The obvious solution—randomly dropping 90% of logs—seems appealing until you realize what you've lost. Random sampling treats all logs equally, which means you'll discard critical error traces with the same probability as routine success messages. When an incident occurs, you'll find yourself staring at incomplete traces, missing the exact log entries that would explain the failure.

Head-based sampling, where you make the keep-or-drop decision when the log is generated, suffers from a different problem: you don't yet know whether this request will become interesting. A request that starts normally might encounter a downstream timeout, trigger a circuit breaker, or expose a data consistency issue—but by the time you discover this, you've already discarded the early logs that would provide context.

Static sampling rates create another trap. Setting a fixed 10% sampling rate works fine during normal traffic, but during an incident when request volume spikes, you're suddenly dropping 90% of the logs from the exact time period you need to investigate. Conversely, during low-traffic periods, you might be over-sampling and wasting resources on redundant data.

Adaptive Sampling Architecture

Modern log sampling requires context-aware decision-making that considers request characteristics, system state, and business value. The architecture centers on a sampling coordinator that maintains global state and adjusts sampling rates dynamically based on multiple signals.

Here's a production-grade implementation of an adaptive sampler in TypeScript:

interface SamplingDecision {
  shouldSample: boolean;
  samplingRate: number;
  reason: string;
}

interface RequestContext {
  traceId: string;
  statusCode?: number;
  duration?: number;
  errorType?: string;
  endpoint: string;
  userId?: string;
}

class AdaptiveSampler {
  private baseRate: number = 0.01; // 1% baseline
  private errorRate: number = 1.0; // 100% for errors
  private slowRequestThreshold: number = 1000; // ms
  private recentErrors: Map<string, number> = new Map();
  private endpointRates: Map<string, number> = new Map();
  private volumeWindow: number[] = [];
  private readonly windowSize: number = 60; // 60 seconds

  constructor(
    private maxLogsPerSecond: number,
    private costBudget: number
  ) {
    this.initializeEndpointRates();
  }

  private initializeEndpointRates(): void {
    // High-value endpoints get higher sampling
    this.endpointRates.set('/api/payment', 0.5);
    this.endpointRates.set('/api/checkout', 0.3);
    this.endpointRates.set('/api/auth', 0.2);
    this.endpointRates.set('/health', 0.001);
  }

  shouldSample(ctx: RequestContext): SamplingDecision {
    // Always sample errors
    if (ctx.statusCode && ctx.statusCode >= 500) {
      return {
        shouldSample: true,
        samplingRate: 1.0,
        reason: 'server_error'
      };
    }

    // Always sample slow requests
    if (ctx.duration && ctx.duration > this.slowRequestThreshold) {
      return {
        shouldSample: true,
        samplingRate: 1.0,
        reason: 'slow_request'
      };
    }

    // Sample based on error correlation
    if (this.isCorrelatedWithRecentErrors(ctx)) {
      return {
        shouldSample: true,
        samplingRate: 0.8,
        reason: 'error_correlation'
      };
    }

    // Endpoint-specific sampling
    const endpointRate = this.endpointRates.get(ctx.endpoint) || this.baseRate;

    // Adjust for current volume
    const volumeAdjustment = this.calculateVolumeAdjustment();
    const finalRate = Math.min(endpointRate * volumeAdjustment, 1.0);

    return {
      shouldSample: Math.random() < finalRate,
      samplingRate: finalRate,
      reason: 'adaptive_rate'
    };
  }

  private isCorrelatedWithRecentErrors(ctx: RequestContext): boolean {
    if (!ctx.userId) return false;

    const userErrorCount = this.recentErrors.get(ctx.userId) || 0;
    return userErrorCount > 0;
  }

  private calculateVolumeAdjustment(): number {
    if (this.volumeWindow.length === 0) return 1.0;

    const currentVolume = this.volumeWindow.reduce((a, b) => a + b, 0) / this.volumeWindow.length;

    // If we're exceeding capacity, reduce sampling
    if (currentVolume > this.maxLogsPerSecond) {
      return this.maxLogsPerSecond / currentVolume;
    }

    return 1.0;
  }

  recordRequest(ctx: RequestContext, sampled: boolean): void {
    // Track volume for adaptive adjustment
    const currentSecond = Math.floor(Date.now() / 1000);
    if (this.volumeWindow.length >= this.windowSize) {
      this.volumeWindow.shift();
    }
    this.volumeWindow.push(sampled ? 1 : 0);

    // Track errors for correlation
    if (ctx.statusCode && ctx.statusCode >= 500 && ctx.userId) {
      const count = this.recentErrors.get(ctx.userId) || 0;
      this.recentErrors.set(ctx.userId, count + 1);

      // Expire old error tracking after 5 minutes
      setTimeout(() => {
        const current = this.recentErrors.get(ctx.userId!) || 0;
        if (current > 0) {
          this.recentErrors.set(ctx.userId!, current - 1);
        }
      }, 300000);
    }
  }
}

This sampler makes intelligent decisions based on multiple factors. It always captures errors and slow requests—the signals most likely to indicate problems. It increases sampling for requests correlated with recent errors, helping you understand cascading failures. It adjusts sampling rates dynamically based on current volume, preventing your infrastructure from being overwhelmed during traffic spikes.

Tail-Based Sampling for Complete Traces

The limitation of head-based sampling is that you must decide whether to keep a log before you know the request outcome. Tail-based sampling solves this by buffering logs temporarily and making the sampling decision after the request completes.

interface TraceSpan {
  traceId: string;
  spanId: string;
  timestamp: number;
  logs: LogEntry[];
  metadata: RequestContext;
}

interface LogEntry {
  level: string;
  message: string;
  timestamp: number;
  attributes: Record<string, any>;
}

class TailBasedSampler {
  private traceBuffer: Map<string, TraceSpan> = new Map();
  private readonly bufferTimeout: number = 30000; // 30 seconds
  private readonly maxBufferSize: number = 100000;

  constructor(
    private sampler: AdaptiveSampler,
    private logShipper: LogShipper
  ) {
    this.startBufferCleanup();
  }

  addLog(traceId: string, log: LogEntry, ctx: RequestContext): void {
    if (!this.traceBuffer.has(traceId)) {
      this.traceBuffer.set(traceId, {
        traceId,
        spanId: this.generateSpanId(),
        timestamp: Date.now(),
        logs: [],
        metadata: ctx
      });
    }

    const trace = this.traceBuffer.get(traceId)!;
    trace.logs.push(log);

    // Check buffer size limits
    if (this.traceBuffer.size > this.maxBufferSize) {
      this.flushOldestTraces(1000);
    }
  }

  completeTrace(traceId: string, finalContext: RequestContext): void {
    const trace = this.traceBuffer.get(traceId);
    if (!trace) return;

    // Update with final context (status code, duration, etc.)
    trace.metadata = { ...trace.metadata, ...finalContext };

    // Make sampling decision based on complete trace
    const decision = this.sampler.shouldSample(trace.metadata);

    if (decision.shouldSample) {
      this.shipLogs(trace, decision);
    }

    this.traceBuffer.delete(traceId);
    this.sampler.recordRequest(trace.metadata, decision.shouldSample);
  }

  private shipLogs(trace: TraceSpan, decision: SamplingDecision): void {
    const enrichedLogs = trace.logs.map(log => ({
      ...log,
      traceId: trace.traceId,
      spanId: trace.spanId,
      samplingRate: decision.samplingRate,
      samplingReason: decision.reason
    }));

    this.logShipper.ship(enrichedLogs);
  }

  private flushOldestTraces(count: number): void {
    const sorted = Array.from(this.traceBuffer.entries())
      .sort((a, b) => a[1].timestamp - b[1].timestamp)
      .slice(0, count);

    for (const [traceId, trace] of sorted) {
      // Force decision on old traces
      const decision = this.sampler.shouldSample(trace.metadata);
      if (decision.shouldSample) {
        this.shipLogs(trace, decision);
      }
      this.traceBuffer.delete(traceId);
    }
  }

  private startBufferCleanup(): void {
    setInterval(() => {
      const now = Date.now();
      for (const [traceId, trace] of this.traceBuffer.entries()) {
        if (now - trace.timestamp > this.bufferTimeout) {
          // Timeout - make best-effort decision
          const decision = this.sampler.shouldSample(trace.metadata);
          if (decision.shouldSample) {
            this.shipLogs(trace, decision);
          }
          this.traceBuffer.delete(traceId);
        }
      }
    }, 5000);
  }

  private generateSpanId(): string {
    return Math.random().toString(36).substring(2, 15);
  }
}

Tail-based sampling provides complete traces for sampled requests. When you investigate an error, you see every log entry from that request's lifecycle—not just the fragments that survived random sampling. The trade-off is memory overhead from buffering and the complexity of managing trace state across distributed services.

Content-Aware Sampling Patterns

Beyond request-level decisions, you can sample based on log content itself. Some log messages are inherently more valuable than others, regardless of request status.

class ContentAwareSampler {
  private readonly alwaysSamplePatterns: RegExp[] = [
    /authentication failed/i,
    /database connection lost/i,
    /circuit breaker opened/i,
    /rate limit exceeded/i,
    /deadlock detected/i
  ];

  private readonly neverSamplePatterns: RegExp[] = [
    /health check succeeded/i,
    /heartbeat/i,
    /metrics collected/i
  ];

  private readonly highValueAttributes: Set<string> = new Set([
    'userId',
    'transactionId',
    'paymentId',
    'orderId'
  ]);

  shouldSampleLog(log: LogEntry, baseDecision: SamplingDecision): boolean {
    // Override base decision for critical patterns
    if (this.matchesPattern(log.message, this.alwaysSamplePatterns)) {
      return true;
    }

    if (this.matchesPattern(log.message, this.neverSamplePatterns)) {
      return false;
    }

    // Increase sampling for logs with high-value attributes
    if (this.hasHighValueAttributes(log.attributes)) {
      return Math.random() < Math.min(baseDecision.samplingRate * 2, 1.0);
    }

    return baseDecision.shouldSample;
  }

  private matchesPattern(message: string, patterns: RegExp[]): boolean {
    return patterns.some(pattern => pattern.test(message));
  }

  private hasHighValueAttributes(attributes: Record<string, any>): boolean {
    return Object.keys(attributes).some(key => 
      this.highValueAttributes.has(key)
    );
  }
}

Content-aware sampling ensures you never discard logs containing critical keywords, even if the overall request would normally be dropped. This prevents the scenario where you sample out the one log message that would have explained an incident.

Common Pitfalls and Edge Cases

Sampling introduces subtle bugs that only manifest during incidents. The most dangerous is inconsistent sampling across service boundaries. If your API gateway samples at 10% but your downstream services sample at 1%, you'll have orphaned trace spans that make debugging impossible. Always propagate sampling decisions through distributed traces using trace context headers.

Memory leaks in tail-based samplers are common. If traces never complete—due to timeouts, crashes, or network partitions—they accumulate in your buffer indefinitely. Implement aggressive timeouts and buffer size limits, accepting that you'll occasionally make sampling decisions on incomplete data.

Sampling bias can skew your metrics. If you sample successful requests at 1% but errors at 100%, your error rate calculations will be wildly inaccurate unless you weight samples correctly. Always include the sampling rate as metadata and adjust aggregations accordingly:

function calculateTrueErrorRate(samples: Array<{status: number, samplingRate: number}>): number {
  let weightedErrors = 0;
  let weightedTotal = 0;

  for (const sample of samples) {
    const weight = 1 / sample.samplingRate;
    weightedTotal += weight;
    if (sample.status >= 500) {
      weightedErrors += weight;
    }
  }

  return weightedErrors / weightedTotal;
}

Clock skew between services breaks tail-based sampling. If Service A's clock is 30 seconds ahead of Service B, traces will timeout before all spans arrive. Use NTP synchronization and implement clock skew detection in your sampling coordinator.

Production Best Practices

Start with conservative sampling rates and increase gradually while monitoring for gaps in observability. A good baseline is 100% sampling for errors, 50% for slow requests, 10% for high-value endpoints, and 1% for everything else. Adjust based on your specific traffic patterns and budget constraints.

Implement sampling rate limits per service to prevent any single service from overwhelming your log infrastructure. If a service starts logging excessively due to a bug, rate limiting prevents it from consuming your entire log budget.

Store sampling metadata with every log entry. Include the sampling rate, sampling reason, and sampler version. This metadata is essential for accurate metric calculations and helps you understand why certain logs were kept or dropped during incident reviews.

Use separate sampling strategies for different log levels. Debug logs can be sampled aggressively (0.1%), while error logs should be sampled at 100%. This prevents debug logging from drowning out critical errors while still providing some debug visibility.

Build sampling dashboards that show sampling rates, dropped log counts, and buffer utilization in real-time. These metrics help you detect sampling problems before they impact incident response. Alert when sampling rates drop below thresholds for critical endpoints.

Implement sampling overrides for incident response. When investigating an issue, you need the ability to temporarily increase sampling for specific users, endpoints, or trace IDs without redeploying services. Build this capability into your sampling coordinator from the start.

Test your sampling strategy under load. Synthetic traffic tests should verify that sampling decisions remain consistent under high volume and that your buffer doesn't overflow during traffic spikes.

FAQ

What is the difference between head-based and tail-based sampling? Head-based sampling makes the keep-or-drop decision when a log is generated, before knowing the request outcome. Tail-based sampling buffers logs temporarily and decides after the request completes, allowing you to sample based on the full request context including final status and duration.

How do you prevent sampling from skewing observability metrics? Include the sampling rate as metadata with every log entry and weight your metric calculations accordingly. For example, if a log was sampled at 10%, it represents 10 actual occurrences. Multiply counts by the inverse of the sampling rate (1/0.1 = 10) to get accurate totals.

What sampling rate should you use for production systems? Start with 100% for errors and slow requests, 10-50% for business-critical endpoints, and 1-5% for routine traffic. Adjust based on your log volume, storage budget, and observability requirements. Monitor for gaps in visibility and increase rates where needed.

How does adaptive sampling reduce costs compared to fixed-rate sampling? Adaptive sampling adjusts rates dynamically based on traffic volume, system state, and request characteristics. During normal operation, it samples less aggressively to reduce costs. During incidents or high-error periods, it automatically increases sampling to maintain visibility exactly when you need it most.

When should you avoid log sampling entirely? Never sample compliance-required logs (audit trails, financial transactions, security events) or logs from systems where complete history is legally mandated. Also avoid sampling during initial deployment of new services when you need full visibility to catch bugs.

How do you handle sampling in distributed traces across multiple services? Propagate the sampling decision through trace context headers (like W3C Trace Context). When the first service decides to sample a trace, all downstream services must honor that decision to maintain complete trace continuity. Use consistent trace IDs across all services.

What are the memory requirements for tail-based sampling? Tail-based sampling requires buffering logs until traces complete, typically 10-30 seconds. For a service handling 10,000 requests per second with an average of 20 logs per request, expect to buffer roughly 2-6 million log entries, requiring 2-10GB of memory depending on log size.

Conclusion

Log sampling transforms observability from a cost center into a strategic capability. By intelligently selecting which logs to retain, you maintain visibility into critical system behavior while reducing storage and processing costs by 90% or more. The key is moving beyond naive random sampling to context-aware strategies that understand request characteristics, system state, and business value.

Implement adaptive sampling first—it provides immediate cost reduction with minimal complexity. Add tail-based sampling for services where complete trace visibility is critical. Layer content-aware sampling on top to ensure you never discard logs containing critical signals.

Start by auditing your current log volume and costs. Identify your highest-volume, lowest-value log sources—health checks, successful routine operations, and redundant debug messages. Implement aggressive sampling for these sources first, then gradually expand to more nuanced sampling strategies

Log Sampling Strategies for High-Throughput Systems

Log Sampling Strategies for High-Throughput Systems

Why Naive Sampling Approaches Fail

Adaptive Sampling Architecture

Tail-Based Sampling for Complete Traces

Content-Aware Sampling Patterns

Common Pitfalls and Edge Cases

Production Best Practices

FAQ

Conclusion

Comments

More from this blog

Embedding-First Architecture for Real-World LLM Apps

AI/ML Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Command Palette

Log Sampling Strategies for High-Throughput Systems

Why Naive Sampling Approaches Fail

Adaptive Sampling Architecture

Tail-Based Sampling for Complete Traces

Content-Aware Sampling Patterns

Common Pitfalls and Edge Cases

Production Best Practices

FAQ

Conclusion

Comments

More from this blog