Skip to main content

Command Palette

Search for a command to run...

Webhook Retry Logic: Exponential Backoff

Published
8 min read
T

Welcome to TopperBlog! 👋

I'm a tech content creator passionate about helping developers level up their careers and master cutting-edge technologies.

🎯 What I Write About: • AI/ML Engineering & LLMs • Web3 & Blockchain Development
• System Design & Architecture • Interview Preparation (FAANG) • Freelancing & Remote Work • Modern Tech Stacks (Next.js, React, Rust, TypeScript) • Performance Optimization & Best Practices

💼 Mission: Sharing practical, actionable insights that accelerate your tech career and maximize your earning potential.

📚 15+ In-Depth Guides covering everything from earning $10k/month as a freelancer to cracking FAANG interviews.

🌐 Let's connect and grow together in this amazing tech journey!

#TechBlogger #SoftwareEngineering #CareerGrowth #WebDevelopment #AIEngineering

Why Simple Retry Logic Fails at Scale

Most webhook implementations start with a basic retry loop: if the HTTP request fails, wait a few seconds and try again. This approach breaks down immediately under real-world conditions.

Fixed-delay retries create synchronized retry storms. When a downstream service experiences a brief outage, hundreds or thousands of webhook deliveries fail simultaneously. If all clients retry after the same fixed interval, they hammer the recovering service with a coordinated wave of requests the moment it comes back online, often causing it to fail again. This pattern repeats, creating a cascading failure that extends a brief hiccup into a prolonged outage.

Linear backoff strategies (increasing delay by a constant amount: 1s, 2s, 3s, 4s) fail to provide adequate spacing at scale. After just a few retries, the delays become too long for time-sensitive events, yet they're still not long enough to allow genuinely failed services to recover. The strategy provides neither fast recovery for transient failures nor appropriate spacing for sustained outages.

Modern systems face additional constraints that simple retry logic ignores entirely. Rate limiting is ubiquitous—APIs enforce per-second, per-minute, and per-hour quotas. Cloud functions have execution time limits. Compliance requirements mandate event delivery within specific timeframes. Distributed tracing and observability require correlation IDs and structured logging throughout the retry lifecycle. None of these concerns fit into a basic retry loop.

Implementing Exponential Backoff with Jitter

Exponential backoff solves the spacing problem by doubling the delay between each retry attempt. Combined with jitter—random variation in the delay—it prevents synchronized retry storms while providing fast recovery for transient failures.

Here's a production-grade implementation in TypeScript that handles the complexities of modern webhook delivery:

interface WebhookRetryConfig {
  maxRetries: number;
  baseDelayMs: number;
  maxDelayMs: number;
  timeoutMs: number;
  jitterFactor: number;
}

interface WebhookDeliveryResult {
  success: boolean;
  statusCode?: number;
  attempt: number;
  totalDelayMs: number;
  error?: Error;
}

class WebhookDeliveryService {
  private config: WebhookRetryConfig;
  private circuitBreaker: CircuitBreaker;

  constructor(config: WebhookRetryConfig) {
    this.config = {
      maxRetries: 5,
      baseDelayMs: 1000,
      maxDelayMs: 300000, // 5 minutes
      timeoutMs: 30000,
      jitterFactor: 0.3,
      ...config
    };
    this.circuitBreaker = new CircuitBreaker({
      failureThreshold: 5,
      resetTimeoutMs: 60000
    });
  }

  async deliverWithRetry(
    url: string,
    payload: unknown,
    headers: Record<string, string>,
    correlationId: string
  ): Promise<WebhookDeliveryResult> {
    let attempt = 0;
    let totalDelayMs = 0;
    const startTime = Date.now();

    while (attempt <= this.config.maxRetries) {
      attempt++;

      // Check circuit breaker before attempting delivery
      if (this.circuitBreaker.isOpen(url)) {
        console.warn(`Circuit breaker open for ${url}`, {
          correlationId,
          attempt
        });

        if (attempt > this.config.maxRetries) {
          return {
            success: false,
            attempt,
            totalDelayMs,
            error: new Error('Circuit breaker open, max retries exceeded')
          };
        }

        const delay = this.calculateDelay(attempt);
        await this.sleep(delay);
        totalDelayMs += delay;
        continue;
      }

      try {
        const controller = new AbortController();
        const timeoutId = setTimeout(
          () => controller.abort(),
          this.config.timeoutMs
        );

        const response = await fetch(url, {
          method: 'POST',
          headers: {
            'Content-Type': 'application/json',
            'X-Correlation-ID': correlationId,
            'X-Delivery-Attempt': attempt.toString(),
            ...headers
          },
          body: JSON.stringify(payload),
          signal: controller.signal
        });

        clearTimeout(timeoutId);

        // Success cases
        if (response.ok) {
          this.circuitBreaker.recordSuccess(url);
          return {
            success: true,
            statusCode: response.status,
            attempt,
            totalDelayMs
          };
        }

        // Non-retryable client errors
        if (response.status >= 400 && response.status < 500 && response.status !== 429) {
          console.error(`Non-retryable error for ${url}`, {
            correlationId,
            statusCode: response.status,
            attempt
          });
          return {
            success: false,
            statusCode: response.status,
            attempt,
            totalDelayMs,
            error: new Error(`HTTP ${response.status}`)
          };
        }

        // Rate limiting - respect Retry-After header
        if (response.status === 429) {
          const retryAfter = response.headers.get('Retry-After');
          const delay = retryAfter 
            ? parseInt(retryAfter) * 1000 
            : this.calculateDelay(attempt);

          console.warn(`Rate limited by ${url}`, {
            correlationId,
            retryAfter,
            attempt
          });

          if (attempt <= this.config.maxRetries) {
            await this.sleep(delay);
            totalDelayMs += delay;
            continue;
          }
        }

        // Server errors - apply exponential backoff
        this.circuitBreaker.recordFailure(url);

        if (attempt <= this.config.maxRetries) {
          const delay = this.calculateDelay(attempt);
          console.info(`Retrying ${url} after ${delay}ms`, {
            correlationId,
            attempt,
            statusCode: response.status
          });
          await this.sleep(delay);
          totalDelayMs += delay;
          continue;
        }

        return {
          success: false,
          statusCode: response.status,
          attempt,
          totalDelayMs,
          error: new Error(`Max retries exceeded`)
        };

      } catch (error) {
        this.circuitBreaker.recordFailure(url);

        console.error(`Network error delivering to ${url}`, {
          correlationId,
          attempt,
          error: error instanceof Error ? error.message : 'Unknown error'
        });

        if (attempt <= this.config.maxRetries) {
          const delay = this.calculateDelay(attempt);
          await this.sleep(delay);
          totalDelayMs += delay;
          continue;
        }

        return {
          success: false,
          attempt,
          totalDelayMs,
          error: error instanceof Error ? error : new Error('Unknown error')
        };
      }
    }

    return {
      success: false,
      attempt,
      totalDelayMs,
      error: new Error('Max retries exceeded')
    };
  }

  private calculateDelay(attempt: number): number {
    // Exponential backoff: baseDelay * 2^(attempt - 1)
    const exponentialDelay = this.config.baseDelayMs * Math.pow(2, attempt - 1);

    // Cap at maximum delay
    const cappedDelay = Math.min(exponentialDelay, this.config.maxDelayMs);

    // Add jitter: random value between (1 - jitterFactor) and (1 + jitterFactor)
    const jitterRange = cappedDelay * this.config.jitterFactor;
    const jitter = (Math.random() * 2 - 1) * jitterRange;

    return Math.floor(cappedDelay + jitter);
  }

  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

This implementation addresses several critical production requirements. The exponential backoff calculation doubles the delay with each attempt while capping it at a maximum value to prevent indefinite delays. Jitter adds randomness to prevent synchronized retries across multiple clients. The circuit breaker tracks failure rates per endpoint and temporarily stops sending requests to consistently failing services, preventing wasted resources and allowing time for recovery.

Circuit Breaker Pattern for Webhook Resilience

The circuit breaker pattern is essential for webhook retry logic at scale. It prevents your system from repeatedly attempting to deliver webhooks to endpoints that are clearly down, reducing load on both your infrastructure and the failing service.

interface CircuitBreakerConfig {
  failureThreshold: number;
  resetTimeoutMs: number;
  halfOpenMaxAttempts: number;
}

enum CircuitState {
  CLOSED = 'CLOSED',
  OPEN = 'OPEN',
  HALF_OPEN = 'HALF_OPEN'
}

class CircuitBreaker {
  private states: Map<string, {
    state: CircuitState;
    failures: number;
    lastFailureTime: number;
    halfOpenAttempts: number;
  }>;
  private config: CircuitBreakerConfig;

  constructor(config: CircuitBreakerConfig) {
    this.states = new Map();
    this.config = {
      failureThreshold: 5,
      resetTimeoutMs: 60000,
      halfOpenMaxAttempts: 3,
      ...config
    };
  }

  isOpen(endpoint: string): boolean {
    const state = this.getState(endpoint);

    if (state.state === CircuitState.OPEN) {
      const timeSinceLastFailure = Date.now() - state.lastFailureTime;

      // Transition to half-open after reset timeout
      if (timeSinceLastFailure >= this.config.resetTimeoutMs) {
        state.state = CircuitState.HALF_OPEN;
        state.halfOpenAttempts = 0;
        return false;
      }

      return true;
    }

    return false;
  }

  recordSuccess(endpoint: string): void {
    const state = this.getState(endpoint);

    if (state.state === CircuitState.HALF_OPEN) {
      // Successful delivery in half-open state - close the circuit
      state.state = CircuitState.CLOSED;
      state.failures = 0;
      state.halfOpenAttempts = 0;
    } else if (state.state === CircuitState.CLOSED) {
      // Reset failure count on success
      state.failures = 0;
    }
  }

  recordFailure(endpoint: string): void {
    const state = this.getState(endpoint);
    state.failures++;
    state.lastFailureTime = Date.now();

    if (state.state === CircuitState.HALF_OPEN) {
      state.halfOpenAttempts++;

      if (state.halfOpenAttempts >= this.config.halfOpenMaxAttempts) {
        state.state = CircuitState.OPEN;
      }
    } else if (state.state === CircuitState.CLOSED) {
      if (state.failures >= this.config.failureThreshold) {
        state.state = CircuitState.OPEN;
        console.warn(`Circuit breaker opened for ${endpoint}`, {
          failures: state.failures,
          threshold: this.config.failureThreshold
        });
      }
    }
  }

  private getState(endpoint: string) {
    if (!this.states.has(endpoint)) {
      this.states.set(endpoint, {
        state: CircuitState.CLOSED,
        failures: 0,
        lastFailureTime: 0,
        halfOpenAttempts: 0
      });
    }
    return this.states.get(endpoint)!;
  }

  getMetrics(endpoint: string) {
    return this.getState(endpoint);
  }
}

The circuit breaker maintains three states: closed (normal operation), open (blocking requests), and half-open (testing recovery). When failures exceed the threshold, it opens the circuit and stops delivery attempts. After a reset timeout, it enters half-open state to test if the service has recovered. A successful delivery closes the circuit; continued failures reopen it.

Idempotency and Deduplication

Webhook retry logic inevitably leads to duplicate deliveries. Network failures can occur after the receiving service processes the webhook but before it sends a response. Your retry logic will resend the webhook, causing the receiver to process it twice.

Idempotency ensures that processing the same webhook multiple times produces the same result as processing it once. Implement this through idempotency keys:

interface WebhookPayload {
  idempotencyKey: string;
  timestamp: number;
  eventType: string;
  data: unknown;
}

class IdempotentWebhookProcessor {
  private processedKeys: Map<string, {
    timestamp: number;
    result: unknown;
  }>;
  private keyExpirationMs: number;

  constructor(keyExpirationMs: number = 86400000) { // 24 hours
    this.processedKeys = new Map();
    this.keyExpirationMs = keyExpirationMs;

    // Cleanup expired keys periodically
    setInterval(() => this.cleanupExpiredKeys(), 3600000); // Every hour
  }

  async processWebhook(
    payload: WebhookPayload,
    processor: (data: unknown) => Promise<unknown>
  ): Promise<{ processed: boolean; result: unknown }> {
    const existing = this.processedKeys.get(payload.idempotencyKey);

    if (existing) {
      console.info('Duplicate webhook detected', {
        idempotencyKey: payload.idempotencyKey,
        originalTimestamp: existing.timestamp
      });

      return {
        processed: false,
        result: existing.result
      };
    }

    const result = await processor(payload.data);

    this.processedKeys.set(payload.idempotencyKey, {
      timestamp: Date.now(),
      result
    });

    return {
      processed: true,
      result
    };
  }

  private cleanupExpiredKeys(): void {
    const now = Date.now();
    const expiredKeys: string[] = [];

    for (const [key, value] of this.processedKeys.entries()) {
      if (now - value.timestamp > this.keyExpirationMs) {
        expiredKeys.push(key);
      }
    }

    expiredKeys.forEach(key => this.processedKeys.delete(key));

    if (expiredKeys.length > 0) {
      console.info(`Cleaned up ${expiredKeys.length} expired idempotency keys`);
    }
  }
}

In production systems, store idempotency keys in Redis or a similar distributed cache rather than in-memory maps. This ensures consistency across multiple service instances and survives restarts.

Dead Letter Queues and Observability

Even with sophisticated retry logic, some webhooks will ultimately fail. Dead letter queues (DLQs) capture these permanently failed events for manual review and reprocessing.

interface DeadLetterEvent {
  originalPayload: unknown;
  targetUrl: string;
  correlationId: string;
  attempts: number;
  firstAttemptTime: number;
  lastAttemptTime: number;
  lastError: string;
  lastStatusCode?: number;
}

class WebhookDeliveryOrchestrator {
  private deliveryService: WebhookDeliveryService;
  private deadLetterQueue: DeadLetterQueue;
  private metrics: MetricsCollector;

  async deliver(
    url: string,
    payload: unknown,
    correlationId: string
  ): Promise<void> {
    const startTime = Date.now();

    try {
      const result = await this.deliveryService.deliverWithRetry(
        url,
        payload,
        {},
        correlationId
      );

      const duration = Date.now() - startTime;

      if (result.success) {
        this.metrics.recordSuccess(url, duration, result.attempt);
        console.info('Webhook delivered successfully', {
          url,
          correlationId,
          attempts: result.attempt,
          durationMs: duration
        });
      } else {
        this.metrics.recordFailure(url, duration, result.attempt);

        await this.deadLetterQueue.enqueue({
          originalPayload: payload,
          targetUrl: url,
          correlationId,
          attempts: result.attempt,
          firstAttemptTime: startTime,
          lastAttemptTime: Date.now(),
          lastError: result.error?.message || 'Unknown error',
          lastStatusCode: result.statusCode
        });

        console.error('Webhook delivery failed permanently', {
          url,
          correlationId,
          attempts: result.attempt,
          error: result.error?.message
        });
      }
    } catch (error) {
      this.metrics.recordError(url);
      throw error;
    }
  }
}

Comprehensive observability is critical for operating webhook systems at scale. Track metrics including delivery success rate, retry attempts per webhook, circuit breaker state transitions, average delivery latency, and DLQ size. Use structured logging with correlation IDs to trace individual webhook deliveries across retries.

Common Pitfalls and Edge Cases

Unbounded retry delays: Without a maximum delay cap, exponential backoff can lead to delays of hours or days. Always set a reasonable maximum delay (typically 5-15 minutes) and consider moving to a DLQ after that point.

Ignoring HTTP status codes: Not all failures should trigger retries. 4xx errors (except 429 rate limiting) typically indicate client errors that won't resolve with retries. Only retry 5xx server errors, network failures, and timeouts.

Missing timeout configuration: Without request timeouts, a single slow endpoint can tie up resources indefinitely. Set aggressive timeouts (10-30 seconds) and treat timeouts as retriable failures.

Retry storms during deployments: When you deploy a new version of a webhook receiver, existing in-flight retries can overwhelm it during startup. Implement graceful startup periods where the circuit breaker is more conservative, or use deployment strategies like canary releases.

Memory leaks in retry state: Storing retry state

Webhook Retry Logic: Exponential Backoff