Why Traditional Rollback Approaches Fail in Modern Distributed Systems

The fundamental problem is that distributed transactions across microservices cannot rely on database-level rollback mechanisms. When Service A commits data to its PostgreSQL instance and Service B commits to its MongoDB cluster, there's no shared transaction coordinator that can atomically roll back both operations.

Two-phase commit (2PC) protocols theoretically solve this but create unacceptable trade-offs in modern cloud environments. 2PC requires all participating services to lock resources during the prepare phase, dramatically reducing system availability. When services span multiple AWS regions or cloud providers, network partitions become common, causing coordinator timeouts that leave the system in an indeterminate state. Modern systems prioritize availability over consistency, making 2PC's blocking behavior incompatible with SLA requirements.

Event sourcing alone doesn't solve compensation either. While event logs provide audit trails, they don't automatically generate compensating actions. A "PaymentProcessed" event needs explicit "PaymentRefunded" logic—the system won't infer this. Teams often assume event replay handles rollback, but replay just reapplies the same operations, potentially making inconsistencies worse.

The shift toward serverless architectures and edge computing in 2025-2026 has made compensation even more critical. AWS Lambda functions, Cloudflare Workers, and similar platforms have strict execution time limits and no guaranteed state persistence. A saga spanning multiple serverless functions must handle compensation explicitly because there's no long-lived transaction context to rely on.

Implementing Saga Compensation: Orchestration vs Choreography

The saga compensation pattern implements rollback through compensating transactions—explicit operations that semantically undo completed work. Two architectural approaches exist: orchestration and choreography.

Orchestration uses a central coordinator that explicitly invokes each service and tracks saga state. When a step fails, the orchestrator calls compensation handlers in reverse order. This provides clear visibility and simpler debugging but creates a single point of failure and potential bottleneck.

Choreography distributes coordination through event-driven communication. Each service listens for events, performs its work, and publishes success or failure events. Compensation triggers when services detect failure events. This scales better but makes the overall saga flow harder to understand and debug.

For most production systems in 2025, hybrid approaches work best: orchestration for critical business flows with complex compensation logic, choreography for high-throughput background processes.

Production-Grade Saga Orchestration Implementation

Here's a realistic implementation using TypeScript with explicit compensation handlers:

interface SagaStep<T, C> {
  name: string;
  action: (context: T) => Promise<C>;
  compensation: (context: T, actionResult: C) => Promise<void>;
  timeout?: number;
}

interface SagaContext {
  orderId: string;
  customerId: string;
  amount: number;
  metadata: Record<string, any>;
}

class SagaOrchestrator<T> {
  private steps: SagaStep<T, any>[] = [];
  private completedSteps: Array<{ step: SagaStep<T, any>; result: any }> = [];

  addStep<C>(step: SagaStep<T, C>): this {
    this.steps.push(step);
    return this;
  }

  async execute(context: T): Promise<{ success: boolean; error?: Error }> {
    try {
      for (const step of this.steps) {
        const timeoutMs = step.timeout || 30000;
        const result = await this.executeWithTimeout(
          step.action(context),
          timeoutMs,
          `Step ${step.name} timed out`
        );

        this.completedSteps.push({ step, result });

        // Persist saga state after each step for crash recovery
        await this.persistSagaState(context, step.name, result);
      }

      return { success: true };
    } catch (error) {
      await this.compensate(context);
      return { success: false, error: error as Error };
    }
  }

  private async compensate(context: T): Promise<void> {
    // Compensate in reverse order
    const stepsToCompensate = [...this.completedSteps].reverse();

    for (const { step, result } of stepsToCompensate) {
      try {
        await this.executeWithTimeout(
          step.compensation(context, result),
          step.timeout || 30000,
          `Compensation for ${step.name} timed out`
        );

        await this.persistCompensationState(context, step.name);
      } catch (compensationError) {
        // Log compensation failure but continue with other compensations
        await this.handleCompensationFailure(
          context,
          step.name,
          compensationError as Error
        );
      }
    }
  }

  private async executeWithTimeout<R>(
    promise: Promise<R>,
    timeoutMs: number,
    errorMessage: string
  ): Promise<R> {
    return Promise.race([
      promise,
      new Promise<R>((_, reject) =>
        setTimeout(() => reject(new Error(errorMessage)), timeoutMs)
      ),
    ]);
  }

  private async persistSagaState(
    context: T,
    stepName: string,
    result: any
  ): Promise<void> {
    // Store in DynamoDB, Redis, or similar for crash recovery
    // Implementation depends on your infrastructure
  }

  private async persistCompensationState(
    context: T,
    stepName: string
  ): Promise<void> {
    // Mark compensation as completed for idempotency
  }

  private async handleCompensationFailure(
    context: T,
    stepName: string,
    error: Error
  ): Promise<void> {
    // Send to dead letter queue for manual intervention
    // Alert on-call engineers
    // Store in compensation failure table for retry
  }
}

Now implement a realistic order processing saga:

// Service clients with proper error handling
class PaymentService {
  async charge(customerId: string, amount: number): Promise<string> {
    // Call Stripe, Adyen, or payment provider
    const paymentId = await this.paymentProvider.createCharge({
      customer: customerId,
      amount,
      idempotencyKey: `charge-${customerId}-${Date.now()}`,
    });
    return paymentId;
  }

  async refund(paymentId: string, amount: number): Promise<void> {
    await this.paymentProvider.createRefund({
      payment: paymentId,
      amount,
      reason: 'saga_compensation',
    });
  }
}

class InventoryService {
  async reserve(items: Array<{ sku: string; quantity: number }>): Promise<string> {
    const reservationId = await this.inventoryDb.transaction(async (trx) => {
      for (const item of items) {
        const available = await trx('inventory')
          .where({ sku: item.sku })
          .first();

        if (available.quantity < item.quantity) {
          throw new Error(`Insufficient inventory for ${item.sku}`);
        }

        await trx('inventory')
          .where({ sku: item.sku })
          .decrement('quantity', item.quantity);
      }

      const [reservation] = await trx('reservations').insert({
        items,
        status: 'reserved',
        expires_at: new Date(Date.now() + 15 * 60 * 1000), // 15 min
      }).returning('id');

      return reservation.id;
    });

    return reservationId;
  }

  async release(reservationId: string): Promise<void> {
    await this.inventoryDb.transaction(async (trx) => {
      const reservation = await trx('reservations')
        .where({ id: reservationId })
        .first();

      if (!reservation) return; // Already released or expired

      for (const item of reservation.items) {
        await trx('inventory')
          .where({ sku: item.sku })
          .increment('quantity', item.quantity);
      }

      await trx('reservations')
        .where({ id: reservationId })
        .update({ status: 'released' });
    });
  }
}

class ShippingService {
  async createShipment(orderId: string, address: any): Promise<string> {
    // Integrate with ShipStation, EasyPost, or carrier API
    const shipment = await this.shippingProvider.createShipment({
      orderId,
      address,
      carrier: 'fedex',
    });
    return shipment.id;
  }

  async cancelShipment(shipmentId: string): Promise<void> {
    await this.shippingProvider.voidLabel(shipmentId);
  }
}

// Saga implementation
async function processOrder(orderData: {
  orderId: string;
  customerId: string;
  amount: number;
  items: Array<{ sku: string; quantity: number }>;
  shippingAddress: any;
}) {
  const paymentService = new PaymentService();
  const inventoryService = new InventoryService();
  const shippingService = new ShippingService();

  const saga = new SagaOrchestrator<SagaContext>()
    .addStep({
      name: 'charge_payment',
      action: async (ctx) => {
        return await paymentService.charge(ctx.customerId, ctx.amount);
      },
      compensation: async (ctx, paymentId) => {
        await paymentService.refund(paymentId, ctx.amount);
      },
      timeout: 10000,
    })
    .addStep({
      name: 'reserve_inventory',
      action: async (ctx) => {
        return await inventoryService.reserve(ctx.metadata.items);
      },
      compensation: async (ctx, reservationId) => {
        await inventoryService.release(reservationId);
      },
      timeout: 5000,
    })
    .addStep({
      name: 'create_shipment',
      action: async (ctx) => {
        return await shippingService.createShipment(
          ctx.orderId,
          ctx.metadata.shippingAddress
        );
      },
      compensation: async (ctx, shipmentId) => {
        await shippingService.cancelShipment(shipmentId);
      },
      timeout: 15000,
    });

  const result = await saga.execute({
    orderId: orderData.orderId,
    customerId: orderData.customerId,
    amount: orderData.amount,
    metadata: {
      items: orderData.items,
      shippingAddress: orderData.shippingAddress,
    },
  });

  return result;
}

Handling Compensation Failures and Idempotency

The hardest problem in saga compensation isn't the happy path—it's handling failures during compensation itself. What happens when the payment refund API is down? When inventory release times out? These scenarios require explicit strategies.

Idempotent compensation handlers are non-negotiable. Compensation might execute multiple times due to retries, crashes, or network issues. Every compensation operation must check current state before acting:

async function compensatePayment(paymentId: string): Promise<void> {
  // Check if refund already exists
  const existingRefund = await paymentProvider.getRefunds(paymentId);
  if (existingRefund.some(r => r.reason === 'saga_compensation')) {
    return; // Already compensated
  }

  await paymentProvider.createRefund({
    payment: paymentId,
    amount: originalAmount,
    reason: 'saga_compensation',
    idempotencyKey: `compensation-${paymentId}`,
  });
}

Dead letter queues handle compensation failures that can't be automatically resolved. When compensation fails after multiple retries, move the saga context to a DLQ for manual intervention:

interface CompensationFailure {
  sagaId: string;
  stepName: string;
  context: SagaContext;
  error: string;
  attemptCount: number;
  lastAttemptAt: Date;
}

async function handleCompensationFailure(
  context: SagaContext,
  stepName: string,
  error: Error
): Promise<void> {
  const failure: CompensationFailure = {
    sagaId: context.orderId,
    stepName,
    context,
    error: error.message,
    attemptCount: 1,
    lastAttemptAt: new Date(),
  };

  await sqsClient.sendMessage({
    QueueUrl: process.env.COMPENSATION_DLQ_URL,
    MessageBody: JSON.stringify(failure),
  });

  await alertingService.sendAlert({
    severity: 'high',
    title: `Compensation failed for saga ${context.orderId}`,
    details: `Step ${stepName} compensation failed: ${error.message}`,
  });
}

Semantic locks prevent concurrent saga execution on the same entity. Use Redis or DynamoDB with conditional writes:

async function acquireSagaLock(orderId: string, ttlSeconds: number = 300): Promise<boolean> {
  try {
    await redisClient.set(
      `saga:lock:${orderId}`,
      Date.now().toString(),
      'NX', // Only set if not exists
      'EX', // Set expiry
      ttlSeconds
    );
    return true;
  } catch {
    return false;
  }
}

Monitoring and Observability for Saga Compensation

Production saga implementations require comprehensive observability. Track these metrics:

Saga completion rate: Percentage of sagas completing successfully without compensation
Compensation execution time: How long rollback takes (should be faster than forward execution)
Partial compensation failures: Sagas where some but not all compensations succeeded
Compensation retry count: How many attempts before success or DLQ

Implement structured logging with correlation IDs:

class SagaLogger {
  constructor(private sagaId: string) {}

  logStepStart(stepName: string): void {
    logger.info({
      sagaId: this.sagaId,
      event: 'step_start',
      step: stepName,
      timestamp: new Date().toISOString(),
    });
  }

  logStepComplete(stepName: string, duration: number): void {
    logger.info({
      sagaId: this.sagaId,
      event: 'step_complete',
      step: stepName,
      duration,
      timestamp: new Date().toISOString(),
    });
  }

  logCompensationStart(stepName: string): void {
    logger.warn({
      sagaId: this.sagaId,
      event: 'compensation_start',
      step: stepName,
      timestamp: new Date().toISOString(),
    });
  }

  logCompensationFailure(stepName: string, error: Error): void {
    logger.error({
      sagaId: this.sagaId,
      event: 'compensation_failure',
      step: stepName,
      error: error.message,
      stack: error.stack,
      timestamp: new Date().toISOString(),
    });
  }
}

Integrate with distributed tracing systems like OpenTelemetry to visualize saga execution across services:

import { trace, context, SpanStatusCode } from '@opentelemetry/api';

async function executeStepWithTracing<T, C>(
  step: SagaStep<T, C>,
  sagaContext: T
): Promise<C> {
  const tracer = trace.getTracer('saga-orchestrator');

  return await tracer.startActiveSpan(
    `saga.step.${step.name}`,
    async (span) => {
      try {
        span.setAttribute('saga.step.name', step.name);
        span.setAttribute('saga.id', (sagaContext as any).orderId);

        const result = await step.action(sagaContext);

        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: (error as Error).message,
        });
        throw error;
      } finally {
        span.end();
      }
    }
  );
}

Common Pitfalls and Edge Cases

Compensation order matters critically. Always compensate in reverse order of execution. If you charge payment, reserve inventory, then create shipment, compensate by canceling shipment, releasing inventory, then refunding payment. Reversing this order can leave the system in invalid states.

Timeout configuration requires careful tuning. Set timeouts too short and you'll trigger unnecessary compensations when services are just slow. Too long and failed sagas block resources. Use percentile-based timeouts: if 99% of payment charges complete in 3 seconds, set timeout to 10 seconds.

Partial compensation creates the worst data inconsistencies. When payment refund succeeds but inventory release fails, you've lost money and still have reserved stock. Always persist compensation state and implement retry logic with exponential backoff.

Network partitions during compensation are inevitable in distributed systems. A compensation request might succeed but the response gets lost, causing duplicate compensation attempts. Idempotency keys and state checks prevent double-refunds.

Long-running sagas (hours or days) need different strategies. Don't hold locks for hours. Instead, use optimistic locking with version numbers and handle conflicts explicitly:

async function compensateWithOptimisticLocking(
  orderId: string,
  expectedVersion: number
): Promise<void> {
  const result = await db.query(
    `UPDATE orders 
     SET status = 'compensated', version = version + 1 
     WHERE id = $1 AND version = $2`,
    [orderId, expectedVersion]
  );

  if (result.rowCount === 0) {
    throw new Error('Order version mismatch - concurrent modification detected');
  }
}

Cascading compensations occur when compensating one saga triggers compensation in dependent sagas. Design saga boundaries carefully to minimize cross-saga dependencies.

Best Practices for Production Saga Compensation

Design compensations as first-class operations, not afterthoughts. For every service operation, implement the corresponding compensation operation with equal care. Document what each compensation does and its idempotency guarantees.

Use semantic compensation over technical rollback. Instead of deleting database records, mark them as canceled or compensated. This preserves audit trails and enables better debugging. A "PaymentRefunded" event is more meaningful than a deleted "PaymentProcesse

Saga Compensation: Rollback Strategy

Why Traditional Rollback Approaches Fail in Modern Distributed Systems

Implementing Saga Compensation: Orchestration vs Choreography

Production-Grade Saga Orchestration Implementation

Handling Compensation Failures and Idempotency

Monitoring and Observability for Saga Compensation

Common Pitfalls and Edge Cases

Best Practices for Production Saga Compensation

Comments

More from this blog

Embedding-First Architecture for Real-World LLM Apps

AI/ML Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Command Palette

Why Traditional Rollback Approaches Fail in Modern Distributed Systems

Implementing Saga Compensation: Orchestration vs Choreography

Production-Grade Saga Orchestration Implementation

Handling Compensation Failures and Idempotency

Monitoring and Observability for Saga Compensation

Common Pitfalls and Edge Cases

Best Practices for Production Saga Compensation

Comments

More from this blog