Why Traditional Integration Approaches Fail at Scale

Polling databases for changes seems straightforward: query a last_modified timestamp column every few seconds and process new rows. This approach collapses under modern requirements for three critical reasons.

First, polling creates unacceptable database load. A system with 50 microservices polling every 5 seconds generates 10 queries per second per service—500 QPS just for change detection before any actual work happens. This consumes connection pool capacity, increases read replica lag, and interferes with transactional workloads. When Black Friday traffic spikes, these polling queries compete with customer-facing transactions for IOPS.

Second, polling cannot capture deletes reliably. Soft deletes with deleted_at flags work until compliance requirements mandate hard deletes for GDPR erasure requests. Once a row disappears, polling-based systems never detect the change, leaving downstream systems with orphaned data that violates privacy regulations.

Third, timestamp-based polling misses mid-transaction states and creates race conditions. If a payment transaction updates three tables—orders, payments, and inventory—polling might capture the order update before the payment commits, triggering fulfillment for an unpaid order. Distributed transactions across microservices make this worse, as each service polls independently without coordination.

API-based synchronous updates seem cleaner: when Service A updates its database, it calls Service B's API to propagate the change. This creates tight coupling that violates microservice principles and introduces cascading failures. If Service B is down, Service A's transaction must either fail (breaking availability) or succeed without propagation (breaking consistency). Retry logic adds complexity, and circular dependencies between services create deadlock scenarios during deployments.

Modern CDC Architecture for Event-Driven Integration

Change data capture event-driven integration uses database transaction logs as the source of truth. Every database writes changes to a transaction log (PostgreSQL's WAL, MySQL's binlog, MongoDB's oplog) for durability and replication. CDC systems read these logs and publish changes as events to a streaming platform, decoupling producers from consumers.

The architecture consists of three layers: CDC connectors that read transaction logs, a streaming platform that buffers and distributes events, and consumers that process changes. This design provides at-least-once delivery guarantees, preserves change order per partition key, and enables independent scaling of producers and consumers.

Here's a production-grade implementation using Debezium with Kafka and TypeScript consumers:

// Debezium connector configuration for PostgreSQL
const debeziumConfig = {
  "name": "orders-cdc-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "orders-db.prod.internal",
    "database.port": "5432",
    "database.user": "cdc_user",
    "database.password": "${vault:secret/cdc/orders-db}",
    "database.dbname": "orders",
    "database.server.name": "orders-prod",
    "table.include.list": "public.orders,public.order_items",
    "plugin.name": "pgoutput",
    "publication.autocreate.mode": "filtered",
    "slot.name": "orders_cdc_slot",
    "slot.drop.on.stop": "false",
    "heartbeat.interval.ms": "10000",
    "tombstones.on.delete": "true",
    "transforms": "unwrap,addMetadata",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false",
    "transforms.addMetadata.type": "org.apache.kafka.connect.transforms.InsertField$Value",
    "transforms.addMetadata.timestamp.field": "cdc_timestamp"
  }
}

This configuration captures changes from the orders and order_items tables, using PostgreSQL's native logical replication (pgoutput) instead of deprecated plugins. The slot.drop.on.stop: false ensures the replication slot persists across connector restarts, preventing data loss during deployments. Heartbeat events every 10 seconds keep the replication slot active even during idle periods, preventing WAL accumulation that could fill disk space.

The consumer implementation handles idempotency and exactly-once processing semantics:

import { Kafka, EachMessagePayload } from 'kafkajs';
import { Pool } from 'pg';

interface OrderChangeEvent {
  before: Order | null;
  after: Order | null;
  op: 'c' | 'u' | 'd' | 'r'; // create, update, delete, read (snapshot)
  ts_ms: number;
  source: {
    db: string;
    table: string;
    lsn: number; // log sequence number for ordering
  };
}

class OrderCDCConsumer {
  private kafka: Kafka;
  private pool: Pool;
  private processedLSNs: Map<string, number> = new Map();

  constructor() {
    this.kafka = new Kafka({
      clientId: 'order-analytics-consumer',
      brokers: ['kafka-1.prod:9092', 'kafka-2.prod:9092', 'kafka-3.prod:9092'],
      ssl: true,
      sasl: {
        mechanism: 'scram-sha-512',
        username: process.env.KAFKA_USERNAME!,
        password: process.env.KAFKA_PASSWORD!
      }
    });

    this.pool = new Pool({
      host: 'analytics-db.prod.internal',
      database: 'analytics',
      max: 20,
      idleTimeoutMillis: 30000,
      connectionTimeoutMillis: 5000
    });
  }

  async start() {
    const consumer = this.kafka.consumer({ 
      groupId: 'order-analytics-v2',
      sessionTimeout: 30000,
      heartbeatInterval: 3000,
      maxWaitTimeInMs: 1000
    });

    await consumer.connect();
    await consumer.subscribe({ 
      topics: ['orders-prod.public.orders'],
      fromBeginning: false 
    });

    await consumer.run({
      eachMessage: async (payload: EachMessagePayload) => {
        await this.processMessage(payload);
      }
    });
  }

  private async processMessage(payload: EachMessagePayload) {
    const event: OrderChangeEvent = JSON.parse(payload.message.value!.toString());
    const partitionKey = `${event.source.table}-${payload.partition}`;

    // Idempotency check using LSN (log sequence number)
    const lastProcessedLSN = this.processedLSNs.get(partitionKey) || 0;
    if (event.source.lsn <= lastProcessedLSN) {
      console.log(`Skipping duplicate event with LSN ${event.source.lsn}`);
      return;
    }

    const client = await this.pool.connect();

    try {
      await client.query('BEGIN');

      switch (event.op) {
        case 'c':
        case 'u':
          await this.upsertOrder(client, event.after!);
          break;
        case 'd':
          await this.deleteOrder(client, event.before!);
          break;
        case 'r':
          // Snapshot read during initial sync
          await this.upsertOrder(client, event.after!);
          break;
      }

      // Store LSN for idempotency in same transaction
      await client.query(
        'INSERT INTO cdc_checkpoints (partition_key, lsn, processed_at) VALUES ($1, $2, NOW()) ON CONFLICT (partition_key) DO UPDATE SET lsn = $2, processed_at = NOW()',
        [partitionKey, event.source.lsn]
      );

      await client.query('COMMIT');
      this.processedLSNs.set(partitionKey, event.source.lsn);

    } catch (error) {
      await client.query('ROLLBACK');
      throw error;
    } finally {
      client.release();
    }
  }

  private async upsertOrder(client: any, order: Order) {
    await client.query(
      `INSERT INTO orders_analytics (order_id, customer_id, total_amount, status, updated_at)
       VALUES ($1, $2, $3, $4, $5)
       ON CONFLICT (order_id) DO UPDATE SET
         customer_id = $2,
         total_amount = $3,
         status = $4,
         updated_at = $5`,
      [order.id, order.customer_id, order.total_amount, order.status, new Date()]
    );
  }

  private async deleteOrder(client: any, order: Order) {
    await client.query(
      'DELETE FROM orders_analytics WHERE order_id = $1',
      [order.id]
    );
  }
}

This implementation achieves exactly-once processing by storing the log sequence number (LSN) in the same transaction as the business logic. If the consumer crashes after processing but before committing, the transaction rolls back and the event is reprocessed. If it crashes after committing, the LSN checkpoint prevents duplicate processing.

Handling Complex Integration Patterns

Real-world CDC implementations must handle schema evolution, data enrichment, and cross-service consistency. When the orders table adds a priority_shipping column, downstream consumers must handle both old and new event formats without downtime.

Schema Registry integration provides forward and backward compatibility:

import { SchemaRegistry } from '@kafkajs/confluent-schema-registry';

class SchemaAwareCDCConsumer {
  private registry: SchemaRegistry;

  constructor() {
    this.registry = new SchemaRegistry({
      host: 'https://schema-registry.prod.internal:8081',
      auth: {
        username: process.env.SCHEMA_REGISTRY_USER!,
        password: process.env.SCHEMA_REGISTRY_PASSWORD!
      }
    });
  }

  async processMessage(payload: EachMessagePayload) {
    const decodedMessage = await this.registry.decode(payload.message.value!);

    // Handle optional fields with defaults for backward compatibility
    const order = {
      ...decodedMessage,
      priority_shipping: decodedMessage.priority_shipping ?? false,
      estimated_delivery: decodedMessage.estimated_delivery ?? this.calculateDelivery(decodedMessage)
    };

    await this.processOrder(order);
  }

  private calculateDelivery(order: any): Date {
    // Fallback logic for events before estimated_delivery was added
    const baseDate = new Date(order.created_at);
    baseDate.setDate(baseDate.getDate() + 5);
    return baseDate;
  }
}

The transactional outbox pattern solves the dual-write problem when services need to update their database and publish events atomically. Instead of writing to the database and then publishing to Kafka (which can fail between steps), write the event to an outbox table in the same transaction:

async function createOrder(orderData: CreateOrderRequest) {
  const client = await pool.connect();

  try {
    await client.query('BEGIN');

    // Insert order
    const orderResult = await client.query(
      'INSERT INTO orders (customer_id, total_amount, status) VALUES ($1, $2, $3) RETURNING *',
      [orderData.customerId, orderData.totalAmount, 'pending']
    );

    const order = orderResult.rows[0];

    // Insert event into outbox table
    await client.query(
      `INSERT INTO outbox_events (aggregate_id, aggregate_type, event_type, payload)
       VALUES ($1, $2, $3, $4)`,
      [
        order.id,
        'Order',
        'OrderCreated',
        JSON.stringify({
          orderId: order.id,
          customerId: order.customer_id,
          totalAmount: order.total_amount,
          createdAt: order.created_at
        })
      ]
    );

    await client.query('COMMIT');
    return order;

  } catch (error) {
    await client.query('ROLLBACK');
    throw error;
  } finally {
    client.release();
  }
}

A separate CDC connector monitors the outbox_events table and publishes events to Kafka. This guarantees that every database change has a corresponding event, even if the application crashes immediately after the transaction commits.

Common Pitfalls and Failure Modes

CDC implementations fail in predictable ways that teams must anticipate. Replication slot lag is the most common production issue. If consumers fall behind, the replication slot retains WAL segments, eventually filling disk space and crashing the database. Monitor pg_replication_slots view and alert when confirmed_flush_lsn falls more than 1GB behind pg_current_wal_lsn().

Schema changes without coordination cause consumer failures. Adding a NOT NULL column without a default value creates events that violate consumer expectations. Implement a three-phase schema migration: add the column as nullable, deploy consumers that handle null values, backfill data, then add the NOT NULL constraint.

Partition key selection determines ordering guarantees. Using customer_id as the partition key ensures all events for a customer are processed in order, but creates hot partitions for high-value customers. Using order_id distributes load evenly but loses cross-order consistency. Choose based on business requirements: payment processing needs customer-level ordering, while analytics can tolerate order-level parallelism.

Clock skew between database servers and Kafka brokers causes timestamp confusion. Always use the database transaction timestamp (ts_ms in Debezium events) for business logic, not the Kafka message timestamp or consumer processing time. This ensures consistent ordering even when consumers lag or replay events.

Tombstone events for deletes require special handling. Kafka's log compaction uses tombstones (null values) to eventually remove deleted records, but consumers must process them before compaction occurs. Configure delete.retention.ms to retain tombstones long enough for all consumers to process them, typically 24-48 hours.

Best Practices for Production CDC Systems

Implement comprehensive monitoring before deploying CDC to production. Track replication lag, consumer lag, event processing rate, and error rates per topic. Alert when lag exceeds 60 seconds or error rates exceed 0.1%. Use distributed tracing to correlate events across services, tagging spans with the LSN or Kafka offset for debugging.

Design for failure recovery from day one. Store consumer offsets in Kafka (not an external database) to leverage Kafka's exactly-once semantics. Implement circuit breakers that pause consumption when downstream systems are unhealthy, preventing cascading failures. Test recovery by deliberately killing consumers mid-processing and verifying no data loss or duplication.

Separate CDC infrastructure from application infrastructure. Run Debezium connectors on dedicated Kafka Connect clusters with independent scaling and monitoring. This prevents application deployments from disrupting CDC pipelines and allows CDC infrastructure to scale based on database change volume, not application traffic.

Use separate Kafka topics for different change types. Instead of one orders topic with all operations, create orders.created, orders.updated, and orders.deleted topics. This allows consumers to subscribe only to relevant changes and simplifies access control. A recommendation engine might only need orders.created, while a compliance system needs all three.

Implement data quality checks in consumers. Validate that monetary amounts are positive, foreign keys reference existing entities, and enum values are valid. Log validation failures to a dead letter queue for investigation rather than crashing the consumer. This prevents corrupt data from propagating through the system.

Version your event schemas from the start. Use semantic versioning (1.0.0, 1.1.0, 2.0.0) and include the version in event metadata. This enables gradual rollouts of schema changes and simplifies debugging when consumers process unexpected formats.

Frequently Asked Questions

What is the difference between CDC and event sourcing?

CDC captures changes from existing databases as a side effect, preserving the database as the source of truth. Event sourcing stores events as the primary data model, reconstructing state by replaying events. Use CDC when integrating existing systems; use event sourcing when building new systems that require full audit trails and time travel capabilities.

How does change data capture work with multi-region databases in 2025?

Modern CDC systems support active-active replication by capturing changes from each region's database and publishing to region-specific Kafka topics. Consumers merge events using vector clocks or last-write-wins strategies based on database timestamps. Cloud-native CDC services like AWS DMS and Google Datastream handle cross-region replication automatically, but require careful conflict resolution design.

What is the best way to handle large binary columns in CDC events?

Exclude binary columns (images, PDFs) from CDC events using connector configuration. Store binaries in object storage (S3, GCS) and include only the storage key in events. This keeps event sizes small (under 1MB) and prevents Kafka broker memory issues. For small binaries under 100KB, base64 encoding is acceptable but monitor topic size growth.

When should you avoid using change data capture?

Avoid CDC when databases don't support logical replication (some managed database services), when change volume exceeds 100,000 events per second per table (consider application-level events instead), or when regulatory requirements prohibit replication slots (some financial systems). Also avoid CDC for temporary tables or staging data that doesn't represent business state changes.

How do you scale CDC consumers to handle millions of events per second?

Increase Kafka topic partition count to enable parallel consumption (aim for 10-20 partitions per consumer instance). Use consumer groups to distribute partitions across instances. Optimize consumer processing by batching database writes, using connection pooling, and processing events asynchronously. For extreme scale, consider stream processing frameworks like Flink or Kafka Streams that provide built-in parallelism and state management.

What are the security implications of CDC in distributed systems?

CDC exposes database changes to Kafka, requiring encryption in transit (TLS) and at rest. Implement fine-grained access control using Kafka ACLs to restrict topic access by consumer group. Mask sensitive fields (PII, payment data) in CDC connectors before publishing to Kafka. Audit all CDC access and monitor for unusual consumption patterns that might indicate data exfiltration.

How do you test CDC pipelines before production deployment?

Use testcontainers to spin up PostgreSQL, Kafka, and Debezium in integration tests. Generate synthetic database changes and verify consumers receive correct events with proper ordering. Test failure scenarios by killing connectors mid-stream and verifying recovery. Load test with production-scale event volumes to identify bottlenecks. Use chaos engineering to inject network partitions and verify exactly-once semantics.

Conclusion

Change data capture event-driven integration transforms databases from isolated silos into real-time event sources that power modern distributed systems. By reading transaction logs instead of polling or synchronous APIs, CDC provides reliable, ordered change streams that enable immediate consistency across microservices while maintaining loose coupling and independent scalability.

The architecture patterns and code examples in

Change Data Capture: Event-Driven Integration

Why Traditional Integration Approaches Fail at Scale

Modern CDC Architecture for Event-Driven Integration

Handling Complex Integration Patterns

Common Pitfalls and Failure Modes

Best Practices for Production CDC Systems

Frequently Asked Questions

Conclusion

Comments

More from this blog

Embedding-First Architecture for Real-World LLM Apps

AI/ML Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Command Palette

Why Traditional Integration Approaches Fail at Scale

Modern CDC Architecture for Event-Driven Integration

Handling Complex Integration Patterns

Common Pitfalls and Failure Modes

Best Practices for Production CDC Systems

Frequently Asked Questions

Conclusion

Comments

More from this blog