Skip to main content

Command Palette

Search for a command to run...

Data Backup: Disaster Recovery Planning

Published
8 min read
T

Welcome to TopperBlog! 👋

I'm a tech content creator passionate about helping developers level up their careers and master cutting-edge technologies.

🎯 What I Write About: • AI/ML Engineering & LLMs • Web3 & Blockchain Development
• System Design & Architecture • Interview Preparation (FAANG) • Freelancing & Remote Work • Modern Tech Stacks (Next.js, React, Rust, TypeScript) • Performance Optimization & Best Practices

💼 Mission: Sharing practical, actionable insights that accelerate your tech career and maximize your earning potential.

📚 15+ In-Depth Guides covering everything from earning $10k/month as a freelancer to cracking FAANG interviews.

🌐 Let's connect and grow together in this amazing tech journey!

#TechBlogger #SoftwareEngineering #CareerGrowth #WebDevelopment #AIEngineering

Why Traditional Backup Approaches Fail Modern Systems

Traditional backup strategies were designed for monolithic applications with centralized databases and predictable change rates. These approaches typically involved scheduled full and incremental backups to tape or dedicated backup appliances, with recovery measured in hours or days.

This model breaks down completely in modern environments. Microservices architectures distribute state across dozens of services, each with different data stores—relational databases, document stores, object storage, message queues, and streaming platforms. Kubernetes-native applications treat infrastructure as ephemeral, making traditional agent-based backup tools ineffective. Real-time analytics pipelines process terabytes hourly, making daily backup windows insufficient for acceptable Recovery Point Objectives (RPO).

The shift to multi-cloud and hybrid architectures introduces additional complexity. Data residency requirements may mandate that European customer data never leaves EU regions, even for backup purposes. Cross-region replication costs can exceed primary storage costs when not architected correctly. API rate limits and egress charges make naive backup approaches economically unsustainable at scale.

Most critically, modern threat actors understand backup systems. Sophisticated ransomware now includes reconnaissance phases that identify and corrupt backup repositories before encrypting production data. Without proper isolation, immutability, and verification, backups become another attack vector rather than a recovery mechanism.

Modern Disaster Recovery Architecture Principles

A resilient data backup strategy disaster recovery plan in 2025 must address five core requirements: continuous protection, immutable storage, automated verification, compliance-aware retention, and infrastructure-as-code deployment.

Continuous Protection and Minimal RPO

Modern applications cannot tolerate hours of data loss. Implement continuous data protection using change data capture (CDC) for databases, event streaming for application state, and incremental forever backups for object storage. This approach captures changes as they occur rather than waiting for scheduled backup windows.

For PostgreSQL databases, logical replication provides near-zero RPO:

import { Client } from 'pg';
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
import { createWriteStream } from 'fs';
import { pipeline } from 'stream/promises';

interface ReplicationConfig {
  sourceConnection: string;
  targetBucket: string;
  slotName: string;
  publicationName: string;
}

class ContinuousBackupManager {
  private sourceClient: Client;
  private s3Client: S3Client;
  private config: ReplicationConfig;
  private lastLSN: string | null = null;

  constructor(config: ReplicationConfig) {
    this.config = config;
    this.sourceClient = new Client({ connectionString: config.sourceConnection });
    this.s3Client = new S3Client({ region: 'us-east-1' });
  }

  async initialize(): Promise<void> {
    await this.sourceClient.connect();

    // Create replication slot if not exists
    await this.sourceClient.query(`
      SELECT * FROM pg_create_logical_replication_slot(
        '${this.config.slotName}', 
        'pgoutput'
      ) WHERE NOT EXISTS (
        SELECT 1 FROM pg_replication_slots 
        WHERE slot_name = '${this.config.slotName}'
      );
    `);

    // Create publication for all tables
    await this.sourceClient.query(`
      CREATE PUBLICATION ${this.config.publicationName} 
      FOR ALL TABLES;
    `).catch(() => {}); // Ignore if exists
  }

  async streamChanges(): Promise<void> {
    const stream = await this.sourceClient.query(`
      START_REPLICATION SLOT ${this.config.slotName} 
      LOGICAL ${this.lastLSN || '0/0'}
      (proto_version '1', publication_names '${this.config.publicationName}')
    `);

    let changeBuffer: any[] = [];
    const flushInterval = 5000; // 5 seconds
    const maxBufferSize = 10000;

    const flushChanges = async () => {
      if (changeBuffer.length === 0) return;

      const timestamp = new Date().toISOString();
      const key = `wal-changes/${timestamp}-${this.lastLSN}.json`;

      await this.s3Client.send(new PutObjectCommand({
        Bucket: this.config.targetBucket,
        Key: key,
        Body: JSON.stringify(changeBuffer),
        ServerSideEncryption: 'AES256',
        ObjectLockMode: 'GOVERNANCE',
        ObjectLockRetainUntilDate: new Date(Date.now() + 90 * 24 * 60 * 60 * 1000)
      }));

      changeBuffer = [];
    };

    setInterval(flushChanges, flushInterval);

    for await (const message of stream) {
      const change = this.parseWALMessage(message);
      changeBuffer.push(change);
      this.lastLSN = change.lsn;

      if (changeBuffer.length >= maxBufferSize) {
        await flushChanges();
      }
    }
  }

  private parseWALMessage(message: any): any {
    // Parse logical replication protocol messages
    return {
      lsn: message.lsn,
      timestamp: message.timestamp,
      operation: message.operation,
      table: message.table,
      data: message.data
    };
  }
}

Immutable Storage and Air-Gapped Isolation

Immutability prevents attackers from modifying or deleting backups. Use S3 Object Lock, Azure Immutable Blob Storage, or Google Cloud Storage retention policies to enforce write-once-read-many (WORM) semantics. Combine this with separate AWS accounts or Azure subscriptions for backup storage, using cross-account IAM roles with time-limited access.

import { STSClient, AssumeRoleCommand } from '@aws-sdk/client-sts';
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';

interface ImmutableBackupConfig {
  backupAccountRole: string;
  backupBucket: string;
  retentionDays: number;
  legalHoldEnabled: boolean;
}

class ImmutableBackupWriter {
  private stsClient: STSClient;
  private config: ImmutableBackupConfig;

  constructor(config: ImmutableBackupConfig) {
    this.config = config;
    this.stsClient = new STSClient({});
  }

  async writeBackup(key: string, data: Buffer): Promise<void> {
    // Assume role in backup account with time-limited credentials
    const assumeRoleResponse = await this.stsClient.send(
      new AssumeRoleCommand({
        RoleArn: this.config.backupAccountRole,
        RoleSessionName: `backup-writer-${Date.now()}`,
        DurationSeconds: 3600, // 1 hour maximum
        Policy: JSON.stringify({
          Version: '2012-10-17',
          Statement: [{
            Effect: 'Allow',
            Action: ['s3:PutObject'],
            Resource: [`arn:aws:s3:::${this.config.backupBucket}/*`]
          }]
        })
      })
    );

    const backupS3Client = new S3Client({
      credentials: {
        accessKeyId: assumeRoleResponse.Credentials!.AccessKeyId!,
        secretAccessKey: assumeRoleResponse.Credentials!.SecretAccessKey!,
        sessionToken: assumeRoleResponse.Credentials!.SessionToken!
      }
    });

    const retentionDate = new Date();
    retentionDate.setDate(retentionDate.getDate() + this.config.retentionDays);

    await backupS3Client.send(new PutObjectCommand({
      Bucket: this.config.backupBucket,
      Key: key,
      Body: data,
      ServerSideEncryption: 'aws:kms',
      SSEKMSKeyId: 'alias/backup-encryption-key',
      ObjectLockMode: 'COMPLIANCE',
      ObjectLockRetainUntilDate: retentionDate,
      ObjectLockLegalHoldStatus: this.config.legalHoldEnabled ? 'ON' : 'OFF',
      Metadata: {
        'backup-timestamp': new Date().toISOString(),
        'source-environment': process.env.ENVIRONMENT || 'unknown',
        'compliance-classification': 'regulated'
      }
    }));
  }
}

Automated Verification and Recovery Testing

Backups are worthless if they cannot be restored. Implement automated verification that periodically restores backups to isolated environments and validates data integrity. This catches corruption, incomplete backups, and configuration drift before disasters occur.

import { ECSClient, RunTaskCommand } from '@aws-sdk/client-ecs';
import { RDSClient, RestoreDBInstanceFromDBSnapshotCommand } from '@aws-sdk/client-rds';

interface VerificationJob {
  backupId: string;
  backupType: 'database' | 'object-storage' | 'filesystem';
  verificationQueries: string[];
  expectedRowCounts: Record<string, number>;
}

class BackupVerificationOrchestrator {
  private ecsClient: ECSClient;
  private rdsClient: RDSClient;

  constructor() {
    this.ecsClient = new ECSClient({});
    this.rdsClient = new RDSClient({});
  }

  async verifyDatabaseBackup(job: VerificationJob): Promise<boolean> {
    const testInstanceId = `verify-${job.backupId}-${Date.now()}`;

    try {
      // Restore snapshot to isolated test instance
      await this.rdsClient.send(new RestoreDBInstanceFromDBSnapshotCommand({
        DBInstanceIdentifier: testInstanceId,
        DBSnapshotIdentifier: job.backupId,
        DBInstanceClass: 'db.t3.medium',
        PubliclyAccessible: false,
        VpcSecurityGroupIds: ['sg-verification-isolated'],
        Tags: [
          { Key: 'Purpose', Value: 'BackupVerification' },
          { Key: 'AutoDelete', Value: 'true' },
          { Key: 'MaxLifetime', Value: '2h' }
        ]
      }));

      // Wait for instance to be available
      await this.waitForInstanceAvailable(testInstanceId);

      // Run verification queries via ECS task
      const verificationResult = await this.runVerificationQueries(
        testInstanceId,
        job.verificationQueries,
        job.expectedRowCounts
      );

      return verificationResult.success;
    } finally {
      // Always cleanup test instance
      await this.cleanupTestInstance(testInstanceId);
    }
  }

  private async runVerificationQueries(
    instanceId: string,
    queries: string[],
    expectedCounts: Record<string, number>
  ): Promise<{ success: boolean; details: any }> {
    const taskResponse = await this.ecsClient.send(new RunTaskCommand({
      cluster: 'backup-verification-cluster',
      taskDefinition: 'database-verification-task',
      launchType: 'FARGATE',
      networkConfiguration: {
        awsvpcConfiguration: {
          subnets: ['subnet-verification-private'],
          securityGroups: ['sg-verification-task'],
          assignPublicIp: 'DISABLED'
        }
      },
      overrides: {
        containerOverrides: [{
          name: 'verification-container',
          environment: [
            { name: 'DB_INSTANCE', value: instanceId },
            { name: 'VERIFICATION_QUERIES', value: JSON.stringify(queries) },
            { name: 'EXPECTED_COUNTS', value: JSON.stringify(expectedCounts) }
          ]
        }]
      }
    }));

    // Poll task completion and retrieve results from CloudWatch Logs
    return { success: true, details: {} }; // Simplified
  }

  private async waitForInstanceAvailable(instanceId: string): Promise<void> {
    // Implementation of polling logic
  }

  private async cleanupTestInstance(instanceId: string): Promise<void> {
    // Implementation of cleanup logic
  }
}

Compliance-Aware Retention and Data Residency

Different data types require different retention periods based on regulatory requirements. Financial records may require seven-year retention, while GDPR mandates deletion of personal data upon request. Implement policy-driven retention that automatically applies appropriate rules based on data classification.

Use tagging and metadata to track data lineage and classification. Implement automated deletion workflows that respect legal holds and audit requirements. For multi-region deployments, ensure backups remain in compliant regions using bucket policies and replication rules.

Infrastructure-as-Code Deployment

Deploy disaster recovery infrastructure using Terraform or Pulumi to ensure consistency and enable rapid recovery in new regions. Version control all backup configurations alongside application code.

import * as pulumi from '@pulumi/pulumi';
import * as aws from '@pulumi/aws';

class DisasterRecoveryStack {
  private config: pulumi.Config;

  constructor() {
    this.config = new pulumi.Config();
  }

  deploy(): void {
    // Create backup account and cross-account role
    const backupAccount = new aws.organizations.Account('backup-account', {
      name: 'disaster-recovery-backups',
      email: 'backups@company.com',
      iamUserAccessToBilling: 'DENY'
    });

    // Create immutable backup bucket with lifecycle policies
    const backupBucket = new aws.s3.Bucket('immutable-backups', {
      bucket: 'company-dr-backups',
      versioning: { enabled: true },
      objectLockConfiguration: {
        objectLockEnabled: 'Enabled',
        rule: {
          defaultRetention: {
            mode: 'COMPLIANCE',
            days: 90
          }
        }
      },
      lifecycleRules: [
        {
          enabled: true,
          transitions: [
            { days: 30, storageClass: 'STANDARD_IA' },
            { days: 90, storageClass: 'GLACIER_IR' },
            { days: 365, storageClass: 'DEEP_ARCHIVE' }
          ]
        }
      ],
      serverSideEncryptionConfiguration: {
        rule: {
          applyServerSideEncryptionByDefault: {
            sseAlgorithm: 'aws:kms',
            kmsMasterKeyId: 'alias/backup-encryption'
          }
        }
      }
    });

    // Create backup verification infrastructure
    const verificationCluster = new aws.ecs.Cluster('verification-cluster', {
      name: 'backup-verification',
      settings: [{
        name: 'containerInsights',
        value: 'enabled'
      }]
    });

    // Create EventBridge rules for automated verification
    const verificationSchedule = new aws.cloudwatch.EventRule('daily-verification', {
      scheduleExpression: 'cron(0 2 * * ? *)', // 2 AM daily
      description: 'Trigger daily backup verification'
    });

    new aws.cloudwatch.EventTarget('verification-target', {
      rule: verificationSchedule.name,
      arn: verificationCluster.arn,
      roleArn: this.createEventBridgeRole().arn,
      ecsTarget: {
        taskDefinitionArn: this.createVerificationTaskDefinition().arn,
        launchType: 'FARGATE',
        networkConfiguration: {
          subnets: this.config.requireObject('privateSubnets'),
          securityGroups: [this.createVerificationSecurityGroup().id],
          assignPublicIp: false
        }
      }
    });
  }

  private createEventBridgeRole(): aws.iam.Role {
    return new aws.iam.Role('eventbridge-ecs-role', {
      assumeRolePolicy: JSON.stringify({
        Version: '2012-10-17',
        Statement: [{
          Effect: 'Allow',
          Principal: { Service: 'events.amazonaws.com' },
          Action: 'sts:AssumeRole'
        }]
      })
    });
  }

  private createVerificationTaskDefinition(): aws.ecs.TaskDefinition {
    return new aws.ecs.TaskDefinition('verification-task', {
      family: 'backup-verification',
      cpu: '1024',
      memory: '2048',
      networkMode: 'awsvpc',
      requiresCompatibilities: ['FARGATE'],
      containerDefinitions: JSON.stringify([{
        name: 'verification-container',
        image: 'company/backup-verification:latest',
        essential: true,
        logConfiguration: {
          logDriver: 'awslogs',
          options: {
            'awslogs-group': '/ecs/backup-verification',
            'awslogs-region': 'us-east-1',
            'awslogs-stream-prefix': 'verification'
          }
        }
      }])
    });
  }

  private createVerificationSecurityGroup(): aws.ec2.SecurityGroup {
    return new aws.ec2.SecurityGroup('verification-sg', {
      description: 'Security group for backup verification tasks',
      vpcId: this.config.require('vpcId'),
      egress: [{
        protocol: '-1',
        fromPort: 0,
        toPort: 0,
        cidrBlocks: ['0.0.0.0/0']
      }]
    });
  }
}

Common Pitfalls and Failure Modes

Insufficient Testing of Recovery Procedures

Many organizations discover their backups are incomplete or corrupted only during actual disasters. Schedule quarterly disaster recovery drills that simulate complete data center failures. Measure actual Recovery Time Objectives (RTO) against business requirements.

Inadequate Backup Isolation

Storing backups in the same AWS account or Azure subscription as production systems provides insufficient protection against credential compromise. Use separate accounts with cross-account roles that require MFA and time-limited access.

Ignoring Application-Level Consistency

File-level or block-level backups of databases often capture inconsistent state. Use application-aware backup methods that ensure transactional consistency—database snapshots with flushed transactions, coordinated snapshots across distributed systems, or quiesced application state.

Underestimating Restore Time at Scale

Restoring terabytes of data across regions takes hours or days, regardless of backup frequency. Implement tiered recovery strategies: critical data with sub-hour RTO, important data with 4-hour RTO, archival data with 24-hour RTO. Use read replicas and standby instances for critical systems.

Neglecting Backup Encryption Key Management