Content Role: pillar

Chaos Engineering: Building Resilient Systems

Controlled failure injection with Litmus and AWS Fault Injection Simulator

Distributed systems fail. Networks partition, services crash, databases become unavailable, and latency spikes occur without warning. The question isn't whether your system will experience failures—it's whether you'll discover those failure modes during a production incident or through controlled experimentation.

Traditional testing validates that systems work under expected conditions. Chaos engineering inverts this approach: it deliberately introduces failures to verify that systems behave correctly under adverse conditions. This practice has evolved from Netflix's pioneering work with Chaos Monkey into a mature discipline with standardized tools and methodologies.

The Problem with Traditional Reliability Testing

Most organizations discover their system's weaknesses during outages. A database failover that was tested once during setup fails in production because connection pools weren't configured correctly. A retry mechanism that looked reasonable in code creates cascading failures under load. Circuit breakers that should protect downstream services never trigger because timeout values were misconfigured.

These failures share a common characteristic: they emerge from the interaction between components under stress, not from bugs in individual services. Load testing validates performance under expected traffic patterns. Integration testing verifies that APIs work correctly. Neither approach reveals how your system behaves when a dependency becomes unavailable, when network latency increases by 500ms, or when a Kubernetes node terminates unexpectedly.

The gap between testing and reality grows as systems become more distributed. Microservices architectures introduce dozens of failure modes that simply don't exist in monolithic applications. Each network call represents a potential failure point. Each service dependency creates opportunities for cascading failures, retry storms, and resource exhaustion.

Understanding Chaos Engineering Practices

Chaos engineering applies the scientific method to distributed systems. You form a hypothesis about how your system should behave during a failure scenario, design an experiment to test that hypothesis, execute the experiment in a controlled manner, and analyze the results.

The practice follows several core principles:

Build a hypothesis around steady state behavior. Define what "normal" looks like using business metrics, not technical metrics. For an e-commerce system, steady state might be "99% of checkout requests complete within 2 seconds with a success rate above 99.9%."

Vary real-world events. Inject failures that actually occur in production: service crashes, network delays, resource exhaustion, DNS failures, clock skew, and disk space issues.

Run experiments in production. Staging environments don't replicate production traffic patterns, data volumes, or infrastructure configurations. The most valuable insights come from production experiments with appropriate safeguards.

Automate experiments to run continuously. Manual chaos testing provides point-in-time validation. Automated experiments catch regressions when code changes, dependencies update, or infrastructure evolves.

Minimize blast radius. Start with small experiments affecting limited traffic. Expand scope as confidence grows.

Implementing Chaos Experiments with Litmus

Litmus provides a Kubernetes-native framework for chaos engineering. It defines experiments as custom resources, making them version-controlled and reproducible.

Setting Up Litmus

// litmus-setup.ts
import * as k8s from '@kubernetes/client-node';
import * as yaml from 'js-yaml';
import * as fs from 'fs';

interface LitmusConfig {
  namespace: string;
  serviceAccount: string;
  experiments: string[];
}

class LitmusManager {
  private k8sApi: k8s.CustomObjectsApi;
  private coreApi: k8s.CoreV1Api;

  constructor() {
    const kc = new k8s.KubeConfig();
    kc.loadFromDefault();
    this.k8sApi = kc.makeApiClient(k8s.CustomObjectsApi);
    this.coreApi = kc.makeApiClient(k8s.CoreV1Api);
  }

  async installLitmus(config: LitmusConfig): Promise<void> {
    // Create namespace
    try {
      await this.coreApi.createNamespace({
        metadata: { name: config.namespace }
      });
    } catch (error: any) {
      if (error.statusCode !== 409) throw error;
    }

    // Install CRDs and operator
    const litmusOperator = yaml.load(
      fs.readFileSync('./manifests/litmus-operator.yaml', 'utf8')
    );

    await this.k8sApi.createNamespacedCustomObject(
      'apps',
      'v1',
      config.namespace,
      'deployments',
      litmusOperator
    );
  }

  async createPodDeleteExperiment(
    targetNamespace: string,
    targetLabel: string,
    duration: number
  ): Promise<string> {
    const experiment = {
      apiVersion: 'litmuschaos.io/v1alpha1',
      kind: 'ChaosEngine',
      metadata: {
        name: 'pod-delete-chaos',
        namespace: targetNamespace
      },
      spec: {
        appinfo: {
          appns: targetNamespace,
          applabel: targetLabel,
          appkind: 'deployment'
        },
        engineState: 'active',
        chaosServiceAccount: 'litmus-admin',
        experiments: [{
          name: 'pod-delete',
          spec: {
            components: {
              env: [
                { name: 'TOTAL_CHAOS_DURATION', value: duration.toString() },
                { name: 'CHAOS_INTERVAL', value: '10' },
                { name: 'FORCE', value: 'false' }
              ]
            }
          }
        }]
      }
    };

    const response = await this.k8sApi.createNamespacedCustomObject(
      'litmuschaos.io',
      'v1alpha1',
      targetNamespace,
      'chaosengines',
      experiment
    );

    return (response.body as any).metadata.name;
  }
}

Defining Network Latency Experiments

// network-chaos.ts
interface NetworkChaosSpec {
  targetPods: string;
  latencyMs: number;
  jitterMs: number;
  duration: number;
}

class NetworkChaosExperiment {
  private litmus: LitmusManager;

  constructor(litmus: LitmusManager) {
    this.litmus = litmus;
  }

  async injectLatency(spec: NetworkChaosSpec): Promise<void> {
    const experiment = {
      apiVersion: 'litmuschaos.io/v1alpha1',
      kind: 'ChaosEngine',
      metadata: {
        name: 'network-latency-chaos',
        namespace: 'default'
      },
      spec: {
        appinfo: {
          appns: 'default',
          applabel: spec.targetPods,
          appkind: 'deployment'
        },
        engineState: 'active',
        chaosServiceAccount: 'litmus-admin',
        experiments: [{
          name: 'pod-network-latency',
          spec: {
            components: {
              env: [
                { name: 'NETWORK_LATENCY', value: spec.latencyMs.toString() },
                { name: 'JITTER', value: spec.jitterMs.toString() },
                { name: 'TOTAL_CHAOS_DURATION', value: spec.duration.toString() },
                { name: 'DESTINATION_IPS', value: '' }, // Empty means all traffic
                { name: 'CONTAINER_RUNTIME', value: 'containerd' }
              ]
            }
          }
        }]
      }
    };

    await this.litmus.createExperiment(experiment);
  }
}

AWS Fault Injection Simulator for Cloud Infrastructure

AWS FIS provides managed chaos engineering for AWS resources. It integrates with CloudWatch, EventBridge, and IAM for production-safe experiments.

Creating an FIS Experiment Template

// fis-experiment.ts
import { 
  FISClient, 
  CreateExperimentTemplateCommand,
  StartExperimentCommand 
} from '@aws-sdk/client-fis';

interface FISExperimentConfig {
  name: string;
  description: string;
  roleArn: string;
  stopConditions: Array<{ source: string; value?: string }>;
}

class FISManager {
  private client: FISClient;

  constructor(region: string) {
    this.client = new FISClient({ region });
  }

  async createEC2TerminationExperiment(
    config: FISExperimentConfig,
    targetTags: Record<string, string>,
    instanceCount: number
  ): Promise<string> {
    const command = new CreateExperimentTemplateCommand({
      description: config.description,
      roleArn: config.roleArn,
      stopConditions: config.stopConditions.map(sc => ({
        source: sc.source,
        value: sc.value
      })),
      actions: {
        'terminate-instances': {
          actionId: 'aws:ec2:terminate-instances',
          parameters: {
            instanceCount: instanceCount.toString()
          },
          targets: {
            Instances: 'target-instances'
          }
        }
      },
      targets: {
        'target-instances': {
          resourceType: 'aws:ec2:instance',
          selectionMode: 'COUNT',
          parameters: {
            instanceCount: instanceCount.toString()
          },
          resourceTags: targetTags
        }
      },
      tags: {
        Name: config.name,
        ManagedBy: 'chaos-engineering'
      }
    });

    const response = await this.client.send(command);
    return response.experimentTemplate!.id!;
  }

  async createRDSFailoverExperiment(
    config: FISExperimentConfig,
    dbClusterIdentifier: string
  ): Promise<string> {
    const command = new CreateExperimentTemplateCommand({
      description: config.description,
      roleArn: config.roleArn,
      stopConditions: config.stopConditions,
      actions: {
        'failover-db': {
          actionId: 'aws:rds:failover-db-cluster',
          targets: {
            Clusters: 'target-cluster'
          }
        }
      },
      targets: {
        'target-cluster': {
          resourceType: 'aws:rds:cluster',
          selectionMode: 'ALL',
          resourceArns: [
            `arn:aws:rds:us-east-1:123456789012:cluster:${dbClusterIdentifier}`
          ]
        }
      },
      tags: {
        Name: config.name
      }
    });

    const response = await this.client.send(command);
    return response.experimentTemplate!.id!;
  }

  async startExperiment(templateId: string): Promise<string> {
    const command = new StartExperimentCommand({
      experimentTemplateId: templateId,
      tags: {
        StartedBy: 'automated-chaos',
        Timestamp: new Date().toISOString()
      }
    });

    const response = await this.client.send(command);
    return response.experiment!.id!;
  }
}

Implementing Safety Mechanisms

// safety-controls.ts
import { CloudWatchClient, PutMetricAlarmCommand } from '@aws-sdk/client-cloudwatch';

class ChaosSafetyControls {
  private cloudwatch: CloudWatchClient;

  constructor(region: string) {
    this.cloudwatch = new CloudWatchClient({ region });
  }

  async createStopConditionAlarm(
    alarmName: string,
    metricName: string,
    threshold: number,
    snsTopicArn: string
  ): Promise<void> {
    const command = new PutMetricAlarmCommand({
      AlarmName: alarmName,
      ComparisonOperator: 'GreaterThanThreshold',
      EvaluationPeriods: 1,
      MetricName: metricName,
      Namespace: 'AWS/ApplicationELB',
      Period: 60,
      Statistic: 'Average',
      Threshold: threshold,
      ActionsEnabled: true,
      AlarmActions: [snsTopicArn],
      AlarmDescription: 'Stop chaos experiment if error rate exceeds threshold',
      TreatMissingData: 'notBreaching'
    });

    await this.cloudwatch.send(command);
  }

  validateBusinessHours(): boolean {
    const now = new Date();
    const hour = now.getUTCHours();
    const day = now.getUTCDay();

    // Only run experiments during business hours (9 AM - 5 PM UTC, weekdays)
    return day >= 1 && day <= 5 && hour >= 9 && hour < 17;
  }

  async checkSystemHealth(healthCheckUrl: string): Promise<boolean> {
    try {
      const response = await fetch(healthCheckUrl);
      return response.ok;
    } catch {
      return false;
    }
  }
}

Common Pitfalls

Running experiments without proper monitoring. You can't learn from experiments if you can't observe their effects. Instrument your system with metrics, logs, and traces before starting chaos experiments.

Insufficient blast radius controls. Starting with experiments that affect 100% of traffic creates unnecessary risk. Begin with 1-5% of traffic and expand gradually.

Ignoring organizational readiness. Chaos engineering requires on-call engineers who can respond if experiments reveal problems. Running experiments when your team is unavailable defeats the purpose.

Testing only infrastructure failures. Application-level failures (memory leaks, deadlocks, cache inconsistencies) often cause more outages than infrastructure problems. Include both in your experiment portfolio.

Treating chaos engineering as a one-time activity. Systems evolve constantly. Experiments that passed last month may fail today after a deployment or configuration change.

Not documenting experiment results. Each experiment generates insights about system behavior. Without documentation, teams repeat experiments or miss patterns across multiple runs.

Best Practices Checklist

[ ] Define steady-state metrics using business KPIs
[ ] Start experiments in non-production environments
[ ] Implement automated rollback mechanisms
[ ] Create runbooks for each experiment type
[ ] Schedule experiments during business hours with team availability
[ ] Monitor blast radius in real-time during experiments
[ ] Document hypotheses before running experiments
[ ] Review experiment results in team retrospectives
[ ] Automate experiments in CI/CD pipelines
[ ] Gradually increase experiment scope and frequency
[ ] Test both infrastructure and application-level failures
[ ] Maintain an experiment catalog with results history

FAQ

How do I convince management to allow chaos experiments in production? Start with non-production environments to build confidence. Present chaos engineering as proactive risk management that reduces MTTR and prevents customer-facing incidents. Begin with minimal blast radius (1-2% of traffic) and demonstrate controlled rollback capabilities.

What's the difference between chaos engineering and traditional testing? Traditional testing validates expected behavior under normal conditions. Chaos engineering explores system behavior under failure conditions, revealing emergent properties that only appear in distributed systems under stress.

How often should we run chaos experiments? Start with weekly manual experiments. As confidence grows, automate experiments to run daily or integrate them into deployment pipelines. Critical systems benefit from continuous chaos with experiments running multiple times per day.

Should we run chaos experiments during peak traffic periods? No. Start during low-traffic periods with minimal blast radius. As your confidence in system resilience grows and safety mechanisms prove effective, gradually move toward testing during representative traffic patterns.

What metrics indicate a successful chaos engineering program? Track MTTR reduction, decreased incident frequency, improved on-call confidence, and faster deployment velocity. Successful programs also show increased experiment coverage and automated experiment execution rates.

How do we handle experiments that cause actual outages? Treat them as valuable learning opportunities. Conduct blameless postmortems, improve safety controls, and update runbooks. The goal is discovering weaknesses in controlled conditions rather than during customer-facing incidents.

Can chaos engineering work for small teams without dedicated SRE resources? Yes. Start with simple experiments using managed tools like AWS FIS. Focus on high-impact scenarios (database failover, service crashes) rather than comprehensive coverage. Even monthly experiments provide significant value.

Chaos Engineering: Building Resilient Systems

Chaos Engineering: Building Resilient Systems

Controlled failure injection with Litmus and AWS Fault Injection Simulator

The Problem with Traditional Reliability Testing

Understanding Chaos Engineering Practices

Implementing Chaos Experiments with Litmus

Setting Up Litmus

Defining Network Latency Experiments

AWS Fault Injection Simulator for Cloud Infrastructure

Creating an FIS Experiment Template

Implementing Safety Mechanisms

Common Pitfalls

Best Practices Checklist

FAQ

Comments

More from this blog

Embedding-First Architecture for Real-World LLM Apps

AI/ML Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Command Palette

Chaos Engineering: Building Resilient Systems

Controlled failure injection with Litmus and AWS Fault Injection Simulator

The Problem with Traditional Reliability Testing

Understanding Chaos Engineering Practices

Implementing Chaos Experiments with Litmus

Setting Up Litmus

Defining Network Latency Experiments

AWS Fault Injection Simulator for Cloud Infrastructure

Creating an FIS Experiment Template

Implementing Safety Mechanisms

Common Pitfalls

Best Practices Checklist

FAQ

Comments

More from this blog