Skip to main content

Command Palette

Search for a command to run...

Infrastructure Testing: Terratest Guide

Published
10 min read
T

Welcome to TopperBlog! 👋

I'm a tech content creator passionate about helping developers level up their careers and master cutting-edge technologies.

🎯 What I Write About: • AI/ML Engineering & LLMs • Web3 & Blockchain Development
• System Design & Architecture • Interview Preparation (FAANG) • Freelancing & Remote Work • Modern Tech Stacks (Next.js, React, Rust, TypeScript) • Performance Optimization & Best Practices

💼 Mission: Sharing practical, actionable insights that accelerate your tech career and maximize your earning potential.

📚 15+ In-Depth Guides covering everything from earning $10k/month as a freelancer to cracking FAANG interviews.

🌐 Let's connect and grow together in this amazing tech journey!

#TechBlogger #SoftwareEngineering #CareerGrowth #WebDevelopment #AIEngineering

Metadata

SEO Title: Infrastructure Testing with Terratest and Kitchen-Terraform

Meta Description: Learn how to implement automated infrastructure testing using Terratest and Kitchen-Terraform to catch misconfigurations before production deployment.

Primary Keyword: infrastructure testing tools

Secondary Keywords: terratest tutorial, kitchen-terraform guide, infrastructure as code testing, terraform testing framework, automated infrastructure validation, IaC testing best practices, terraform integration testing

Tags: Infrastructure-as-Code, Terraform, DevOps, Testing, Cloud-Infrastructure, Automation, SRE

Search Intent: how-to

Content Role: pillar


Infrastructure Testing with Terratest and Kitchen-Terraform: A Production Guide

Infrastructure failures in production cost organizations an average of $5,600 per minute according to recent industry reports. Yet most teams still deploy Terraform configurations with minimal or no automated testing, relying instead on manual reviews and hoping for the best. This approach worked when infrastructure changes were infrequent, but in 2025's cloud-native landscape where teams deploy infrastructure changes dozens of times daily, manual validation has become the bottleneck that leads to outages, security vulnerabilities, and compliance violations.

The problem isn't a lack of awareness—most engineering teams understand they should test infrastructure code. The challenge is that infrastructure testing requires spinning up real cloud resources, validating their configuration, and tearing them down, all while managing costs, test isolation, and execution time. This article demonstrates how to implement production-grade infrastructure testing using Terratest and Kitchen-Terraform, two complementary tools that solve different aspects of the infrastructure testing challenge.

Why Traditional Infrastructure Validation Fails

Many teams rely on terraform plan, static analysis tools like tfsec or Checkov, and manual code reviews. While these approaches catch syntax errors and policy violations, they fail to validate the most critical aspect: whether your infrastructure actually works as intended when deployed.

Consider a common scenario: you're deploying a Kubernetes cluster with specific node configurations, security groups, and IAM roles. Static analysis confirms your HCL syntax is correct and security policies are met. But it cannot tell you:

  • Whether the nodes can actually join the cluster
  • If the security group rules allow necessary traffic while blocking unauthorized access
  • Whether the IAM roles have the precise permissions needed—not too much, not too little
  • If your application can successfully connect to the database through the configured networking

These runtime behaviors only emerge when infrastructure is actually provisioned. This is where infrastructure testing tools become essential.

Understanding Terratest vs Kitchen-Terraform

Before diving into implementation, it's important to understand when to use each tool.

Terratest is a Go-based testing framework that treats infrastructure testing as integration testing. You write Go test functions that deploy infrastructure, validate it, and tear it down. Terratest excels at:

  • Complex validation logic requiring API calls to cloud providers
  • Testing infrastructure alongside application deployment
  • Scenarios requiring programmatic validation of resource state
  • Integration with existing Go-based CI/CD pipelines

Kitchen-Terraform follows the Test Kitchen pattern familiar to Chef users, using a declarative YAML configuration to define test scenarios. It's ideal for:

  • Teams preferring declarative over imperative testing
  • Standardized test patterns across multiple Terraform modules
  • Organizations already using Test Kitchen for other infrastructure
  • Scenarios where InSpec provides sufficient validation capabilities

For most modern cloud-native teams in 2025, Terratest offers greater flexibility and better integration with contemporary development workflows. However, Kitchen-Terraform remains valuable for teams with existing Test Kitchen investments or those preferring declarative approaches.

Implementing Production-Grade Terratest

Let's build a realistic testing framework for a Terraform module that provisions an AWS EKS cluster with associated networking and security configurations.

Project Structure

terraform-aws-eks/
├── main.tf
├── variables.tf
├── outputs.tf
├── examples/
│   └── complete/
│       ├── main.tf
│       └── variables.tf
└── test/
    ├── go.mod
    ├── go.sum
    └── eks_test.go

Setting Up Terratest

First, initialize your Go module for testing:

// test/go.mod
module github.com/yourorg/terraform-aws-eks/test

go 1.22

require (
    github.com/gruntwork-io/terratest v0.46.11
    github.com/stretchr/testify v1.8.4
    github.com/aws/aws-sdk-go-v2 v1.24.1
    github.com/aws/aws-sdk-go-v2/config v1.26.6
    github.com/aws/aws-sdk-go-v2/service/eks v1.35.0
)

Writing Comprehensive Infrastructure Tests

// test/eks_test.go
package test

import (
    "context"
    "fmt"
    "testing"
    "time"

    "github.com/aws/aws-sdk-go-v2/config"
    "github.com/aws/aws-sdk-go-v2/service/eks"
    "github.com/gruntwork-io/terratest/modules/random"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
)

func TestEKSClusterDeployment(t *testing.T) {
    t.Parallel()

    // Generate unique names to avoid conflicts in parallel test execution
    uniqueID := random.UniqueId()
    clusterName := fmt.Sprintf("test-eks-%s", uniqueID)

    // Configure AWS region for testing
    awsRegion := "us-west-2"

    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../examples/complete",
        Vars: map[string]interface{}{
            "cluster_name": clusterName,
            "region":       awsRegion,
            "environment":  "test",
        },
        EnvVars: map[string]string{
            "AWS_DEFAULT_REGION": awsRegion,
        },
    })

    // Ensure cleanup happens even if test fails
    defer terraform.Destroy(t, terraformOptions)

    // Deploy infrastructure
    terraform.InitAndApply(t, terraformOptions)

    // Retrieve outputs for validation
    clusterEndpoint := terraform.Output(t, terraformOptions, "cluster_endpoint")
    clusterSecurityGroupID := terraform.Output(t, terraformOptions, "cluster_security_group_id")
    nodeGroupName := terraform.Output(t, terraformOptions, "node_group_name")

    // Validate cluster endpoint is accessible
    assert.NotEmpty(t, clusterEndpoint, "Cluster endpoint should not be empty")
    assert.Contains(t, clusterEndpoint, "eks.amazonaws.com", "Endpoint should be valid EKS URL")

    // Validate cluster status using AWS SDK
    validateClusterStatus(t, awsRegion, clusterName)

    // Validate node group is ready
    validateNodeGroup(t, awsRegion, clusterName, nodeGroupName)

    // Validate security group configuration
    validateSecurityGroup(t, awsRegion, clusterSecurityGroupID)
}

func validateClusterStatus(t *testing.T, region, clusterName string) {
    ctx := context.Background()

    cfg, err := config.LoadDefaultConfig(ctx, config.WithRegion(region))
    require.NoError(t, err, "Failed to load AWS config")

    client := eks.NewFromConfig(cfg)

    // Retry logic for eventual consistency
    maxRetries := 10
    for i := 0; i < maxRetries; i++ {
        output, err := client.DescribeCluster(ctx, &eks.DescribeClusterInput{
            Name: &clusterName,
        })

        require.NoError(t, err, "Failed to describe cluster")

        if output.Cluster.Status == "ACTIVE" {
            assert.Equal(t, "ACTIVE", string(output.Cluster.Status))
            assert.NotNil(t, output.Cluster.Endpoint)
            assert.NotNil(t, output.Cluster.CertificateAuthority)
            return
        }

        if i < maxRetries-1 {
            time.Sleep(30 * time.Second)
        }
    }

    t.Fatal("Cluster did not reach ACTIVE status within timeout")
}

func validateNodeGroup(t *testing.T, region, clusterName, nodeGroupName string) {
    ctx := context.Background()

    cfg, err := config.LoadDefaultConfig(ctx, config.WithRegion(region))
    require.NoError(t, err)

    client := eks.NewFromConfig(cfg)

    output, err := client.DescribeNodegroup(ctx, &eks.DescribeNodegroupInput{
        ClusterName:   &clusterName,
        NodegroupName: &nodeGroupName,
    })

    require.NoError(t, err, "Failed to describe node group")

    // Validate node group configuration
    assert.Equal(t, "ACTIVE", string(output.Nodegroup.Status))
    assert.NotNil(t, output.Nodegroup.ScalingConfig)
    assert.GreaterOrEqual(t, *output.Nodegroup.ScalingConfig.DesiredSize, int32(1))

    // Validate node group has proper IAM role
    assert.NotNil(t, output.Nodegroup.NodeRole)
}

func validateSecurityGroup(t *testing.T, region, securityGroupID string) {
    // Implementation would validate security group rules
    // ensuring proper ingress/egress configuration
    assert.NotEmpty(t, securityGroupID)
}

// Test for module upgrade scenarios
func TestEKSClusterUpgrade(t *testing.T) {
    t.Parallel()

    // Deploy with version 1.28
    // Apply upgrade to 1.29
    // Validate cluster remains healthy
    // This tests blue-green upgrade patterns
}

// Test for disaster recovery
func TestEKSClusterBackupRestore(t *testing.T) {
    t.Parallel()

    // Deploy cluster
    // Create backup of critical resources
    // Simulate failure
    // Restore from backup
    // Validate functionality
}

Implementing Kitchen-Terraform for Declarative Testing

For teams preferring declarative approaches, Kitchen-Terraform provides a YAML-based testing framework:

# .kitchen.yml
---
driver:
  name: terraform
  root_module_directory: test/fixtures/complete

provisioner:
  name: terraform

verifier:
  name: terraform
  systems:
    - name: basic
      backend: aws
      controls:
        - cluster_exists
        - cluster_active
        - nodes_ready

platforms:
  - name: aws

suites:
  - name: default
    driver:
      variables:
        cluster_name: test-eks-kitchen
        region: us-west-2
    verifier:
      systems:
        - name: basic
          backend: aws
          profile_locations:
            - test/integration/default

InSpec control file for validation:

# test/integration/default/controls/eks_cluster.rb
cluster_name = input('cluster_name')
region = input('region')

control 'cluster_exists' do
  impact 1.0
  title 'EKS Cluster exists'

  describe aws_eks_cluster(cluster_name: cluster_name, region: region) do
    it { should exist }
    its('status') { should eq 'ACTIVE' }
    its('version') { should match /^1\.(28|29|30)$/ }
  end
end

control 'cluster_active' do
  impact 1.0
  title 'EKS Cluster is active and accessible'

  describe aws_eks_cluster(cluster_name: cluster_name, region: region) do
    its('endpoint') { should_not be_nil }
    its('certificate_authority.data') { should_not be_nil }
  end
end

control 'nodes_ready' do
  impact 1.0
  title 'Node group is ready'

  describe aws_eks_nodegroups(cluster_name: cluster_name, region: region) do
    its('entries.count') { should be >= 1 }
    its('entries.first.status') { should eq 'ACTIVE' }
  end
end

Common Pitfalls and Edge Cases

Test Isolation Failures

The most common failure mode is test interference when running parallel tests. Always use unique resource names:

// Bad: Hard-coded names cause conflicts
clusterName := "test-eks-cluster"

// Good: Unique names per test run
clusterName := fmt.Sprintf("test-eks-%s", random.UniqueId())

Cost Management

Infrastructure tests provision real resources that cost money. Implement aggressive cleanup:

// Always use defer for cleanup
defer terraform.Destroy(t, terraformOptions)

// Add timeout protection
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Minute)
defer cancel()

Eventual Consistency Issues

Cloud resources don't become available instantly. Implement proper retry logic:

// Bad: Immediate validation fails
output, _ := client.DescribeCluster(ctx, input)
assert.Equal(t, "ACTIVE", output.Cluster.Status)

// Good: Retry with exponential backoff
terraform.Retry(t, 10, 30*time.Second, func() (string, error) {
    output, err := client.DescribeCluster(ctx, input)
    if err != nil {
        return "", err
    }
    if output.Cluster.Status != "ACTIVE" {
        return "", fmt.Errorf("cluster not ready")
    }
    return "ready", nil
})

State Management in CI/CD

Tests must handle Terraform state properly:

terraformOptions := &terraform.Options{
    TerraformDir: "../examples/complete",
    BackendConfig: map[string]interface{}{
        "bucket": fmt.Sprintf("terraform-test-state-%s", uniqueID),
        "key":    "test/terraform.tfstate",
        "region": "us-west-2",
    },
}

Resource Quota Limits

Tests can hit AWS service quotas. Implement circuit breakers:

if err != nil && strings.Contains(err.Error(), "LimitExceeded") {
    t.Skip("Skipping test due to AWS quota limits")
}

Best Practices for Production Infrastructure Testing

1. Implement Test Stages

Structure tests in stages matching your deployment pipeline:

  • Unit tests: Validate Terraform syntax and basic logic (terraform validate, tflint)
  • Integration tests: Deploy to isolated test accounts (Terratest/Kitchen-Terraform)
  • Smoke tests: Validate critical paths in staging
  • Chaos tests: Inject failures to validate resilience

2. Use Dedicated Test Accounts

Never run infrastructure tests in production accounts. Use separate AWS accounts with:

  • Strict resource quotas to prevent runaway costs
  • Automated cleanup policies (AWS Config rules, Lambda functions)
  • Cost alerts and budget limits

3. Optimize Test Execution Time

Infrastructure tests are slow. Optimize with:

// Run independent tests in parallel
func TestEKSCluster(t *testing.T) {
    t.Parallel() // Enable parallel execution
}

// Use smaller instance types for testing
Vars: map[string]interface{}{
    "instance_type": "t3.small", // vs production t3.xlarge
    "node_count":    1,           // vs production 3
}

4. Implement Comprehensive Logging

terraform.InitAndApply(t, &terraform.Options{
    TerraformDir: "../examples/complete",
    Logger:       logger.New(logger.Terratest),
    NoColor:      false,
})

5. Version Pin Everything

# versions.tf
terraform {
  required_version = "~> 1.7.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.40.0"
    }
  }
}

6. Create Test Fixtures

Maintain example configurations specifically for testing:

examples/
├── minimal/          # Smallest viable configuration
├── complete/         # Full-featured example
└── upgrade/          # Version upgrade scenarios

7. Monitor Test Infrastructure Costs

Implement cost tracking:

// Tag all test resources
Vars: map[string]interface{}{
    "tags": map[string]string{
        "Environment": "test",
        "ManagedBy":   "terratest",
        "TestID":      uniqueID,
        "CostCenter":  "engineering",
    },
}

Frequently Asked Questions

How long should infrastructure tests take to run?

Aim for under 15 minutes for integration tests. EKS clusters take 10-15 minutes to provision, so tests involving them will be slower. Use parallel execution and smaller instance types to optimize. For CI/CD, consider running full tests only on main branch merges, with faster smoke tests on pull requests.

Should I test every Terraform module separately?

Yes, treat each reusable module as a library that needs its own test suite. Test modules in isolation first, then test compositions of modules. This approach helps identify whether issues stem from individual modules or their integration.

How do I handle secrets in infrastructure tests?

Never commit secrets to test code. Use AWS Secrets Manager or Parameter Store, and reference them in tests. For CI/CD, use OIDC federation to grant temporary credentials rather than long-lived access keys. Terratest supports AWS credential chains automatically.

What's the best way to test disaster recovery scenarios?

Create dedicated tests that deploy infrastructure, simulate failures (terminate instances, delete resources), and validate recovery mechanisms. Use Terratest's retry logic to validate that auto-scaling groups, health checks, and backup restoration work correctly. These tests are expensive to run, so execute them on a schedule rather than on every commit.

How do I test infrastructure changes without affecting production?

Use Terraform workspaces or separate state files for test environments. Deploy to isolated AWS accounts or VPCs. Implement proper tagging so test resources are clearly identified. Use AWS Organizations SCPs to prevent test accounts from accessing production resources.

Can I use Terratest with other cloud providers besides AWS?

Yes, Terratest supports Azure, GCP, Kubernetes, and other platforms. The testing patterns remain the same—deploy, validate, destroy. Use the appropriate SDK for your cloud provider (Azure SDK for Go, Google Cloud Client Libraries) for validation logic.

How do I debug failing infrastructure tests?

Enable verbose Terraform logging with TF_LOG=DEBUG. Use terraform.InitAndApplyAndIdempotent to catch configuration drift issues. Comment out the defer terraform.Destroy line temporarily to inspect resources after test failure. Check CloudWatch logs, VPC Flow Logs, and CloudTrail for detailed error information.

Conclusion

Infrastructure testing is no longer optional for teams deploying cloud infrastructure at scale. Terratest and Kitchen-Terraform provide complementary approaches to validating that your infrastructure code works correctly before reaching production. Terratest offers programmatic flexibility ideal for complex validation scenarios, while Kitchen-Terraform provides declarative simplicity for standardized testing patterns.

The key to successful infrastructure testing is treating it like application testing: write tests early, run them automatically, and continuously refine them as your infrastructure evolves. Start with basic deployment validation, then progressively add tests for security configurations, disaster recovery, and upgrade scenarios.

Next steps to implement infrastructure testing:

  1. Set up a dedicated AWS test account with cost controls
  2. Choose Terratest or Kitchen-Terraform based on your team's preferences
  3. Write tests for your most critical Terraform modules first
  4. Integrate tests into your CI/CD pipeline with appropriate triggers
  5. Monitor test execution time and costs, optimizing as needed
  6. Expand test coverage to include disaster recovery and chaos scenarios

Infrastructure testing requires upfront investment, but the cost of catching misconfigurations before production far outweighs the expense of running automated tests. Start small, prove value with your most critical infrastructure, then expand coverage systematically.