The Real Problem with Health Check Selection

The fundamental challenge isn't technical complexity—it's understanding what "healthy" actually means for your application. A server might accept TCP connections while its application layer is deadlocked. An HTTP endpoint might return 200 OK while the database connection pool is exhausted. These scenarios create "zombie" instances that appear healthy to simple probes but deliver degraded or failed responses to actual users.

In 2025, this problem intensifies with distributed tracing requirements, observability expectations, and the proliferation of sidecar proxies in service mesh architectures. Your health checks must integrate with circuit breakers, rate limiters, and adaptive load balancing algorithms that make real-time routing decisions based on instance health signals. The wrong probe type creates blind spots that sophisticated orchestration systems cannot compensate for.

Why Traditional Health Check Approaches Fail

Legacy health check configurations typically defaulted to TCP probes because they were "fast and simple." This made sense when applications were monolithic, failure modes were binary (up or down), and recovery meant restarting a process. Modern cloud-native applications break these assumptions completely.

Consider a typical Node.js API service in 2025. The process might be running, the port accepting connections (TCP probe passes), but the event loop is blocked by a memory leak, authentication tokens have expired for downstream services, or the Redis cache is unreachable. TCP probes cannot detect these application-level failures. Traffic continues flowing to the degraded instance until user-facing errors trigger alerts—often minutes after the problem began.

HTTP probes seemed like the obvious solution, but naive implementations create new problems. A health endpoint that simply returns 200 OK without checking dependencies provides false confidence. Conversely, health checks that perform expensive operations (database queries, external API calls) under high load can trigger cascading failures when probe traffic itself overwhelms struggling instances.

The shift toward Kubernetes, service meshes like Istio and Linkerd, and serverless platforms has introduced additional complexity. These systems often implement multiple health check types (liveness, readiness, startup probes) with different semantics. Misunderstanding the distinction between "is this container alive?" versus "can this instance handle traffic?" leads to restart loops, traffic blackholing, and unpredictable behavior during deployments.

TCP Health Probes: When Connection-Level Checks Suffice

TCP health checks verify that a target port accepts connections. The load balancer establishes a TCP handshake and immediately closes the connection. If the handshake completes successfully, the probe passes.

Appropriate use cases for TCP probes:

Stateless TCP services: Load balancers, reverse proxies, or TCP-based protocols where connection acceptance indicates readiness
Performance-critical paths: When probe overhead must be absolutely minimized (sub-millisecond latency requirements)
Non-HTTP protocols: Database connections, message queues, gRPC services where HTTP isn't the native protocol
Infrastructure components: When monitoring the network layer itself rather than application logic

TCP probes excel in scenarios where the ability to accept connections directly correlates with service health. A properly configured Nginx reverse proxy that accepts TCP connections is genuinely ready to handle requests. The application logic is minimal, and failure modes are binary.

Here's a production-grade TCP health check configuration for AWS Application Load Balancer using Terraform:

resource "aws_lb_target_group" "tcp_backend" {
  name     = "tcp-backend-tg"
  port     = 8080
  protocol = "TCP"
  vpc_id   = var.vpc_id

  health_check {
    enabled             = true
    protocol            = "TCP"
    port                = "traffic-port"
    healthy_threshold   = 2
    unhealthy_threshold = 2
    timeout             = 3
    interval            = 10
  }

  deregistration_delay = 30

  connection_termination = true
}

The critical parameters here are healthy_threshold and unhealthy_threshold. Setting both to 2 means an instance must pass two consecutive probes to be marked healthy, and fail two consecutive probes to be marked unhealthy. This prevents flapping during transient network issues while maintaining fast failure detection.

HTTP Health Probes: Application-Aware Monitoring

HTTP health checks make requests to a specific endpoint and evaluate the response code, headers, and optionally the response body. This enables application-level health verification that TCP probes cannot provide.

When HTTP probes are essential:

Stateful applications: Services maintaining in-memory state, caches, or connection pools
Dependency-aware health: Applications that must verify downstream service availability
Complex initialization: Services with lengthy startup sequences requiring readiness signals
Graceful degradation: Applications that can report partial health states

A well-designed HTTP health endpoint checks critical dependencies without creating excessive overhead. Here's a production-ready implementation in TypeScript using Express:

import express, { Request, Response } from 'express';
import { createClient } from 'redis';
import { Pool } from 'pg';

interface HealthStatus {
  status: 'healthy' | 'degraded' | 'unhealthy';
  timestamp: string;
  checks: {
    database: boolean;
    cache: boolean;
    memory: boolean;
  };
  details?: Record<string, unknown>;
}

class HealthCheckService {
  private dbPool: Pool;
  private redisClient: ReturnType<typeof createClient>;
  private lastHealthCheck: HealthStatus | null = null;
  private cacheTimeout = 5000; // Cache health status for 5 seconds

  constructor(dbPool: Pool, redisClient: ReturnType<typeof createClient>) {
    this.dbPool = dbPool;
    this.redisClient = redisClient;
  }

  async performHealthCheck(): Promise<HealthStatus> {
    const now = Date.now();

    // Return cached result if recent enough to prevent probe storms
    if (this.lastHealthCheck && 
        (now - new Date(this.lastHealthCheck.timestamp).getTime()) < this.cacheTimeout) {
      return this.lastHealthCheck;
    }

    const checks = {
      database: await this.checkDatabase(),
      cache: await this.checkRedis(),
      memory: this.checkMemory()
    };

    const failedChecks = Object.values(checks).filter(v => !v).length;

    let status: HealthStatus['status'];
    if (failedChecks === 0) {
      status = 'healthy';
    } else if (failedChecks === 1) {
      status = 'degraded';
    } else {
      status = 'unhealthy';
    }

    this.lastHealthCheck = {
      status,
      timestamp: new Date().toISOString(),
      checks,
      details: {
        memoryUsage: process.memoryUsage(),
        uptime: process.uptime()
      }
    };

    return this.lastHealthCheck;
  }

  private async checkDatabase(): Promise<boolean> {
    try {
      const result = await Promise.race([
        this.dbPool.query('SELECT 1'),
        new Promise((_, reject) => 
          setTimeout(() => reject(new Error('Database timeout')), 2000)
        )
      ]);
      return true;
    } catch (error) {
      console.error('Database health check failed:', error);
      return false;
    }
  }

  private async checkRedis(): Promise<boolean> {
    try {
      await Promise.race([
        this.redisClient.ping(),
        new Promise((_, reject) => 
          setTimeout(() => reject(new Error('Redis timeout')), 1000)
        )
      ]);
      return true;
    } catch (error) {
      console.error('Redis health check failed:', error);
      return false;
    }
  }

  private checkMemory(): boolean {
    const usage = process.memoryUsage();
    const heapUsedPercent = (usage.heapUsed / usage.heapTotal) * 100;
    return heapUsedPercent < 90; // Fail if heap usage exceeds 90%
  }
}

// Express route setup
const app = express();
const healthService = new HealthCheckService(dbPool, redisClient);

app.get('/health', async (req: Request, res: Response) => {
  const health = await healthService.performHealthCheck();

  const statusCode = health.status === 'healthy' ? 200 : 
                     health.status === 'degraded' ? 200 : 503;

  res.status(statusCode).json(health);
});

// Separate liveness probe - always returns 200 if process is running
app.get('/healthz', (req: Request, res: Response) => {
  res.status(200).send('OK');
});

This implementation demonstrates several critical patterns:

Caching: Health check results are cached for 5 seconds to prevent probe storms during high-frequency checks
Timeouts: Each dependency check has aggressive timeouts to prevent blocking
Graceful degradation: Returns 200 for degraded state (one failed dependency) to avoid unnecessary instance removal
Separate endpoints: /health for readiness (can handle traffic) and /healthz for liveness (process is alive)

The corresponding load balancer configuration:

resource "aws_lb_target_group" "http_backend" {
  name     = "http-backend-tg"
  port     = 3000
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    enabled             = true
    protocol            = "HTTP"
    path                = "/health"
    port                = "traffic-port"
    healthy_threshold   = 3
    unhealthy_threshold = 2
    timeout             = 5
    interval            = 15
    matcher             = "200"
  }

  deregistration_delay = 60
}

Notice the higher healthy_threshold (3) for HTTP probes. This prevents premature traffic routing to instances that might be experiencing transient dependency issues during startup or recovery.

Hybrid Approaches and Advanced Patterns

Modern load balancers and service meshes support sophisticated health check strategies that combine multiple probe types. Kubernetes, for example, distinguishes between liveness, readiness, and startup probes—each serving different purposes.

Liveness probes determine if a container should be restarted. These should use simple TCP or HTTP checks that only fail when the process is truly deadlocked or corrupted. Overly aggressive liveness probes cause restart loops.

Readiness probes determine if a container should receive traffic. These should perform comprehensive dependency checks and can temporarily remove instances from rotation during transient issues.

Startup probes handle slow-starting applications by disabling liveness checks until initial startup completes. This prevents premature restarts of applications with lengthy initialization.

Here's a Kubernetes deployment manifest demonstrating this pattern:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: api-service:v2.1.0
        ports:
        - containerPort: 3000

        startupProbe:
          httpGet:
            path: /healthz
            port: 3000
          failureThreshold: 30
          periodSeconds: 10
          # Allows up to 5 minutes for startup

        livenessProbe:
          httpGet:
            path: /healthz
            port: 3000
          initialDelaySeconds: 0
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3

        readinessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 2

This configuration ensures that:

Startup probe prevents premature liveness failures during initialization
Liveness probe uses the simple /healthz endpoint to detect process-level failures
Readiness probe uses the comprehensive /health endpoint to verify application-level health

Common Pitfalls and Edge Cases

Probe storms during incidents: When multiple instances fail simultaneously, health check traffic can overwhelm recovering instances. Implement exponential backoff and result caching in health endpoints.

Circular dependencies: Health checks that verify downstream service availability can create circular dependencies in microservices architectures. Service A checks Service B, which checks Service C, which checks Service A. This causes cascading failures during partial outages.

Insufficient timeout configuration: Health check timeouts must be shorter than the probe interval to prevent overlapping checks. A 10-second timeout with a 10-second interval creates unpredictable behavior.

Ignoring deregistration delay: When instances are marked unhealthy, in-flight requests may still be routed to them during the deregistration period. Set this value based on your application's longest request duration (typically 30-60 seconds for APIs).

False positives from cold starts: Serverless functions and auto-scaled containers may fail initial health checks due to cold start latency. Use startup probes with generous failure thresholds.

Health check endpoint security: Exposing detailed health information can leak infrastructure details. Implement authentication for detailed health endpoints or return minimal information on public endpoints.

Database connection pool exhaustion: Health checks that create new database connections on every probe can exhaust connection pools. Reuse connections or implement connection pooling specifically for health checks.

Best Practices for Production Health Checks

1. Match probe type to failure mode: Use TCP probes for infrastructure components where connection acceptance equals readiness. Use HTTP probes for application services with complex state.

2. Implement tiered health checks: Separate liveness (process alive) from readiness (can handle traffic) from startup (initialization complete).

3. Cache health check results: Prevent probe storms by caching results for 5-10 seconds, especially under high probe frequency.

4. Set aggressive timeouts: Health check timeouts should be 2-3 seconds maximum. Slow responses indicate degraded performance that warrants traffic removal.

5. Monitor health check metrics: Track probe success rates, response times, and failure patterns. Sudden changes indicate infrastructure or application issues.

6. Test failure scenarios: Regularly simulate dependency failures to verify health checks correctly remove instances from rotation.

7. Document expected behavior: Clearly document what each health endpoint checks and under what conditions it should fail.

8. Implement graceful shutdown: Handle SIGTERM signals to stop accepting new requests while completing in-flight requests before health checks fail.

9. Use consistent response formats: Standardize health check response schemas across services for easier monitoring and debugging.

10. Consider probe source IP allowlisting: Restrict health check endpoints to load balancer IP ranges to prevent abuse.

Frequently Asked Questions

What is the main difference between TCP and HTTP health checks in load balancers?

TCP health checks verify that a port accepts connections by completing a TCP handshake, while HTTP health checks make actual HTTP requests to specific endpoints and evaluate response codes. TCP probes only confirm network-level connectivity, whereas HTTP probes can verify application-level health including dependency status and resource availability.

When should you use TCP probes instead of HTTP health checks?

Use TCP probes for stateless infrastructure components like reverse proxies, non-HTTP protocols such as databases or message queues, and performance-critical scenarios where sub-millisecond probe latency is required. TCP probes are appropriate when connection acceptance directly indicates service readiness without complex application state.

How do health check intervals affect application reliability in 2025?

Health check intervals balance failure detection speed against probe overhead. Shorter intervals (5-10 seconds) enable faster failure detection but increase load on instances. Modern best practice uses 10-15 second intervals with low healthy/unhealthy thresholds (2-3 consecutive failures) to achieve sub-minute failure detection without excessive overhead.

What happens when health check endpoints are too comprehensive?

Overly comprehensive health checks that query multiple dependencies or perform expensive operations can create cascading failures. During incidents when systems are already stressed, health check traffic itself can overwhelm instances, causing healthy servers to fail probes and be removed from rotation, amplifying the outage.

How should health checks work with Kubernetes liveness and readiness probes?

Kubernetes liveness probes should use simple checks (TCP or basic HTTP) that only fail when the container must be restarted. Readiness probes should perform comprehensive dependency checks to determine traffic eligibility. Startup probes should have generous failure thresholds to accommodate slow initialization without triggering premature liveness failures.

Best way to implement health checks for microservices with many dependencies?

Implement tiered health checks where critical dependencies (database, authentication) cause hard failures (503 response), while non-critical dependencies (caching, analytics) cause degraded status (200 response with warning). Cache health check results for 5-10 seconds and use circuit breakers to prevent cascading failures when dependencies are unavailable.

When should you avoid using HTTP health checks?

Avoid HTTP health checks for pure TCP services, when probe overhead significantly impacts performance, or when the HTTP layer adds unnecessary complexity. Also avoid them when health endpoints cannot be implemented without creating circular dependencies or when the application cannot reliably determine its own health state.

Conclusion

Choosing between TCP and HTTP health probes fundamentally depends on what "healthy" means for your specific application. TCP probes provide fast, low-overhead connection verification suitable for infrastructure components and stateless services. HTTP probes enable application-aware health monitoring essential for stateful services with complex dependencies.

Modern production systems in 2025 typically require both: TCP probes for infrastructure layers and HTTP probes with tiered health checks for application services. Implement separate liveness and readiness endpoints, cache health check results to prevent probe storms, and set aggressive timeouts to quickly detect degraded instances.

Start by auditing your current health check configuration. Identify services using TCP probes that should verify application-level health. Implement comprehensive HTTP health endpoints with dependency checking, result caching, and graceful degradation. Test failure scenarios to verify probes correctly remove unhealthy instances without causing cascading failures. Monitor health check metrics alongside application performance to continuously refine your configuration as your architecture evolves.

Load Balancer Health Checks: TCP vs HTTP

The Real Problem with Health Check Selection

Why Traditional Health Check Approaches Fail

TCP Health Probes: When Connection-Level Checks Suffice

HTTP Health Probes: Application-Aware Monitoring

Hybrid Approaches and Advanced Patterns

Common Pitfalls and Edge Cases

Best Practices for Production Health Checks

Frequently Asked Questions

Conclusion

Comments

More from this blog

Embedding-First Architecture for Real-World LLM Apps

AI/ML Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Command Palette

The Real Problem with Health Check Selection

Why Traditional Health Check Approaches Fail

TCP Health Probes: When Connection-Level Checks Suffice

HTTP Health Probes: Application-Aware Monitoring

Hybrid Approaches and Advanced Patterns

Common Pitfalls and Edge Cases

Best Practices for Production Health Checks

Frequently Asked Questions

Conclusion

Comments

More from this blog