Why Default HPA Metrics Fail Modern Workloads

The standard Kubernetes HPA implementation defaults to CPU utilization as its primary metric, with an 80% target threshold and a 15-second metric resolution from the metrics-server. This configuration assumes workloads scale linearly with CPU consumption and that CPU pressure appears before user-facing degradation occurs. Neither assumption holds for contemporary application architectures.

Consider a gRPC service handling streaming responses. CPU utilization might hover at 50% while the service maintains 10,000 concurrent streams per pod. When traffic doubles, the service doesn't consume proportionally more CPU—it exhausts file descriptors and connection limits first. The HPA sees stable CPU metrics and refuses to scale, leading to connection rejections and cascading failures across dependent services.

Memory-based scaling presents similar challenges. Garbage-collected languages like Go and Java exhibit sawtooth memory patterns where utilization climbs gradually until GC triggers, then drops sharply. An HPA configured with memory thresholds will trigger unnecessary scaling events during normal GC cycles, creating pod churn that actually degrades performance through constant rescheduling and warm-up periods.

The metrics-server itself introduces architectural limitations. It provides only current resource utilization snapshots with no historical context, making it impossible to implement predictive scaling or detect gradual degradation trends. Its 60-second default scrape interval means the HPA operates on stale data during rapid traffic changes, and the aggregation pipeline from kubelet to metrics-server to HPA controller adds 30-45 seconds of latency in typical clusters.

Implementing Custom Metrics for Production HPA

Modern horizontal pod autoscaler metrics configuration requires integrating custom and external metrics that directly correlate with user experience and business outcomes. The Kubernetes custom metrics API and external metrics API provide standardized interfaces for exposing application-specific metrics to the HPA controller.

The architecture involves three components: a metrics source (Prometheus, Datadog, or cloud provider monitoring), a metrics adapter that implements the custom metrics API, and HPA resources configured to query these metrics. Here's a production-grade implementation using Prometheus and the prometheus-adapter:

apiVersion: v1
kind: ConfigMap
metadata:
  name: adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'http_requests_pending{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_pending$"
        as: "${1}_pending_per_pod"
      metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

    - seriesQuery: 'inference_queue_depth{namespace!="",service!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          service: {resource: "service"}
      name:
        as: "inference_queue_depth"
      metricsQuery: 'avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

    - seriesQuery: 'grpc_server_stream_count{namespace!="",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        as: "active_streams_per_pod"
      metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

This configuration exposes three critical metrics: pending HTTP requests per pod, inference queue depth at the service level, and active gRPC streams per pod. The metricsQuery field uses Prometheus query language to aggregate raw metrics into values the HPA can consume, with a 2-minute rate window to smooth out transient spikes.

The corresponding HPA configuration demonstrates multi-metric scaling with different target values:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-inference-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-inference-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: active_streams_per_pod
      target:
        type: AverageValue
        averageValue: "800"
  - type: Object
    object:
      metric:
        name: inference_queue_depth
      describedObject:
        apiVersion: v1
        kind: Service
        name: ml-inference-service
      target:
        type: Value
        value: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 5
        periodSeconds: 30
      selectPolicy: Max

The behavior section represents a critical advancement in HPA capabilities introduced in Kubernetes 1.23 and refined through 1.31. It allows precise control over scaling velocity and stability. This configuration scales up aggressively (doubling capacity every 30 seconds or adding 5 pods, whichever is greater) while scaling down conservatively (reducing by 50% or 2 pods per minute, whichever is smaller, with a 5-minute stabilization window).

External Metrics for Cloud-Native Architectures

External metrics enable scaling based on infrastructure signals outside the Kubernetes cluster. This proves essential for event-driven architectures where workload demand originates from message queues, cloud storage events, or third-party APIs.

For a Kafka consumer deployment, scaling based on consumer lag provides more accurate capacity planning than any pod-level metric:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: kafka-consumer-hpa
  namespace: streaming
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: event-processor
  minReplicas: 2
  maxReplicas: 30
  metrics:
  - type: External
    external:
      metric:
        name: kafka_consumer_lag
        selector:
          matchLabels:
            topic: "user-events"
            consumer_group: "event-processor"
      target:
        type: AverageValue
        averageValue: "1000"

This configuration requires an external metrics provider that queries Kafka's consumer group metadata. The KEDA (Kubernetes Event-Driven Autoscaling) project provides production-ready scalers for Kafka, RabbitMQ, Azure Service Bus, AWS SQS, and dozens of other event sources. KEDA operates as a metrics adapter and can coexist with prometheus-adapter for comprehensive scaling coverage.

For cloud-native applications on AWS, scaling based on SQS queue depth with KEDA:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: sqs-processor-scaler
  namespace: processing
spec:
  scaleTargetRef:
    name: sqs-processor
  minReplicaCount: 1
  maxReplicaCount: 40
  pollingInterval: 10
  cooldownPeriod: 120
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789/processing-queue
      queueLength: "50"
      awsRegion: "us-east-1"
      identityOwner: "operator"
      scaleOnInFlight: "false"
      scaleIfSuspended: "false"

KEDA's ScaledObject provides a higher-level abstraction over HPA with built-in support for scaling to zero, which standard HPA cannot achieve. The scaleOnInFlight parameter prevents premature scale-down when messages are being processed but haven't been deleted from the queue yet.

Advanced Scaling Patterns and Composite Metrics

Production environments often require composite scaling logic that considers multiple metrics simultaneously. The HPA controller evaluates all configured metrics independently and selects the highest desired replica count, but this can lead to over-provisioning when metrics spike independently.

For more sophisticated control, implement custom metrics that combine multiple signals:

// Prometheus recording rule for composite scaling metric
import { PrometheusRule } from '@kubernetes/client-node';

const compositeMetricRule: PrometheusRule = {
  apiVersion: 'monitoring.coreos.com/v1',
  kind: 'PrometheusRule',
  metadata: {
    name: 'composite-scaling-metrics',
    namespace: 'monitoring'
  },
  spec: {
    groups: [{
      name: 'hpa_composite_metrics',
      interval: '15s',
      rules: [{
        record: 'workload:scaling_pressure:ratio',
        expr: `
          (
            (rate(http_requests_total[2m]) / 100) * 0.4 +
            (http_requests_pending / 50) * 0.3 +
            (avg_over_time(http_request_duration_seconds{quantile="0.95"}[2m]) / 0.5) * 0.3
          )
        `
      }, {
        record: 'workload:scaling_pressure:per_pod',
        expr: `
          workload:scaling_pressure:ratio / 
          count(up{job="workload-pods"} == 1)
        `
      }]
    }]
  }
};

This composite metric weighs request rate (40%), pending requests (30%), and P95 latency (30%) to create a unified scaling signal. The per-pod calculation ensures the HPA receives a normalized value regardless of current replica count.

Common Pitfalls and Edge Cases

Metric staleness causes the most frequent HPA failures in production. When a metrics adapter becomes unavailable or a Prometheus query times out, the HPA controller continues using the last known value for up to 5 minutes (the default horizontal-pod-autoscaler-downscale-stabilization period). This can prevent scaling up during outages or cause inappropriate scale-downs when metrics disappear.

Implement health checks for your metrics pipeline:

apiVersion: v1
kind: Service
metadata:
  name: metrics-adapter-health
  namespace: monitoring
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "6443"
spec:
  selector:
    app: prometheus-adapter
  ports:
  - name: metrics
    port: 6443
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: adapter-health
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: prometheus-adapter
  endpoints:
  - port: metrics
    interval: 15s
    path: /healthz

Metric resolution mismatches create scaling oscillations. If your application emits metrics every 10 seconds but the HPA evaluates every 15 seconds with a 30-second Prometheus query window, you'll see delayed reactions and potential flapping. Align all timing parameters: application metric emission interval, Prometheus scrape interval, query window in the adapter configuration, and HPA sync period.

Resource quota exhaustion silently prevents scaling. When an HPA attempts to scale up but the namespace has reached its resource quota, the scaling event fails without clear indication in the HPA status. The controller continues attempting to scale, creating event spam in the cluster. Always configure quota monitoring:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: production
spec:
  hard:
    requests.cpu: "200"
    requests.memory: "400Gi"
    limits.cpu: "400"
    limits.memory: "800Gi"
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values: ["high-priority", "medium-priority"]

Cold start latency undermines reactive scaling. When the HPA creates new pods, they require 30-90 seconds to pass readiness checks and begin receiving traffic. During this window, existing pods remain overloaded. Implement predictive scaling using scheduled scaling for known traffic patterns:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: scheduled-baseline-hpa
  namespace: production
  annotations:
    # Increase minReplicas during business hours
    autoscaling.alpha.kubernetes.io/conditions: |
      [
        {
          "type": "Schedule",
          "status": "True",
          "schedule": "0 8 * * 1-5",
          "minReplicas": 10,
          "maxReplicas": 50
        },
        {
          "type": "Schedule", 
          "status": "True",
          "schedule": "0 18 * * 1-5",
          "minReplicas": 3,
          "maxReplicas": 50
        }
      ]

Note that scheduled scaling requires additional controllers like Kubernetes Cronitor or custom operators, as native HPA doesn't support time-based rules.

Best Practices for Production HPA Configuration

Establish metric baselines through load testing before deploying HPA configurations. Run sustained load tests at 50%, 100%, and 150% of expected peak traffic while recording all candidate metrics. Identify which metrics correlate most strongly with user-facing latency and error rates. This empirical approach prevents guessing at appropriate threshold values.

Implement gradual rollout for HPA changes using progressive delivery. Deploy new HPA configurations to a canary deployment first, monitor scaling behavior for 24-48 hours, then promote to production. Use GitOps workflows to version control HPA manifests alongside application code.

Configure comprehensive observability for autoscaling decisions:

apiVersion: v1
kind: ConfigMap
metadata:
  name: hpa-dashboard
  namespace: monitoring
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "HPA Scaling Decisions",
        "panels": [
          {
            "title": "Desired vs Current Replicas",
            "targets": [{
              "expr": "kube_horizontalpodautoscaler_status_desired_replicas{namespace='production'}"
            }, {
              "expr": "kube_horizontalpodautoscaler_status_current_replicas{namespace='production'}"
            }]
          },
          {
            "title": "Metric Values vs Targets",
            "targets": [{
              "expr": "kube_horizontalpodautoscaler_status_current_metrics_value"
            }, {
              "expr": "kube_horizontalpodautoscaler_spec_target_metric"
            }]
          },
          {
            "title": "Scaling Events",
            "targets": [{
              "expr": "rate(kube_horizontalpodautoscaler_status_condition{status='true',type='ScalingLimited'}[5m])"
            }]
          }
        ]
      }
    }

Set appropriate resource requests and limits on pods. The HPA calculates desired replicas based on actual resource usage divided by requested resources. If requests are too low, the HPA will scale prematurely; if too high, it will under-provision capacity. Use Vertical Pod Autoscaler (VPA) in recommendation mode to identify optimal request values, then configure HPA based on those baselines.

Implement circuit breakers and rate limiting at the application level to prevent cascading failures during scaling delays. Even with optimally configured HPAs, a 30-60 second gap exists between detecting load and new pods becoming ready. Applications must gracefully degrade during this window rather than failing completely.

Test failure scenarios regularly. Simulate metrics adapter failures, Prometheus outages, and API server slowdowns to verify HPA behavior degrades gracefully. Document expected behavior during each failure mode and establish runbooks for operators.

FAQ

What is the difference between custom metrics and external metrics in HPA configuration?

Custom metrics are associated with Kubernetes objects (pods, services, ingress) and typically represent application-level telemetry like request rates or queue depths per pod. External metrics come from sources outside the cluster, such as cloud provider monitoring services, SaaS APIs, or external message queues. Use custom metrics for application-specific scaling signals and external metrics for infrastructure or third-party service signals that drive workload demand.

How does HPA metrics configuration work with GPU-accelerated workloads in 2025?

GPU workloads require specialized metrics beyond standard resource utilization. Configure custom metrics for GPU memory usage, CUDA kernel execution time, and inference throughput using NVIDIA DCGM exporter with Prometheus. Scale based on GPU utilization percentage and inference queue depth rather than CPU metrics, as GPU workloads often show minimal CPU usage while being compute-bound on accelerators. Set longer stabilization windows (5-10 minutes) for scale-down to avoid expensive GPU pod churn.

What is the best way to configure HPA for event-driven microservices?

Event-driven architectures should scale based on message queue depth or lag rather than pod resource metrics. Use KEDA with queue-specific scalers (Kafka, RabbitMQ, SQS, Azure Service Bus) to scale based on unconsumed messages. Configure the queueLength parameter to represent the number of messages a single pod can process within your target latency SLA. Enable scale-to-zero for cost optimization during idle periods, but set appropriate cooldown periods to prevent rapid scaling oscillations.

When should you avoid using Horizontal Pod Autoscaler?

Avoid HPA for stateful workloads that require coordination during scaling, such as databases, caches with local state, or applications using leader election. These workloads need StatefulSet with manual scaling or specialized operators. Also avoid HPA for batch jobs that should run to completion regardless of resource usage—use Job or CronJob resources instead. Finally, don't use HPA for workloads with startup times exceeding 2-3 minutes, as scaling latency will prevent effective autoscaling.

How do you scale HPA configuration across multiple clusters?

Implement a centralized configuration management approach using GitOps tools like ArgoCD or Flux. Store HPA manifests in a Git repository with environment-specific overlays using Kustomize or Helm. Use cluster-specific values for minReplicas, maxReplicas, and metric

HPA: Horizontal Pod Autoscaler Metrics

Why Default HPA Metrics Fail Modern Workloads

Implementing Custom Metrics for Production HPA

External Metrics for Cloud-Native Architectures

Advanced Scaling Patterns and Composite Metrics

Common Pitfalls and Edge Cases

Best Practices for Production HPA Configuration

FAQ

Comments

More from this blog

Embedding-First Architecture for Real-World LLM Apps

AI/ML Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Containers/K8s Modern Patterns

Command Palette

Why Default HPA Metrics Fail Modern Workloads

Implementing Custom Metrics for Production HPA

External Metrics for Cloud-Native Architectures

Advanced Scaling Patterns and Composite Metrics

Common Pitfalls and Edge Cases

Best Practices for Production HPA Configuration

FAQ

Comments

More from this blog