Skip to main content

Command Palette

Search for a command to run...

API Rate Limiting: Token Bucket vs Sliding Window

Distributed rate limiting with Redis and consistent hashing

Published
8 min read
T

Welcome to TopperBlog! 👋

I'm a tech content creator passionate about helping developers level up their careers and master cutting-edge technologies.

🎯 What I Write About: • AI/ML Engineering & LLMs • Web3 & Blockchain Development
• System Design & Architecture • Interview Preparation (FAANG) • Freelancing & Remote Work • Modern Tech Stacks (Next.js, React, Rust, TypeScript) • Performance Optimization & Best Practices

💼 Mission: Sharing practical, actionable insights that accelerate your tech career and maximize your earning potential.

📚 15+ In-Depth Guides covering everything from earning $10k/month as a freelancer to cracking FAANG interviews.

🌐 Let's connect and grow together in this amazing tech journey!

#TechBlogger #SoftwareEngineering #CareerGrowth #WebDevelopment #AIEngineering

Content Role: pillar

API Rate Limiting: Token Bucket vs Sliding Window

Distributed rate limiting with Redis and consistent hashing

Rate limiting protects your API infrastructure from abuse, ensures fair resource allocation, and maintains service quality under load. Without proper rate limiting, a single misbehaving client can degrade performance for all users, or worse, take down your entire service.

This article examines two production-grade API rate limiting algorithms—token bucket and sliding window—with practical implementations using Redis for distributed systems. We'll explore their trade-offs, implementation patterns, and how to handle edge cases in high-throughput environments.

The Problem: Why Simple Rate Limiting Fails at Scale

In-memory rate limiters work fine for single-server deployments, but modern applications run across multiple instances behind load balancers. Consider these failure scenarios:

Inconsistent limits across instances: User A hits server 1 five times and server 2 five times. Each server allows the request because they track limits independently, effectively doubling the rate limit.

Race conditions: Two requests arrive simultaneously at different servers. Both check Redis, see the user hasn't exceeded limits, and both increment the counter—potentially allowing bursts beyond your threshold.

Clock skew: Time-based algorithms depend on synchronized clocks. Even small discrepancies between servers can cause incorrect rate limit calculations.

Hot key problems: Popular users or API keys create hotspots in your distributed cache, causing performance degradation.

These issues require algorithmic solutions combined with distributed system patterns.

Token Bucket Algorithm

The token bucket algorithm maintains a bucket that fills with tokens at a constant rate. Each request consumes one token. If the bucket is empty, the request is rejected.

How It Works

  1. Initialize a bucket with a maximum capacity (burst size)
  2. Add tokens at a fixed rate (refill rate)
  3. Each request attempts to remove a token
  4. If tokens are available, allow the request and decrement
  5. If bucket is empty, reject the request

Advantages

  • Allows bursts: Clients can consume accumulated tokens for legitimate traffic spikes
  • Smooth traffic: Naturally smooths out request patterns over time
  • Simple state: Only requires tracking token count and last refill time

Redis Implementation

interface TokenBucketConfig {
  maxTokens: number;
  refillRate: number; // tokens per second
  keyPrefix: string;
}

class TokenBucketRateLimiter {
  constructor(
    private redis: Redis,
    private config: TokenBucketConfig
  ) {}

  async allowRequest(userId: string): Promise<boolean> {
    const key = `${this.config.keyPrefix}:${userId}`;
    const now = Date.now();

    const script = `
      local key = KEYS[1]
      local max_tokens = tonumber(ARGV[1])
      local refill_rate = tonumber(ARGV[2])
      local now = tonumber(ARGV[3])

      local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
      local tokens = tonumber(bucket[1]) or max_tokens
      local last_refill = tonumber(bucket[2]) or now

      -- Calculate tokens to add based on elapsed time
      local elapsed = (now - last_refill) / 1000
      local tokens_to_add = elapsed * refill_rate
      tokens = math.min(max_tokens, tokens + tokens_to_add)

      if tokens >= 1 then
        tokens = tokens - 1
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)
        return 1
      else
        return 0
      end
    `;

    const result = await this.redis.eval(
      script,
      1,
      key,
      this.config.maxTokens,
      this.config.refillRate,
      now
    );

    return result === 1;
  }
}

Token Bucket Pitfalls

Thundering herd: If many clients accumulate full buckets during quiet periods, they can all burst simultaneously when traffic resumes.

Precision loss: Floating-point arithmetic in token calculations can cause drift over time. Use integer math with millisecond precision instead.

Memory overhead: Each user requires persistent state. For millions of users, this becomes significant.

Sliding Window Algorithm

The sliding window algorithm tracks requests within a moving time window. It provides more predictable rate limiting without allowing large bursts.

How It Works

  1. Define a time window (e.g., 60 seconds)
  2. Track timestamps of all requests within the window
  3. For each new request, remove expired timestamps
  4. Count remaining requests
  5. Allow if count is below threshold

Advantages

  • Predictable limits: Enforces exact request counts per time period
  • No burst allowance: Prevents sudden traffic spikes
  • Accurate accounting: Tracks actual request distribution

Redis Implementation with Sorted Sets

interface SlidingWindowConfig {
  windowSize: number; // seconds
  maxRequests: number;
  keyPrefix: string;
}

class SlidingWindowRateLimiter {
  constructor(
    private redis: Redis,
    private config: SlidingWindowConfig
  ) {}

  async allowRequest(userId: string): Promise<boolean> {
    const key = `${this.config.keyPrefix}:${userId}`;
    const now = Date.now();
    const windowStart = now - (this.config.windowSize * 1000);

    const script = `
      local key = KEYS[1]
      local window_start = tonumber(ARGV[1])
      local now = tonumber(ARGV[2])
      local max_requests = tonumber(ARGV[3])
      local window_size = tonumber(ARGV[4])

      -- Remove expired entries
      redis.call('ZREMRANGEBYSCORE', key, 0, window_start)

      -- Count current requests in window
      local current_count = redis.call('ZCARD', key)

      if current_count < max_requests then
        -- Add current request with timestamp as score
        redis.call('ZADD', key, now, now)
        redis.call('EXPIRE', key, window_size * 2)
        return 1
      else
        return 0
      end
    `;

    const result = await this.redis.eval(
      script,
      1,
      key,
      windowStart,
      now,
      this.config.maxRequests,
      this.config.windowSize
    );

    return result === 1;
  }

  async getRemainingRequests(userId: string): Promise<number> {
    const key = `${this.config.keyPrefix}:${userId}`;
    const now = Date.now();
    const windowStart = now - (this.config.windowSize * 1000);

    await this.redis.zremrangebyscore(key, 0, windowStart);
    const count = await this.redis.zcard(key);

    return Math.max(0, this.config.maxRequests - count);
  }
}

Sliding Window Pitfalls

Memory growth: Storing individual timestamps for high-traffic users consumes significant memory. A user making 1000 requests per minute requires storing 1000 sorted set entries.

Boundary issues: Requests at window boundaries can cause unexpected behavior. A user could make max requests at 00:59 and again at 01:00, effectively doubling their rate.

Cleanup overhead: Removing expired entries adds latency to every request.

Distributed System Considerations

Consistent Hashing for Hot Key Distribution

When certain API keys or users generate disproportionate traffic, they create hot keys in Redis. Consistent hashing distributes load across multiple Redis instances:

import { createHash } from 'crypto';

class ConsistentHashRateLimiter {
  private nodes: string[];
  private virtualNodes: Map<number, string>;

  constructor(
    private redisClients: Map<string, Redis>,
    private limiter: TokenBucketRateLimiter,
    virtualNodeCount: number = 150
  ) {
    this.nodes = Array.from(redisClients.keys());
    this.virtualNodes = this.buildHashRing(virtualNodeCount);
  }

  private buildHashRing(virtualNodeCount: number): Map<number, string> {
    const ring = new Map<number, string>();

    for (const node of this.nodes) {
      for (let i = 0; i < virtualNodeCount; i++) {
        const hash = this.hash(`${node}:${i}`);
        ring.set(hash, node);
      }
    }

    return new Map([...ring.entries()].sort((a, b) => a[0] - b[0]));
  }

  private hash(key: string): number {
    const hash = createHash('md5').update(key).digest();
    return hash.readUInt32BE(0);
  }

  private getNode(key: string): string {
    const keyHash = this.hash(key);

    for (const [nodeHash, node] of this.virtualNodes) {
      if (keyHash <= nodeHash) {
        return node;
      }
    }

    return this.nodes[0];
  }

  async allowRequest(userId: string): Promise<boolean> {
    const node = this.getNode(userId);
    const redis = this.redisClients.get(node)!;

    // Use the selected Redis instance
    const limiterWithNode = new TokenBucketRateLimiter(
      redis,
      this.limiter['config']
    );

    return limiterWithNode.allowRequest(userId);
  }
}

Handling Redis Failures

Implement circuit breakers and fallback strategies:

class ResilientRateLimiter {
  private circuitOpen = false;
  private failureCount = 0;
  private readonly failureThreshold = 5;

  async allowRequest(userId: string): Promise<boolean> {
    if (this.circuitOpen) {
      // Fail open: allow requests when Redis is down
      return true;
    }

    try {
      const allowed = await this.limiter.allowRequest(userId);
      this.failureCount = 0;
      return allowed;
    } catch (error) {
      this.failureCount++;

      if (this.failureCount >= this.failureThreshold) {
        this.circuitOpen = true;
        setTimeout(() => {
          this.circuitOpen = false;
          this.failureCount = 0;
        }, 30000); // Reset after 30 seconds
      }

      // Fail open during outages
      return true;
    }
  }
}

Common Pitfalls

Using fixed windows: Fixed windows (e.g., requests per minute starting at :00) allow users to make 2x requests by splitting across boundaries. Always use sliding windows or token buckets.

Ignoring clock drift: In distributed systems, use Redis TIME command or a centralized time source rather than application server clocks.

Not setting TTLs: Rate limit keys without expiration accumulate indefinitely. Always set appropriate TTLs.

Synchronous Redis calls: Blocking on Redis for every request adds latency. Use pipelining or async patterns.

Missing rate limit headers: Return X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers so clients can implement backoff strategies.

Inadequate monitoring: Track rate limit hit rates, Redis latency, and false positive rates to tune your configuration.

Best Practices Checklist

  • [ ] Use Lua scripts for atomic Redis operations
  • [ ] Implement consistent hashing for hot key distribution
  • [ ] Set appropriate TTLs on all rate limit keys
  • [ ] Return standard rate limit headers in responses
  • [ ] Monitor rate limit effectiveness and Redis performance
  • [ ] Implement circuit breakers for Redis failures
  • [ ] Use integer math to avoid floating-point precision issues
  • [ ] Test behavior at window boundaries
  • [ ] Document rate limits in API documentation
  • [ ] Implement tiered rate limits for different user classes
  • [ ] Use Redis pipelining for batch operations
  • [ ] Configure Redis maxmemory-policy to evict old keys

Algorithm Selection Guide

Choose Token Bucket when:

  • You want to allow legitimate traffic bursts
  • User experience benefits from burst allowance
  • You need simple state management
  • Memory efficiency is important

Choose Sliding Window when:

  • You need strict, predictable rate limits
  • Preventing bursts is critical for stability
  • You can afford higher memory usage
  • Accurate request distribution matters

Hybrid approach: Use token bucket for most endpoints and sliding window for critical resources that cannot tolerate bursts.

FAQ

Q: How do I handle rate limiting for anonymous users?

A: Use IP address as the identifier, but implement IP range detection for proxies and NAT gateways. Consider more lenient limits for anonymous users and stricter limits for authenticated users.

Q: What happens when Redis goes down?

A: Implement a fail-open strategy where requests are allowed during outages, or use a local in-memory fallback with reduced accuracy. Monitor circuit breaker metrics to detect issues.

Q: How do I rate limit WebSocket connections?

A: Rate limit both connection establishment and message frequency. Track message timestamps in a sliding window and disconnect clients exceeding limits.

Q: Should I rate limit by user, API key, or IP address?

A: Use multiple dimensions: strict limits per API key, moderate limits per user, and lenient limits per IP. Apply the most restrictive limit that matches.

Q: How do I test rate limiting in development?

A: Use Redis in Docker for local testing. Create integration tests that make rapid requests and verify rejection behavior. Test boundary conditions and clock skew scenarios.

Q: What's the performance impact of Lua scripts in Redis?

A: Lua scripts execute atomically and efficiently in Redis. The overhead is minimal compared to multiple round trips. Always benchmark with production-like load.

Q: How do I implement rate limiting for GraphQL APIs?

A: Rate limit based on query complexity scores rather than request count. Parse queries, calculate complexity (depth, breadth, field count), and consume tokens proportionally.