Rate Limiting in Distributed Systems: Token Bucket vs Sliding Window

TL;DR

Token bucket allows bursts and is better for APIs; sliding window gives exact enforcement but costs more memory. Implement both atomically with a Redis Lua script to avoid race conditions.

A deep dive into the two most common rate limiting algorithms — when to use each, how they behave under burst traffic, and how to implement them with Redis.In this post, I'll walk through the key concepts with code examples drawn from real production implementations.

Why Rate Limiting Matters

Without rate limiting, a single misbehaving client can degrade service for everyone. In multi-tenant systems, this is catastrophic — one tenant's burst shouldn't affect another's latency. Rate limiting is also a first line of defense against API abuse and credential stuffing attacks.

Token Bucket Algorithm

Token Bucket allows bursts up to the bucket capacity, then enforces a steady refill rate. A bucket holds at most N tokens; each request consumes one token; tokens refill at rate R per second. If the bucket is empty, the request is rejected or queued.

The key advantage: it smooths out traffic while allowing short bursts. A client can fire 10 requests in 1 second if the bucket is full, then slow to the refill rate thereafter.

Sliding Window Algorithm

Sliding Window tracks the number of requests in a rolling time window. Unlike Fixed Window, it doesn't have a boundary problem where a client can double their allowed requests by straddling a window boundary. The tradeoff: it's more expensive to compute (requires sorted sets or approximate counters).

token_bucket.lua

-- Atomic token bucket check using Redis Lualocal key = KEYS[1]local capacity = tonumber(ARGV[1])local refill_rate = tonumber(ARGV[2])  -- tokens/secondlocal now = tonumber(ARGV[3])  -- unix timestamp mslocal data = redis.call("HMGET", key, "tokens", "last_refill")local tokens = tonumber(data[1]) or capacitylocal last_refill = tonumber(data[2]) or now-- Calculate tokens to addlocal elapsed = (now - last_refill) / 1000local new_tokens = math.min(capacity, tokens + elapsed * refill_rate)if new_tokens < 1 then  return 0  -- rate limitedendredis.call("HMSET", key, "tokens", new_tokens - 1, "last_refill", now)redis.call("PEXPIRE", key, math.ceil(capacity / refill_rate) * 1000)return 1  -- allowed

Comparison Table

Algorithm	Burst	Precision	Memory
Token Bucket	Allowed up to N	Approximate	O(1) per key
Sliding Window	Not allowed	Exact	O(W) per key
Fixed Window	Boundary burst	Approximate	O(1) per key

When to Use Each

Use token bucket for APIs where short bursts are legitimate (e.g., a user opening your app fires several concurrent requests). Use sliding window when you need hard enforcement — billing APIs, vote submission, financial transactions. Use fixed window only when simplicity matters more than correctness.

Production Note

Always implement the rate limit check atomically. A non-atomic GET-then-SET allows two concurrent requests to both see a non-empty bucket and both get approved — defeating the purpose entirely. Redis Lua scripts run atomically server-side with no inter-request interleaving.

Distributed Environments

In a multi-replica gateway, each replica must share the same rate limit state — otherwise a client can route around per-replica limits. Redis is the natural shared store. The Lua script approach works identically whether you have 1 or 100 gateway replicas.

If Redis is unavailable, decide how to fail: fail-open (allow requests) preserves availability at the cost of briefly unenforced limits; fail-closed (reject all) preserves limit correctness at the cost of a service outage. For most APIs, fail-open is the right default.

How I Built an Exactly-Once Job Queue Without a Coordinator