Summary
A production rate limiter is not just a Redis counter with an expiry. Start by defining the policy, choose an algorithm that matches the traffic shape, make the decision atomic, design for Redis and regional failures, and separate approximate infrastructure protection from billing-grade quota enforcement.
A practical guide to rate-limiting algorithms, atomic Redis enforcement, failure modes, multi-region trade-offs, and the edge cases that shape production API design. In this post, I'll walk through the key concepts with code examples drawn from real production implementations.
Start With the Problem, Not Redis
A rate limiter controls how quickly a client can consume a shared resource. It protects an API from abusive traffic, accidental retry loops, credential-stuffing attempts, noisy tenants, and sudden bursts that would otherwise overload downstream services. It can also enforce product limits such as a free plan allowing 1,000 invoice exports per month.
The first design question is not which Redis command to use. It is what should be limited. A public endpoint may need an IP-based limit before authentication. An authenticated API may use a user ID, API key, or tenant ID. A login route often needs several policies at once: account, IP address, and subnet. An expensive report endpoint may need weighted costs or a separate concurrency limit.
| Policy key | Good fit | Important caveat |
|---|---|---|
| user_id | Authenticated user fairness | One user may legitimately use several devices |
| api_key | Developer and partner APIs | A leaked key becomes a hot abusive key |
| tenant_id | Multi-tenant SaaS protection | One noisy user can consume a tenant's shared quota |
| ip_address | Unauthenticated endpoints | NAT can place many real users behind one IP |
| endpoint | Route-specific protection | Different endpoints may need different token costs |
| global | Protecting finite shared capacity | One global key can become a bottleneck |
Design question
Clarify whether the limiter protects infrastructure, prevents abuse, or enforces a paid quota. Those goals look similar at first, but they require different accuracy, latency, availability, and auditability trade-offs.
Estimate the Load
For an API serving 50 million requests per day, the first calculation is straightforward: 50,000,000 / 86,400 ≈ 579 RPS. That is only the daily average. Real traffic is rarely flat. A product launch, scheduled job, regional morning peak, or retry storm can push a system far above the average for a short period.
Average RPS tells us the baseline load on the limiter. Peak RPS tells us whether a Redis node, network hop, or global counter becomes a bottleneck. Active-key cardinality tells us how much state we retain. Endpoint cost tells us whether every request should consume the same amount of capacity. A useful estimate should cover all four.
Why a Redis Counter With Expiry Is Incomplete
A counter with an expiry is a reasonable starting point. It is the fixed window counter: increment a key for the current minute, set a TTL, and reject requests after the limit. It is fast, simple, and often good enough for internal tools or low-risk controls.
count = INCR(key)if count == 1: EXPIRE(key, 60)if count > limit: rejectelse: allowThe weakness appears at the window boundary. With a limit of 100 requests per minute, a client can send 100 requests at 10:00:59 and another 100 requests at 10:01:00. Both windows are technically valid, but the service receives 200 requests almost immediately. The Redis counter is not wrong; the policy is simply too coarse for that traffic shape.
Failure mode
The obvious design is useful when its failure mode is acceptable. Engineering judgment starts with naming the failure mode before choosing the implementation.
Core Rate-Limiting Algorithms
Fixed Window Counter
Count requests inside a fixed interval such as a second, minute, or day. Each interval gets a distinct key with a TTL. The algorithm needs constant memory per active key and works well when simplicity matters more than precise smoothing. Its defining weakness is the boundary burst.
Sliding Window Log
Store the timestamp of every accepted request, commonly in a Redis sorted set. On each decision, remove timestamps older than the rolling window, count what remains, and accept only when the count is below the limit. This closely models 'no more than N requests in any 60-second period,' but its memory use grows with request volume.
Sliding Window Counter
Store counts for the current and previous fixed windows, then weight the previous window by how much of it still overlaps the current rolling interval. If 25% of the current minute has elapsed, 75% of the previous minute still contributes. This reduces boundary spikes while keeping constant per-key state, but the result is approximate.
estimated_usage = current_count + previous_count * remaining_window_ratio# Example: 25% of current minute has elapsedestimated_usage = 40 + 80 * 0.75estimated_usage = 100Token Bucket
Give each policy key a bucket with a maximum capacity and a steady refill rate. Each request spends one or more tokens. A full bucket permits a controlled burst, while sustained traffic settles at the refill rate. A bucket with capacity 100 and refill rate 10 tokens per second can accept a burst of 100 requests, then continue at roughly 10 RPS.
Token bucket is a strong default for general APIs because it stores only two values per key: the remaining token balance and the last refill time. It supports weighted costs as well. A product lookup might cost one token, an invoice PDF five, and an AI report twenty.
Leaky Bucket
Place incoming work into a bounded queue and drain it at a steady rate. When the queue is full, drop or reject new work. The output becomes smooth, which is valuable for background processing and downstream services with predictable capacity. The trade-off is queueing latency. A leaky bucket is often a traffic-shaping mechanism, while a token bucket is often an admission-control mechanism.
Concurrency Limiter
A concurrency limiter answers a different question: how many expensive operations may run at the same time? A user might be allowed 100 report requests per hour but only five active report generations. Rate limits protect capacity over time; concurrency limits protect scarce workers, connections, and downstream dependencies in the moment. Many systems need both.
Choosing the Algorithm
| Algorithm | State per key | Burst behavior | Accuracy | Good fit |
|---|---|---|---|---|
| Fixed window | O(1) | Boundary burst | Approximate | Simple low-risk limits |
| Sliding window log | O(requests in window) | Strict rolling window | High | Security-sensitive actions |
| Sliding window counter | O(1) | Smoothed boundary | Approximate | Large public APIs |
| Token bucket | O(1) | Controlled by capacity | Policy-driven | General API protection |
| Leaky bucket | O(queue size) | Smooth output | Queue-driven | Background work and shaping |
| Concurrency limiter | O(active operations) | Caps in-flight work | Exact per store | Expensive endpoints |
For broad API protection, start with token bucket and tune capacity separately from refill rate. Use a sliding window log when strict rolling-window behavior is worth the memory cost, such as suspicious login attempts. Use a sliding window counter when the fixed-window boundary problem matters but an approximate result is acceptable. Keep concurrency controls beside rate limits for expensive work.
From One Process to Distributed Enforcement
On one application server, an in-memory map is often enough. The limiter avoids a network call and can make decisions extremely quickly. This is useful for a local service, a single worker, or an early layer that rejects obviously abusive traffic.
Client | v API server | v In-memory bucket map | +-- token available --> handler | +-- bucket empty -----> 429
Once the API runs on several replicas, isolated in-memory state stops being a reliable source of truth. If three servers each allow 100 requests, a client routed across all three can receive 300 accepted requests. The distributed design needs shared state or an explicitly approximate allocation strategy.
Client
|
v
Load balancer
|
+--> API server A --+
|
+--> API server B --+--> Redis rate-limit state
|
+--> API server C --+
|
+-- allow --> API handler
+-- deny ---> 429Shared storage alone does not make the decision correct. A token bucket requires a read, refill calculation, decision, decrement, and write. If the application executes those as separate network operations, concurrent requests can both observe the same token and both pass.
Request A reads tokens = 1Request B reads tokens = 1Request A allows and writes tokens = 0Request B allows and writes tokens = 0The result is two accepted requests even though only one token existed. The check and update must be one atomic operation.
Production Patterns
Token Bucket With Redis Lua
A compact Redis Lua script keeps the hot-path decision atomic. It reads the bucket, computes refilled tokens, rejects or deducts the requested cost, stores the new state, and applies a TTL before another request can interleave. The TTL removes inactive policy keys after enough idle time for a bucket to refill.
local key = KEYS[1]local capacity = tonumber(ARGV[1])local refill_rate = tonumber(ARGV[2])local requested = tonumber(ARGV[3]) or 1local now_parts = redis.call("TIME")local now = tonumber(now_parts[1]) + tonumber(now_parts[2]) / 1000000local data = redis.call("HMGET", key, "tokens", "last_refill")local tokens = tonumber(data[1]) or capacitylocal last_refill = tonumber(data[2]) or nowlocal elapsed = math.max(0, now - last_refill)local refilled = math.min(capacity, tokens + elapsed * refill_rate)if refilled < requested then return {0, refilled}endlocal remaining = refilled - requestedredis.call("HSET", key, "tokens", remaining, "last_refill", now)redis.call("EXPIRE", key, math.ceil(capacity / refill_rate) * 2)return {1, remaining}Using Redis TIME avoids trusting a client timestamp or the wall clock of whichever API replica handled the request. Another valid design is to pass time from a trusted application clock when that behavior is intentional and measured. Either way, scripts should remain short: atomic server-side execution is valuable precisely because it serializes the work.
Production note
The old advice still belongs here: do not implement token bucket as GET, calculate, then SET from application code. Lua keeps refill, decrement, decision, and TTL update together as one atomic operation.
The key shape should make the policy visible. Examples include rate_limit:user:123, rate_limit:api_key:partner_42, rate_limit:tenant:pharmacy_7:endpoint:invoice_pdf, and rate_limit:login:ip:203.0.113.10. Explicit keys make debugging and observability much easier.
Scaling Redis Without Creating a New Bottleneck
At the interview scale of roughly 579 average RPS, a carefully implemented Redis-backed limiter is not inherently exotic. The design becomes interesting when peaks grow, policy keys multiply, or a single global key receives every request. Redis Cluster distributes keyspace across 16,384 hash slots, so per-user, per-key, and per-tenant state can spread across shards.
| Pressure point | Why it happens | Practical response |
|---|---|---|
| Hot global key | Every request updates one bucket | Prefer distributed admission layers or accept approximation when splitting capacity |
| Hot user or API key | One caller floods a single shard key | Reject early with a local limiter, WAF rule, or temporary block |
| High cardinality | Random IPs create many retained keys | Use TTLs, prefix policies, and bounded local protection |
| Large sliding logs | A timestamp is stored for each request | Clean aggressively or choose a constant-state algorithm |
| Redis round trips | Every accepted request checks shared state | Use local pre-limiters and keep scripts small |
| Uneven shards | Key distribution or hot tenants concentrate load | Observe per-shard load and revisit partitioning |
A local in-memory limiter in front of Redis is a useful first layer. It is not the global source of truth. Its job is to reject obvious abuse cheaply, reduce pressure during attacks, and provide a degraded fallback when shared state is unavailable.
Request | v Local in-memory pre-limiter | +-- obvious abuse --> 429 | v Redis distributed limiter | +-- policy exceeded --> 429 | v API handler
Failure Behavior
A production design needs an explicit answer for Redis timeouts and outages. There is no universal default. The correct behavior depends on what the limiter protects and what failure is more expensive.
| Mode | Behavior when shared limiter fails | Good fit | Risk |
|---|---|---|---|
| Fail open | Temporarily allow requests | Low-risk endpoints where availability dominates | Abuse may pass during the outage |
| Fail closed | Reject requests | Sensitive operations and infrastructure at risk of overload | Limiter failure becomes an API outage |
| Degraded local mode | Use stricter per-process limits with alerts | General APIs needing a balanced fallback | Limits become approximate across replicas |
The fallback should be observable and time-bounded. A hidden fail-open path can quietly remove protection. A hidden fail-closed path can quietly block good users. Record the transition, alert on it, and make the recovery behavior explicit.
Multi-Region Design
A single central Redis deployment adds cross-region latency when users are served from Asia, Europe, and North America. Regional limiters keep the request path fast, but they make a strict global quota harder. If each region independently believes capacity is available, the combined system may allow a small overage.
| Approach | Latency | Global accuracy | Trade-off |
|---|---|---|---|
| Regional buckets with async reconciliation | Low | Approximate | Small overage is accepted for speed and resilience |
| Quota reservation per region | Low on normal requests | Stronger | A busy region can exhaust its allocation while another leaves quota unused |
| Centralized global check | Higher across distant regions | Strongest | Correctness is purchased with latency and a tighter dependency |
Quota reservation is a useful middle ground. A global coordinator divides a quota across regions, such as 400 units for Singapore and 300 each for Europe and the United States. Regions spend local allocations quickly and request more when needed. This reduces global coordination on every request, but it introduces allocation imbalance and refill complexity.
Infrastructure Protection Is Not Billing
Infrastructure protection asks whether the system stays healthy under load. A small amount of over-allowance is often acceptable if the alternative is a slow or fragile central dependency. Token buckets, regional state, local fallback, and eventual reconciliation work well here.
Billing-grade quota enforcement asks whether a customer consumed more than the product contract allows. That requires durable records, auditability, reconciliation, idempotent accounting, and stronger consistency. A monthly allowance of 1,000 paid exports should not quietly become 2,000 because several regions accepted requests independently.
Rule of thumb
Use the hot-path limiter to protect service capacity. Use durable accounting to settle business quota. They can cooperate, but treating them as the same system creates avoidable correctness problems.
API Contract and Client Behavior
When a request is throttled, respond with HTTP 429 Too Many Requests. RFC 6585 defines the status code and says the response may include Retry-After. Clients should respect that signal and use exponential backoff with jitter instead of retrying in lockstep.
HTTP/1.1 429 Too Many RequestsRetry-After: 30Content-Type: application/json{ "code": "RATE_LIMIT_EXCEEDED", "message": "Rate limit exceeded. Try again later."}Many APIs also expose practical quota information so clients can reduce traffic before receiving a 429. The exact contract belongs in the API documentation, and it should not reveal sensitive internal capacity details.
Standards note - May 31, 2026
RateLimit and RateLimit-Policy are defined by draft-ietf-httpapi-ratelimit-headers-11, an active IETF Internet-Draft updated May 23, 2026. Treat these fields as work in progress rather than a finalized RFC. Retry-After and HTTP 429 remain the stable baseline.
Edge Cases Worth Designing For
Edge cases are where a limiter stops being a data-structure exercise and becomes a systems problem. The right response is often a layered policy rather than one increasingly complicated bucket.
| Edge case | Failure mode | Design response |
|---|---|---|
| Window boundary burst | A fixed window admits roughly 2x traffic near a boundary | Use token bucket or a sliding-window approach |
| Non-atomic update | Concurrent requests spend the same token | Use atomic INCR where sufficient or a short Redis Lua script |
| Clock skew | Replicas calculate different refill amounts | Use a trusted time source such as Redis TIME; never trust client time |
| Redis outage | The shared decision path disappears | Choose fail open, fail closed, or degraded local mode explicitly |
| Hot user or API key | One key overloads a shard | Pre-limit locally, block abusive keys, and observe shard load |
| NAT or shared IP | Many legitimate users share one address | Prefer identity-based limits after authentication |
| IPv6 rotation | An attacker cycles addresses | Consider prefix-based controls and combine signals carefully |
| Login abuse | Credential stuffing bypasses a single policy | Layer account, IP, subnet, and failure-count limits |
| Expensive endpoint | Cheap and costly requests consume equal allowance | Spend weighted tokens and add concurrency control |
| Unauthenticated bot burst | No user ID exists yet | Use CDN or WAF protection, IP limits, and progressive challenges |
| Plan change | Cached limits remain stale after upgrade or downgrade | Version policies and invalidate cached configuration |
| Multi-region overage | Regions independently accept beyond a global quota | Reserve quota or reconcile with an accepted tolerance |
| Retry storm | Clients retry together and amplify load | Return Retry-After and require exponential backoff with jitter |
| Memory growth | Random identities leave endless state | Set TTLs, clean logs, and monitor key cardinality |
| Multi-account abuse | An attacker spreads traffic across accounts | Combine tenant, device, network, and fraud signals |
End-to-End Reference Architecture
Client
|
v
CDN / WAF
|
v
Load balancer
|
v
API gateway
|
+--> local pre-limiter
| |
| +-- abusive burst --> 429
|
+--> Redis Cluster distributed limiter
| |
| +-- policy exceeded --> 429 + Retry-After
|
+--> auth, validation, and routing
|
v
API service
|
v
downstream dependencies
PostgreSQL: plans, subscriptions, audit records, billing reconciliation
Metrics: allowed, blocked, errors, fallback state, latency, hot keysRedis belongs on the fast admission path. PostgreSQL belongs behind the durable product rules: subscription plans, tenant configuration, audit history, and billing reconciliation. CDN and WAF controls absorb obvious unauthenticated abuse before it reaches application infrastructure. Each layer has a narrower job.
Observability
A limiter can fail in two directions: it can allow traffic that should be blocked, or block users who should be allowed. Both failures can hide behind an apparently healthy API. Instrument the decision path as a first-class production component.
rate_limiter_allowed_totalandrate_limiter_blocked_total, partitioned by policy and endpointhttp_429_totalby endpoint, tenant, API key, and regionrate_limiter_redis_latency_msandrate_limiter_redis_errors_totalrate_limiter_fallback_mode_totalwith alerts for entry and durationrate_limiter_hot_keys_detected_totaland Redis per-shard CPU or memory pressure- Policy-cardinality dashboards for IP, user, tenant, and API-key keys
Use sampled decision logs for investigation: policy key type, endpoint, requested cost, allow or reject result, region, fallback mode, and a safe reason code. Avoid leaking raw secrets such as API keys into logs.
A Strong Design Answer
A strong answer is structured. Clarify the policy key and whether the goal is infrastructure safety or business quota. Estimate the average and peak load. Choose an algorithm based on burst behavior and accuracy. Make the update atomic. Explain shard keys, hot keys, TTLs, failure behavior, regional consistency, client backoff, and observability.
Interview-ready summary
For 50 million requests per day, the average is about 579 RPS, but I would design for peak traffic. For general API protection I would start with token bucket because it allows controlled bursts with constant per-key state. Across replicas I would keep shared state in Redis and execute refill, decision, decrement, and TTL update atomically with a short Lua script. I would shard by user, API key, or tenant, add local pre-limiters for abuse and degraded operation, and choose regional approximation or stricter quota reservation based on whether I am protecting infrastructure or enforcing billing-grade quota.
Where to Go Deeper
The most useful next step is to go one level deeper into the decisions that change production behavior. These references and topics extend the design without turning it into a single oversized limiter.
- HTTP 429 and Retry-After: RFC 6585, Section 4
- RateLimit and RateLimit-Policy Internet-Draft
- Redis Lua Scripting and Atomic Execution
- Redis Cluster Hash Slots and Shard Behavior
- Token bucket tuning: capacity, refill rate, weighted cost, and the burst budget
- Multi-region quota allocation: reservation, reconciliation, and acceptable overage
- Client resilience: Retry-After, exponential backoff, jitter, and idempotency
The durable mental model is simple: the algorithm decides how capacity is counted, storage decides where state lives, atomicity prevents double-spending, scaling distributes pressure, failure policy defines degraded behavior, and the product requirement decides which trade-offs are acceptable.