Engineering Notes
Engineering PatternsMay 31, 2026 · 22 min read

Designing API Rate Limiters: Algorithms, Distributed Systems, and Production Trade-offs

A practical guide to rate-limiting algorithms, atomic Redis enforcement, failure modes, multi-region trade-offs, and the edge cases that shape production API design.

Rate LimitingRedisDistributed Systems

Summary

A production rate limiter is not just a Redis counter with an expiry. Start by defining the policy, choose an algorithm that matches the traffic shape, make the decision atomic, design for Redis and regional failures, and separate approximate infrastructure protection from billing-grade quota enforcement.

A practical guide to rate-limiting algorithms, atomic Redis enforcement, failure modes, multi-region trade-offs, and the edge cases that shape production API design. In this post, I'll walk through the key concepts with code examples drawn from real production implementations.

Start With the Problem, Not Redis

A rate limiter controls how quickly a client can consume a shared resource. It protects an API from abusive traffic, accidental retry loops, credential-stuffing attempts, noisy tenants, and sudden bursts that would otherwise overload downstream services. It can also enforce product limits such as a free plan allowing 1,000 invoice exports per month.

The first design question is not which Redis command to use. It is what should be limited. A public endpoint may need an IP-based limit before authentication. An authenticated API may use a user ID, API key, or tenant ID. A login route often needs several policies at once: account, IP address, and subnet. An expensive report endpoint may need weighted costs or a separate concurrency limit.

Policy keyGood fitImportant caveat
user_idAuthenticated user fairnessOne user may legitimately use several devices
api_keyDeveloper and partner APIsA leaked key becomes a hot abusive key
tenant_idMulti-tenant SaaS protectionOne noisy user can consume a tenant's shared quota
ip_addressUnauthenticated endpointsNAT can place many real users behind one IP
endpointRoute-specific protectionDifferent endpoints may need different token costs
globalProtecting finite shared capacityOne global key can become a bottleneck

Design question

Clarify whether the limiter protects infrastructure, prevents abuse, or enforces a paid quota. Those goals look similar at first, but they require different accuracy, latency, availability, and auditability trade-offs.

Estimate the Load

For an API serving 50 million requests per day, the first calculation is straightforward: 50,000,000 / 86,400 ≈ 579 RPS. That is only the daily average. Real traffic is rarely flat. A product launch, scheduled job, regional morning peak, or retry storm can push a system far above the average for a short period.

50M
requests per day
~579
average RPS
Peak > avg
design assumption

Average RPS tells us the baseline load on the limiter. Peak RPS tells us whether a Redis node, network hop, or global counter becomes a bottleneck. Active-key cardinality tells us how much state we retain. Endpoint cost tells us whether every request should consume the same amount of capacity. A useful estimate should cover all four.

Why a Redis Counter With Expiry Is Incomplete

A counter with an expiry is a reasonable starting point. It is the fixed window counter: increment a key for the current minute, set a TTL, and reject requests after the limit. It is fast, simple, and often good enough for internal tools or low-risk controls.

fixed_window.txt
count = INCR(key)if count == 1:    EXPIRE(key, 60)if count > limit:    rejectelse:    allow

The weakness appears at the window boundary. With a limit of 100 requests per minute, a client can send 100 requests at 10:00:59 and another 100 requests at 10:01:00. Both windows are technically valid, but the service receives 200 requests almost immediately. The Redis counter is not wrong; the policy is simply too coarse for that traffic shape.

Failure mode

The obvious design is useful when its failure mode is acceptable. Engineering judgment starts with naming the failure mode before choosing the implementation.

Core Rate-Limiting Algorithms

Fixed Window Counter

Count requests inside a fixed interval such as a second, minute, or day. Each interval gets a distinct key with a TTL. The algorithm needs constant memory per active key and works well when simplicity matters more than precise smoothing. Its defining weakness is the boundary burst.

Sliding Window Log

Store the timestamp of every accepted request, commonly in a Redis sorted set. On each decision, remove timestamps older than the rolling window, count what remains, and accept only when the count is below the limit. This closely models 'no more than N requests in any 60-second period,' but its memory use grows with request volume.

Sliding Window Counter

Store counts for the current and previous fixed windows, then weight the previous window by how much of it still overlaps the current rolling interval. If 25% of the current minute has elapsed, 75% of the previous minute still contributes. This reduces boundary spikes while keeping constant per-key state, but the result is approximate.

sliding_window_counter.txt
estimated_usage = current_count + previous_count * remaining_window_ratio# Example: 25% of current minute has elapsedestimated_usage = 40 + 80 * 0.75estimated_usage = 100

Token Bucket

Give each policy key a bucket with a maximum capacity and a steady refill rate. Each request spends one or more tokens. A full bucket permits a controlled burst, while sustained traffic settles at the refill rate. A bucket with capacity 100 and refill rate 10 tokens per second can accept a burst of 100 requests, then continue at roughly 10 RPS.

Token bucket is a strong default for general APIs because it stores only two values per key: the remaining token balance and the last refill time. It supports weighted costs as well. A product lookup might cost one token, an invoice PDF five, and an AI report twenty.

Leaky Bucket

Place incoming work into a bounded queue and drain it at a steady rate. When the queue is full, drop or reject new work. The output becomes smooth, which is valuable for background processing and downstream services with predictable capacity. The trade-off is queueing latency. A leaky bucket is often a traffic-shaping mechanism, while a token bucket is often an admission-control mechanism.

Concurrency Limiter

A concurrency limiter answers a different question: how many expensive operations may run at the same time? A user might be allowed 100 report requests per hour but only five active report generations. Rate limits protect capacity over time; concurrency limits protect scarce workers, connections, and downstream dependencies in the moment. Many systems need both.

Choosing the Algorithm

AlgorithmState per keyBurst behaviorAccuracyGood fit
Fixed windowO(1)Boundary burstApproximateSimple low-risk limits
Sliding window logO(requests in window)Strict rolling windowHighSecurity-sensitive actions
Sliding window counterO(1)Smoothed boundaryApproximateLarge public APIs
Token bucketO(1)Controlled by capacityPolicy-drivenGeneral API protection
Leaky bucketO(queue size)Smooth outputQueue-drivenBackground work and shaping
Concurrency limiterO(active operations)Caps in-flight workExact per storeExpensive endpoints

For broad API protection, start with token bucket and tune capacity separately from refill rate. Use a sliding window log when strict rolling-window behavior is worth the memory cost, such as suspicious login attempts. Use a sliding window counter when the fixed-window boundary problem matters but an approximate result is acceptable. Keep concurrency controls beside rate limits for expensive work.

From One Process to Distributed Enforcement

On one application server, an in-memory map is often enough. The limiter avoids a network call and can make decisions extremely quickly. This is useful for a local service, a single worker, or an early layer that rejects obviously abusive traffic.

Flow· Single-node request path
Client
  |
  v
API server
  |
  v
In-memory bucket map
  |
  +-- token available --> handler
  |
  +-- bucket empty -----> 429

Once the API runs on several replicas, isolated in-memory state stops being a reliable source of truth. If three servers each allow 100 requests, a client routed across all three can receive 300 accepted requests. The distributed design needs shared state or an explicitly approximate allocation strategy.

Flow· Shared distributed state
Client
  |
  v
Load balancer
  |
  +--> API server A --+
  |
  +--> API server B --+--> Redis rate-limit state
  |
  +--> API server C --+
                         |
                         +-- allow --> API handler
                         +-- deny ---> 429

Shared storage alone does not make the decision correct. A token bucket requires a read, refill calculation, decision, decrement, and write. If the application executes those as separate network operations, concurrent requests can both observe the same token and both pass.

race_condition.txt
Request A reads tokens = 1Request B reads tokens = 1Request A allows and writes tokens = 0Request B allows and writes tokens = 0

The result is two accepted requests even though only one token existed. The check and update must be one atomic operation.

Production Patterns

Token Bucket With Redis Lua

A compact Redis Lua script keeps the hot-path decision atomic. It reads the bucket, computes refilled tokens, rejects or deducts the requested cost, stores the new state, and applies a TTL before another request can interleave. The TTL removes inactive policy keys after enough idle time for a bucket to refill.

token_bucket.lua
local key = KEYS[1]local capacity = tonumber(ARGV[1])local refill_rate = tonumber(ARGV[2])local requested = tonumber(ARGV[3]) or 1local now_parts = redis.call("TIME")local now = tonumber(now_parts[1]) + tonumber(now_parts[2]) / 1000000local data = redis.call("HMGET", key, "tokens", "last_refill")local tokens = tonumber(data[1]) or capacitylocal last_refill = tonumber(data[2]) or nowlocal elapsed = math.max(0, now - last_refill)local refilled = math.min(capacity, tokens + elapsed * refill_rate)if refilled < requested then  return {0, refilled}endlocal remaining = refilled - requestedredis.call("HSET", key, "tokens", remaining, "last_refill", now)redis.call("EXPIRE", key, math.ceil(capacity / refill_rate) * 2)return {1, remaining}

Using Redis TIME avoids trusting a client timestamp or the wall clock of whichever API replica handled the request. Another valid design is to pass time from a trusted application clock when that behavior is intentional and measured. Either way, scripts should remain short: atomic server-side execution is valuable precisely because it serializes the work.

Production note

The old advice still belongs here: do not implement token bucket as GET, calculate, then SET from application code. Lua keeps refill, decrement, decision, and TTL update together as one atomic operation.

The key shape should make the policy visible. Examples include rate_limit:user:123, rate_limit:api_key:partner_42, rate_limit:tenant:pharmacy_7:endpoint:invoice_pdf, and rate_limit:login:ip:203.0.113.10. Explicit keys make debugging and observability much easier.

Scaling Redis Without Creating a New Bottleneck

At the interview scale of roughly 579 average RPS, a carefully implemented Redis-backed limiter is not inherently exotic. The design becomes interesting when peaks grow, policy keys multiply, or a single global key receives every request. Redis Cluster distributes keyspace across 16,384 hash slots, so per-user, per-key, and per-tenant state can spread across shards.

Pressure pointWhy it happensPractical response
Hot global keyEvery request updates one bucketPrefer distributed admission layers or accept approximation when splitting capacity
Hot user or API keyOne caller floods a single shard keyReject early with a local limiter, WAF rule, or temporary block
High cardinalityRandom IPs create many retained keysUse TTLs, prefix policies, and bounded local protection
Large sliding logsA timestamp is stored for each requestClean aggressively or choose a constant-state algorithm
Redis round tripsEvery accepted request checks shared stateUse local pre-limiters and keep scripts small
Uneven shardsKey distribution or hot tenants concentrate loadObserve per-shard load and revisit partitioning

A local in-memory limiter in front of Redis is a useful first layer. It is not the global source of truth. Its job is to reject obvious abuse cheaply, reduce pressure during attacks, and provide a degraded fallback when shared state is unavailable.

Flow· Layered admission control
Request
  |
  v
Local in-memory pre-limiter
  |
  +-- obvious abuse --> 429
  |
  v
Redis distributed limiter
  |
  +-- policy exceeded --> 429
  |
  v
API handler

Failure Behavior

A production design needs an explicit answer for Redis timeouts and outages. There is no universal default. The correct behavior depends on what the limiter protects and what failure is more expensive.

ModeBehavior when shared limiter failsGood fitRisk
Fail openTemporarily allow requestsLow-risk endpoints where availability dominatesAbuse may pass during the outage
Fail closedReject requestsSensitive operations and infrastructure at risk of overloadLimiter failure becomes an API outage
Degraded local modeUse stricter per-process limits with alertsGeneral APIs needing a balanced fallbackLimits become approximate across replicas

The fallback should be observable and time-bounded. A hidden fail-open path can quietly remove protection. A hidden fail-closed path can quietly block good users. Record the transition, alert on it, and make the recovery behavior explicit.

Multi-Region Design

A single central Redis deployment adds cross-region latency when users are served from Asia, Europe, and North America. Regional limiters keep the request path fast, but they make a strict global quota harder. If each region independently believes capacity is available, the combined system may allow a small overage.

ApproachLatencyGlobal accuracyTrade-off
Regional buckets with async reconciliationLowApproximateSmall overage is accepted for speed and resilience
Quota reservation per regionLow on normal requestsStrongerA busy region can exhaust its allocation while another leaves quota unused
Centralized global checkHigher across distant regionsStrongestCorrectness is purchased with latency and a tighter dependency

Quota reservation is a useful middle ground. A global coordinator divides a quota across regions, such as 400 units for Singapore and 300 each for Europe and the United States. Regions spend local allocations quickly and request more when needed. This reduces global coordination on every request, but it introduces allocation imbalance and refill complexity.

Infrastructure Protection Is Not Billing

Infrastructure protection asks whether the system stays healthy under load. A small amount of over-allowance is often acceptable if the alternative is a slow or fragile central dependency. Token buckets, regional state, local fallback, and eventual reconciliation work well here.

Billing-grade quota enforcement asks whether a customer consumed more than the product contract allows. That requires durable records, auditability, reconciliation, idempotent accounting, and stronger consistency. A monthly allowance of 1,000 paid exports should not quietly become 2,000 because several regions accepted requests independently.

Rule of thumb

Use the hot-path limiter to protect service capacity. Use durable accounting to settle business quota. They can cooperate, but treating them as the same system creates avoidable correctness problems.

API Contract and Client Behavior

When a request is throttled, respond with HTTP 429 Too Many Requests. RFC 6585 defines the status code and says the response may include Retry-After. Clients should respect that signal and use exponential backoff with jitter instead of retrying in lockstep.

rate_limit_response.http
HTTP/1.1 429 Too Many RequestsRetry-After: 30Content-Type: application/json{  "code": "RATE_LIMIT_EXCEEDED",  "message": "Rate limit exceeded. Try again later."}

Many APIs also expose practical quota information so clients can reduce traffic before receiving a 429. The exact contract belongs in the API documentation, and it should not reveal sensitive internal capacity details.

Standards note - May 31, 2026

RateLimit and RateLimit-Policy are defined by draft-ietf-httpapi-ratelimit-headers-11, an active IETF Internet-Draft updated May 23, 2026. Treat these fields as work in progress rather than a finalized RFC. Retry-After and HTTP 429 remain the stable baseline.

Edge Cases Worth Designing For

Edge cases are where a limiter stops being a data-structure exercise and becomes a systems problem. The right response is often a layered policy rather than one increasingly complicated bucket.

Edge caseFailure modeDesign response
Window boundary burstA fixed window admits roughly 2x traffic near a boundaryUse token bucket or a sliding-window approach
Non-atomic updateConcurrent requests spend the same tokenUse atomic INCR where sufficient or a short Redis Lua script
Clock skewReplicas calculate different refill amountsUse a trusted time source such as Redis TIME; never trust client time
Redis outageThe shared decision path disappearsChoose fail open, fail closed, or degraded local mode explicitly
Hot user or API keyOne key overloads a shardPre-limit locally, block abusive keys, and observe shard load
NAT or shared IPMany legitimate users share one addressPrefer identity-based limits after authentication
IPv6 rotationAn attacker cycles addressesConsider prefix-based controls and combine signals carefully
Login abuseCredential stuffing bypasses a single policyLayer account, IP, subnet, and failure-count limits
Expensive endpointCheap and costly requests consume equal allowanceSpend weighted tokens and add concurrency control
Unauthenticated bot burstNo user ID exists yetUse CDN or WAF protection, IP limits, and progressive challenges
Plan changeCached limits remain stale after upgrade or downgradeVersion policies and invalidate cached configuration
Multi-region overageRegions independently accept beyond a global quotaReserve quota or reconcile with an accepted tolerance
Retry stormClients retry together and amplify loadReturn Retry-After and require exponential backoff with jitter
Memory growthRandom identities leave endless stateSet TTLs, clean logs, and monitor key cardinality
Multi-account abuseAn attacker spreads traffic across accountsCombine tenant, device, network, and fraud signals

End-to-End Reference Architecture

Flow· Production request path
Client
  |
  v
CDN / WAF
  |
  v
Load balancer
  |
  v
API gateway
  |
  +--> local pre-limiter
  |      |
  |      +-- abusive burst --> 429
  |
  +--> Redis Cluster distributed limiter
  |      |
  |      +-- policy exceeded --> 429 + Retry-After
  |
  +--> auth, validation, and routing
         |
         v
     API service
         |
         v
  downstream dependencies

PostgreSQL: plans, subscriptions, audit records, billing reconciliation
Metrics: allowed, blocked, errors, fallback state, latency, hot keys

Redis belongs on the fast admission path. PostgreSQL belongs behind the durable product rules: subscription plans, tenant configuration, audit history, and billing reconciliation. CDN and WAF controls absorb obvious unauthenticated abuse before it reaches application infrastructure. Each layer has a narrower job.

Observability

A limiter can fail in two directions: it can allow traffic that should be blocked, or block users who should be allowed. Both failures can hide behind an apparently healthy API. Instrument the decision path as a first-class production component.

  • rate_limiter_allowed_total and rate_limiter_blocked_total, partitioned by policy and endpoint
  • http_429_total by endpoint, tenant, API key, and region
  • rate_limiter_redis_latency_ms and rate_limiter_redis_errors_total
  • rate_limiter_fallback_mode_total with alerts for entry and duration
  • rate_limiter_hot_keys_detected_total and Redis per-shard CPU or memory pressure
  • Policy-cardinality dashboards for IP, user, tenant, and API-key keys

Use sampled decision logs for investigation: policy key type, endpoint, requested cost, allow or reject result, region, fallback mode, and a safe reason code. Avoid leaking raw secrets such as API keys into logs.

A Strong Design Answer

A strong answer is structured. Clarify the policy key and whether the goal is infrastructure safety or business quota. Estimate the average and peak load. Choose an algorithm based on burst behavior and accuracy. Make the update atomic. Explain shard keys, hot keys, TTLs, failure behavior, regional consistency, client backoff, and observability.

Interview-ready summary

For 50 million requests per day, the average is about 579 RPS, but I would design for peak traffic. For general API protection I would start with token bucket because it allows controlled bursts with constant per-key state. Across replicas I would keep shared state in Redis and execute refill, decision, decrement, and TTL update atomically with a short Lua script. I would shard by user, API key, or tenant, add local pre-limiters for abuse and degraded operation, and choose regional approximation or stricter quota reservation based on whether I am protecting infrastructure or enforcing billing-grade quota.

Where to Go Deeper

The most useful next step is to go one level deeper into the decisions that change production behavior. These references and topics extend the design without turning it into a single oversized limiter.

  • Token bucket tuning: capacity, refill rate, weighted cost, and the burst budget
  • Multi-region quota allocation: reservation, reconciliation, and acceptable overage
  • Client resilience: Retry-After, exponential backoff, jitter, and idempotency

The durable mental model is simple: the algorithm decides how capacity is counted, storage decides where state lives, atomicity prevents double-spending, scaling distributes pressure, failure policy defines degraded behavior, and the product requirement decides which trade-offs are acceptable.

Share