Designing API Rate Limiters: Algorithms, Distributed Systems, and Production Trade-offs

Summary

A production rate limiter is not just a Redis counter with an expiry. Start by defining the policy, choose an algorithm that matches the traffic shape, make the decision atomic, design for Redis and regional failures, and separate approximate infrastructure protection from billing-grade quota enforcement.

A practical guide to rate-limiting algorithms, atomic Redis enforcement, failure modes, multi-region trade-offs, and the edge cases that shape production API design. In this post, I'll walk through the key concepts with code examples drawn from real production implementations.

Start With the Problem, Not Redis

A rate limiter controls how quickly a client can consume a shared resource. It protects an API from abusive traffic, accidental retry loops, credential-stuffing attempts, noisy tenants, and sudden bursts that would otherwise overload downstream services. It can also enforce product limits such as a free plan allowing 1,000 invoice exports per month.

The first design question is not which Redis command to use. It is what should be limited. A public endpoint may need an IP-based limit before authentication. An authenticated API may use a user ID, API key, or tenant ID. A login route often needs several policies at once: account, IP address, and subnet. An expensive report endpoint may need weighted costs or a separate concurrency limit.

Policy key	Good fit	Important caveat
user_id	Authenticated user fairness	One user may legitimately use several devices
api_key	Developer and partner APIs	A leaked key becomes a hot abusive key
tenant_id	Multi-tenant SaaS protection	One noisy user can consume a tenant's shared quota
ip_address	Unauthenticated endpoints	NAT can place many real users behind one IP
endpoint	Route-specific protection	Different endpoints may need different token costs
global	Protecting finite shared capacity	One global key can become a bottleneck

Design question

Clarify whether the limiter protects infrastructure, prevents abuse, or enforces a paid quota. Those goals look similar at first, but they require different accuracy, latency, availability, and auditability trade-offs.

Estimate the Load

For an API serving 50 million requests per day, the first calculation is straightforward: 50,000,000 / 86,400 ≈ 579 RPS. That is only the daily average. Real traffic is rarely flat. A product launch, scheduled job, regional morning peak, or retry storm can push a system far above the average for a short period.

50M

requests per day

~579

average RPS

Peak > avg

design assumption

Average RPS tells us the baseline load on the limiter. Peak RPS tells us whether a Redis node, network hop, or global counter becomes a bottleneck. Active-key cardinality tells us how much state we retain. Endpoint cost tells us whether every request should consume the same amount of capacity. A useful estimate should cover all four.

Why a Redis Counter With Expiry Is Incomplete

A counter with an expiry is a reasonable starting point. It is the fixed window counter: increment a key for the current minute, set a TTL, and reject requests after the limit. It is fast, simple, and often good enough for internal tools or low-risk controls.

fixed_window.txt

count = INCR(key)if count == 1:    EXPIRE(key, 60)if count > limit:    rejectelse:    allow

The weakness appears at the window boundary. With a limit of 100 requests per minute, a client can send 100 requests at 10:00:59 and another 100 requests at 10:01:00. Both windows are technically valid, but the service receives 200 requests almost immediately. The Redis counter is not wrong; the policy is simply too coarse for that traffic shape.

Failure mode

The obvious design is useful when its failure mode is acceptable. Engineering judgment starts with naming the failure mode before choosing the implementation.

Core Rate-Limiting Algorithms

Fixed Window Counter

Count requests inside a fixed interval such as a second, minute, or day. Each interval gets a distinct key with a TTL. The algorithm needs constant memory per active key and works well when simplicity matters more than precise smoothing. Its defining weakness is the boundary burst.

Sliding Window Log

Store the timestamp of every accepted request, commonly in a Redis sorted set. On each decision, remove timestamps older than the rolling window, count what remains, and accept only when the count is below the limit. This closely models 'no more than N requests in any 60-second period,' but its memory use grows with request volume.

Sliding Window Counter

Store counts for the current and previous fixed windows, then weight the previous window by how much of it still overlaps the current rolling interval. If 25% of the current minute has elapsed, 75% of the previous minute still contributes. This reduces boundary spikes while keeping constant per-key state, but the result is approximate.

sliding_window_counter.txt

estimated_usage = current_count + previous_count * remaining_window_ratio# Example: 25% of current minute has elapsedestimated_usage = 40 + 80 * 0.75estimated_usage = 100

Token Bucket

Give each policy key a bucket with a maximum capacity and a steady refill rate. Each request spends one or more tokens. A full bucket permits a controlled burst, while sustained traffic settles at the refill rate. A bucket with capacity 100 and refill rate 10 tokens per second can accept a burst of 100 requests, then continue at roughly 10 RPS.

Token bucket is a strong default for general APIs because it stores only two values per key: the remaining token balance and the last refill time. It supports weighted costs as well. A product lookup might cost one token, an invoice PDF five, and an AI report twenty.

Leaky Bucket

Place incoming work into a bounded queue and drain it at a steady rate. When the queue is full, drop or reject new work. The output becomes smooth, which is valuable for background processing and downstream services with predictable capacity. The trade-off is queueing latency. A leaky bucket is often a traffic-shaping mechanism, while a token bucket is often an admission-control mechanism.

Concurrency Limiter

A concurrency limiter answers a different question: how many expensive operations may run at the same time? A user might be allowed 100 report requests per hour but only five active report generations. Rate limits protect capacity over time; concurrency limits protect scarce workers, connections, and downstream dependencies in the moment. Many systems need both.

Choosing the Algorithm

Algorithm	State per key	Burst behavior	Accuracy	Good fit
Fixed window	O(1)	Boundary burst	Approximate	Simple low-risk limits
Sliding window log	O(requests in window)	Strict rolling window	High	Security-sensitive actions
Sliding window counter	O(1)	Smoothed boundary	Approximate	Large public APIs
Token bucket	O(1)	Controlled by capacity	Policy-driven	General API protection
Leaky bucket	O(queue size)	Smooth output	Queue-driven	Background work and shaping
Concurrency limiter	O(active operations)	Caps in-flight work	Exact per store	Expensive endpoints

For broad API protection, start with token bucket and tune capacity separately from refill rate. Use a sliding window log when strict rolling-window behavior is worth the memory cost, such as suspicious login attempts. Use a sliding window counter when the fixed-window boundary problem matters but an approximate result is acceptable. Keep concurrency controls beside rate limits for expensive work.

From One Process to Distributed Enforcement

On one application server, an in-memory map is often enough. The limiter avoids a network call and can make decisions extremely quickly. This is useful for a local service, a single worker, or an early layer that rejects obviously abusive traffic.

Flow· Single-node request path

Client
  |
  v
API server
  |
  v
In-memory bucket map
  |
  +-- token available --> handler
  |
  +-- bucket empty -----> 429

Once the API runs on several replicas, isolated in-memory state stops being a reliable source of truth. If three servers each allow 100 requests, a client routed across all three can receive 300 accepted requests. The distributed design needs shared state or an explicitly approximate allocation strategy.

Flow· Shared distributed state

Client
  |
  v
Load balancer
  |
  +--> API server A --+
  |
  +--> API server B --+--> Redis rate-limit state
  |
  +--> API server C --+
                         |
                         +-- allow --> API handler
                         +-- deny ---> 429

Shared storage alone does not make the decision correct. A token bucket requires a read, refill calculation, decision, decrement, and write. If the application executes those as separate network operations, concurrent requests can both observe the same token and both pass.

race_condition.txt

Request A reads tokens = 1Request B reads tokens = 1Request A allows and writes tokens = 0Request B allows and writes tokens = 0

The result is two accepted requests even though only one token existed. The check and update must be one atomic operation.

Production Patterns

Token Bucket With Redis Lua

A compact Redis Lua script keeps the hot-path decision atomic. It reads the bucket, computes refilled tokens, rejects or deducts the requested cost, stores the new state, and applies a TTL before another request can interleave. The TTL removes inactive policy keys after enough idle time for a bucket to refill.

token_bucket.lua

local key = KEYS[1]local capacity = tonumber(ARGV[1])local refill_rate = tonumber(ARGV[2])local requested = tonumber(ARGV[3]) or 1local now_parts = redis.call("TIME")local now = tonumber(now_parts[1]) + tonumber(now_parts[2]) / 1000000local data = redis.call("HMGET", key, "tokens", "last_refill")local tokens = tonumber(data[1]) or capacitylocal last_refill = tonumber(data[2]) or nowlocal elapsed = math.max(0, now - last_refill)local refilled = math.min(capacity, tokens + elapsed * refill_rate)if refilled < requested then  return {0, refilled}endlocal remaining = refilled - requestedredis.call("HSET", key, "tokens", remaining, "last_refill", now)redis.call("EXPIRE", key, math.ceil(capacity / refill_rate) * 2)return {1, remaining}

Using Redis TIME avoids trusting a client timestamp or the wall clock of whichever API replica handled the request. Another valid design is to pass time from a trusted application clock when that behavior is intentional and measured. Either way, scripts should remain short: atomic server-side execution is valuable precisely because it serializes the work.

Production note

The old advice still belongs here: do not implement token bucket as GET, calculate, then SET from application code. Lua keeps refill, decrement, decision, and TTL update together as one atomic operation.

The key shape should make the policy visible. Examples include rate_limit:user:123, rate_limit:api_key:partner_42, rate_limit:tenant:pharmacy_7:endpoint:invoice_pdf, and rate_limit:login:ip:203.0.113.10. Explicit keys make debugging and observability much easier.

Scaling Redis Without Creating a New Bottleneck

At the interview scale of roughly 579 average RPS, a carefully implemented Redis-backed limiter is not inherently exotic. The design becomes interesting when peaks grow, policy keys multiply, or a single global key receives every request. Redis Cluster distributes keyspace across 16,384 hash slots, so per-user, per-key, and per-tenant state can spread across shards.

Pressure point	Why it happens	Practical response
Hot global key	Every request updates one bucket	Prefer distributed admission layers or accept approximation when splitting capacity
Hot user or API key	One caller floods a single shard key	Reject early with a local limiter, WAF rule, or temporary block
High cardinality	Random IPs create many retained keys	Use TTLs, prefix policies, and bounded local protection
Large sliding logs	A timestamp is stored for each request	Clean aggressively or choose a constant-state algorithm
Redis round trips	Every accepted request checks shared state	Use local pre-limiters and keep scripts small
Uneven shards	Key distribution or hot tenants concentrate load	Observe per-shard load and revisit partitioning

A local in-memory limiter in front of Redis is a useful first layer. It is not the global source of truth. Its job is to reject obvious abuse cheaply, reduce pressure during attacks, and provide a degraded fallback when shared state is unavailable.

Flow· Layered admission control

Request
  |
  v
Local in-memory pre-limiter
  |
  +-- obvious abuse --> 429
  |
  v
Redis distributed limiter
  |
  +-- policy exceeded --> 429
  |
  v
API handler

Failure Behavior

A production design needs an explicit answer for Redis timeouts and outages. There is no universal default. The correct behavior depends on what the limiter protects and what failure is more expensive.

Mode	Behavior when shared limiter fails	Good fit	Risk
Fail open	Temporarily allow requests	Low-risk endpoints where availability dominates	Abuse may pass during the outage
Fail closed	Reject requests	Sensitive operations and infrastructure at risk of overload	Limiter failure becomes an API outage
Degraded local mode	Use stricter per-process limits with alerts	General APIs needing a balanced fallback	Limits become approximate across replicas

The fallback should be observable and time-bounded. A hidden fail-open path can quietly remove protection. A hidden fail-closed path can quietly block good users. Record the transition, alert on it, and make the recovery behavior explicit.

Multi-Region Design

A single central Redis deployment adds cross-region latency when users are served from Asia, Europe, and North America. Regional limiters keep the request path fast, but they make a strict global quota harder. If each region independently believes capacity is available, the combined system may allow a small overage.

Approach	Latency	Global accuracy	Trade-off
Regional buckets with async reconciliation	Low	Approximate	Small overage is accepted for speed and resilience
Quota reservation per region	Low on normal requests	Stronger	A busy region can exhaust its allocation while another leaves quota unused
Centralized global check	Higher across distant regions	Strongest	Correctness is purchased with latency and a tighter dependency

Quota reservation is a useful middle ground. A global coordinator divides a quota across regions, such as 400 units for Singapore and 300 each for Europe and the United States. Regions spend local allocations quickly and request more when needed. This reduces global coordination on every request, but it introduces allocation imbalance and refill complexity.

Infrastructure Protection Is Not Billing

Infrastructure protection asks whether the system stays healthy under load. A small amount of over-allowance is often acceptable if the alternative is a slow or fragile central dependency. Token buckets, regional state, local fallback, and eventual reconciliation work well here.

Billing-grade quota enforcement asks whether a customer consumed more than the product contract allows. That requires durable records, auditability, reconciliation, idempotent accounting, and stronger consistency. A monthly allowance of 1,000 paid exports should not quietly become 2,000 because several regions accepted requests independently.

Rule of thumb

Use the hot-path limiter to protect service capacity. Use durable accounting to settle business quota. They can cooperate, but treating them as the same system creates avoidable correctness problems.

API Contract and Client Behavior

When a request is throttled, respond with HTTP 429 Too Many Requests. RFC 6585 defines the status code and says the response may include Retry-After. Clients should respect that signal and use exponential backoff with jitter instead of retrying in lockstep.

rate_limit_response.http

HTTP/1.1 429 Too Many RequestsRetry-After: 30Content-Type: application/json{  "code": "RATE_LIMIT_EXCEEDED",  "message": "Rate limit exceeded. Try again later."}

Many APIs also expose practical quota information so clients can reduce traffic before receiving a 429. The exact contract belongs in the API documentation, and it should not reveal sensitive internal capacity details.

Standards note - May 31, 2026

RateLimit and RateLimit-Policy are defined by draft-ietf-httpapi-ratelimit-headers-11, an active IETF Internet-Draft updated May 23, 2026. Treat these fields as work in progress rather than a finalized RFC. Retry-After and HTTP 429 remain the stable baseline.

Edge Cases Worth Designing For

Edge cases are where a limiter stops being a data-structure exercise and becomes a systems problem. The right response is often a layered policy rather than one increasingly complicated bucket.

Edge case	Failure mode	Design response
Window boundary burst	A fixed window admits roughly 2x traffic near a boundary	Use token bucket or a sliding-window approach
Non-atomic update	Concurrent requests spend the same token	Use atomic INCR where sufficient or a short Redis Lua script
Clock skew	Replicas calculate different refill amounts	Use a trusted time source such as Redis TIME; never trust client time
Redis outage	The shared decision path disappears	Choose fail open, fail closed, or degraded local mode explicitly
Hot user or API key	One key overloads a shard	Pre-limit locally, block abusive keys, and observe shard load
NAT or shared IP	Many legitimate users share one address	Prefer identity-based limits after authentication
IPv6 rotation	An attacker cycles addresses	Consider prefix-based controls and combine signals carefully
Login abuse	Credential stuffing bypasses a single policy	Layer account, IP, subnet, and failure-count limits
Expensive endpoint	Cheap and costly requests consume equal allowance	Spend weighted tokens and add concurrency control
Unauthenticated bot burst	No user ID exists yet	Use CDN or WAF protection, IP limits, and progressive challenges
Plan change	Cached limits remain stale after upgrade or downgrade	Version policies and invalidate cached configuration
Multi-region overage	Regions independently accept beyond a global quota	Reserve quota or reconcile with an accepted tolerance
Retry storm	Clients retry together and amplify load	Return Retry-After and require exponential backoff with jitter
Memory growth	Random identities leave endless state	Set TTLs, clean logs, and monitor key cardinality
Multi-account abuse	An attacker spreads traffic across accounts	Combine tenant, device, network, and fraud signals

End-to-End Reference Architecture

Flow· Production request path

Client
  |
  v
CDN / WAF
  |
  v
Load balancer
  |
  v
API gateway
  |
  +--> local pre-limiter
  |      |
  |      +-- abusive burst --> 429
  |
  +--> Redis Cluster distributed limiter
  |      |
  |      +-- policy exceeded --> 429 + Retry-After
  |
  +--> auth, validation, and routing
         |
         v
     API service
         |
         v
  downstream dependencies

PostgreSQL: plans, subscriptions, audit records, billing reconciliation
Metrics: allowed, blocked, errors, fallback state, latency, hot keys

Redis belongs on the fast admission path. PostgreSQL belongs behind the durable product rules: subscription plans, tenant configuration, audit history, and billing reconciliation. CDN and WAF controls absorb obvious unauthenticated abuse before it reaches application infrastructure. Each layer has a narrower job.

Observability

A limiter can fail in two directions: it can allow traffic that should be blocked, or block users who should be allowed. Both failures can hide behind an apparently healthy API. Instrument the decision path as a first-class production component.

rate_limiter_allowed_total and rate_limiter_blocked_total, partitioned by policy and endpoint
http_429_total by endpoint, tenant, API key, and region
rate_limiter_redis_latency_ms and rate_limiter_redis_errors_total
rate_limiter_fallback_mode_total with alerts for entry and duration
rate_limiter_hot_keys_detected_total and Redis per-shard CPU or memory pressure
Policy-cardinality dashboards for IP, user, tenant, and API-key keys

Use sampled decision logs for investigation: policy key type, endpoint, requested cost, allow or reject result, region, fallback mode, and a safe reason code. Avoid leaking raw secrets such as API keys into logs.

A Strong Design Answer

A strong answer is structured. Clarify the policy key and whether the goal is infrastructure safety or business quota. Estimate the average and peak load. Choose an algorithm based on burst behavior and accuracy. Make the update atomic. Explain shard keys, hot keys, TTLs, failure behavior, regional consistency, client backoff, and observability.

Interview-ready summary

For 50 million requests per day, the average is about 579 RPS, but I would design for peak traffic. For general API protection I would start with token bucket because it allows controlled bursts with constant per-key state. Across replicas I would keep shared state in Redis and execute refill, decision, decrement, and TTL update atomically with a short Lua script. I would shard by user, API key, or tenant, add local pre-limiters for abuse and degraded operation, and choose regional approximation or stricter quota reservation based on whether I am protecting infrastructure or enforcing billing-grade quota.

Where to Go Deeper

The most useful next step is to go one level deeper into the decisions that change production behavior. These references and topics extend the design without turning it into a single oversized limiter.

Token bucket tuning: capacity, refill rate, weighted cost, and the burst budget
Multi-region quota allocation: reservation, reconciliation, and acceptable overage
Client resilience: Retry-After, exponential backoff, jitter, and idempotency

The durable mental model is simple: the algorithm decides how capacity is counted, storage decides where state lives, atomicity prevents double-spending, scaling distributes pressure, failure policy defines degraded behavior, and the product requirement decides which trade-offs are acceptable.

RBAC That Survives Real Tenants