How Rate Limiting Works Internally...

1. The Hook: Why Rate Limiting Is a Survival Mechanism
Most engineers first encounter rate limiting as a consumer — your API call comes back with a 429 Too Many Requests and you feel mildly inconvenienced. But flip to the other side of that wall, and the story changes completely. Rate limiting is not a feature you add when you have spare time. It is the difference between a system that stays alive under pressure and one that silently crumbles.
The "Noisy Neighbor" Problem
Imagine a shared apartment building where one tenant decides to run a hot tub, a sauna, and a server rack simultaneously. Everyone else's power flickers. That is precisely what happens in a shared API infrastructure when one aggressive client — or a runaway script with a bug in its retry loop — starts firing thousands of requests per second. Because most backend resources (database connection pools, CPU, memory, downstream API quotas) are shared across all tenants, a single bad actor can starve every other user of service. No malicious intent required. A developer accidentally deploying a client that forgot its exponential backoff can cause a production outage for thousands of unrelated users.
Cost Control in the Cloud Era
Pre-cloud, a server had a physical ceiling. It would slow down, queue requests, and eventually drop them — but your bill was fixed. In the cloud era, auto-scaling removes that ceiling. Your Kubernetes cluster happily spins up 200 pods to handle the flood of traffic from a DDoS attack or a viral Reddit thread, and your AWS invoice at the end of the month reflects every one of those compute-hours. Rate limiting caps the blast radius. It says: regardless of what is hitting us, we will not process more than X requests per second, which means our infrastructure cost is bounded and our legitimate users are protected.
The API Contract: 429 and Retry-After
The HTTP specification gave us a clean, standardised way to communicate rate limiting to clients. When a request is rejected for exceeding a limit, the server returns:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1719835200
The Retry-After header is the polite handshake of the internet. It tells the client: "I am not broken, I am not hostile — come back in 30 seconds." Well-designed clients respect this header and back off gracefully. Poorly designed ones ignore it and hammer the server harder, which is exactly why the server-side enforcement needs to be robust regardless of client behaviour.
2. The Core Algorithms
This is where the interesting engineering lives. There are four primary algorithms in widespread use, and each one makes a different trade-off between precision, memory, burst tolerance, and implementation complexity.
Token Bucket
The token bucket is probably the most widely deployed algorithm in production systems. Stripe uses it. Amazon's AWS API Gateway uses it. The mental model is intuitive: imagine a bucket with a maximum capacity of N tokens. Tokens are added to the bucket at a constant rate — say, 10 tokens per second. Every incoming request must consume one token to proceed. If a request arrives and the bucket is empty, it is dropped (or queued, depending on your implementation). If the bucket is full and no requests are coming in, tokens accumulate up to the maximum capacity but no further.
The key insight is that the bucket's capacity is what allows bursts. A client that has been quiet for 10 seconds has accumulated 100 tokens (at 10/s rate) and can fire 100 requests in rapid succession before being throttled. This is extremely desirable for legitimate use cases — a dashboard that refreshes data in bulk, or a CI pipeline that makes a flurry of API calls at startup.
Trade-off summary:
Allows controlled bursts up to bucket capacity
Smooths sustained traffic to the refill rate
Simple state: just the current token count and last refill timestamp
Used by: Stripe, AWS API Gateway, many CDNs
Leaky Bucket
The leaky bucket flips the mental model. Instead of tokens, think of a bucket with a hole in the bottom. Requests pour in at any rate, but they drip out at a constant, fixed rate. If the bucket overflows (too many requests queued), new arrivals are dropped.
The practical implementation is a FIFO queue with a fixed-rate processor. Regardless of how many requests arrive in a spike, the output is a perfectly smooth, metered stream. This makes the leaky bucket ideal for situations where downstream systems cannot handle bursty input — a legacy database, a third-party API with strict rate limits of its own, or a hardware device with fixed processing capacity.
The downside is that legitimate bursts are penalised. If a user fires 50 valid requests at once, 40 of them sit in a queue (or get dropped), even if the system has headroom. NGINX's limit_req module uses a leaky bucket variant.
Trade-off summary:
Produces perfectly smooth output traffic
No burst tolerance — all spikes are clipped
Predictable queue depth and latency
Used by: NGINX, network traffic shaping
Fixed Window Counter
This is the simplest algorithm to understand and implement. Divide time into fixed intervals — say, one-minute windows. Keep a counter for each window. Each request increments the counter. When the counter hits the limit, reject requests until the window resets.
Window: 13:00:00 — 13:01:00 | Counter: 0 → 1 → 2 → ... → 100 | REJECT
Window: 13:01:00 — 13:02:00 | Counter resets to 0
The implementation is almost trivially simple, but it has a well-known structural flaw: the boundary problem.
Consider a limit of 100 requests per minute. A clever (or just unlucky) user sends 100 requests at 13:00:58 — 13:00:59, exhausting their window. The window resets at 13:01:00, and they immediately send another 100 requests at 13:01:00 — 13:01:01. From the algorithm's perspective, both batches are perfectly valid. But from a real-time perspective, 200 requests arrived in a two-second span — double the intended rate. For most use cases this edge case is tolerable. For billing systems or security-sensitive endpoints, it is a genuine vulnerability.
Trade-off summary:
Extremely simple to implement (a single Redis
INCR+EXPIRE)Low memory footprint
Boundary problem allows ~2x traffic bursts at window edges
Good enough for most non-critical rate limiting scenarios
Sliding Window Log
The sliding window log is the accurate answer to the fixed window's boundary problem. Instead of a counter, maintain a log (typically a sorted set by timestamp) of every request. When a new request arrives, remove all entries older than the window size, then count the remaining entries. If the count is below the limit, allow the request and add its timestamp to the log.
Limit: 5 requests per 60 seconds
Current time: 13:05:45
Log: [13:04:55, 13:05:10, 13:05:30, 13:05:40]
Remove entries before 13:04:45 → none removed
Count: 4 → below limit → ALLOW, add 13:05:45
Log: [13:04:55, 13:05:10, 13:05:30, 13:05:40, 13:05:45]
This is extremely precise — there is no boundary effect because the window slides continuously with time. The cost is memory: you are storing a timestamp for every single request within the window, which becomes expensive at high traffic volumes.
Sliding Window Counter
A practical hybrid of the fixed window and sliding window log. Keep two fixed window counters — the current window and the previous window — and use a weighted average to approximate the sliding window count.
Approximate current rate = (previous_window_count × overlap_fraction) + current_window_count
For example, if you are 30% into the current window, the overlap fraction from the previous window is 70%, so:
Rate ≈ (previous_count × 0.7) + current_count
This gives you near-sliding-window accuracy at the memory cost of just two integers per user. Cloudflare uses this approach at their edge nodes.
Trade-off summary:
Near-accurate, eliminates most of the boundary problem
Very low memory footprint (two counters per user)
Slight approximation error (~0.003% under uniform traffic per Cloudflare's own analysis)
Used by: Cloudflare
3. Taking It Distributed: The Real-World Challenge
Every algorithm described above is easy to implement on a single server. In reality, any serious backend runs multiple application servers behind a load balancer, which means your rate limiting state is fragmented. Server A has seen 80 of a user's requests this minute; Server B has seen 60. Neither knows about the other. The user has effectively been granted 140 requests when the limit is 100.
This is the distributed rate limiting problem, and it is genuinely hard.
The Redis Approach
The canonical solution is to move the rate limiting state out of the individual application servers and into a shared, fast, in-memory data store. Redis is almost universally chosen for this role because it offers sub-millisecond latency, atomic operations, and built-in key expiry.
The architecture looks like this:
Request → Load Balancer
↓
┌─────────────────────┐
│ App Server A │──── Redis INCR(user:123) ────┐
│ App Server B │──── Redis INCR(user:123) ────┤──→ Redis Cluster
│ App Server C │──── Redis INCR(user:123) ────┘
└─────────────────────┘
Every server checks the same Redis key for a given user. The rate limiting state is now consistent across the entire fleet.
For a fixed window counter, the implementation is two Redis commands:
INCR user:123:2024:minute:845
EXPIRE user:123:2024:minute:845 60
INCR atomically increments the counter and returns the new value. If the returned value exceeds the limit, reject the request. EXPIRE ensures the key is automatically cleaned up after the window passes.
Race Conditions
Here is where it gets subtle. Consider two application servers processing requests from the same user at the exact same millisecond:
Server A: GET user:123 → 99 (below limit of 100)
Server B: GET user:123 → 99 (below limit of 100)
Server A: SET user:123 = 100 (allow request)
Server B: SET user:123 = 100 (allow request — OOPS, this is the 101st)
Both servers read a value below the limit, both decided to allow the request, and now you have let through more requests than permitted. This is a classic check-and-set race condition, and it is not hypothetical — at high request rates it happens constantly.
The naive fix is a Redis distributed lock. Acquire a lock, read, increment, release. But this serialises access to the counter, adds significant latency, and creates a new failure mode: what happens if a server dies while holding the lock?
The Lua Script Solution
The elegant, production-grade solution is to use Redis's scripting capability. Redis executes Lua scripts atomically — the entire script runs as a single, uninterruptible operation from Redis's perspective. No other command can execute in between.
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = redis.call('INCR', key)
if current == 1 then
redis.call('EXPIRE', key, window)
end
if current > limit then
return 0 -- rejected
else
return 1 -- allowed
end
This script does the read, the increment, and the decision in one atomic step. The race condition is structurally eliminated — not by locking, but by ensuring the check and the mutation are a single indivisible operation. This is the approach used by most serious rate limiting libraries (e.g., redis-py's rate limiter, redis-cell module, and implementations at Stripe and GitHub).
4. Performance and Low-Latency Considerations
For the vast majority of applications, a round-trip to Redis adds 1–5 milliseconds of latency per request. That is entirely acceptable. But for latency-sensitive systems — high-frequency trading platforms, real-time bidding engines, ultra-low-latency API gateways — even a single network hop per request is too expensive.
Local Caching with Periodic Sync
One approach is a hybrid model: each application server maintains a local, in-memory counter and only syncs with central Redis periodically (every 100ms, for example). The local counter is the hot path — no network, no serialisation, pure memory access. The sync flushes the local delta to Redis and pulls the global count.
App Server A: local_count = 0
→ 50 requests arrive → local_count = 50
→ 100ms sync: Redis INCRBY(key, 50) → global = 50
→ pull global = 50, reset local_count = 0
App Server B: local_count = 0
→ 60 requests arrive → local_count = 60
→ 100ms sync: Redis INCRBY(key, 60) → global = 110
→ pull global = 110
The trade-off is precision. During the sync interval, each server has a stale view of the global count. You may allow slightly more traffic than the strict limit during that window. For most use cases this is acceptable — the overrun is bounded by (num_servers × local_batch_size) and averages out over time. For hard security limits, it is not.
Rate Limiting at the Edge
Perhaps the most important architectural principle: rate limiting should happen as early as possible in the request path. Every layer inward that a request travels before being rejected wastes resources — TCP connection establishment, TLS handshake, load balancer processing, application server thread allocation.
The ideal order is:
Network edge / CDN (Cloudflare, Fastly, Akamai): Drop obviously malicious traffic based on IP reputation, ASN, or crude request-rate heuristics before it ever reaches your infrastructure.
API Gateway / Load Balancer (AWS API Gateway, Kong, Nginx): Apply per-user, per-key, or per-endpoint rate limits using centralised state (Redis). This is where most business logic rate limiting should live.
Application layer: Fine-grained, context-aware limits that require business logic (e.g., "this user is on the free tier") as a last line.
Pushing rate limiting to the API Gateway is particularly powerful because it is the natural choke point — all traffic passes through it, it is horizontally scalable, and it is typically managed infrastructure with Redis already wired in.
5. Summary and Decision Matrix
Rate limiting is a layered problem. No single algorithm or deployment pattern is universally correct. The right choice depends on your traffic shape, consistency requirements, and latency budget.
Quick decision guide:
| Requirement | Algorithm | Storage |
|---|---|---|
| Smooth, metered output to a downstream system | Leaky Bucket | Single server or Redis queue |
| Allow legitimate traffic bursts | Token Bucket | Redis (atomic Lua) |
| Simplest possible implementation | Fixed Window Counter | Redis INCR + EXPIRE |
| High accuracy without the memory cost of full logs | Sliding Window Counter | Redis (two counters) |
| Distributed, accurate, production-grade | Sliding Window Counter + Redis + Lua | Redis Cluster |
| Ultra-low latency, can tolerate slight over-counting | Token Bucket + Local Cache | Local memory + async Redis sync |
| Drop attacks before they reach infrastructure | Any algorithm | Edge node (CDN/Gateway) |
The common thread across all serious production implementations is this: rate limiting state should be centralised (Redis), mutations should be atomic (Lua scripts), and enforcement should happen as early in the request path as possible (API Gateway or edge). Everything else is a tuning decision.
The 429 Too Many Requests response is not a wall — it is a signal. A well-designed rate limiting system is one that sends that signal cleanly, consistently, and at the right layer, so that the rest of your infrastructure never has to deal with the traffic that triggered it.
Further reading: Stripe's engineering blog post on their rate limiting implementation is the gold standard for the Token Bucket approach in distributed systems. Cloudflare's blog covers sliding window counters at edge scale. The Redis documentation on the INCR/EXPIRE pattern and the redis-cell module covers the atomic operations described here.


