Rate Limiting in Distributed Systems: A Complete Guide
Rate limiting is a critical technique for controlling the rate of requests that clients can make to a service. It protects your infrastructure from abuse, prevents resource starvation, and ensures fair usage across all consumers. In this comprehensive guide, we explore the major rate limiting algorithms, their trade-offs, and how to implement them at scale.
Why Rate Limiting Matters
Without rate limiting, a single misbehaving client can overwhelm your entire system. Consider these real-world scenarios:
- API Abuse: A buggy client sending thousands of requests per second can exhaust your database connections.
- DDoS Mitigation: Rate limiting serves as a first line of defense against distributed denial-of-service attacks.
- Cost Control: Cloud services charge per request. Uncontrolled traffic leads to unexpected bills.
- Fair Usage: Ensuring no single tenant monopolizes shared resources in a multi-tenant system.
Rate limiting is closely related to other resilience patterns like circuit breakers and partial failure handling. Together, they form a robust defense against cascading failures.
Rate Limiting Algorithms
1. Token Bucket Algorithm
The token bucket is the most widely used algorithm. A bucket holds tokens that are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected.
Key Properties:
- Allows short bursts of traffic up to the bucket capacity
- Smooths out traffic over time
- Simple to implement and understand
- Used by AWS API Gateway and Stripe
import time
import threading
class TokenBucket:
def __init__(self, rate, capacity):
self.rate = rate # tokens per second
self.capacity = capacity # max tokens
self.tokens = capacity
self.last_refill = time.monotonic()
self.lock = threading.Lock()
def allow_request(self):
with self.lock:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(self.capacity,
self.tokens + elapsed * self.rate)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
# Usage: 10 requests/sec, burst of 20
limiter = TokenBucket(rate=10, capacity=20)
if limiter.allow_request():
process_request()
else:
return_429_too_many_requests()
2. Leaky Bucket Algorithm
The leaky bucket processes requests at a constant rate, like water leaking from a bucket. Incoming requests are added to a queue (the bucket). If the queue is full, new requests are dropped.
Key Difference from Token Bucket: The leaky bucket enforces a strict output rate, while the token bucket allows bursts. This makes leaky bucket ideal for scenarios requiring smooth, consistent throughput.
import time
from collections import deque
import threading
class LeakyBucket:
def __init__(self, rate, capacity):
self.rate = rate
self.capacity = capacity
self.queue = deque()
self.last_leak = time.monotonic()
self.lock = threading.Lock()
def allow_request(self):
with self.lock:
now = time.monotonic()
elapsed = now - self.last_leak
leak_count = int(elapsed * self.rate)
for _ in range(min(leak_count, len(self.queue))):
self.queue.popleft()
if leak_count > 0:
self.last_leak = now
if len(self.queue) < self.capacity:
self.queue.append(now)
return True
return False
3. Fixed Window Counter
The fixed window algorithm divides time into fixed windows (e.g., 1-minute intervals) and counts requests in each window. Once the counter exceeds the limit, subsequent requests are rejected until the next window.
Pros: Simple, memory-efficient, easy to implement with Redis.
Cons: Suffers from boundary burst problem — a client can send 2x the limit at the window boundary.
import redis
import time
r = redis.Redis()
def fixed_window_rate_limit(client_id, limit, window_seconds):
window = int(time.time() / window_seconds)
key = f"rate:{client_id}:{window}"
pipe = r.pipeline()
pipe.incr(key)
pipe.expire(key, window_seconds)
result = pipe.execute()
current_count = result[0]
return current_count <= limit
4. Sliding Window Log
Maintains a sorted set of timestamps for each request. To check the limit, count all timestamps within the sliding window. This eliminates the boundary burst problem but uses more memory.
def sliding_window_log(client_id, limit, window_seconds):
now = time.time()
key = f"sliding:{client_id}"
window_start = now - window_seconds
pipe = r.pipeline()
pipe.zremrangebyscore(key, 0, window_start)
pipe.zadd(key, {str(now): now})
pipe.zcard(key)
pipe.expire(key, window_seconds)
result = pipe.execute()
return result[2] <= limit
5. Sliding Window Counter
A hybrid approach combining fixed window and sliding window. It estimates the count in the current sliding window by weighting the previous and current fixed window counts.
def sliding_window_counter(client_id, limit, window_seconds):
now = time.time()
current_window = int(now / window_seconds)
previous_window = current_window - 1
elapsed_in_window = (now % window_seconds) / window_seconds
current_key = f"rate:{client_id}:{current_window}"
previous_key = f"rate:{client_id}:{previous_window}"
current_count = int(r.get(current_key) or 0)
previous_count = int(r.get(previous_key) or 0)
weighted_count = previous_count * (1 - elapsed_in_window) + current_count
return weighted_count <= limit
Algorithm Comparison
| Algorithm | Memory | Burst Handling | Accuracy | Complexity |
|---|---|---|---|---|
| Token Bucket | Low | Allows bursts | Good | Low |
| Leaky Bucket | Low | Strict smoothing | Good | Low |
| Fixed Window | Very Low | Boundary burst issue | Moderate | Very Low |
| Sliding Window Log | High | No boundary issues | Exact | Medium |
| Sliding Window Counter | Low | Near-accurate | High | Low |
Distributed Rate Limiting with Redis
In a distributed system with multiple API servers, you need a centralized rate limiter. Redis is the de facto choice because of its atomic operations and low latency. Here is a production-ready implementation using Redis with the sliding window counter approach:
-- Redis Lua script for atomic sliding window counter
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local current_window = math.floor(now / window)
local previous_window = current_window - 1
local elapsed = (now % window) / window
local current_key = key .. ":" .. current_window
local previous_key = key .. ":" .. previous_window
local current_count = tonumber(redis.call("GET", current_key) or "0")
local previous_count = tonumber(redis.call("GET", previous_key) or "0")
local weighted = previous_count * (1 - elapsed) + current_count
if weighted < limit then
redis.call("INCR", current_key)
redis.call("EXPIRE", current_key, window * 2)
return 1
else
return 0
end
API Rate Limiting Best Practices
When implementing rate limiting for your API, follow these best practices:
- Return proper headers: Include
X-RateLimit-Limit,X-RateLimit-Remaining, andX-RateLimit-Resetin responses. - Use 429 status code: Return HTTP 429 Too Many Requests with a
Retry-Afterheader. - Rate limit by multiple dimensions: Per user, per IP, per API key, and per endpoint.
- Implement graceful degradation: Consider returning cached or partial results instead of outright rejection.
- Use tiered limits: Different plans get different rate limits (free: 100/hr, pro: 10,000/hr).
from flask import Flask, request, jsonify
import functools
app = Flask(__name__)
limiter = TokenBucket(rate=10, capacity=20)
def rate_limit(limit_per_minute):
def decorator(f):
@functools.wraps(f)
def wrapper(*args, **kwargs):
client_id = request.headers.get("X-API-Key", request.remote_addr)
if not check_rate_limit(client_id, limit_per_minute, 60):
remaining = get_remaining(client_id)
reset_time = get_reset_time(client_id)
response = jsonify({"error": "Rate limit exceeded"})
response.status_code = 429
response.headers["X-RateLimit-Limit"] = str(limit_per_minute)
response.headers["X-RateLimit-Remaining"] = str(remaining)
response.headers["X-RateLimit-Reset"] = str(reset_time)
response.headers["Retry-After"] = str(reset_time - int(time.time()))
return response
return f(*args, **kwargs)
return wrapper
return decorator
@app.route("/api/data")
@rate_limit(limit_per_minute=100)
def get_data():
return jsonify({"data": "value"})
Real-World Rate Limiting Architectures
Major tech companies implement rate limiting at multiple layers:
| Company | Algorithm | Implementation |
|---|---|---|
| Stripe | Token Bucket | Per-API-key, 100 req/sec default |
| GitHub | Sliding Window | 5,000 req/hr per authenticated user |
| Twitter/X | Fixed Window | Per-endpoint, 15-min windows |
| Cloudflare | Sliding Window Counter | Edge-level, configurable per route |
For more on protecting distributed systems, see our guides on circuit breakers and idempotency. You can also use the System Design Calculator to estimate the rate limiting requirements for your system.
Frequently Asked Questions
Q: What is the difference between rate limiting and throttling?
Rate limiting rejects requests that exceed a defined threshold, typically returning HTTP 429. Throttling slows down request processing without outright rejection, often using queuing or delayed responses. In practice, the terms are frequently used interchangeably, but throttling implies a softer approach where requests are eventually processed.
Q: Which rate limiting algorithm should I use?
For most API use cases, the sliding window counter offers the best balance of accuracy and memory efficiency. Use the token bucket if you need to allow controlled bursts. Use the leaky bucket if you need a strictly smooth output rate, such as for write-heavy database operations.
Q: How do I rate limit in a microservices architecture?
Use a centralized rate limiter with Redis or a dedicated rate limiting service. Place rate limiting at the API gateway level (e.g., Kong, Envoy, AWS API Gateway) for global limits, and at the service level for service-specific limits. This layered approach is connected to service discovery patterns for routing.
Q: How do I handle rate limiting across multiple data centers?
You can use local rate limiting per data center (simpler, but allows total traffic to exceed the global limit) or global rate limiting with a shared Redis cluster (accurate, but adds cross-region latency). A common compromise is to use local counters synced periodically to a global store. Learn more about this in our geo-distribution guide.