Rate Limiting APIs Without Breaking User Experience

· 11 min read · Backend Development

How to implement token bucket rate limiting with Redis, handle bursts gracefully, and communicate limits to clients — with FastAPI examples.

Rate Limiting APIs Without Breaking User Experience

Overview

Rate limiting protects your API from abuse. Bad rate limiting punishes legitimate users. This post shows how to implement rate limiting that stops bots and scrapers while keeping real users unaware it even exists.

Problem

Your API is public. You launch. Within a week:

  • A scraper is hitting your blog endpoint 10,000 times per hour
  • A misconfigured webhook is retrying every second
  • A user's bugged SPA is firing duplicate requests on every keystroke

Without rate limiting, these consume your database connections, inflate your hosting bill, and degrade performance for everyone.

But naive rate limiting ('100 requests per minute, hard cutoff') blocks legitimate users during normal bursts — like rapidly paginating through search results or submitting a form after a validation error.

Solution

Use the token bucket algorithm. It allows bursts while enforcing a long-term average rate.

How Token Bucket Works

  • Each client has a 'bucket' of tokens (e.g., 20)
  • Each request consumes one token
  • Tokens refill at a steady rate (e.g., 10 per minute)
  • If the bucket is empty, the request is rejected

This means a user can fire 20 rapid requests (the burst), then sustains 10/minute after that. No artificial cliff.

Implementation with Redis

import time
import redis.asyncio as redis
from fastapi import Request, HTTPException

redis_client = redis.from_url("redis://localhost:6379")

async def check_rate_limit(
    key: str,
    max_tokens: int = 20,
    refill_rate: float = 10,  # tokens per minute
) -> dict:
    now = time.time()
    pipe = redis_client.pipeline()
    
    # Get current state
    bucket_key = f"ratelimit:{key}"
    pipe.hmget(bucket_key, "tokens", "last_refill")
    result = await pipe.execute()
    
    tokens_str, last_refill_str = result[0]
    
    if tokens_str is None:
        # First request — initialize bucket
        tokens = max_tokens - 1
        await redis_client.hmset(bucket_key, {
            "tokens": tokens,
            "last_refill": now
        })
        await redis_client.expire(bucket_key, 300)
        return {"allowed": True, "remaining": tokens}
    
    tokens = float(tokens_str)
    last_refill = float(last_refill_str)
    
    # Refill tokens based on elapsed time
    elapsed = now - last_refill
    tokens = min(max_tokens, tokens + elapsed * (refill_rate / 60))
    
    if tokens < 1:
        retry_after = (1 - tokens) / (refill_rate / 60)
        return {"allowed": False, "retry_after": retry_after}
    
    tokens -= 1
    await redis_client.hmset(bucket_key, {
        "tokens": tokens,
        "last_refill": now
    })
    
    return {"allowed": True, "remaining": int(tokens)}

FastAPI Middleware

from fastapi import FastAPI
from starlette.responses import JSONResponse

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    # Skip rate limiting for static assets
    if request.url.path.startswith(("/assets", "/images")):
        return await call_next(request)
    
    # Use IP + path as the rate limit key
    client_ip = request.headers.get(
        "x-forwarded-for", request.client.host
    ).split(",")[0].strip()
    key = f"{client_ip}:{request.url.path}"
    
    result = await check_rate_limit(key)
    
    if not result["allowed"]:
        return JSONResponse(
            status_code=429,
            content={"error": "Too many requests"},
            headers={
                "Retry-After": str(int(result["retry_after"])),
                "X-RateLimit-Limit": "20",
                "X-RateLimit-Remaining": "0",
            }
        )
    
    response = await call_next(request)
    response.headers["X-RateLimit-Remaining"] = str(result["remaining"])
    return response

Implementation

Tiered Rate Limits

Different endpoints need different limits:

Endpoint Burst Rate Why
GET /api/posts 30 15/min Read-heavy, cacheable
POST /api/contact 3 2/min Prevents spam
POST /api/auth/login 5 5/min Brute force protection
GET /api/search 10 10/min Expensive queries
RATE_LIMITS = {
    "POST:/api/contact": {"max_tokens": 3, "refill_rate": 2},
    "POST:/api/auth/login": {"max_tokens": 5, "refill_rate": 5},
    "GET:/api/search": {"max_tokens": 10, "refill_rate": 10},
    "default": {"max_tokens": 20, "refill_rate": 15},
}

Client-Side Handling

Do not just throw errors at the user. Handle 429 responses gracefully:

async function fetchWithRetry(url: string, retries = 2): Promise<Response> {
  const response = await fetch(url);
  
  if (response.status === 429 && retries > 0) {
    const retryAfter = parseInt(
      response.headers.get("Retry-After") || "5"
    );
    await new Promise(r => setTimeout(r, retryAfter * 1000));
    return fetchWithRetry(url, retries - 1);
  }
  
  return response;
}

Upstash for Serverless

If you deploy on Vercel (serverless), you cannot run a local Redis. Use Upstash:

import { Ratelimit } from "@upstash/ratelimit";
import { Redis } from "@upstash/redis";

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(10, "60 s"),
});

export default async function handler(req) {
  const ip = req.headers.get("x-forwarded-for") ?? "127.0.0.1";
  const { success, limit, remaining } = await ratelimit.limit(ip);
  
  if (!success) {
    return new Response("Too many requests", { status: 429 });
  }
  // ... handle request
}

Challenges

Distributed rate limiting: If your API runs on multiple servers, each server's in-memory counter is independent. Redis solves this — it is the single source of truth.

IP-based limits behind proxies: Users behind corporate NATs share an IP. Consider adding authenticated user ID as a secondary rate limit key for logged-in users.

Conclusion

Token bucket is the right default for API rate limiting. It handles bursts naturally without punishing legitimate users. Always communicate limits via headers (X-RateLimit-Remaining, Retry-After). Use Upstash when deploying serverless. And test your limits with real traffic patterns before tightening them.