When you start exposing APIs or microservices to the outside world, one of the first real-world problems you hit is traffic that doesn’t behave nicely. Sometimes it is honest heavy usage from one customer, sometimes an integration bug that loops requests, and sometimes outright malicious scanning or credential stuffing. In all of these cases, you need a way to say “enough for now” without breaking the rest of your platform. That is exactly what rate limiting does.
In this article, we will walk through how we design and implement rate limiting for API and microservice architectures using three battle‑tested building blocks: Nginx as the gateway, Cloudflare at the edge, and Redis as a fast shared counter store. I will focus on practical patterns that we actually use in production‑style environments on VPS and dedicated servers at dchost.com: what to limit, where to enforce it, sample Nginx configs, Redis scripts, and how to tune thresholds without constantly fighting false positives.
We will also look at how rate limiting sits next to other pieces you might already be running: WAF rules, DDoS mitigation, caching, and firewalls. The goal is to help you build a layered but understandable traffic control strategy that protects your APIs and microservices while still feeling fast and predictable for legitimate users.
İçindekiler
- 1 Why Rate Limiting Matters for APIs and Microservices
- 2 Core Rate Limiting Concepts and Algorithms
- 3 Designing a Rate Limiting Strategy for Your Architecture
- 4 Implementing Rate Limiting with Nginx
- 5 Cloudflare Rate Limiting and WAF Rules at the Edge
- 6 Redis-Based Distributed Rate Limiting for APIs and Microservices
- 7 Observability, Testing and Tuning Your Limits
- 8 Network-Level Rate Limiting with Firewalls
- 9 Putting It All Together on dchost.com Infrastructure
Why Rate Limiting Matters for APIs and Microservices
Rate limiting is simply controlling how many requests a given actor (IP, user, API key, tenant, etc.) can make over a specific period of time. Done right, it acts like a safety valve for your infrastructure and your business logic.
Key goals of rate limiting
- Protect backend capacity: Databases, queues and external providers all have limits. Rate limiting keeps noisy neighbors from exhausting them.
- Fairness between tenants: In multi‑tenant SaaS, you do not want one large customer to slow down everyone else.
- Abuse and fraud prevention: Limiting login attempts, password reset flows, search endpoints and pricing APIs slows down attackers and scrapers.
- Cost control: If you pay per external API call or per CPU second, rate limits cap the financial blast radius of bugs and misuse.
- SLO and SLA protection: Keeping tail latencies under control is much easier if you cap the worst offenders before they cause a chain reaction.
How it fits with WAF, DDoS and firewalls
Rate limiting is one layer in a broader defense‑in‑depth strategy:
- Network‑level protection (firewalls, basic DDoS filtering) blocks obvious floods or port scans.
- WAF rules block malicious patterns in payloads (SQL injection, XSS signatures, etc.). For a practical overview, our Cloudflare security settings guide on WAF, rate limiting and bot protection shows how we layer these at the edge.
- Rate limiting focuses on volume and frequency of otherwise valid‑looking requests.
In microservice environments, rate limiting is often what keeps a noisy client from pushing a shared dependency (like a database or a payment gateway) over the edge and taking other services down with it.
Core Rate Limiting Concepts and Algorithms
Before we touch Nginx, Cloudflare or Redis, it helps to have a clear vocabulary and mental model.
Basic terminology
- Limit: Maximum number of requests allowed within a time window (for example, 100 requests per minute).
- Window: The time period for the limit (per second, per minute, per hour, per day).
- Key: The identifier you are limiting: IP, user ID, API key, tenant ID, or a combination.
- Burst: Extra requests temporarily allowed above the limit to smooth out short spikes.
- Quota vs rate: Some APIs control total requests per day/month (quota) as well as requests per second/minute (rate).
Main rate limiting algorithms
Most implementations are variants of a few well‑known algorithms:
- Fixed window: Count requests in discrete windows (e.g. 00:00–00:59, 01:00–01:59). Simple but can have edge‑of‑window bursts.
- Sliding window: Keep timestamps of recent requests and count those within the last N seconds. Smoother but needs more storage.
- Token bucket: Tokens accumulate over time; each request consumes one. Allows bursts up to bucket size while enforcing a long‑term average rate.
- Leaky bucket: Like a bucket that leaks at a constant rate; if more water (requests) enters than can leak out, extra is dropped.
- Concurrency limiting: Limit how many in‑flight requests or jobs a key can have at the same time, protecting slow dependencies.
Nginx’s built‑in limit_req is effectively a token bucket. Cloudflare offers rules based on request rate and concurrency. Redis lets you implement any of the above by combining counters, TTLs and Lua scripts.
Designing a Rate Limiting Strategy for Your Architecture
Good rate limiting is more than just picking a number and copying a config snippet. You need to think about who you’re protecting, what you’re protecting, and where you enforce limits.
Choosing the right key: IP, user or tenant?
Each key type has trade‑offs:
- IP address: Easy to implement at the edge (Nginx, Cloudflare). Works well for anonymous endpoints and brute‑force protection. Weak for mobile users behind NAT or large corporate proxies.
- API key / client ID: Best for authenticated APIs. Fair per‑client limits; not affected by NAT. Requires that authentication happens before rate limiting.
- User ID: Useful for consumer‑facing apps where many users may share an IP. Often combined with a coarser IP‑based limit as a backup.
- Tenant / organization ID: Critical in B2B SaaS to enforce plan‑based quotas (for example, 10k requests/day for the Basic plan, 1M for Enterprise).
In practice, we often layer them. For example:
- Edge rate limiting at Cloudflare by IP to block obvious floods.
- Gateway rate limiting at Nginx by
$api_client_idor$jwt_sub. - Business‑level quotas enforced inside the application using Redis counters.
Where to enforce limits in a microservices stack
A typical dchost.com‑style API stack might look like this:
- Client → Cloudflare: Global edge network, WAF and edge rate limiting.
- Cloudflare → Nginx gateway on your VPS/dedicated server: TLS termination, routing, Nginx rate limiting and microcaching.
- Nginx → backend microservices: RPC/HTTP calls to internal services, each of which may also have local or centralized rate limits.
Rate limits can live in all three layers:
- At the edge (Cloudflare): Protects origin bandwidth and CPU; cheapest place to drop junk traffic.
- At the gateway (Nginx): More context‑aware (paths, headers, auth) and closer to your app logic.
- Inside services (Redis‑backed): Fine‑grained quotas per user, per feature, per tenant; shared across instances.
How to respond: 429 and Retry‑After
For HTTP APIs, the standard way to signal rate limiting is:
- Status code:
429 Too Many Requests - Header:
Retry-After: <seconds>or an HTTP date
Many SDKs and API clients know how to react to 429 automatically (backoff and retry). If you return 500 or 503 instead, clients cannot distinguish rate limiting from real errors, and they may retry too aggressively.
Implementing Rate Limiting with Nginx
Nginx is our go‑to gateway and reverse proxy for APIs and microservices. Its native rate limiting is efficient and, when combined with a shared memory zone, works nicely on a single VPS or even across multiple worker processes.
Basic per‑IP rate limiting
Here is a minimal example that limits each IP to 10 requests per second with a small burst:
http {
# Define a shared memory zone "api_limit" with 10MB of storage
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
server {
listen 80;
server_name api.example.com;
location /v1/ {
limit_req zone=api_limit burst=20 nodelay;
proxy_pass http://backend_api;
}
}
}
Notes:
$binary_remote_addris a compact representation of the client IP.rate=10r/ssets a base rate of 10 requests per second.burst=20lets a client briefly exceed the rate, which is often necessary for legitimate short spikes.nodelaytells Nginx to reject excess requests immediately instead of queuing them.
By default, Nginx will return 503 Service Unavailable when the limit is exceeded. For APIs, you should explicitly configure a 429:
http {
limit_req_status 429;
}
Rate limiting by API key or user ID
Per‑IP limits are crude. For authenticated APIs, it is far better to limit per client. You can expose a header such as X-API-Client-ID from your auth layer (for example, after validating a JWT) and build the zone key from that:
map $http_x_api_client_id $api_client_id {
default $http_x_api_client_id;
}
limit_req_zone $api_client_id zone=client_limit:20m rate=100r/m;
server {
listen 80;
server_name api.example.com;
location /v1/ {
# Per-client limit: 100 requests per minute
limit_req zone=client_limit burst=50 nodelay;
proxy_pass http://backend_api;
}
}
This way, customers on high‑latency mobile connections or behind shared IPs are not penalized for each other’s traffic.
Different limits per endpoint
Some endpoints are much more sensitive than others. For example:
POST /v1/auth/loginandPOST /v1/password-resetshould have very low limits.GET /v1/productscan tolerate higher rates.GET /v1/reports/exportmight be limited per hour or per day.
With Nginx you can define multiple zones and apply them selectively:
limit_req_zone $binary_remote_addr zone=login_limit:5m rate=5r/m;
limit_req_zone $api_client_id zone=read_limit:10m rate=50r/s;
server {
listen 80;
location = /v1/auth/login {
limit_req zone=login_limit burst=5 nodelay;
proxy_pass http://auth_service;
}
location /v1/ {
limit_req zone=read_limit burst=100 nodelay;
proxy_pass http://backend_api;
}
}
Protecting login and XML‑RPC style endpoints
For web applications like WordPress, we have seen great results combining Nginx rate limiting with Fail2ban to tame brute‑force attacks against login and XML‑RPC endpoints. The same pattern works for custom APIs: rate limit at Nginx, log offenders, and optionally ban repeat abusers at the firewall level. We describe this in detail in our article the calm way to stop wp-login.php and XML-RPC brute force with Nginx rate limiting + Fail2ban.
Moving beyond a single node: Nginx + Redis
Classic Nginx limit_req uses shared memory local to one server. In a horizontally scaled microservice architecture with multiple API gateways, you may want limits to be global across all nodes.
For that, you can:
- Introduce a dedicated rate limiting service that speaks HTTP/gRPC and uses Redis under the hood.
- Or use Nginx modules (such as
lua-resty-limit-trafficwith OpenResty) to store counters in Redis directly from Nginx.
The pattern is the same: Nginx extracts a key (client ID, user, tenant), calls out to Redis to check/update a counter, and based on the response either proxies to upstream or returns 429. Redis gives you atomic increments and expirations that work reliably across multiple API gateway instances.
Cloudflare Rate Limiting and WAF Rules at the Edge
If you already run your domains through Cloudflare, it’s often the most cost‑effective place to start rate limiting. It reduces load on your servers, saves bandwidth, and can be tuned per path and per HTTP method.
Layering edge rules in front of your origin
Cloudflare offers two main tools useful for APIs and microservices:
- Rate limiting rules: Define thresholds such as “if a single IP makes more than 100 requests to
/api/in 10 seconds, then block or challenge for 1 minute”. - WAF rules: Block obvious attack patterns, bad bots, or known vulnerability exploits before they reach your origin.
For a step‑by‑step walk‑through, including small business use cases and WordPress protection scenarios, see our article Cloudflare security settings guide for WAF, rate limiting and bot protection.
Typical Cloudflare rules for APIs
For API traffic, we commonly configure rules like:
- Global per‑IP rate limit on
/api/paths to absorb obvious floods. - Stricter limits on
/api/auth/*endpoints to slow down credential stuffing. - Soft limits (JS challenge) on expensive search or listing endpoints to throttle scrapers.
Unlike Nginx’s native limiter, Cloudflare can also combine several dimensions: country, ASN, user‑agent, bot score, and more. It is excellent at filtering out low‑quality or obviously automated traffic before it reaches more expensive layers.
Working with 429s and client behavior
Cloudflare lets you choose the action for each rule: block, challenge, or serve a custom response. For machine‑to‑machine APIs, we usually prefer a straightforward 429 at the edge that clients can understand. For browser‑based traffic, a CAPTCHA or JavaScript challenge may be more appropriate for suspected bots.
If you pair Cloudflare with Nginx and Redis behind it, think of Cloudflare as the outer guard: it handles the worst spikes, while your origin‑side rate limiting is more application‑aware and focused on fairness between legitimate clients.
Redis-Based Distributed Rate Limiting for APIs and Microservices
For truly flexible and global rate limits, especially in a microservice architecture, Redis is the workhorse of choice. It combines low latency, atomic operations and rich data structures, which makes implementing various algorithms straightforward.
Why Redis works so well for rate limiting
- Atomic counters: The
INCRandINCRBYcommands let you update counters safely even under heavy concurrency. - Key expiration (TTL): With
EXPIREorSETEX, you can automatically reset counters after a window ends. - Lua scripting: Run small scripts server‑side for token bucket or sliding window logic in a single atomic step.
- Data structures: Sorted sets, hashes and bitmaps open the door to complex per‑feature, per‑user tracking.
For production use, you will want Redis to be highly available so that your rate limiting is not a single point of failure. Our guide Never Lose Your Cache Again: High‑Availability Redis for WordPress describes how we use Sentinel, AOF/RDB and failover, and the same principles apply to a rate limiting Redis cluster.
Simple fixed-window limit with INCR + EXPIRE
A straightforward per‑minute limit could look like this in pseudocode:
function checkRateLimit(clientId):
window = currentUnixTime() // 60 # current minute
key = "rate:" + clientId + ":" + window
count = redis.INCR(key)
if count == 1:
redis.EXPIRE(key, 60) # expire after one minute
if count > 1000:
return { allowed: false, retryAfter: 60 }
return { allowed: true }
Here:
- Each client has a fresh key every minute (for example,
rate:client123:29123456). - The first request sets the TTL; subsequent ones simply increment.
- Once count exceeds 1000, your service returns 429 to the caller.
Token bucket with Lua scripting
To allow brief bursts while maintaining a long‑term limit, a token bucket usually feels more natural. A minimal Redis Lua script might:
- Read the last refill timestamp and token count.
- Calculate how many tokens to add based on elapsed time and refill rate.
- Deduct one token for the current request if available.
- Return whether the request is allowed plus the current token count.
Example (simplified Lua):
-- KEYS[1] = bucket key
-- ARGV[1] = capacity
-- ARGV[2] = refill_rate (tokens per second)
-- ARGV[3] = now (current timestamp in seconds)
local capacity = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local data = redis.call("HMGET", KEYS[1], "tokens", "timestamp")
local tokens = tonumber(data[1]) or capacity
local timestamp = tonumber(data[2]) or now
-- Refill tokens based on time elapsed
local delta = math.max(0, now - timestamp)
local refill = delta * refillRate
tokens = math.min(capacity, tokens + refill)
local allowed = 0
if tokens >= 1 then
tokens = tokens - 1
allowed = 1
end
redis.call("HMSET", KEYS[1], "tokens", tokens, "timestamp", now)
redis.call("EXPIRE", KEYS[1], 3600)
return { allowed, tokens }
Your application calls this script via EVALSHA for each request. A response of allowed = 1 means proceed; 0 means return a 429 with an appropriate Retry-After.
Concurrency limits to protect slow dependencies
Sometimes the problem isn’t the total number of requests per minute, but how many slow operations are happening at the same moment. A classic example is generating a large report or calling a slow external payment API.
A simple pattern:
- On request start, try to
INCRa key likeconcurrent:client123:report. - If the result is more than the allowed maximum (for example, 3), immediately
DECRagain and return 429. - On request completion (success or failure),
DECRthe key.
To protect against stuck counts (for example, if a process crashes), you can combine INCR with a TTL (or use a Lua script that sets a TTL when the counter becomes non‑zero) so that orphaned counters eventually disappear.
Observability, Testing and Tuning Your Limits
Rate limiting that you cannot see or measure will eventually hurt you. Observability is just as important as the rules themselves.
Logging and metrics for rate limits
At minimum, you should:
- Log every 429 response from Nginx and your application, including the key that triggered it (IP, user, client ID).
- Expose metrics such as “requests allowed vs limited” per endpoint and per plan.
- Alert when the percentage of 429s spikes unexpectedly; it might indicate an attack or a misconfigured limit.
For centralizing logs across multiple VPS instances and keeping them queryable, we like stacks such as Loki + Promtail + Grafana. Our article VPS log management without the drama: centralized logging with Loki and Promtail shows how we set this up in hosting environments very similar to API clusters.
Dry runs and soft enforcement
When rolling out new limits, a useful approach is:
- Implement the counters and logging, but still allow all requests (soft mode).
- Observe who would have been rate‑limited for a few days: which endpoints, which keys, at what times.
- Adjust thresholds and key selection based on real observed traffic.
- Switch to hard enforcement with 429s once you are confident.
This is especially important for B2B APIs where you may have clients doing legitimate high‑volume batch imports or overnight sync jobs.
Choosing sensible numbers
Capacity planning is a whole topic on its own, but a quick rule of thumb:
- Estimate your backend’s safe sustained throughput (requests per second) without breaching CPU, database or external API limits.
- Reserve a margin (for example, 30–40%) for unexpected spikes.
- Distribute the remaining capacity across tenants according to their plans.
Our guide on how to estimate traffic and bandwidth needs on shared hosting and VPS walks through similar calculations from a hosting perspective; the same thinking applies when you translate those numbers into per‑client and per‑endpoint API limits.
Network-Level Rate Limiting with Firewalls
While Nginx, Cloudflare and Redis handle application‑layer limits, it is sometimes useful to add a coarse network‑layer limit for obvious abuse (for example, blocking an IP that is opening thousands of TCP connections per second).
On Linux VPS and dedicated servers, modern setups increasingly use nftables for this purpose. You can define rules that:
- Limit new connections per IP per second to your API ports.
- Temporarily drop or rate limit offenders at the packet level.
- Combine with port knocking or connection tracking for sensitive internal services.
If you want practical examples, including rate limiting and IPv6‑aware rules, our article The nftables firewall cookbook for VPS: rate limiting, port knocking and IPv6 rules goes step‑by‑step through real configurations.
Network‑level rate limiting will never be as precise as Redis‑backed application‑layer limits, but it is an excellent coarse filter and backstop if higher layers misbehave or are overwhelmed.
Putting It All Together on dchost.com Infrastructure
Let’s put all of this into a realistic architecture that we frequently see from our customers running APIs and microservices on VPS and dedicated servers at dchost.com.
A reference architecture for API rate limiting
- DNS and edge: Your API domain points to Cloudflare. At the edge, you configure WAF rules and a few broad rate limiting rules (per‑IP caps on
/api/and tighter caps on/api/auth/*). - Gateway server: On a high‑performance VPS with NVMe SSD (we discuss why NVMe matters in our article NVMe VPS hosting guide), you run Nginx as the API gateway. Here you enforce per‑client limits with
limit_reqand return 429 withRetry-Afterwhen clients exceed their plan. - Redis tier: On the same server (for smaller projects) or on a separate VPS/dedicated node (for larger ones), you run a HA Redis deployment. This stores global per‑tenant quotas, token buckets for critical operations, and concurrency counters for slow jobs.
- Backend microservices: Each service trusts the gateway for coarse limits but also performs its own fine‑grained checks via Redis for business‑specific constraints (for example, a maximum of 10 weekly exports per team).
- Observability: Logs and metrics are shipped from all nodes into a centralized stack (for example, Loki + Grafana) so you can see 429 rates, top offenders and path‑level statistics.
Scaling up: VPS, dedicated and colocation options
Because we run our own data centers and infrastructure at dchost.com, this layered approach to rate limiting fits naturally into our hosting portfolio:
- Small and medium APIs: One or two well‑sized NVMe VPS servers are usually enough to host Nginx, Redis and microservices together, with Cloudflare at the edge.
- High‑traffic APIs and SaaS platforms: Separate gateway, Redis and backend tiers across multiple VPS or dedicated servers, often with dedicated database nodes and object storage in the mix.
- Very large or regulated workloads: Customers bring their own hardware via colocation and still apply the same Nginx + Cloudflare + Redis patterns, just on physically dedicated racks and network segments.
Next steps
If you’re planning a new API or microservice platform—or if you’re already hitting limits and noisy neighbors on your current environment—taking a day to design proper rate limiting often saves weeks of firefighting later. Start with a simple edge rule at Cloudflare, add Nginx limits for the riskiest endpoints, and then introduce Redis‑backed quotas where you need cross‑node and per‑tenant control.
At dchost.com we implement these patterns daily on our VPS, dedicated and colocation platforms. If you’re not sure how to size the servers or where to draw the boundaries between gateway, Redis and your services, our team can help you choose a realistic architecture and grow it over time without surprises.
