Technology

Rate Limiting Strategies for APIs and Microservices with Nginx, Cloudflare and Redis

When you start exposing APIs or microservices to the outside world, one of the first real-world problems you hit is traffic that doesn’t behave nicely. Sometimes it is honest heavy usage from one customer, sometimes an integration bug that loops requests, and sometimes outright malicious scanning or credential stuffing. In all of these cases, you need a way to say “enough for now” without breaking the rest of your platform. That is exactly what rate limiting does.

In this article, we will walk through how we design and implement rate limiting for API and microservice architectures using three battle‑tested building blocks: Nginx as the gateway, Cloudflare at the edge, and Redis as a fast shared counter store. I will focus on practical patterns that we actually use in production‑style environments on VPS and dedicated servers at dchost.com: what to limit, where to enforce it, sample Nginx configs, Redis scripts, and how to tune thresholds without constantly fighting false positives.

We will also look at how rate limiting sits next to other pieces you might already be running: WAF rules, DDoS mitigation, caching, and firewalls. The goal is to help you build a layered but understandable traffic control strategy that protects your APIs and microservices while still feeling fast and predictable for legitimate users.

Why Rate Limiting Matters for APIs and Microservices

Rate limiting is simply controlling how many requests a given actor (IP, user, API key, tenant, etc.) can make over a specific period of time. Done right, it acts like a safety valve for your infrastructure and your business logic.

Key goals of rate limiting

  • Protect backend capacity: Databases, queues and external providers all have limits. Rate limiting keeps noisy neighbors from exhausting them.
  • Fairness between tenants: In multi‑tenant SaaS, you do not want one large customer to slow down everyone else.
  • Abuse and fraud prevention: Limiting login attempts, password reset flows, search endpoints and pricing APIs slows down attackers and scrapers.
  • Cost control: If you pay per external API call or per CPU second, rate limits cap the financial blast radius of bugs and misuse.
  • SLO and SLA protection: Keeping tail latencies under control is much easier if you cap the worst offenders before they cause a chain reaction.

How it fits with WAF, DDoS and firewalls

Rate limiting is one layer in a broader defense‑in‑depth strategy:

  • Network‑level protection (firewalls, basic DDoS filtering) blocks obvious floods or port scans.
  • WAF rules block malicious patterns in payloads (SQL injection, XSS signatures, etc.). For a practical overview, our Cloudflare security settings guide on WAF, rate limiting and bot protection shows how we layer these at the edge.
  • Rate limiting focuses on volume and frequency of otherwise valid‑looking requests.

In microservice environments, rate limiting is often what keeps a noisy client from pushing a shared dependency (like a database or a payment gateway) over the edge and taking other services down with it.

Core Rate Limiting Concepts and Algorithms

Before we touch Nginx, Cloudflare or Redis, it helps to have a clear vocabulary and mental model.

Basic terminology

  • Limit: Maximum number of requests allowed within a time window (for example, 100 requests per minute).
  • Window: The time period for the limit (per second, per minute, per hour, per day).
  • Key: The identifier you are limiting: IP, user ID, API key, tenant ID, or a combination.
  • Burst: Extra requests temporarily allowed above the limit to smooth out short spikes.
  • Quota vs rate: Some APIs control total requests per day/month (quota) as well as requests per second/minute (rate).

Main rate limiting algorithms

Most implementations are variants of a few well‑known algorithms:

  • Fixed window: Count requests in discrete windows (e.g. 00:00–00:59, 01:00–01:59). Simple but can have edge‑of‑window bursts.
  • Sliding window: Keep timestamps of recent requests and count those within the last N seconds. Smoother but needs more storage.
  • Token bucket: Tokens accumulate over time; each request consumes one. Allows bursts up to bucket size while enforcing a long‑term average rate.
  • Leaky bucket: Like a bucket that leaks at a constant rate; if more water (requests) enters than can leak out, extra is dropped.
  • Concurrency limiting: Limit how many in‑flight requests or jobs a key can have at the same time, protecting slow dependencies.

Nginx’s built‑in limit_req is effectively a token bucket. Cloudflare offers rules based on request rate and concurrency. Redis lets you implement any of the above by combining counters, TTLs and Lua scripts.

Designing a Rate Limiting Strategy for Your Architecture

Good rate limiting is more than just picking a number and copying a config snippet. You need to think about who you’re protecting, what you’re protecting, and where you enforce limits.

Choosing the right key: IP, user or tenant?

Each key type has trade‑offs:

  • IP address: Easy to implement at the edge (Nginx, Cloudflare). Works well for anonymous endpoints and brute‑force protection. Weak for mobile users behind NAT or large corporate proxies.
  • API key / client ID: Best for authenticated APIs. Fair per‑client limits; not affected by NAT. Requires that authentication happens before rate limiting.
  • User ID: Useful for consumer‑facing apps where many users may share an IP. Often combined with a coarser IP‑based limit as a backup.
  • Tenant / organization ID: Critical in B2B SaaS to enforce plan‑based quotas (for example, 10k requests/day for the Basic plan, 1M for Enterprise).

In practice, we often layer them. For example:

  • Edge rate limiting at Cloudflare by IP to block obvious floods.
  • Gateway rate limiting at Nginx by $api_client_id or $jwt_sub.
  • Business‑level quotas enforced inside the application using Redis counters.

Where to enforce limits in a microservices stack

A typical dchost.com‑style API stack might look like this:

  1. Client → Cloudflare: Global edge network, WAF and edge rate limiting.
  2. Cloudflare → Nginx gateway on your VPS/dedicated server: TLS termination, routing, Nginx rate limiting and microcaching.
  3. Nginx → backend microservices: RPC/HTTP calls to internal services, each of which may also have local or centralized rate limits.

Rate limits can live in all three layers:

  • At the edge (Cloudflare): Protects origin bandwidth and CPU; cheapest place to drop junk traffic.
  • At the gateway (Nginx): More context‑aware (paths, headers, auth) and closer to your app logic.
  • Inside services (Redis‑backed): Fine‑grained quotas per user, per feature, per tenant; shared across instances.

How to respond: 429 and Retry‑After

For HTTP APIs, the standard way to signal rate limiting is:

  • Status code: 429 Too Many Requests
  • Header: Retry-After: <seconds> or an HTTP date

Many SDKs and API clients know how to react to 429 automatically (backoff and retry). If you return 500 or 503 instead, clients cannot distinguish rate limiting from real errors, and they may retry too aggressively.

Implementing Rate Limiting with Nginx

Nginx is our go‑to gateway and reverse proxy for APIs and microservices. Its native rate limiting is efficient and, when combined with a shared memory zone, works nicely on a single VPS or even across multiple worker processes.

Basic per‑IP rate limiting

Here is a minimal example that limits each IP to 10 requests per second with a small burst:

http {
    # Define a shared memory zone "api_limit" with 10MB of storage
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;

    server {
        listen 80;
        server_name api.example.com;

        location /v1/ {
            limit_req zone=api_limit burst=20 nodelay;

            proxy_pass http://backend_api;
        }
    }
}

Notes:

  • $binary_remote_addr is a compact representation of the client IP.
  • rate=10r/s sets a base rate of 10 requests per second.
  • burst=20 lets a client briefly exceed the rate, which is often necessary for legitimate short spikes.
  • nodelay tells Nginx to reject excess requests immediately instead of queuing them.

By default, Nginx will return 503 Service Unavailable when the limit is exceeded. For APIs, you should explicitly configure a 429:

http {
    limit_req_status 429;
}

Rate limiting by API key or user ID

Per‑IP limits are crude. For authenticated APIs, it is far better to limit per client. You can expose a header such as X-API-Client-ID from your auth layer (for example, after validating a JWT) and build the zone key from that:

map $http_x_api_client_id $api_client_id {
    default $http_x_api_client_id;
}

limit_req_zone $api_client_id zone=client_limit:20m rate=100r/m;

server {
    listen 80;
    server_name api.example.com;

    location /v1/ {
        # Per-client limit: 100 requests per minute
        limit_req zone=client_limit burst=50 nodelay;

        proxy_pass http://backend_api;
    }
}

This way, customers on high‑latency mobile connections or behind shared IPs are not penalized for each other’s traffic.

Different limits per endpoint

Some endpoints are much more sensitive than others. For example:

  • POST /v1/auth/login and POST /v1/password-reset should have very low limits.
  • GET /v1/products can tolerate higher rates.
  • GET /v1/reports/export might be limited per hour or per day.

With Nginx you can define multiple zones and apply them selectively:

limit_req_zone $binary_remote_addr zone=login_limit:5m rate=5r/m;
limit_req_zone $api_client_id      zone=read_limit:10m rate=50r/s;

server {
    listen 80;

    location = /v1/auth/login {
        limit_req zone=login_limit burst=5 nodelay;
        proxy_pass http://auth_service;
    }

    location /v1/ {
        limit_req zone=read_limit burst=100 nodelay;
        proxy_pass http://backend_api;
    }
}

Protecting login and XML‑RPC style endpoints

For web applications like WordPress, we have seen great results combining Nginx rate limiting with Fail2ban to tame brute‑force attacks against login and XML‑RPC endpoints. The same pattern works for custom APIs: rate limit at Nginx, log offenders, and optionally ban repeat abusers at the firewall level. We describe this in detail in our article the calm way to stop wp-login.php and XML-RPC brute force with Nginx rate limiting + Fail2ban.

Moving beyond a single node: Nginx + Redis

Classic Nginx limit_req uses shared memory local to one server. In a horizontally scaled microservice architecture with multiple API gateways, you may want limits to be global across all nodes.

For that, you can:

  • Introduce a dedicated rate limiting service that speaks HTTP/gRPC and uses Redis under the hood.
  • Or use Nginx modules (such as lua-resty-limit-traffic with OpenResty) to store counters in Redis directly from Nginx.

The pattern is the same: Nginx extracts a key (client ID, user, tenant), calls out to Redis to check/update a counter, and based on the response either proxies to upstream or returns 429. Redis gives you atomic increments and expirations that work reliably across multiple API gateway instances.

Cloudflare Rate Limiting and WAF Rules at the Edge

If you already run your domains through Cloudflare, it’s often the most cost‑effective place to start rate limiting. It reduces load on your servers, saves bandwidth, and can be tuned per path and per HTTP method.

Layering edge rules in front of your origin

Cloudflare offers two main tools useful for APIs and microservices:

  • Rate limiting rules: Define thresholds such as “if a single IP makes more than 100 requests to /api/ in 10 seconds, then block or challenge for 1 minute”.
  • WAF rules: Block obvious attack patterns, bad bots, or known vulnerability exploits before they reach your origin.

For a step‑by‑step walk‑through, including small business use cases and WordPress protection scenarios, see our article Cloudflare security settings guide for WAF, rate limiting and bot protection.

Typical Cloudflare rules for APIs

For API traffic, we commonly configure rules like:

  • Global per‑IP rate limit on /api/ paths to absorb obvious floods.
  • Stricter limits on /api/auth/* endpoints to slow down credential stuffing.
  • Soft limits (JS challenge) on expensive search or listing endpoints to throttle scrapers.

Unlike Nginx’s native limiter, Cloudflare can also combine several dimensions: country, ASN, user‑agent, bot score, and more. It is excellent at filtering out low‑quality or obviously automated traffic before it reaches more expensive layers.

Working with 429s and client behavior

Cloudflare lets you choose the action for each rule: block, challenge, or serve a custom response. For machine‑to‑machine APIs, we usually prefer a straightforward 429 at the edge that clients can understand. For browser‑based traffic, a CAPTCHA or JavaScript challenge may be more appropriate for suspected bots.

If you pair Cloudflare with Nginx and Redis behind it, think of Cloudflare as the outer guard: it handles the worst spikes, while your origin‑side rate limiting is more application‑aware and focused on fairness between legitimate clients.

Redis-Based Distributed Rate Limiting for APIs and Microservices

For truly flexible and global rate limits, especially in a microservice architecture, Redis is the workhorse of choice. It combines low latency, atomic operations and rich data structures, which makes implementing various algorithms straightforward.

Why Redis works so well for rate limiting

  • Atomic counters: The INCR and INCRBY commands let you update counters safely even under heavy concurrency.
  • Key expiration (TTL): With EXPIRE or SETEX, you can automatically reset counters after a window ends.
  • Lua scripting: Run small scripts server‑side for token bucket or sliding window logic in a single atomic step.
  • Data structures: Sorted sets, hashes and bitmaps open the door to complex per‑feature, per‑user tracking.

For production use, you will want Redis to be highly available so that your rate limiting is not a single point of failure. Our guide Never Lose Your Cache Again: High‑Availability Redis for WordPress describes how we use Sentinel, AOF/RDB and failover, and the same principles apply to a rate limiting Redis cluster.

Simple fixed-window limit with INCR + EXPIRE

A straightforward per‑minute limit could look like this in pseudocode:

function checkRateLimit(clientId):
    window = currentUnixTime() // 60          # current minute
    key = "rate:" + clientId + ":" + window

    count = redis.INCR(key)
    if count == 1:
        redis.EXPIRE(key, 60)  # expire after one minute

    if count > 1000:
        return { allowed: false, retryAfter: 60 }

    return { allowed: true }

Here:

  • Each client has a fresh key every minute (for example, rate:client123:29123456).
  • The first request sets the TTL; subsequent ones simply increment.
  • Once count exceeds 1000, your service returns 429 to the caller.

Token bucket with Lua scripting

To allow brief bursts while maintaining a long‑term limit, a token bucket usually feels more natural. A minimal Redis Lua script might:

  1. Read the last refill timestamp and token count.
  2. Calculate how many tokens to add based on elapsed time and refill rate.
  3. Deduct one token for the current request if available.
  4. Return whether the request is allowed plus the current token count.

Example (simplified Lua):

-- KEYS[1] = bucket key
-- ARGV[1] = capacity
-- ARGV[2] = refill_rate (tokens per second)
-- ARGV[3] = now (current timestamp in seconds)

local capacity   = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])
local now        = tonumber(ARGV[3])

local data = redis.call("HMGET", KEYS[1], "tokens", "timestamp")
local tokens    = tonumber(data[1]) or capacity
local timestamp = tonumber(data[2]) or now

-- Refill tokens based on time elapsed
local delta = math.max(0, now - timestamp)
local refill = delta * refillRate

tokens = math.min(capacity, tokens + refill)

local allowed = 0
if tokens >= 1 then
  tokens = tokens - 1
  allowed = 1
end

redis.call("HMSET", KEYS[1], "tokens", tokens, "timestamp", now)
redis.call("EXPIRE", KEYS[1], 3600)

return { allowed, tokens }

Your application calls this script via EVALSHA for each request. A response of allowed = 1 means proceed; 0 means return a 429 with an appropriate Retry-After.

Concurrency limits to protect slow dependencies

Sometimes the problem isn’t the total number of requests per minute, but how many slow operations are happening at the same moment. A classic example is generating a large report or calling a slow external payment API.

A simple pattern:

  • On request start, try to INCR a key like concurrent:client123:report.
  • If the result is more than the allowed maximum (for example, 3), immediately DECR again and return 429.
  • On request completion (success or failure), DECR the key.

To protect against stuck counts (for example, if a process crashes), you can combine INCR with a TTL (or use a Lua script that sets a TTL when the counter becomes non‑zero) so that orphaned counters eventually disappear.

Observability, Testing and Tuning Your Limits

Rate limiting that you cannot see or measure will eventually hurt you. Observability is just as important as the rules themselves.

Logging and metrics for rate limits

At minimum, you should:

  • Log every 429 response from Nginx and your application, including the key that triggered it (IP, user, client ID).
  • Expose metrics such as “requests allowed vs limited” per endpoint and per plan.
  • Alert when the percentage of 429s spikes unexpectedly; it might indicate an attack or a misconfigured limit.

For centralizing logs across multiple VPS instances and keeping them queryable, we like stacks such as Loki + Promtail + Grafana. Our article VPS log management without the drama: centralized logging with Loki and Promtail shows how we set this up in hosting environments very similar to API clusters.

Dry runs and soft enforcement

When rolling out new limits, a useful approach is:

  1. Implement the counters and logging, but still allow all requests (soft mode).
  2. Observe who would have been rate‑limited for a few days: which endpoints, which keys, at what times.
  3. Adjust thresholds and key selection based on real observed traffic.
  4. Switch to hard enforcement with 429s once you are confident.

This is especially important for B2B APIs where you may have clients doing legitimate high‑volume batch imports or overnight sync jobs.

Choosing sensible numbers

Capacity planning is a whole topic on its own, but a quick rule of thumb:

  • Estimate your backend’s safe sustained throughput (requests per second) without breaching CPU, database or external API limits.
  • Reserve a margin (for example, 30–40%) for unexpected spikes.
  • Distribute the remaining capacity across tenants according to their plans.

Our guide on how to estimate traffic and bandwidth needs on shared hosting and VPS walks through similar calculations from a hosting perspective; the same thinking applies when you translate those numbers into per‑client and per‑endpoint API limits.

Network-Level Rate Limiting with Firewalls

While Nginx, Cloudflare and Redis handle application‑layer limits, it is sometimes useful to add a coarse network‑layer limit for obvious abuse (for example, blocking an IP that is opening thousands of TCP connections per second).

On Linux VPS and dedicated servers, modern setups increasingly use nftables for this purpose. You can define rules that:

  • Limit new connections per IP per second to your API ports.
  • Temporarily drop or rate limit offenders at the packet level.
  • Combine with port knocking or connection tracking for sensitive internal services.

If you want practical examples, including rate limiting and IPv6‑aware rules, our article The nftables firewall cookbook for VPS: rate limiting, port knocking and IPv6 rules goes step‑by‑step through real configurations.

Network‑level rate limiting will never be as precise as Redis‑backed application‑layer limits, but it is an excellent coarse filter and backstop if higher layers misbehave or are overwhelmed.

Putting It All Together on dchost.com Infrastructure

Let’s put all of this into a realistic architecture that we frequently see from our customers running APIs and microservices on VPS and dedicated servers at dchost.com.

A reference architecture for API rate limiting

  1. DNS and edge: Your API domain points to Cloudflare. At the edge, you configure WAF rules and a few broad rate limiting rules (per‑IP caps on /api/ and tighter caps on /api/auth/*).
  2. Gateway server: On a high‑performance VPS with NVMe SSD (we discuss why NVMe matters in our article NVMe VPS hosting guide), you run Nginx as the API gateway. Here you enforce per‑client limits with limit_req and return 429 with Retry-After when clients exceed their plan.
  3. Redis tier: On the same server (for smaller projects) or on a separate VPS/dedicated node (for larger ones), you run a HA Redis deployment. This stores global per‑tenant quotas, token buckets for critical operations, and concurrency counters for slow jobs.
  4. Backend microservices: Each service trusts the gateway for coarse limits but also performs its own fine‑grained checks via Redis for business‑specific constraints (for example, a maximum of 10 weekly exports per team).
  5. Observability: Logs and metrics are shipped from all nodes into a centralized stack (for example, Loki + Grafana) so you can see 429 rates, top offenders and path‑level statistics.

Scaling up: VPS, dedicated and colocation options

Because we run our own data centers and infrastructure at dchost.com, this layered approach to rate limiting fits naturally into our hosting portfolio:

  • Small and medium APIs: One or two well‑sized NVMe VPS servers are usually enough to host Nginx, Redis and microservices together, with Cloudflare at the edge.
  • High‑traffic APIs and SaaS platforms: Separate gateway, Redis and backend tiers across multiple VPS or dedicated servers, often with dedicated database nodes and object storage in the mix.
  • Very large or regulated workloads: Customers bring their own hardware via colocation and still apply the same Nginx + Cloudflare + Redis patterns, just on physically dedicated racks and network segments.

Next steps

If you’re planning a new API or microservice platform—or if you’re already hitting limits and noisy neighbors on your current environment—taking a day to design proper rate limiting often saves weeks of firefighting later. Start with a simple edge rule at Cloudflare, add Nginx limits for the riskiest endpoints, and then introduce Redis‑backed quotas where you need cross‑node and per‑tenant control.

At dchost.com we implement these patterns daily on our VPS, dedicated and colocation platforms. If you’re not sure how to size the servers or where to draw the boundaries between gateway, Redis and your services, our team can help you choose a realistic architecture and grow it over time without surprises.

Frequently Asked Questions

There is no single best key; it depends on how your API is used. For anonymous or public endpoints, per-IP limits are simple and effective at the edge (for example, on Cloudflare or Nginx). For authenticated APIs, per-client or per-API-key limits are usually fairer, because they are not affected by NAT or shared proxies. In multi-tenant SaaS, you often also need per-tenant or per-organization quotas so one customer cannot starve others. In practice, we usually combine several: a coarse per-IP limit at Cloudflare, a per-client limit at the Nginx gateway, and business-level per-tenant quotas backed by Redis in the application.

Start from your backend capacity, not from arbitrary numbers. Estimate how many requests per second your stack can handle comfortably without saturating CPU, databases or external APIs, then reserve a margin for spikes. From there, allocate client and endpoint limits according to customer plans and endpoint cost (logins and password resets get low limits; product listings get higher limits). Before enforcing, run the limits in ‘dry-run’ mode: compute counters, log would-be 429s, but still allow traffic. After a few days, review who would have been limited and adjust thresholds. This incremental approach avoids surprises when you switch to hard enforcement.

Each layer solves a different problem and they work best together. Cloudflare or a similar edge layer is great for coarse per-IP limits and blocking obvious floods close to the client. Nginx as your gateway can apply smarter limits based on headers, paths and authentication state (for example, per-client IDs) and return clean 429 responses. Redis is ideal for global, cross-node limits and business-specific quotas shared between multiple microservices. Inside your services, you can check these Redis-backed counters to enforce per-tenant or per-feature limits. For most projects, a combination of edge + Nginx + Redis offers the best balance of protection, visibility and flexibility.

Caching and rate limiting complement each other. Microcaching on Nginx (for example, caching GET responses for 1–5 seconds) drastically reduces backend load for hot endpoints so you can safely set more generous rate limits without overloading your application. Rate limiting aims at client behaviour, while caching optimizes response reuse. A good pattern is to cache idempotent GET endpoints as much as possible, then apply stricter rate limits on expensive POST/PUT/DELETE or report-generation endpoints. If you are new to Nginx microcaching, our article on how Nginx microcaching makes PHP applications feel instantly faster is a good reference to understand how it affects capacity planning and rate limit choices.

You should design for graceful degradation. For simple, local Nginx limits that use in-memory zones, Redis is not involved, so those continue to work as long as Nginx is healthy. For Redis-backed global limits or a separate rate limiting microservice, decide your failure mode explicitly: fail-open (temporarily allow all traffic if the limiter is unavailable) or fail-closed (block or severely limit traffic). For most public APIs, fail-open with strong edge and Nginx protections is more practical; otherwise you risk self-inflicted downtime. High-availability Redis setups with Sentinel or clustering, plus monitoring and alerts, drastically reduce the chance that rate limiting becomes a single point of failure.