IPThreat - How Fast Should Your API Actually Respond to the Same IP Twice?

By IPThreat Team April 25, 2026

#apisecurity #ratelimiting #cybersecurity #itsecurity #websecurity #apiprotection #threatmitigation

When No Limit Becomes the Liability

In late 2024, a mid-sized SaaS company offering document processing services discovered something uncomfortable: their internal API, exposed only through a subdomain they considered "obscure," had been scraped continuously for 11 days. An automated client had made over 4.2 million requests, extracting customer-facing pricing logic, model inference endpoints, and internal error messages that leaked database schema details. No authentication tokens were stolen. No firewall rules were triggered. The only thing missing was a rate limit.

This isn't an isolated story. With ransomware attacks rising sharply and threat actors growing more sophisticated — as seen in Russia's recent campaign hacking routers to steal Microsoft Office tokens, essentially harvesting authentication credentials at the infrastructure layer — the attack surface for APIs has become a primary target. APIs are now the connective tissue of modern infrastructure. They're also frequently the path of least resistance.

Rate limiting is one of the most underestimated controls in API security. It doesn't make headlines the way zero-day patches do, but implemented correctly, it can neutralize credential stuffing, slow enumeration attacks, throttle data exfiltration, and reduce the blast radius of compromised accounts. Done poorly, it creates friction for legitimate users and a false sense of security for everyone else.

This article walks through how rate limiting actually works in production environments, where common implementations fail, and how to build strategies that hold up under real adversarial pressure.

Understanding What You're Actually Limiting

Before choosing an algorithm, you need to define what you're measuring. Most teams default to "requests per minute per IP," which is a reasonable starting point but leaves significant gaps.

The Four Axes of Rate Limiting

IP address: The most common dimension. Effective against unsophisticated bots and scanners, but trivially bypassed using residential proxy pools, which are now widely available to threat actors for a few dollars per gigabyte of traffic.
User account or API key: More meaningful for authenticated endpoints. A single compromised credential shouldn't be able to drain your infrastructure or exfiltrate records at machine speed.
Endpoint or resource type: A password reset endpoint has a fundamentally different risk profile than a product catalog endpoint. They should not share a rate limit policy.
Behavioral fingerprint: Combining headers, TLS fingerprints, device identifiers, and request patterns to build a more durable identity that isn't reset by an IP change.

Most production APIs need to layer at least two of these axes. A policy that limits only by IP is trivially defeated by any botnet operator or anyone routing through a datacenter proxy. A policy that limits only by account misses unauthenticated abuse entirely. The Scattered Spider threat group — whose member Tyler Buchanan recently pleaded guilty — was known for targeting authenticated sessions through social engineering. Rate limiting on account activity post-authentication is exactly the kind of control that can surface anomalous session behavior before damage is done.

The Core Algorithms and Their Real Tradeoffs

Fixed Window Counting

The simplest approach: count requests in discrete time windows (e.g., 100 requests per minute, resetting at the top of each minute). Implementation is straightforward — a Redis INCR with an expiry is often enough.

The failure mode is the boundary attack. An adversary who knows your window resets at :00 can send 100 requests at :59 and 100 more at :01, delivering 200 requests in two seconds without triggering any limit. For low-sensitivity endpoints this may be acceptable. For authentication endpoints, it's a meaningful gap.

When to use it: Coarse-grained throttling on high-volume, low-sensitivity endpoints where implementation simplicity and low overhead matter more than precision.

Sliding Window Log

Store a timestamp for every request in the current window. When a new request arrives, discard timestamps outside the window, count what remains, and allow or deny. This eliminates the boundary attack entirely — the window always reflects the last N seconds of actual traffic, not a clock-aligned bucket.

The cost is memory. If you're storing per-request timestamps for millions of users, your Redis footprint grows proportionally. For APIs with high request volumes and large user bases, this can become prohibitively expensive.

When to use it: High-sensitivity endpoints (authentication, password reset, account modification) with lower absolute request volumes where precision matters more than memory efficiency.

Sliding Window Counter

A practical middle ground. Maintain counts for the current and previous fixed windows, then estimate the request count for the rolling window using a weighted average based on how far into the current window you are. The formula is roughly: current_count + previous_count × (1 - elapsed_fraction).

This approximates a sliding window with O(1) storage per user. The approximation introduces a small margin of error (typically under 1% at the boundary), which is acceptable for most use cases. Redis-based implementations are common and performant at scale.

When to use it: General-purpose rate limiting at scale. This is the algorithm behind most production CDN and API gateway implementations.

Token Bucket

Each client has a bucket with a maximum capacity. Tokens are added to the bucket at a fixed rate. Each request consumes one token. If the bucket is empty, the request is denied. Clients can accumulate tokens during quiet periods and spend them in bursts.

This is well-suited for APIs where occasional bursts are legitimate — a user who hasn't made requests for an hour should arguably be able to make several in quick succession. The tradeoff is that a client who waits long enough can always accumulate a full bucket and deliver a burst, which may matter for endpoints where even short bursts are dangerous.

When to use it: APIs consumed by other applications or services where bursty-but-legitimate traffic patterns are expected and you want to avoid punishing normal usage spikes.

Leaky Bucket

Requests enter a queue (the bucket) and are processed at a constant rate, regardless of arrival rate. Excess requests that overflow the queue are dropped. This smooths traffic into a consistent output rate.

Leaky bucket is primarily a traffic shaping mechanism rather than an abuse prevention tool. It doesn't distinguish between a legitimate burst and an attack — it just enforces a consistent output rate. It's more useful for protecting downstream services from overload than for preventing adversarial abuse.

When to use it: Protecting internal microservices or downstream dependencies from upstream traffic spikes, not as a primary security control on public-facing endpoints.

Implementation Scenarios: Where Theory Meets Production

Scenario 1: Authentication Endpoints Under Credential Stuffing

Credential stuffing attacks — automated login attempts using leaked username/password combinations — have surged alongside the proliferation of data breaches. A standard login endpoint without rate limiting can absorb tens of thousands of attempts per minute from a distributed botnet.

The naive fix is IP-based rate limiting: 5 failed attempts per minute per IP. The problem is that sophisticated attackers use residential proxy pools with thousands of distinct IPs, each making only one or two attempts. Your per-IP limit never triggers.

A more effective layered approach:

Per-account lockout with progressive backoff: After 5 failed attempts on a specific account, regardless of source IP, introduce exponential delay before the next attempt is accepted. After 10 failures, require email verification to unlock. This directly targets the account enumeration pattern without caring about IP diversity.
Global failure rate monitoring: Track the ratio of failed to successful authentications across your entire endpoint in a rolling window. If the failure rate spikes above a threshold (say, 40% of attempts failing in a 5-minute window), trigger elevated scrutiny — CAPTCHA challenges, additional logging, alerts to the security team.
IP reputation scoring as a signal, not a binary: Known datacenter ranges and flagged proxy IPs should trigger lower thresholds, not automatic blocks. Blocking too aggressively on IP alone causes collateral damage to legitimate users behind shared NAT or corporate proxies.

Scenario 2: Data Exfiltration Through Authenticated Endpoints

The document-scraping incident described at the opening involved an authenticated session — the attacker had valid credentials. Standard unauthenticated rate limits did nothing.

ClipBanker's marathon infection chain, recently analyzed by threat researchers, illustrates how attackers invest heavily in persistence and slow, patient exfiltration to avoid detection thresholds. The same logic applies to API abuse: sophisticated actors deliberately stay under rate limits they can observe or infer.

Controls for authenticated data exfiltration:

Volume limits on data-returning endpoints: A user querying your customer records endpoint 500 times in an hour may be legitimate. 50,000 times is almost certainly not. Set absolute volume caps on endpoints that return sensitive or monetizable data, separate from your general request rate limits.
Response size accounting: Count bytes returned, not just requests made. A single request that returns 10MB of records is meaningfully different from a request that returns 200 bytes. Some API gateways support this natively; others require custom middleware.
Anomaly detection on access patterns: A user who normally hits 10 endpoints and suddenly starts paginating through every record in your database should trigger a behavioral alert. This isn't strictly rate limiting — it's behavioral analytics — but it fills the gap that rate limits alone can't cover.

Scenario 3: API Keys and Service-to-Service Communication

Internal and partner APIs are frequently protected by API keys rather than OAuth tokens. When a key is compromised — through a code repository exposure, a misconfigured environment variable, or a supply chain attack — rate limiting on the key itself can contain the damage.

The industrial automation sector has seen this problem acutely. Recent threat landscape reporting for Q4 2025 showed sustained targeting of OT/IT boundary systems, including API interfaces between industrial control systems and cloud management platforms. A compromised API key with no rate limit means an attacker can query every sensor, every configuration endpoint, and every historical record as fast as the network allows.

Key-level rate limiting recommendations:

Apply separate limits to read versus write operations. An automated monitoring service legitimately makes thousands of read requests but should rarely make more than a handful of writes per hour.
Implement quota windows that scale with business context — daily limits for batch processing keys, per-minute limits for real-time integration keys.
Alert immediately when a key hits its limit. Limit breaches for service accounts are almost never legitimate and should trigger investigation, not silent denial.

Infrastructure Considerations: Where to Enforce

Edge vs. Application Layer

Rate limiting can be enforced at several layers, each with different tradeoffs:

CDN/Edge (Cloudflare, Fastly, Akamai): Earliest possible enforcement, before traffic reaches your infrastructure. Excellent for absorbing volumetric abuse and simple IP-based limits. Limited visibility into application-layer context — you can't easily rate limit by user account or apply business logic at this layer without significant configuration complexity.

API Gateway (Kong, AWS API Gateway, Apigee): The most common enforcement point for API-specific limits. Has access to authentication context, can enforce per-key and per-user limits, and centralizes policy management. The risk is that a misconfigured or bypassed gateway can leave your backend unprotected — defense in depth requires application-layer validation as well.

Application middleware: Full access to business context, user roles, and behavioral history. The right place for complex, context-aware limits. The cost is that every request hits your application servers before being denied, which matters under heavy attack load. Application-layer rate limiting should be a secondary control, not the only one.

The practical answer for most organizations: Enforce coarse IP-based limits at the edge, context-aware limits at the API gateway, and behavioral controls in the application layer. Each layer catches what the layer above it misses.

Distributed Rate Limiting with Redis

In a horizontally scaled environment, rate limit state must be shared across instances. A limit enforced only in local memory on each application server means an attacker can simply spread requests across servers to defeat per-instance limits.

Redis is the standard solution. A basic sliding window counter implementation using Redis MULTI/EXEC transactions or Lua scripts ensures atomic operations across distributed nodes. Redis Cluster handles high-availability requirements. The operational overhead is worth it — a rate limiter that doesn't share state across your fleet is not actually a rate limiter.

One caveat: Redis introduces a network hop and can become a bottleneck under extreme load. For very high-traffic environments, consider local caching with a short TTL (accepting slight over-admission at window boundaries) combined with a Redis sync, or evaluate dedicated rate limiting services designed for this workload.

Response Strategy: What You Do After Limiting

How your API responds to rate-limited requests matters more than most teams realize.

Always return HTTP 429 with Retry-After. This is specified in RFC 6585 and widely expected by well-behaved clients. A client that receives a proper 429 with a Retry-After header can back off and retry. A client that receives a 500 error or a connection timeout may retry immediately, making the problem worse.

Don't reveal your limits to potential attackers. The Retry-After header and standard rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) are useful for legitimate API consumers. Consider whether publishing these values on unauthenticated or high-risk endpoints helps attackers calibrate their request rate to stay just under your threshold. Some teams suppress these headers on authentication endpoints specifically.

Log everything you block. Rate limit denials are security events, not just operational noise. A sudden spike in 429 responses from a specific IP range, user account, or geographic region is an early warning signal worth investigating. Feed these events into your SIEM and set alerts on anomalous volumes.

Consider soft limits before hard limits. Rather than immediately denying requests at the threshold, some teams implement a soft limit zone — between 80% and 100% of the limit — where requests are still served but flagged for review, or where CAPTCHA challenges are introduced. This reduces false positives on legitimate users experiencing unusual traffic spikes while still surfacing suspicious patterns.

Testing Your Rate Limits Before Attackers Do

A rate limit that hasn't been tested is a rate limit you can't trust. Regular testing should be part of your API security practice:

Boundary testing: Verify that your limits trigger exactly where configured, not earlier (causing false positives) or later (leaving you exposed). Test the boundary attack on any fixed window implementations.
Distributed bypass testing: Simulate requests from multiple source IPs to verify that per-user and per-account limits hold regardless of IP diversity.
Header manipulation testing: Some implementations incorrectly trust X-Forwarded-For or X-Real-IP headers to identify client IPs. Test whether spoofing these headers bypasses your limits. If your application is behind a trusted reverse proxy, only accept these headers from the proxy, not from arbitrary clients.
State persistence testing: Verify that limits persist correctly across application server restarts, deployments, and Redis failover events.

Include rate limit testing in your regular penetration testing scope and in pre-release security reviews for new API endpoints. A new endpoint that goes live without a rate limit policy is a vulnerability, even if it doesn't look like one in a traditional vulnerability scan.

The Trust Equation in Rate Limiting Design

A theme emerging across the security industry is that trust is becoming a core operational asset — not just a soft value, but a technical and business requirement. Rate limiting is fundamentally about calibrating trust: extending enough latitude for legitimate use while constraining the behaviors that indicate abuse.

This means your rate limits should be differentiated by trust level. An unauthenticated request from an unknown IP should face tighter constraints than an authenticated request from a long-standing enterprise customer with consistent behavioral history. A request carrying a client certificate from a known integration partner should have different limits than an anonymous API call. Microsoft's rollout of Entra passkeys reflects the industry's move toward stronger authentication signals — stronger identity signals should translate directly into more permissive but monitored rate limit policies.

The goal is not to make APIs painful to use. Overly aggressive rate limits drive legitimate users toward workarounds, create support burden, and ultimately damage the developer experience that makes your API worth using. The goal is to make abuse expensive and visible — slow enough that it's not worth doing, noisy enough that you catch it when it happens anyway.

Key Takeaways for Security and Operations Teams

Rate limiting by IP alone is insufficient. Layer IP, account, endpoint, and behavioral dimensions for meaningful protection.
Choose your algorithm based on your threat model: sliding window for high-sensitivity endpoints, token bucket for service-to-service APIs with bursty patterns, sliding window counter for general-purpose high-scale use.
Enforce at multiple layers: edge for volumetric abuse, API gateway for context-aware limits, application layer for behavioral controls.
Shared state (Redis or equivalent) is non-negotiable in distributed environments. Per-instance rate limiting is not rate limiting.
Rate limit denials are security signals. Log them, alert on spikes, and integrate them into your threat detection workflow.
Test your rate limits regularly and include bypass scenarios (IP spoofing, distributed requests, boundary attacks) in your test cases.
Calibrate limits to trust level. Stronger authentication and longer positive behavioral history should earn more latitude, not just the same flat limit as anonymous traffic.