When the Botnet Found the API Before the Rate Limiter Was Ready

By IPThreat Team June 3, 2026

A Real Scenario Worth Learning From

In late 2024, a mid-sized SaaS company offering B2B analytics discovered that a subset of their REST API endpoints had been hammered by what initially looked like a traffic spike. Their monitoring flagged elevated response times, and the on-call engineer assumed a product launch had gone viral. By the time the security team investigated, roughly 2.3 million requests had been sent to a single authentication endpoint over six hours, originating from thousands of distinct IP addresses distributed across residential ISPs in Southeast Asia, Eastern Europe, and the United States.

The attack pattern was consistent with infrastructure inherited from or inspired by botnets like 911 S5, a residential proxy network that at its peak controlled over 19 million IP addresses before its 2024 disruption. 911 S5's dismantling did not eliminate the tradecraft. Threat actors continued using similar residential proxy pools to distribute credential stuffing and API abuse campaigns in ways that made traditional IP-based blocking largely ineffective. The SaaS company had no meaningful rate limiting on its authentication endpoint. Within the six-hour window, attackers validated approximately 140,000 credential pairs.

That failure had a straightforward root cause: the team had implemented rate limiting on their public documentation as a box-checking exercise but had never applied it to the endpoints that actually mattered. This article covers what rate limiting actually needs to look like to defend against the threats currently hitting web APIs, including botnet-distributed attacks, credential stuffing, and automated scraping campaigns.

Why Rate Limiting Is a Security Control, Not a Performance Feature

Rate limiting is frequently treated as an infrastructure concern rather than a security control. Engineering teams implement it to prevent server overload, but the threat model that actually justifies it is adversarial. Attackers use automation at scale. Whether that automation is a botnet conducting credential stuffing, a scraper harvesting pricing data, or a scanner probing endpoints for unpatched vulnerabilities, the common thread is volume over time at a pace humans cannot match.

The April 2026 CVE landscape has continued to surface vulnerabilities in API frameworks, authentication libraries, and content management systems. The DriveSurge campaign, which hijacked thousands of websites for ClickFix and FakeUpdate delivery, relied in part on automated probing to identify vulnerable targets at scale. The WeedHack malware campaign that infected over 116,000 Minecraft systems used automated distribution mechanisms that depend on the absence of effective rate controls on upstream services.

When rate limiting is treated as a performance feature, it gets configured generously to avoid irritating legitimate users. When it is treated as a security control, it gets configured with the adversary's capabilities in mind: thousands of requests per second from distributed sources, with randomized timing, rotating user agents, and residential IP pools that defeat blocklist-based defenses.

The Four Core Rate Limiting Models

Fixed Window Counting

Fixed window counting tracks how many requests arrive within a defined time window, for example 100 requests per minute. When the counter exceeds the threshold, the server returns a 429 Too Many Requests response until the window resets. This model is simple to implement and inexpensive computationally.

The practical weakness is the boundary burst problem. An attacker can send 100 requests in the last second of one window and 100 requests in the first second of the next window, effectively delivering 200 requests in two seconds without triggering the rate limit. Against authentication endpoints or endpoints that invoke expensive backend operations, this burst exposure is significant.

Sliding Window Counting

Sliding window counting resolves the boundary burst problem by tracking request timestamps within a rolling window rather than resetting at fixed intervals. A 100-requests-per-minute limit under a sliding window means the system evaluates the last 60 seconds at the moment each new request arrives. If 100 requests have occurred in that interval, the new request is rejected regardless of where the minute boundary falls.

This model requires storing timestamps or using approximate algorithms like Redis sorted sets with score-based expiration. The overhead is modest for most deployments, and the improved accuracy against burst-timing attacks makes it the more appropriate default for security-sensitive endpoints.

Token Bucket

The token bucket model allocates tokens to each client at a defined refill rate. Each request consumes a token. When the bucket empties, requests are rejected until tokens accumulate again. The bucket has a maximum capacity, which defines the allowable burst above the steady-state rate.

Token bucket is well-suited for APIs that serve legitimate clients with variable but bursty traffic patterns, such as mobile applications that synchronize data on reconnection after a period offline. The security configuration decision is how large to set the bucket. A bucket that allows 1,000 burst requests provides more attack surface than one that allows 50. For authentication endpoints, small buckets are appropriate. For read-heavy data endpoints serving legitimate high-volume clients, larger buckets may be justified with compensating controls.

Leaky Bucket

The leaky bucket model processes requests at a fixed output rate regardless of input rate. Requests that arrive faster than the output rate queue up to a maximum size, after which excess requests are dropped. This model enforces strict rate smoothing and is particularly useful for protecting backend services that cannot tolerate burst traffic, such as databases or third-party APIs with their own rate limits.

The tradeoff is that leaky bucket introduces latency for bursting clients and can create queue buildup that degrades response times before the rejection threshold is reached. For security purposes, the queuing behavior means attackers are not immediately informed of the rate limit, which can complicate their automation logic but also means the API continues processing a queue of potentially malicious requests at the allowed rate.

What to Rate Limit By

The choice of rate limiting key determines the effectiveness of the control. Most introductory implementations rate limit by IP address. Against distributed botnet attacks sourced from residential proxy networks, this is insufficient on its own.

IP Address

IP-based rate limiting stops unsophisticated single-source attacks and reduces the effectiveness of small-scale scraping. Against campaigns using residential proxies with pools of tens of thousands of addresses, IP-based limits may allow an attacker to send hundreds of thousands of requests before any single IP hits the threshold. IP limiting should be combined with other keys, not relied upon exclusively.

One legitimate use of IP-based limiting is aggressive limiting at low thresholds for known bad IP ranges. ESET's APT Activity Report covering Q4 2025 through Q1 2026 documented continued use of compromised hosting infrastructure by multiple advanced threat actors. Applying stricter rate limits or immediate blocking to ASNs associated with commercial hosting and VPS providers that appear in your request logs but should not produce legitimate user traffic is a reasonable supplemental control.

User Account

Rate limiting by authenticated user account is effective once authentication has occurred. For login endpoints themselves, account-based limiting prevents password spraying against known usernames: after five failed attempts within a window, subsequent attempts from any IP address are rejected or delayed. This breaks the distributed credential stuffing model even when the attacker rotates IPs freely.

The implementation detail that matters is whether the account-based counter increments on failed attempts only or on all attempts including successes. Incrementing only on failures avoids penalizing legitimate users who authenticate successfully after a few attempts, while still blocking sustained automated attacks.

API Key

For machine-to-machine APIs, rate limiting by API key is the most reliable primary key. Each key has its own quota, and quota exhaustion triggers rejection regardless of which IP the request originates from. This model works well for B2B API products where clients are identifiable and have agreed usage tiers.

The security risk with API key limiting is key compromise. If an attacker obtains a legitimate API key, they can operate within that key's quota without triggering limits. API key limits need to be complemented by anomaly detection on usage patterns: a key that suddenly shifts from requests for financial data to requests for user account enumeration should trigger an alert regardless of whether it has hit its rate limit.

Endpoint Class

Applying different rate limits to different endpoint classes based on their risk profile and computational cost is more effective than applying uniform limits across the API surface. Authentication endpoints, password reset endpoints, and account enumeration endpoints warrant the most aggressive limits. Read-only data endpoints serving cacheable content can tolerate higher limits without meaningful security degradation.

A reasonable tiered structure for a typical web API might apply limits of 5 requests per minute per IP and 10 per minute per account to authentication endpoints, 60 requests per minute per IP to search endpoints, and 600 requests per minute per API key to bulk data endpoints. These numbers require calibration against actual traffic patterns, but the principle of differentiating by risk is sound.

Distributed Rate Limiting Architecture

A single-node rate limiter breaks down in horizontally scaled deployments. If five API server instances each maintain independent counters, a client can send five times the intended limit by distributing requests across nodes. Effective rate limiting in distributed environments requires a shared state store.

Redis is the most common choice for shared rate limiting state. It provides atomic increment operations, key expiration, and the performance characteristics needed for high-throughput APIs. A practical implementation uses the INCR command with EXPIRE to implement fixed window counting, or sorted sets with ZADD and ZRANGEBYSCORE for sliding window implementations.

The failure mode to plan for is Redis unavailability. When the state store goes down, the rate limiter must either fail open (allow all requests, abandoning the control) or fail closed (reject all requests, causing service disruption). Neither option is acceptable for a production API. The right design uses local in-memory fallback counters with conservative thresholds during state store unavailability, combined with alerting that treats state store failure as a high-priority incident.

For APIs deployed across multiple geographic regions, a globally consistent rate limiter requires either a globally distributed data store like Redis Cluster with replication, or accepting that per-region limits will be looser than intended by a factor proportional to the number of regions. The latter approach is often acceptable in practice: if an attacker is distributing requests across regions to defeat regional rate limits, they are typically also using a larger number of source IPs, which means other detection mechanisms are more likely to fire.

Behavioral Detection as a Rate Limiting Complement

Rate limiting by static thresholds operates on volume. Sophisticated adversaries operate at volumes just below those thresholds, or briefly exceed them before backing off, or use enough source diversity that no single key trips the limit. Behavioral detection addresses this gap.

The behavioral patterns worth monitoring include the ratio of failed to successful requests on authentication endpoints (high failure ratios indicate credential stuffing even when per-IP limits are not exceeded), the distribution of request timing (human traffic is irregular; automated traffic tends toward regular intervals or machine-speed randomization), the correlation between request patterns and known threat indicators (IP ranges appearing in threat intelligence feeds, user agents associated with automation frameworks), and sequences of requests that suggest enumeration rather than normal application use.

The BTMOB RAT campaign targeting Android devices, documented in mid-2025, communicated with command and control infrastructure through patterns that included regular polling intervals. Similar regularity appears in API-based botnets. A client that queries an API endpoint at precisely 30-second intervals for six hours is exhibiting automation behavior regardless of whether it has exceeded a volume threshold.

Communicating Rate Limits to Legitimate Clients

Rate limiting controls that are properly documented and communicated cause less friction for legitimate clients. The standard mechanism is HTTP response headers. The RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset headers (from the IETF draft standard) allow clients to self-throttle before receiving a 429 response.

Including Retry-After in 429 responses tells clients exactly how long to wait before retrying. Well-written API clients will respect this header and back off automatically. Attackers may ignore it, but the behavior difference between a legitimate client and an attacker can then be operationalized: a client that repeatedly ignores Retry-After is exhibiting adversarial behavior and can be subjected to increasingly aggressive blocking.

Developer documentation should specify rate limits per endpoint class, describe the quota model (fixed window versus sliding window versus token bucket), and provide guidance on handling 429 responses. Teams that document rate limits get fewer legitimate clients hammering APIs without awareness, which reduces false positive rate limiting events and makes anomalous traffic easier to distinguish.

Handling Legitimate High-Volume Clients

Rate limiting creates operational friction when legitimate clients have genuine high-volume needs. A data pipeline that processes millions of records by calling an enrichment API thousands of times per hour is a real use case that needs accommodation. Blanket rate limiting without exception handling drives clients to build workarounds that may introduce security issues of their own.

The right approach is explicit quota tiers tied to API keys, with higher tiers requiring business justification and review. A free tier might allow 1,000 requests per hour, a standard tier 10,000, and an enterprise tier negotiated individually. Each tier has its own rate limiting configuration. The security review for elevating a client to a higher tier should include verifying their identity, understanding their use case, and confirming the use case is consistent with observed traffic patterns.

Burst allowances can be granted to specific API keys without raising the sustained rate limit. A key with a 1,000 requests per hour sustained limit might be allowed a 200-request burst over 60 seconds to accommodate initialization sequences. This requires the token bucket model and careful configuration of bucket capacity relative to refill rate.

Testing Rate Limiting Controls Before Attackers Do

Rate limiting implementations contain bugs. The boundary conditions of fixed window implementations, the persistence of counters across deployments, the behavior under Redis failover, and the interaction between multiple rate limiting layers all create potential gaps. Testing these conditions before deployment is more productive than discovering them during an incident.

Red team exercises that specifically target rate limiting controls should verify that the stated limit is actually enforced at the stated threshold, that distributed requests across multiple source IPs are handled as intended, that limits apply consistently across all API nodes, that the system degrades gracefully when the shared state store is unavailable, and that authentication endpoint limits cannot be bypassed through HTTP method variation, parameter encoding differences, or path normalization issues.

Load testing tools like k6, Locust, or Artillery can generate the request volumes needed to verify rate limiting behavior. These tests should be run against staging environments that mirror production configuration precisely, including any CDN or API gateway layers that may apply their own rate limiting upstream of the application.

CDN and API Gateway Rate Limiting

Many organizations deploy CDNs or API gateways that include rate limiting capabilities. Cloudflare, AWS API Gateway, Kong, and similar products provide rate limiting that operates before requests reach application servers. This upstream layer provides several advantages: it operates at network scale, it can absorb volumetric attacks without consuming application server resources, and it typically includes IP reputation and bot detection capabilities that complement volume-based limiting.

The risk with relying exclusively on gateway-level rate limiting is that it may not have the application context needed for fine-grained controls. A gateway can enforce 100 requests per minute per IP across all endpoints, but distinguishing between an authentication endpoint and a data retrieval endpoint requires either gateway configuration that mirrors application routing logic or application-level rate limiting that receives only the traffic that passes the gateway layer.

The containers security landscape, including the supply chain attack patterns documented in recent research on container escape techniques, has highlighted that infrastructure components including API gateways can themselves be attack targets. Gateway configuration should be treated as security-sensitive code: stored in version control, reviewed before deployment, and monitored for unauthorized changes.

Integrating Rate Limiting With Incident Response

Rate limiting generates signals that are valuable beyond their immediate function of blocking excess requests. A spike in 429 responses on an authentication endpoint is an indicator of a credential stuffing attack in progress. Logging 429 events with source IP, endpoint, and timestamp provides the data needed to investigate, escalate, and respond to active attacks.

The logging configuration matters. Many teams log 429 responses at a low priority level or aggregate them into counters without preserving individual request metadata. When an incident occurs, the relevant question is not how many requests were blocked but which IPs were attacking, what timing pattern they used, whether any requests before the limit was reached succeeded, and whether the attack correlated with other indicators. That requires per-request logging of blocked attempts, not just aggregate counters.

Automated escalation based on rate limiting signal thresholds improves response time. If authentication endpoint 429 responses exceed a threshold over a short window, that event should trigger an alert to the security operations team, not just appear in a dashboard. The goal is to move from detection to investigation before the attack window closes.

Key Takeaways for Implementation

  • Apply the most aggressive rate limits to authentication, password reset, and account enumeration endpoints, where abuse directly enables account takeover.
  • Use sliding window or token bucket models rather than fixed window counting for security-sensitive endpoints to close the boundary burst gap.
  • Rate limit by multiple keys simultaneously: IP address for baseline protection, user account for preventing per-account abuse regardless of IP rotation, and API key for machine-to-machine traffic.
  • Deploy rate limiting state in a shared store like Redis when running multiple API nodes, and define a tested failover behavior for state store unavailability.
  • Log rate limiting events at a detail level that supports incident investigation, including source IP, endpoint, timestamp, and whether the client respected Retry-After headers.
  • Integrate rate limiting signal into security monitoring and alerting so that attack patterns generate operational alerts rather than sitting in dashboards.
  • Test rate limiting controls explicitly before deployment, including boundary conditions, distributed request patterns, and behavior under infrastructure failure.
  • Document rate limits in API documentation and include appropriate response headers so legitimate clients can self-throttle and development teams understand the controls in place.

Rate limiting is a foundational control, but it is not a complete defense. It works best as one layer in a stack that includes behavioral detection, IP reputation filtering, strong authentication, and monitoring. Treating it as a solved problem after initial implementation leaves gaps that adversaries with distributed infrastructure and patient automation will find.

Contact IPThreat