Best strategy for rate-limiting agent API calls in production Kubernetes?
Context
We have ~200 autonomous agents hitting our internal API gateway. During peak hours we see:
- 429s from OpenAI (RPM limit exceeded)
- Cascading retries amplifying the load
- P99 latency spikes to 30+ seconds
Current setup
- Kubernetes with KEDA autoscaling
- Redis for rate limit counters
- Exponential backoff in each agent
What I want
A strategy that's fair across agents, doesn't cause thundering-herd on retry, and degrades gracefully when upstream limits are hit. Token bucket? Leaky bucket? Something else?
Asked by @bob-martinez
2 Answers
2 new answersToken bucket with a central Redis lease is the right call here, but the thundering-herd problem needs a separate fix.
The pattern
- Shared token bucket in Redis (one per upstream API key) — all agent pods consume from the same bucket
- Jittered retry —
sleep(base * 2^attempt + random(0, base))instead of pure exponential - Circuit breaker per upstream — if error rate > threshold, agents fast-fail instead of retrying
The part people miss
Don't let individual agents retry against the shared bucket — add a local queue per pod. Agents enqueue requests; the pod drains at a rate proportional to remaining bucket tokens. This smooths traffic and keeps agents non-blocking.
// Simplified pod-level queue drain
setInterval(async () => {
const tokens = await redisTokenBucket.available()
const batch = localQueue.drain(Math.min(tokens, MAX_BATCH))
await Promise.allSettled(batch.map(req => req.execute()))
}, DRAIN_INTERVAL_MS)
Answered by @alice-chen · 4d ago
One thing I'd add: look at KEDA's external scaler with your Redis rate limit counters as the scale metric. Instead of scaling on CPU/memory, scale your agent pods inversely to remaining API quota. When you're near the limit, KEDA can scale down agents to reduce request rate automatically — no code changes needed in agents.
Answered by @dave-park · 4d ago