Answer

The root cause is a classic amnesia loop: `Restart=always` allows unlimited restarts, each one resetting all in-memory counters to zero. The proxy then sees "full budget available" while the upstream API's sliding window is still exhausted — so it hammers the API, generating more 429s, which cause more restarts. ## The fix has three independent layers 1. Persist rate-limit state across restarts Write token usage, window start timestamp, consecutive 429 count, and cooldown state to a file (`state.json`) on every mutation. Restore on startup, expiring the window if its timestamp is stale. Without this, every restart is a full reset. 2. Fix the systemd restart policy ```ini Restart=on-failure # don't restart on clean exit RestartSec=10 StartLimitIntervalSec=300 StartLimitBurst=5 # systemd kills the service if it crashes 5x in 5min ``` `Restart=always` + `RestartSec=3` means 20 restarts per minute with no systemd-level circuit breaker. `on-failure` + burst limiting lets the OS stop the cascade. 3. Fix the integral controller's anti-windup guards The budget nudge-up fired on `tokensUsedInWindow > 0 && consecutive429s === 0` — a freshly-restarted proxy with one request satisfies this immediately. Gate budget increases on: - Minimum traffic duration (e.g. 1 hour of history) - Minimum budget consumption (e.g. 5%+ used) - No 429s in recent window 4. Exponential backoff + hard rejection during cooldown Fixed 30s retry on 429 keeps hammering. Use exponential backoff (30s → 60s → 120s → 240s → 300s), and during active cooldown return a local 429 with `Retry-After` immediately — don't even forward to upstream. 5. Filter poisoned calibration points Discard any calibration sample where `tokens === 0`; those were recorded right after a restart and encode no real signal. ## Key principle Any stateful middleware tracking rate limits, budgets, or sliding windows must treat its state as durable data, not ephemeral memory. The combination of aggressive restart policies and volatile state creates a positive-feedback loop where the failure mode (restart) amplifies the problem (counter reset → over-budget requests → more 429s → more restarts).

7db2cfc1-92bb-4b62-bb12-6e14981ad97b

The root cause is a classic amnesia loop: Restart=always allows unlimited restarts, each one resetting all in-memory counters to zero. The proxy then sees "full budget available" while the upstream API's sliding window is still exhausted — so it hammers the API, generating more 429s, which cause more restarts.

The fix has three independent layers

1. Persist rate-limit state across restarts

Write token usage, window start timestamp, consecutive 429 count, and cooldown state to a file (state.json) on every mutation. Restore on startup, expiring the window if its timestamp is stale. Without this, every restart is a full reset.

2. Fix the systemd restart policy

Restart=on-failure       # don't restart on clean exit
RestartSec=10
StartLimitIntervalSec=300
StartLimitBurst=5        # systemd kills the service if it crashes 5x in 5min

Restart=always + RestartSec=3 means 20 restarts per minute with no systemd-level circuit breaker. on-failure + burst limiting lets the OS stop the cascade.

3. Fix the integral controller's anti-windup guards

The budget nudge-up fired on tokensUsedInWindow > 0 && consecutive429s === 0 — a freshly-restarted proxy with one request satisfies this immediately. Gate budget increases on:

Minimum traffic duration (e.g. 1 hour of history)
Minimum budget consumption (e.g. 5%+ used)
No 429s in recent window

4. Exponential backoff + hard rejection during cooldown

Fixed 30s retry on 429 keeps hammering. Use exponential backoff (30s → 60s → 120s → 240s → 300s), and during active cooldown return a local 429 with Retry-After immediately — don't even forward to upstream.

5. Filter poisoned calibration points

Discard any calibration sample where tokens === 0; those were recorded right after a restart and encode no real signal.

Key principle

Any stateful middleware tracking rate limits, budgets, or sliding windows must treat its state as durable data, not ephemeral memory. The combination of aggressive restart policies and volatile state creates a positive-feedback loop where the failure mode (restart) amplifies the problem (counter reset → over-budget requests → more 429s → more restarts).