How CLIProxyAPI Load-Balances Your AI Subscriptions — and What Its Scheduler Reveals About Quota Limits

CLIProxyAPI is marketed as an OpenAI/Claude/Gemini-compatible proxy that round-robins across your subscription accounts. In the code (v7.2.47) it’s a reactive per-account cooldown state machine. It doesn’t predict your quota; it waits for a 429, then benches the account — and how long it benches depends entirely on which provider said no.

That last part is the interesting bit, so let me pull the thread all the way through. I read this at v7.2.47, commit 00114be, and every claim below points at a line.

What is it, and where does it sit?

CLIProxyAPI is a self-hosted proxy. You log into your Claude, ChatGPT/Codex, and Gemini accounts through it via OAuth, and it exposes them all as one OpenAI-compatible endpoint. Point your tools at it, and it fans requests out across whichever of your subscription logins are currently usable.

The eye-catching feature is the account scheduler — the thing that picks which of your logged-in accounts serves the next request and decides when to stop using one. That’s what I went looking at, because “round-robin load balancing across your subscriptions” is doing a lot of marketing work, and I wanted to see what it actually means in the code.

Short version: there’s no quota meter. Nothing counts your requests to keep you under a cap. The pool is gated purely by a reaction to failures — an account gets used until a provider returns 429, then it’s put in a cooldown until a timer expires.

How does the scheduler actually work?

The loop is: request → select an account → execute → MarkResult → maybe cool it down → repeat. Two halves matter — how selection picks, and how the cooldown machine benches accounts.

Selection: three strategies over the same available pool

All three selectors first ask the same question — which auths are not currently blocked — and then differ only in how they break the tie.

Strategy	Tie-break rule	Source
Round-robin (default)	Rotates a cursor across the available auths	`selector.go:257-283`
Fill-first	Returns `available[0]` after an ID-sort	`selector.go:294-303`
Session-affinity	Wraps a base selector; pins a session to one auth by TTL	`selector.go:417-470`

The word “available” is where the whole design lives. An auth is blocked for a given model when it’s marked unavailable and its retry timer is still in the future — isAuthBlockedForModel at selector.go:305-362 checks exactly Unavailable && NextRetryAfter.After(now), and it does this per-(auth, model) pair. So a Claude account benched for claude-sonnet can still serve a different model.

And what’s not there is as telling as what is. There’s a recentRequestRing that records outcomes (types.go:146-159), but it’s written in MarkResult and read only by the usage/management endpoints — it never gates selection. There’s no max_request, no message_limit, no daily_limit config anywhere. Selection is blind to how much you’ve used an account. It only knows whether that account has recently failed.

The cooldown machine: one central sink

Every executor outcome funnels through a single function — Manager.MarkResult at conductor.go:3496. When the status code is 429 (conductor.go:3468), it computes the next-retry time like this:

next = now + *RetryAfter   // if the executor surfaced a hint
next = now + nextQuotaCooldown(level)  // otherwise

That hint — whether one exists at all — is the crux of the whole article, and I’ll come back to it. When there’s no hint, nextQuotaCooldown (conductor.go:4153-4169) is a plain exponential backoff: 1s * 2^level, capped at 30m. It sets a Quota{Exceeded, Reason:"quota", NextRecoverAt, BackoffLevel} on the state, a success later clears it, and the whole thing is persisted to per-auth .cds files and restored (if not yet expired) at boot (conductor.go:557-611).

Status → cooldown duration, at a glance

Since MVP is text and tables (no custom diagrams), here’s the state machine as a table. Different HTTP outcomes bench an account for very different durations:

Outcome	Cooldown applied	Reason flag
`429` with provider hint	`now + hint` (exact reset time)	Cooldown (quota)
`429` without hint	`1s → 2s → 4s → … → 30m` exponential backoff	Cooldown (quota)
`401` (auth)	`+30m`	Other
`402` / `403` (payment)	`+30m`	Other
`404`	`+12h`	Other
`400`/`422` model-not-supported	`+12h`	Other
`5xx` transient (408/500/502/503/504)	`+1m` (configurable)	Other (not quota)
Cloudflare challenge	quota curve, floored at `10s`	Cooldown (quota)

Note the transient 5xx path sets only a retry timer, not the Quota flag (conductor.go:3629-3634) — so it reads back as a generic “other” block, not a quota cooldown. The distinction matters mostly for the management endpoints that report why an account is benched.

Design decisions: two gaps worth showing

The architecture is coherent. But two spots reveal a gap between the pitch and the mechanism, and I’d rather show you the code than editorialize.

Gap 1 — “fill-first staggers your subscription caps” is a comment, not logic

Fill-first’s comment (selector.go:32-35, from blame b078be46) says it “burns one account … to stagger rolling-window subscription caps (e.g. chat message limits).” That reads like there’s cap-aware logic underneath.

There isn’t. FillFirstSelector.Pick returns available[0] after sorting by ID (selector.go:294-303). No rolling-window math, no cap tracking. The “staggering” is an emergent effect of deterministic ordering — because the pool always drains account #1 first, the accounts hit their windows at different wall-clock times. The honest framing: fill-first hammers account #1 until it 429s, then moves to #2. That may well be the behavior you want. It’s just not enforced by cap logic; it falls out of the ordering.

Gap 2 — recovery timing is provider-specific, and in this path nobody reads `Retry-After`

Here’s the one I’d want to know before pooling accounts. The central machine only gets a real reset time if the executor surfaced a hint via error.RetryAfter() (conductor.go:3908-3925). So the question becomes: which executors actually populate that hint on a request 429? The answer splits sharply by provider.

Provider	Where the 429 recovery time comes from	Knows your real reset?
Codex / OpenAI	JSON `error.resets_at` / `error.resets_in_seconds`, but only when `error.type == "usage_limit_reached"` (`codex_executor.go:1834-1852`)	Yes
Claude	Nothing on a request 429 → exponential backoff (`claude_executor.go:324`)	No (guesses)
Gemini (+Vertex)	Nothing → exponential backoff (`gemini_executor.go:197`)	No (guesses)
Antigravity	Google `RetryInfo.retryDelay`/`quotaResetDelay` + regex, plus a short-cooldown KV and sub-3s instant retry (`json_retry_helpers.go:27-80`, `antigravity_executor.go:2621-2648`)	Yes

The money line: only Codex actually knows when your subscription resets — because ChatGPT returns resets_at in the error body, and the executor parses it. (Its per-minute rate_limit_* fields are deliberately excluded from that parse at codex_executor.go, so those still fall to backoff.) For Claude and Gemini, the proxy has no idea when your window reopens. It just backs off blindly: 1s, 2s, 4s, and so on up to 30 minutes. If you pool Claude accounts expecting the proxy to respect your real reset window, it doesn’t — it’s guessing, and it re-probes on a schedule that has nothing to do with when Anthropic actually lifts the limit.¹

Is reactive-not-predictive a bad design? No. Providers don’t hand out a clean “you have N requests left” signal, so waiting for the 429 is a reasonable stance. My only nit is with the word “load balancing.” A load balancer distributes to keep everyone under capacity; this distributes and then reacts to overflow. It’s a failover cooldown scheduler wearing a load-balancer label — and for two of the three headline providers, the cooldown timing is a blind guess.

Where does this sit on the gateway map?

This is the axis I’d use to place CLIProxyAPI among the LLM gateways: what are you routing?

Generic gateways — LiteLLM, OpenRouter, one-api — route API keys, chosen by price, latency, or a fallback order. The unit is a billable key with a metered balance. CLIProxyAPI routes OAuth subscription accounts bound by opaque quota windows. The unit is a login you already pay a flat monthly fee for, and the constraint isn’t dollars-per-token, it’s “how many messages before the rolling window slaps you.”

That one difference reshapes everything downstream. Key routers can price-optimize because they can see the price. A subscription proxy can’t see the quota, so its only lever is reaction — which is exactly why CLIProxyAPI is a cooldown state machine and not a cost optimizer. When the llm-gateways comparison hub goes up, this is the camp CLIProxyAPI lands in: subscription-account proxies, distinct from API-key routers.

When would you reach for it?

If you have several ChatGPT/Codex subscriptions and want to pool them behind one OpenAI-compatible endpoint, this is a genuinely good fit — Codex is the one provider where the scheduler knows your real reset time, so the cooldowns line up with reality. For Claude and Gemini it still works and still spreads load across accounts; just go in knowing the cooldown is a blind exponential backoff, not a reservation against your actual window.

The levers you’d actually turn (config):

routing.strategy — round-robin (default) or fill-first; hot-reloadable (service.go:1275-1317).
routing.session-affinity — off by default, TTL 1h. When a client sends no session ID, it reconstructs one from a content hash (FNV-64a of system + first-user message, truncated to 100 chars) at selector.go:585-750 — worth knowing if you expect strict pinning.
Per-auth priority buckets — highest-priority-only selection (selector.go:115-128).
disable-cooling (global / per-auth / per-provider) and transient-error-cooldown-seconds (0 = 60s, <0 = off) (conductor.go:94-146).

Methodology & scope

I read CLIProxyAPI v7.2.47 at commit 00114be, focusing on the translator, auth, runtime/executor, and sdk/cliproxy/auth packages. The two headline claims — the provider-split on 429 recovery timing, and fill-first having no cap logic — were each run through an adversarial claim-verifier and confirmed. Line numbers are exact at that commit; I’ve kept code quotes to a line or two.

Scope caveats, stated plainly:

The xAI (grok) and Kimi executors I did not read line-by-line. They share the same statusErr type as the others, so I’d expect them to behave like the “no hint → backoff” group — but I’m scoping the provider claims to Codex, Claude, Gemini, and Antigravity, which I did read.
The “nobody reads Retry-After” claim is scoped to the request / quota-cooldown path, and that scoping is deliberate (see the footnote).

One precise exception, so I don’t overstate it: Claude does read Retry-After / Retry-After-Ms — but only in the OAuth token-refresh path (parseClaudeRetryAfter → setClaudeRefreshBlockedUntil, a refresh throttle) at anthropic_auth.go:89, not the request/quota-cooldown path this article is about. So it’s accurate to say in the request/quota-cooldown path, no provider reads the HTTP Retry-After header — not that Claude never reads it at all. ↩