Rate limits & scaling
Real-time inference is bounded by per-organization rate limits. They scale with your account tier and usage history. For offline / large-volume workloads, Batch runs on a separate quota with higher concurrency limits.
How limits work
Two counters are applied independently — whichever you hit first triggers throttling:
- RPM — requests per minute, regardless of size.
- TPM — tokens per minute (input + output combined).
Limits are tracked per (organization, model, region). Rolling-window enforcement
is used, not fixed buckets, so burst traffic is smoothed over a 60-second window.
Tiers
Your tier upgrades automatically as spend and account history grow. You can also request a bump at any time in the dashboard.
| Tier | Requirement | Default RPM | Default TPM |
|---|---|---|---|
| Free | New account, unpaid | 60 | 60,000 |
| Tier 1 | First paid top-up | 500 | 500,000 |
| Tier 2 | ¥300 spent & 7+ days old | 2,500 | 2,000,000 |
| Tier 3 | ¥3,000 spent & 30+ days old | 10,000 | 10,000,000 |
| Tier 4 | ¥30,000 spent | 30,000 | 50,000,000 |
| Enterprise | Contract | Custom | Custom |
Response headers
Every response includes live counters so your client can back off pre-emptively.
x-ratelimit-limit-requests: 2500
x-ratelimit-remaining-requests: 2487
x-ratelimit-reset-requests: 42ms
x-ratelimit-limit-tokens: 2000000
x-ratelimit-remaining-tokens: 1884221
x-ratelimit-reset-tokens: 1.2s
The 429 response
When you exceed a limit you get HTTP 429 Too Many Requests. The body includes which counter tripped and a suggested delay.
{
"error": {
"type": "rate_limit_exceeded",
"message": "Rate limit reached for moonshotai/kimi-k2.5 on tier 2.",
"limit_type": "tokens",
"retry_after": 3.4
}
}
Recommended client behaviour
- Respect
Retry-After. It’s the server’s best guess at when capacity frees up. - Exponential backoff with jitter. On repeated 429s, double the delay (cap at 30s) and add ±20% jitter.
- Watch
x-ratelimit-remaining-*. Throttle client-side before you hit the wall. - Batch where possible. Fewer larger requests save RPM headroom.
import time, random
from openai import OpenAI, RateLimitError
client = OpenAI(base_url="https://api.wylon.cn/v1", api_key=os.environ["WYLON_API_KEY"])
def call_with_retry(messages, attempts=6):
for i in range(attempts):
try:
return client.chat.completions.create(model="moonshotai/kimi-k2.5", messages=messages)
except RateLimitError as e:
delay = min(30, (2 ** i)) * (0.8 + random.random() * 0.4)
time.sleep(delay)
raise RuntimeError("exhausted retries")
Requesting a higher limit
Most teams never need to ask — tier upgrades happen automatically. For one-off events (product launches, marketing campaigns) file a request from Dashboard → Limits → Request increase. Include expected peak RPM/TPM and a window. Approvals typically arrive within one business day.
Pair with Batch
For workloads that tolerate async processing (offline evaluation, document batch processing, synthetic data generation), move them to Batch: real-time quota is consumed only by online requests, while batch jobs use an independent quota and benefit from discounted pricing.