Inference/Rate limits & scaling

Rate limits & scaling

Real-time inference is bounded by per-organization rate limits. They scale with your account tier and usage history. For offline / large-volume workloads, Batch runs on a separate quota with higher concurrency limits.

How limits work

Two counters are applied independently — whichever you hit first triggers throttling:

RPM — requests per minute, regardless of size.
TPM — tokens per minute (input + output combined).

Limits are tracked per (organization, model, region). Rolling-window enforcement is used, not fixed buckets, so burst traffic is smoothed over a 60-second window.

Tiers

Your tier upgrades automatically as spend and account history grow. You can also request a bump at any time in the dashboard.

Tier	Requirement	Default RPM	Default TPM
Free	New account, unpaid	60	60,000
Tier 1	First paid top-up	500	500,000
Tier 2	¥300 spent & 7+ days old	2,500	2,000,000
Tier 3	¥3,000 spent & 30+ days old	10,000	10,000,000
Tier 4	¥30,000 spent	30,000	50,000,000
Enterprise	Contract	Custom	Custom

info

Per-model limits differ. Large frontier models may have tighter defaults than small ones. Check the Limits page in the dashboard for your live numbers.

Response headers

Every response includes live counters so your client can back off pre-emptively.

x-ratelimit-limit-requests:      2500
x-ratelimit-remaining-requests:  2487
x-ratelimit-reset-requests:      42ms

x-ratelimit-limit-tokens:        2000000
x-ratelimit-remaining-tokens:    1884221
x-ratelimit-reset-tokens:        1.2s

The 429 response

When you exceed a limit you get HTTP 429 Too Many Requests. The body includes which counter tripped and a suggested delay.

{
  "error": {
    "type": "rate_limit_exceeded",
    "message": "Rate limit reached for moonshotai/kimi-k2.5 on tier 2.",
    "limit_type": "tokens",
    "retry_after": 3.4
  }
}

Recommended client behaviour

Respect Retry-After. It’s the server’s best guess at when capacity frees up.
Exponential backoff with jitter. On repeated 429s, double the delay (cap at 30s) and add ±20% jitter.
Watch x-ratelimit-remaining-*. Throttle client-side before you hit the wall.
Batch where possible. Fewer larger requests save RPM headroom.

import time, random
from openai import OpenAI, RateLimitError

client = OpenAI(base_url="https://api.wylon.cn/v1", api_key=os.environ["WYLON_API_KEY"])

def call_with_retry(messages, attempts=6):
    for i in range(attempts):
        try:
            return client.chat.completions.create(model="moonshotai/kimi-k2.5", messages=messages)
        except RateLimitError as e:
            delay = min(30, (2 ** i)) * (0.8 + random.random() * 0.4)
            time.sleep(delay)
    raise RuntimeError("exhausted retries")

Requesting a higher limit

Most teams never need to ask — tier upgrades happen automatically. For one-off events (product launches, marketing campaigns) file a request from Dashboard → Limits → Request increase. Include expected peak RPM/TPM and a window. Approvals typically arrive within one business day.

Pair with Batch

For workloads that tolerate async processing (offline evaluation, document batch processing, synthetic data generation), move them to Batch: real-time quota is consumed only by online requests, while batch jobs use an independent quota and benefit from discounted pricing.