wylon

Rate limits & scaling

Real-time inference is bounded by per-organization rate limits. They scale with your account tier and usage history. For offline / large-volume workloads, Batch runs on a separate quota with higher concurrency limits.

How limits work

Two counters are applied independently — whichever you hit first triggers throttling:

Limits are tracked per (organization, model, region). Rolling-window enforcement is used, not fixed buckets, so burst traffic is smoothed over a 60-second window.

Tiers

Your tier upgrades automatically as spend and account history grow. You can also request a bump at any time in the dashboard.

TierRequirementDefault RPMDefault TPM
FreeNew account, unpaid6060,000
Tier 1First paid top-up500500,000
Tier 2¥300 spent & 7+ days old2,5002,000,000
Tier 3¥3,000 spent & 30+ days old10,00010,000,000
Tier 4¥30,000 spent30,00050,000,000
EnterpriseContractCustomCustom
info
Per-model limits differ. Large frontier models may have tighter defaults than small ones. Check the Limits page in the dashboard for your live numbers.

Response headers

Every response includes live counters so your client can back off pre-emptively.

x-ratelimit-limit-requests:      2500
x-ratelimit-remaining-requests:  2487
x-ratelimit-reset-requests:      42ms

x-ratelimit-limit-tokens:        2000000
x-ratelimit-remaining-tokens:    1884221
x-ratelimit-reset-tokens:        1.2s

The 429 response

When you exceed a limit you get HTTP 429 Too Many Requests. The body includes which counter tripped and a suggested delay.

{
  "error": {
    "type": "rate_limit_exceeded",
    "message": "Rate limit reached for moonshotai/kimi-k2.5 on tier 2.",
    "limit_type": "tokens",
    "retry_after": 3.4
  }
}

Recommended client behaviour

import time, random
from openai import OpenAI, RateLimitError

client = OpenAI(base_url="https://api.wylon.cn/v1", api_key=os.environ["WYLON_API_KEY"])

def call_with_retry(messages, attempts=6):
    for i in range(attempts):
        try:
            return client.chat.completions.create(model="moonshotai/kimi-k2.5", messages=messages)
        except RateLimitError as e:
            delay = min(30, (2 ** i)) * (0.8 + random.random() * 0.4)
            time.sleep(delay)
    raise RuntimeError("exhausted retries")

Requesting a higher limit

Most teams never need to ask — tier upgrades happen automatically. For one-off events (product launches, marketing campaigns) file a request from Dashboard → Limits → Request increase. Include expected peak RPM/TPM and a window. Approvals typically arrive within one business day.

Pair with Batch

For workloads that tolerate async processing (offline evaluation, document batch processing, synthetic data generation), move them to Batch: real-time quota is consumed only by online requests, while batch jobs use an independent quota and benefit from discounted pricing.

沪ICP备2026010432号-1 沪公网安备31010402336632号