Inference overview
wylon's inference engine runs natively on multi-vendor GPU silicon (Biren, Cambricon, MetaX, Sunrise), orchestrated by a super-node architecture and served through an OpenAI- and Anthropic-compatible API. Serverless by default, with Batch for high-volume async workloads.
Available model categories
Three model categories are currently supported, all accessed through the same chat completions surface; the full list lives in the model catalog.
| Category | Example models | Typical uses |
|---|---|---|
| Text-to-text | MiniMax M2, Kimi K2, GLM-4.6, Qwen3, DeepSeek V3.2 | Chat, reasoning, code, tool use |
| Vision-language | Kimi-VL, Qwen3-VL, GLM-4V | Image understanding, document parsing |
| Embedding | BGE, Qwen3-Embedding | Semantic search, RAG, clustering |
Infrastructure
Models run on wylon's purpose-built compute backbone. The scheduler routes each request to the most cost-effective node that meets your latency target; the underlying silicon is transparent to your application.
- GPU silicon: Biren BR, Cambricon MLU, MetaX, Sunrise — multi-vendor, mixed scheduling.
- Availability zones: multiple AZs across East, North, and South China with automatic nearest-zone routing.
- Scaling: serverless autoscaling from idle to millions of tokens per minute.
- Bulk workloads: offload to Batch for async processing at a discounted rate.
Inference optimizations
wylon applies a stack of open-source and proprietary optimizations end-to-end. The combined effect preserves ~99% of the original model’s output quality while delivering 2–5× higher throughput than vanilla serving.
| Technique | What it does |
|---|---|
| KV cache | Reuses key-value tensors from prior tokens to skip redundant compute. |
| Paged attention | Chunks long sequences into pages; eliminates memory fragmentation. |
| Flash attention | Fused attention kernel with tiled softmax — faster, lower memory. |
| Quantization | FP8 / INT4 weight compression to reduce memory and boost speed. |
| Continuous batching | Merges in-flight requests at token granularity for GPU utilization. |
| Context caching | Stores intermediate layer outputs for repeated prefixes (system prompts, docs). |
| Speculative decoding | A small draft model predicts several tokens the large model verifies in one pass. |
A minimal request
The same endpoint handles all text-to-text models. Swap the model ID to compare.
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ["WYLON_API_KEY"],
base_url="https://api.wylon.cn/v1",
)
resp = client.chat.completions.create(
model="moonshotai/kimi-k2.5",
messages=[{"role": "user", "content": "Summarize RLHF in two sentences."}],
)
print(resp.choices[0].message.content)
curl https://api.wylon.cn/v1/chat/completions \
-H "Authorization: Bearer $WYLON_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"moonshotai/kimi-k2.5","messages":[{"role":"user","content":"Summarize RLHF in two sentences."}]}'
Tunable parameters
The API accepts the standard OpenAI-compatible sampling parameters including
temperature, top_p, max_tokens,
presence_penalty, frequency_penalty, stop,
and seed.
The full list with defaults lives in Chat completions.