Inference/Overview

Inference overview

wylon's inference engine runs natively on multi-vendor GPU silicon (Biren, Cambricon, MetaX, Sunrise), orchestrated by a super-node architecture and served through an OpenAI- and Anthropic-compatible API. Serverless by default, with Batch for high-volume async workloads.

bolt

No infrastructure to manage. Send a request, get a response. wylon handles continuous batching, autoscaling, and routing across heterogeneous GPUs so you can focus on your application.

Available model categories

Three model categories are currently supported, all accessed through the same chat completions surface; the full list lives in the model catalog.

Category	Example models	Typical uses
Text-to-text	MiniMax M2, Kimi K2, GLM-4.6, Qwen3, DeepSeek V3.2	Chat, reasoning, code, tool use
Vision-language	Kimi-VL, Qwen3-VL, GLM-4V	Image understanding, document parsing
Embedding	BGE, Qwen3-Embedding	Semantic search, RAG, clustering

Infrastructure

Models run on wylon's purpose-built compute backbone. The scheduler routes each request to the most cost-effective node that meets your latency target; the underlying silicon is transparent to your application.

GPU silicon: Biren BR, Cambricon MLU, MetaX, Sunrise — multi-vendor, mixed scheduling.
Availability zones: multiple AZs across East, North, and South China with automatic nearest-zone routing.
Scaling: serverless autoscaling from idle to millions of tokens per minute.
Bulk workloads: offload to Batch for async processing at a discounted rate.

Inference optimizations

wylon applies a stack of open-source and proprietary optimizations end-to-end. The combined effect preserves ~99% of the original model’s output quality while delivering 2–5× higher throughput than vanilla serving.

Technique	What it does
KV cache	Reuses key-value tensors from prior tokens to skip redundant compute.
Paged attention	Chunks long sequences into pages; eliminates memory fragmentation.
Flash attention	Fused attention kernel with tiled softmax — faster, lower memory.
Quantization	FP8 / INT4 weight compression to reduce memory and boost speed.
Continuous batching	Merges in-flight requests at token granularity for GPU utilization.
Context caching	Stores intermediate layer outputs for repeated prefixes (system prompts, docs).
Speculative decoding	A small draft model predicts several tokens the large model verifies in one pass.

A minimal request

The same endpoint handles all text-to-text models. Swap the model ID to compare.

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["WYLON_API_KEY"],
    base_url="https://api.wylon.cn/v1",
)

resp = client.chat.completions.create(
    model="moonshotai/kimi-k2.5",
    messages=[{"role": "user", "content": "Summarize RLHF in two sentences."}],
)

print(resp.choices[0].message.content)

curl https://api.wylon.cn/v1/chat/completions \
  -H "Authorization: Bearer $WYLON_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"moonshotai/kimi-k2.5","messages":[{"role":"user","content":"Summarize RLHF in two sentences."}]}'

Tunable parameters

The API accepts the standard OpenAI-compatible sampling parameters including temperature, top_p, max_tokens, presence_penalty, frequency_penalty, stop, and seed. The full list with defaults lives in Chat completions.

Chat completions arrow_forward

Parameters, streaming, multi-turn conversations, and the full response schema.

Function calling arrow_forward

Let models invoke your tools and orchestrate external systems.

Batch arrow_forward

Submit large async workloads as JSONL files and use discounted batch pricing.

Observability arrow_forward

Metrics, logs and usage breakdowns for debugging and cost analysis.

Inference overview

Available model categories

Infrastructure

Inference optimizations

A minimal request

Tunable parameters

Next