wylon

Inference overview

wylon's inference engine runs natively on multi-vendor GPU silicon (Biren, Cambricon, MetaX, Sunrise), orchestrated by a super-node architecture and served through an OpenAI- and Anthropic-compatible API. Serverless by default, with Batch for high-volume async workloads.

bolt
No infrastructure to manage. Send a request, get a response. wylon handles continuous batching, autoscaling, and routing across heterogeneous GPUs so you can focus on your application.

Available model categories

Three model categories are currently supported, all accessed through the same chat completions surface; the full list lives in the model catalog.

Category Example models Typical uses
Text-to-text MiniMax M2, Kimi K2, GLM-4.6, Qwen3, DeepSeek V3.2 Chat, reasoning, code, tool use
Vision-language Kimi-VL, Qwen3-VL, GLM-4V Image understanding, document parsing
Embedding BGE, Qwen3-Embedding Semantic search, RAG, clustering

Infrastructure

Models run on wylon's purpose-built compute backbone. The scheduler routes each request to the most cost-effective node that meets your latency target; the underlying silicon is transparent to your application.

Inference optimizations

wylon applies a stack of open-source and proprietary optimizations end-to-end. The combined effect preserves ~99% of the original model’s output quality while delivering 2–5× higher throughput than vanilla serving.

TechniqueWhat it does
KV cacheReuses key-value tensors from prior tokens to skip redundant compute.
Paged attentionChunks long sequences into pages; eliminates memory fragmentation.
Flash attentionFused attention kernel with tiled softmax — faster, lower memory.
QuantizationFP8 / INT4 weight compression to reduce memory and boost speed.
Continuous batchingMerges in-flight requests at token granularity for GPU utilization.
Context cachingStores intermediate layer outputs for repeated prefixes (system prompts, docs).
Speculative decodingA small draft model predicts several tokens the large model verifies in one pass.

A minimal request

The same endpoint handles all text-to-text models. Swap the model ID to compare.

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["WYLON_API_KEY"],
    base_url="https://api.wylon.cn/v1",
)

resp = client.chat.completions.create(
    model="moonshotai/kimi-k2.5",
    messages=[{"role": "user", "content": "Summarize RLHF in two sentences."}],
)

print(resp.choices[0].message.content)
curl https://api.wylon.cn/v1/chat/completions \
  -H "Authorization: Bearer $WYLON_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"moonshotai/kimi-k2.5","messages":[{"role":"user","content":"Summarize RLHF in two sentences."}]}'

Tunable parameters

The API accepts the standard OpenAI-compatible sampling parameters including temperature, top_p, max_tokens, presence_penalty, frequency_penalty, stop, and seed. The full list with defaults lives in Chat completions.

Next

沪ICP备2026010432号-1 沪公网安备31010402336632号