Production AI inference
on wylon

A high-performance LLM inference platform built for developers and enterprises.

Core products · Token Factory

Token Factory(public sign-up coming soon)

Pay-per-token inference for leading open-source LLMs. OpenAI- and Anthropic-compatible APIs, ready to use with your existing clients.

Supported model families: MiniMax · Kimi · GLM · Qwen · DeepSeek

MiniMax

Long-context, productivity-grade workloads — document processing, summarization, multi-turn business conversations.

FlagshipMiniMax M2.5

Best forLong-form / structured

Use casesEnterprise knowledge / long documents

Moonshot

Kimi

Solid on multimodal, ultra-long context, and code — a popular foundation for agents and developer tooling.

FlagshipKimi K2.5

Best forLong dialogue / repos

Use casesAgent / IDE / research

Zhipu · Z.ai

GLM

Strong on Chinese corpora — covers general chat, tool use, and agentic workflows.

FlagshipGLM 5

Best forGeneral chat / tool calls

Use casesCustomer support / Chinese-language workloads

Qwen

Full lineup from on-device tiny models to flagship MoEs, balancing performance and cost.

FlagshipQwen3 / Qwen2.5-VL

Best forGeneral / multilingual / multimodal

Use casesContent / RAG / vision

DeepSeek

Reputation for reasoning, code, and math — its MoE line offers excellent price-performance.

FlagshipDeepSeek-V3 / R1

Best forReasoning / code / chain-of-thought

Use casesCode / agent planning

And more

See the full list, context lengths, and pricing in the model catalog and pricing pages.

Browse model catalog→ Read the docs→

Core products · GPU Service

GPU Service(coming soon)

GPU compute across instances, bare-metal servers, and dedicated clusters. Now accepting early-access requests.

Instance

GPU Instance

On-demand single- or multi-GPU instance containers — provisioned in minutes, billed by the hour.

Bare-metal

Bare-Metal

Dedicated GPU servers with full hardware control and resource isolation.

Cluster

Multi-node networked compute pool, sized for large-scale inference.

Request early access

Core technology

An end-to-end inference system

A super-node GPU architecture with system-level hardware-software co-design.

01 wylon Token Factory

02 Cloud inference engine

03 wylon super-node platform

A vertically integrated Token Factory

Drop-in compatible API

OpenAI- and Anthropic-compatible API surface — set the base URL and you're done.

Full-stack observability

Track TTFT, TPS, cache-hit ratio, and per-token usage in real time.

System-level KV-cache acceleration

A dedicated KV-cache engine with hybrid tiering accelerates repeated-prefix and long-context inference.

Topology-aware MoE acceleration

The super-node architecture enables topology-aware scheduling for tensor, pipeline, and expert parallelism (TP/PP/EP).

Multi-node high availability

LLM-aware scheduling across an elastic GPU topology for higher availability and predictable scale.

Multi-vendor GPU infrastructure

Broad support across leading GPU vendors — Cambricon, Biren, Sunrise, MetaX, and more.

High-performance GPU interconnect topology

GPU communication topology co-designed with MoE parallelism for large-scale inference.

Energy-efficient inference design

The wylon super-node platform jointly optimizes scheduling, batching, and thermals to deliver high energy efficiency at scale.

Why wylon

Built for LLM inference workloads

Full-stack, hardware-software co-designed

End-to-end cloud infrastructure across GPU compatibility, super-node architecture, inference runtime, and API services — tightly integrated to reduce performance overhead.

99.9%

Availability SLA

Token Factory delivers up to 99.9% availability, sized for production-grade traffic.

Ops overhead

Drop-in inference API — no infrastructure to manage.

10×

Speedup

Backed by system-level caching, prefill is up to 10× faster than baseline implementations.

Chip families supported

Native compatibility across Biren, Cambricon, MetaX, Sunrise, and more.

Learn more

Explore the platform, APIs, and deployment options.

Read the docs→

Silicon partners

Get started

Try the wylon Token Factory

Token Factory is open for sign-ups and can be integrated in minutes. GPU Service is accepting early-access applications.

Get started with Token Factory Contact sales

Production AI inference on wylon