wylon Token Factory
Token-metered inference for leading open-source LLMs. OpenAI- and Anthropic-compatible APIs integrate in a few lines, with system-level caching enabled by default for long-context and repeated-prefix workloads.
Inference optimized across GPU architectures
A production inference cloud built on multi-vendor GPUs and a super-node architecture.
Multi-vendor GPU support
Runtime kernels, scheduling, and parallelism are tuned for Biren, Cambricon, MetaX, Sunrise, and more, improving throughput and reliability across mainstream models.
A new super-node architecture
wylon runs LLM inference at scale on a super-node architecture. Each super-node combines high-bandwidth GPU interconnects with topology-aware scheduling, improving throughput and tail latency for MoE, long-context, and high-concurrency workloads.
System-level cache management
The cache engine spans the request scheduler, reusing system prompts, retrieved snippets, tool definitions, and multi-turn context for up to 10× faster repeated-prefix inference.
Fast integration with production visibility
Drop-in compatible API
One endpoint for multiple open-source model families, so you can switch providers without changing your integration protocol.
Batch API
High-throughput batch inference API for latency-insensitive analytical workloads.
System-level cache
Cross-request, cross-session context caching — increases efficiency and reduces cost simultaneously.
Full-stack observability
Track TTFT, TPS, cache-hit ratio, and per-token usage in real time.
Drop-in compatible, ready to run
wylon Token Factory is compatible with the OpenAI and Anthropic APIs, so you can keep your existing clients, agents, and middleware.
Three-line setup
Point your client's base URL at api.wylon.cn, add your wylon API key, choose a Token Factory model ID, and keep the rest of your code unchanged.
- openai → chat completions, tool calls, JSON mode
- anthropic → messages, streaming
- curl → native HTTP + SSE
Scale, reliability, and model coverage
FAQ
Which models are supported?
We cover leading open-source families: MiniMax, Kimi, GLM, Qwen, and DeepSeek. See the full list and per-model details in the model catalog.
Do you offer enterprise services?
Yes. We offer dedicated plans for enterprise customers — please reach out via Contact us to talk to our solutions team.
How does wylon handle my data?
wylon handles your data in accordance with applicable regulations. We do not use your data to train models or for unrelated commercial purposes. You can request deletion at any time. Details: Privacy Policy.
I don't see the model I want — what do I do?
We continuously onboard new models based on industry developments and customer demand. Enterprise customers can request specific model onboarding or dedicated deployments through our solutions team.
Sign up and run your first inference request.
Sign up and verify to get an API key plus starter credits — or get in touch with us directly.