Token Factory

wylon Token Factory

Token-metered inference for leading open-source LLMs. OpenAI- and Anthropic-compatible APIs integrate in a few lines, with system-level caching enabled by default for long-context and repeated-prefix workloads.

Get started Read the docs

Features

Inference optimized across GPU architectures

A production inference cloud built on multi-vendor GPUs and a super-node architecture.

Multi-vendor GPU support

Runtime kernels, scheduling, and parallelism are tuned for Biren, Cambricon, MetaX, Sunrise, and more, improving throughput and reliability across mainstream models.

A new super-node architecture

wylon runs LLM inference at scale on a super-node architecture. Each super-node combines high-bandwidth GPU interconnects with topology-aware scheduling, improving throughput and tail latency for MoE, long-context, and high-concurrency workloads.

System-level cache management

The cache engine spans the request scheduler, reusing system prompts, retrieved snippets, tool definitions, and multi-turn context for up to 10× faster repeated-prefix inference.

Capabilities

Fast integration with production visibility

Drop-in compatible API

One endpoint for multiple open-source model families, so you can switch providers without changing your integration protocol.

Batch API

High-throughput batch inference API for latency-insensitive analytical workloads.

System-level cache

Cross-request, cross-session context caching — increases efficiency and reduces cost simultaneously.

Full-stack observability

Track TTFT, TPS, cache-hit ratio, and per-token usage in real time.

A familiar API

Drop-in compatible, ready to run

wylon Token Factory is compatible with the OpenAI and Anthropic APIs, so you can keep your existing clients, agents, and middleware.

Three-line setup

Point your client's base URL at api.wylon.cn, add your wylon API key, choose a Token Factory model ID, and keep the rest of your code unchanged.

openai → chat completions, tool calls, JSON mode
anthropic → messages, streaming
curl → native HTTP + SSE

# pip install openai
                from openai import OpenAI

                client = OpenAI(
                base_url="https://api.wylon.cn/v1",
                api_key="wyl_...",
                )

                resp = client.chat.completions.create(
                model="kimi-k2.5",
                messages=[{"role": "user",
                "content": "Hello, wylon."}],
                stream=True,
                )

                for chunk in resp:
                print(chunk.choices[0].delta.content, end="")
              

# pip install anthropic
                from anthropic import Anthropic

                client = Anthropic(
                base_url="https://api.wylon.cn/anthropic",
                api_key="wyl_...",
                )

                with client.messages.stream(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=[{"role": "user",
                "content": "Hello, wylon."}],
                ) as stream:
                for text in stream.text_stream:
                print(text, end="")
              

// npm i openai
                import OpenAI from "openai";

                const client = new OpenAI({
                baseURL: "https://api.wylon.cn/v1",
                apiKey: process.env.WYLON_API_KEY,
                });

                const stream = await client.chat.completions.create({
                model: "kimi-k2.5",
                messages: [{ role: "user", content: "Hello, wylon." }],
                stream: true,
                });

                for await (const chunk of stream) {
                process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
                }
              

# Streaming chat completion via SSE
                curl https://api.wylon.cn/v1/chat/completions \
                -H "Authorization: Bearer $WYLON_API_KEY" \
                -H "Content-Type: application/json" \
                -d '{
                  "model": "kimi-k2.5",
                  "stream": true,
                  "messages": [
                  { "role": "user", "content": "Hello, wylon." }
                  ]
                  }'

Production readiness

Scale, reliability, and model coverage

99.9%

Availability SLA

10×

First-token speedup (high cache-hit)

1M+

Tokens-per-minute (premium tier)

GPU vendors supported — Biren, Cambricon, MetaX, Sunrise, and more.

FAQ

Which models are supported?

We cover leading open-source families: MiniMax, Kimi, GLM, Qwen, and DeepSeek. See the full list and per-model details in the model catalog.

Do you offer enterprise services?

Yes. We offer dedicated plans for enterprise customers — please reach out via Contact us to talk to our solutions team.

How does wylon handle my data?

wylon handles your data in accordance with applicable regulations. We do not use your data to train models or for unrelated commercial purposes. You can request deletion at any time. Details: Privacy Policy.

I don't see the model I want — what do I do?

We continuously onboard new models based on industry developments and customer demand. Enterprise customers can request specific model onboarding or dedicated deployments through our solutions team.

Get started

Sign up and run your first inference request.

Get started → Contact us