Quickstart
Token Factory is wylon's developer-facing inference service — a single API for the leading open-source LLMs (MiniMax, Kimi, GLM, Qwen, DeepSeek), running on a purpose-built GPU compute backbone. Drop in any OpenAI- or Anthropic-compatible client and you can ship your first request in minutes.
Overview
Browse the model catalog to pick a model, then call the OpenAI / Anthropic-compatible API from your application. Works seamlessly with common frameworks — LangChain, LlamaIndex, LiteLLM, the OpenAI SDK, and more.
On wylon you can:
- Send requests — prompts, chats, images — and receive streamed responses.
- Build inference into applications and agents through supported third-party integrations.
Start building now
Follow the three steps below to send your first request to the wylon inference API.
-
Create an account
Sign up for a free wylon account in the dashboard. After completing identity verification you can use every available model; see Billing & consumption for plan details and free credit.
infoAlready using OpenAI or Anthropic? wylon is drop-in compatible with both SDKs — you can migrate in minutes. See Switch to wylon. -
Generate an API key
Navigate to Account Settings → API keys in the dashboard and click Create new key. Copy the generated API token. Detailed key management and scope rules live in API Keys.
Export it to your shell environment so the code samples below can authenticate:
# Add to ~/.bashrc or ~/.profile export WYLON_API_KEY="wl-••••••••••••••••••••••••••••••••" export WYLON_BASE_URL="https://api.wylon.cn/v1"# Add to ~/.zshrc export WYLON_API_KEY="wl-••••••••••••••••••••••••••••••••" export WYLON_BASE_URL="https://api.wylon.cn/v1"# PowerShell — persistent for the current user [Environment]::SetEnvironmentVariable("WYLON_API_KEY", "wl-••••••••••••••••••••••••••••••••", "User") [Environment]::SetEnvironmentVariable("WYLON_BASE_URL", "https://api.wylon.cn/v1", "User")keyKeep your key secret. Never commit API keys to source control or ship them in client-side bundles. Use a secret manager or server-side proxy for production workloads. -
Send your first request
Using the OpenAI-compatible interface as an example.
Point any OpenAI-compliant client athttps://api.wylon.cn/v1and use a supported model ID. The example below calls Kimi K2 with a simple chat prompt.from openai import OpenAI import os client = OpenAI( api_key=os.environ["WYLON_API_KEY"], base_url="https://api.wylon.cn/v1", ) response = client.chat.completions.create( model="moonshotai/kimi-k2.5", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain KV cache in one paragraph."}, ], temperature=0.6, max_tokens=512, ) print(response.choices[0].message.content)import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.WYLON_API_KEY, baseURL: "https://api.wylon.cn/v1", }); const response = await client.chat.completions.create({ model: "moonshotai/kimi-k2.5", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Explain KV cache in one paragraph." }, ], temperature: 0.6, max_tokens: 512, }); console.log(response.choices[0].message.content);curl https://api.wylon.cn/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $WYLON_API_KEY" \ -d '{ "model": "moonshotai/kimi-k2.5", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain KV cache in one paragraph."} ], "temperature": 0.6, "max_tokens": 512 }'package main import ( "context" "fmt" "os" "github.com/sashabaranov/go-openai" ) func main() { cfg := openai.DefaultConfig(os.Getenv("WYLON_API_KEY")) cfg.BaseURL = "https://api.wylon.cn/v1" client := openai.NewClientWithConfig(cfg) resp, err := client.CreateChatCompletion(context.Background(), openai.ChatCompletionRequest{ Model: "moonshotai/kimi-k2.5", Messages: []openai.ChatCompletionMessage{ {Role: "user", Content: "Explain KV cache in one paragraph."}, }, }) if err != nil { fmt.Println(err); return } fmt.Println(resp.Choices[0].Message.Content) }A successful response returns the assistant’s completion, the tokens consumed, and the cache-hit ratio served from wylon’s system-level KV cache.
{ "id": "cmpl-9f1c7b2e8a41", "object": "chat.completion", "model": "moonshotai/kimi-k2.5", "created": 1744828800, "choices": [{ "index": 0, "message": { "role": "assistant", "content": "KV cache stores the key and value tensors …" }, "finish_reason": "stop" }], "usage": { "prompt_tokens": 24, "completion_tokens": 128, "total_tokens": 152, "cache_hit_ratio": 0.71 // wylon extension: context-cache hit ratio } }The
cache_hit_ratioinusageis a wylon extension to the OpenAI schema: it reports the share of input tokens served from the system-level context cache (the more repeated prefix, the higher the hit ratio and the lower the cost). All other fields match the OpenAI protocol.
API endpoints
The wylon inference API follows the OpenAI protocol. The table below lists the endpoints you’ll use most often.
| Method & path | Purpose | Notes |
|---|---|---|
| POST /chat/completions | Conversational generation | Supports streaming, function calling, structured output. |
| POST /completions | Legacy text completion | For models without chat templates. |
| GET /models | List available models | Returns the model IDs available to your account. |
| POST /batches | Submit a batch job | Process large async workloads at discounted pricing. See Batch. |
Common parameters
Every chat completion request accepts the parameters below. Defaults are tuned for balanced quality and latency.
| Parameter | Type | Description |
|---|---|---|
| model | string | Model ID, e.g. moonshotai/kimi-k2.5. See all
models. |
| messages | array | Ordered list of {role, content} turns. Roles: system, user,
assistant, tool.
|
| temperature | number | Sampling temperature between 0 and 2. Defaults to 0.7. |
| max_tokens | integer | Maximum tokens to generate. Bounded by the model’s context window. |
| stream | boolean | Return a server-sent event stream of token deltas. |
| tools | array | Function definitions the model may call. See Function calling. |
| response_format | object | Force JSON or a named schema. See Structured output. |
Explore
You’re set up. Dive deeper into the capabilities you’ll need next.
Need help?
For support, reach out via Contact us; live service health is published on the status page.