Guides / Quickstart

Quickstart

Token Factory is wylon's developer-facing inference service — a single API for the leading open-source LLMs (MiniMax, Kimi, GLM, Qwen, DeepSeek), running on a purpose-built GPU compute backbone. Drop in any OpenAI- or Anthropic-compatible client and you can ship your first request in minutes.

Overview

Browse the model catalog to pick a model, then call the OpenAI / Anthropic-compatible API from your application. Works seamlessly with common frameworks — LangChain, LlamaIndex, LiteLLM, the OpenAI SDK, and more.

On wylon you can:

Send requests — prompts, chats, images — and receive streamed responses.
Build inference into applications and agents through supported third-party integrations.

Start building now

Follow the three steps below to send your first request to the wylon inference API.

Create an account

Sign up for a free wylon account in the dashboard. After completing identity verification you can use every available model; see Billing & consumption for plan details and free credit.

info
Already using OpenAI or Anthropic? wylon is drop-in compatible with both SDKs — you can migrate in minutes. See Switch to wylon.

Generate an API key

Navigate to Account Settings → API keys in the dashboard and click Create new key. Copy the generated API token. Detailed key management and scope rules live in API Keys.

Export it to your shell environment so the code samples below can authenticate:

# Add to ~/.bashrc or ~/.profile
export WYLON_API_KEY="wl-••••••••••••••••••••••••••••••••"
export WYLON_BASE_URL="https://api.wylon.cn/v1"

# Add to ~/.zshrc
export WYLON_API_KEY="wl-••••••••••••••••••••••••••••••••"
export WYLON_BASE_URL="https://api.wylon.cn/v1"

# PowerShell — persistent for the current user
[Environment]::SetEnvironmentVariable("WYLON_API_KEY", "wl-••••••••••••••••••••••••••••••••", "User")
[Environment]::SetEnvironmentVariable("WYLON_BASE_URL", "https://api.wylon.cn/v1", "User")

key

Keep your key secret. Never commit API keys to source control or ship them in client-side bundles. Use a secret manager or server-side proxy for production workloads.

Send your first request

Using the OpenAI-compatible interface as an example.
Point any OpenAI-compliant client at https://api.wylon.cn/v1 and use a supported model ID. The example below calls Kimi K2 with a simple chat prompt.

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["WYLON_API_KEY"],
    base_url="https://api.wylon.cn/v1",
)

response = client.chat.completions.create(
    model="moonshotai/kimi-k2.5",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain KV cache in one paragraph."},
    ],
    temperature=0.6,
    max_tokens=512,
)

print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.WYLON_API_KEY,
  baseURL: "https://api.wylon.cn/v1",
});

const response = await client.chat.completions.create({
  model: "moonshotai/kimi-k2.5",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user",   content: "Explain KV cache in one paragraph." },
  ],
  temperature: 0.6,
  max_tokens: 512,
});

console.log(response.choices[0].message.content);

curl https://api.wylon.cn/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $WYLON_API_KEY" \
  -d '{
    "model": "moonshotai/kimi-k2.5",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user",   "content": "Explain KV cache in one paragraph."}
    ],
    "temperature": 0.6,
    "max_tokens": 512
  }'

package main

import (
    "context"
    "fmt"
    "os"

    "github.com/sashabaranov/go-openai"
)

func main() {
    cfg := openai.DefaultConfig(os.Getenv("WYLON_API_KEY"))
    cfg.BaseURL = "https://api.wylon.cn/v1"
    client := openai.NewClientWithConfig(cfg)

    resp, err := client.CreateChatCompletion(context.Background(),
        openai.ChatCompletionRequest{
            Model: "moonshotai/kimi-k2.5",
            Messages: []openai.ChatCompletionMessage{
                {Role: "user", Content: "Explain KV cache in one paragraph."},
            },
        })
    if err != nil { fmt.Println(err); return }
    fmt.Println(resp.Choices[0].Message.Content)
}

A successful response returns the assistant’s completion, the tokens consumed, and the cache-hit ratio served from wylon’s system-level KV cache.

{
  "id": "cmpl-9f1c7b2e8a41",
  "object": "chat.completion",
  "model": "moonshotai/kimi-k2.5",
  "created": 1744828800,
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "KV cache stores the key and value tensors …"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 128,
    "total_tokens": 152,
    "cache_hit_ratio": 0.71      // wylon extension: context-cache hit ratio
  }
}

The cache_hit_ratio in usage is a wylon extension to the OpenAI schema: it reports the share of input tokens served from the system-level context cache (the more repeated prefix, the higher the hit ratio and the lower the cost). All other fields match the OpenAI protocol.

API endpoints

The wylon inference API follows the OpenAI protocol. The table below lists the endpoints you’ll use most often.

Method & path	Purpose	Notes
POST /chat/completions	Conversational generation	Supports streaming, function calling, structured output.
POST /completions	Legacy text completion	For models without chat templates.
GET /models	List available models	Returns the model IDs available to your account.
POST /batches	Submit a batch job	Process large async workloads at discounted pricing. See Batch.

Common parameters

Every chat completion request accepts the parameters below. Defaults are tuned for balanced quality and latency.

Parameter	Type	Description
model	string	Model ID, e.g. `moonshotai/kimi-k2.5`. See all models.
messages	array	Ordered list of `{role, content}` turns. Roles: `system`, `user`, `assistant`, `tool`.
temperature	number	Sampling temperature between `0` and `2`. Defaults to `0.7`.
max_tokens	integer	Maximum tokens to generate. Bounded by the model’s context window.
stream	boolean	Return a server-sent event stream of token deltas.
tools	array	Function definitions the model may call. See Function calling.
response_format	object	Force JSON or a named schema. See Structured output.

Explore

You’re set up. Dive deeper into the capabilities you’ll need next.

Inference overview arrow_forward

Supported model categories, the GPU compute backbone, and inference optimizations.

Function calling arrow_forward

Let models reliably invoke your code, query databases, and orchestrate tool chains.

Structured output & JSON arrow_forward

Constrain generations to JSON Schema. Parse confidently — no regex, no retries.

Batch arrow_forward

Submit large async workloads as JSONL files at discounted pricing.

Rate limits & scaling arrow_forward

Per-minute request and token budgets, and how to request a quota increase.

Observability arrow_forward

Real-time usage, error rates and request logs for debugging and cost analysis.

Need help?

For support, reach out via Contact us; live service health is published on the status page.

Quickstart

Overview

Start building now

Create an account

Generate an API key

Send your first request

API endpoints

Common parameters

Explore

Need help?