wylon

Chat completions

/v1/chat/completions is wylon’s primary inference surface and is fully OpenAI-compatible. It accepts a list of messages and returns a generated assistant turn — streaming, function calling and structured output are all controlled through request parameters. Read Inference overview for the high-level picture.

Endpoint

POSThttps://api.wylon.cn/v1/chat/completions

Request parameters

ParameterTypeDescription
model
required
stringModel ID, e.g. moonshotai/Kimi-K2. The list of available IDs lives in the model catalog or via GET /v1/models.
messages
required
arrayOrdered conversation turns. Roles supported: system, user, assistant, tool.
temperaturenumberSampling temperature, range 02. Higher means more random.
top_pnumberNucleus sampling threshold, range 01. Tune either this or temperature, not both.
top_kintegerSample from the top-K most probable tokens at each step; 0 disables.
max_tokensintegerMaximum tokens to generate; cannot exceed the model’s context window.
nintegerNumber of completion candidates to return for the same input. Default 1.
streambooleanReturn a server-sent event stream of deltas. Default false. See Streaming below.
stopstring / arrayUp to 4 stop sequences; generation halts as soon as one is matched.
presence_penaltynumber-2.0 to 2.0. Penalize tokens already present, encouraging novelty.
frequency_penaltynumber-2.0 to 2.0. Penalize tokens by frequency to discourage repetition.
seedintegerBest-effort deterministic sampling seed; same input ⇒ same output.
toolsarrayFunction definitions the model may call. See Function calling.
tool_choicestring / objectControl tool invocation: "auto" (default) / "none" / "required" / a named function.
response_formatobjectForce JSON or a named schema. See Structured output.
logprobsbooleanWhether to return log probabilities for each sampled token.
top_logprobsintegerWhen logprobs is true, return per-step top-N candidate probabilities, 020.
userstringStable end-user identifier used for abuse monitoring and organization-level audit.
info
The defaults for temperature, top_p etc. vary per model. Check the model’s entry in the model catalog before relying on a default value.

Message schema

Each entry in messages is an object with a role and content. Content can be a string or, for vision-language models, a list of content parts.

{
  "role": "user",
  "content": "What is retrieval-augmented generation?"
}
{
  "role": "user",
  "content": [
    {"type": "text", "text": "Describe what is in this image."},
    {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
  ]
}

Multi-turn conversation

To continue a conversation, append the previous assistant turn and the next user turn. The model is stateless — you are responsible for managing history.

from openai import OpenAI
client = OpenAI(base_url="https://api.wylon.cn/v1", api_key=os.environ["WYLON_API_KEY"])

history = [
    {"role": "system", "content": "You are a terse senior engineer."},
    {"role": "user",   "content": "Pick between REST and gRPC for internal services."},
]

while True:
    resp = client.chat.completions.create(model="moonshotai/kimi-k2.5", messages=history)
    reply = resp.choices[0].message.content
    print("assistant:", reply)
    history.append({"role": "assistant", "content": reply})
    user = input("you: ")
    if not user: break
    history.append({"role": "user", "content": user})
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://api.wylon.cn/v1", apiKey: process.env.WYLON_API_KEY });

const history = [
  { role: "system", content: "You are a terse senior engineer." },
  { role: "user",   content: "Pick between REST and gRPC for internal services." },
];

const resp = await client.chat.completions.create({
  model: "moonshotai/kimi-k2.5", messages: history,
});
history.push(resp.choices[0].message);

Streaming

Set stream: true to receive token-level deltas as server-sent events. Final chunk has [DONE].

stream = client.chat.completions.create(
    model="moonshotai/kimi-k2.5",
    messages=[{"role": "user", "content": "Write a haiku about GPUs."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta: print(delta, end="", flush=True)
curl -N https://api.wylon.cn/v1/chat/completions \
  -H "Authorization: Bearer $WYLON_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"moonshotai/kimi-k2.5","messages":[{"role":"user","content":"Write a haiku about GPUs."}],"stream":true}'

Response schema

A non-streamed response matches the OpenAI protocol one-for-one.

{
  "id": "cmpl-9f1c7b2e8a41",
  "object": "chat.completion",
  "created": 1744828800,
  "model": "moonshotai/Kimi-K2",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "…" },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 128,
    "total_tokens": 152,
    "cache_hit_ratio": 0.71      // wylon extension field
  }
}

cache_hit_ratio is a wylon extension to the OpenAI schema: it reports the share of input tokens served from the system-level context cache — the more repeated prefix (system prompts, long documents), the higher the ratio and the lower the cost.

Finish reasons

ValueMeaning
stopModel completed naturally or matched a stop sequence.
lengthHit max_tokens or the model’s context limit.
tool_callsModel requested one or more tool invocations.
content_filterOutput was blocked by content safety policy.

Error codes

On failure the response is an OpenAI-compatible error envelope: {"error": {"type": "...", "message": "..."}}. Common HTTP statuses:

StatusMeaning
400Bad request: missing required field, value out of range, unknown model ID, etc.
401Authentication failed: invalid or revoked API key, or missing Authorization header.
403Forbidden: account does not have access to this model or region.
429Hit a rate limit; the response includes a retry-after header.
500 / 503Transient server error; retry with exponential backoff.
沪ICP备2026010432号-1 沪公网安备31010402336632号