Chat completions
/v1/chat/completions is wylon’s primary inference surface and is fully OpenAI-compatible.
It accepts a list of messages and returns a generated assistant turn — streaming,
function calling and
structured output are all controlled through
request parameters. Read Inference overview
for the high-level picture.
Endpoint
POSThttps://api.wylon.cn/v1/chat/completions
Request parameters
| Parameter | Type | Description |
|---|---|---|
modelrequired | string | Model ID, e.g. moonshotai/Kimi-K2. The list of available IDs lives in the model catalog or via GET /v1/models. |
messagesrequired | array | Ordered conversation turns. Roles supported: system, user, assistant, tool. |
temperature | number | Sampling temperature, range 0–2. Higher means more random. |
top_p | number | Nucleus sampling threshold, range 0–1. Tune either this or temperature, not both. |
top_k | integer | Sample from the top-K most probable tokens at each step; 0 disables. |
max_tokens | integer | Maximum tokens to generate; cannot exceed the model’s context window. |
n | integer | Number of completion candidates to return for the same input. Default 1. |
stream | boolean | Return a server-sent event stream of deltas. Default false. See Streaming below. |
stop | string / array | Up to 4 stop sequences; generation halts as soon as one is matched. |
presence_penalty | number | -2.0 to 2.0. Penalize tokens already present, encouraging novelty. |
frequency_penalty | number | -2.0 to 2.0. Penalize tokens by frequency to discourage repetition. |
seed | integer | Best-effort deterministic sampling seed; same input ⇒ same output. |
tools | array | Function definitions the model may call. See Function calling. |
tool_choice | string / object | Control tool invocation: "auto" (default) / "none" / "required" / a named function. |
response_format | object | Force JSON or a named schema. See Structured output. |
logprobs | boolean | Whether to return log probabilities for each sampled token. |
top_logprobs | integer | When logprobs is true, return per-step top-N candidate probabilities, 0–20. |
user | string | Stable end-user identifier used for abuse monitoring and organization-level audit. |
temperature, top_p etc. vary per model. Check the model’s entry in the model catalog before relying on a default value.
Message schema
Each entry in messages is an object with a role and content. Content can be a string or, for vision-language models, a list of content parts.
{
"role": "user",
"content": "What is retrieval-augmented generation?"
}
{
"role": "user",
"content": [
{"type": "text", "text": "Describe what is in this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
]
}
Multi-turn conversation
To continue a conversation, append the previous assistant turn and the next user turn. The model is stateless — you are responsible for managing history.
from openai import OpenAI
client = OpenAI(base_url="https://api.wylon.cn/v1", api_key=os.environ["WYLON_API_KEY"])
history = [
{"role": "system", "content": "You are a terse senior engineer."},
{"role": "user", "content": "Pick between REST and gRPC for internal services."},
]
while True:
resp = client.chat.completions.create(model="moonshotai/kimi-k2.5", messages=history)
reply = resp.choices[0].message.content
print("assistant:", reply)
history.append({"role": "assistant", "content": reply})
user = input("you: ")
if not user: break
history.append({"role": "user", "content": user})
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://api.wylon.cn/v1", apiKey: process.env.WYLON_API_KEY });
const history = [
{ role: "system", content: "You are a terse senior engineer." },
{ role: "user", content: "Pick between REST and gRPC for internal services." },
];
const resp = await client.chat.completions.create({
model: "moonshotai/kimi-k2.5", messages: history,
});
history.push(resp.choices[0].message);
Streaming
Set stream: true to receive token-level deltas as server-sent events. Final chunk has [DONE].
stream = client.chat.completions.create(
model="moonshotai/kimi-k2.5",
messages=[{"role": "user", "content": "Write a haiku about GPUs."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta: print(delta, end="", flush=True)
curl -N https://api.wylon.cn/v1/chat/completions \
-H "Authorization: Bearer $WYLON_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"moonshotai/kimi-k2.5","messages":[{"role":"user","content":"Write a haiku about GPUs."}],"stream":true}'
Response schema
A non-streamed response matches the OpenAI protocol one-for-one.
{
"id": "cmpl-9f1c7b2e8a41",
"object": "chat.completion",
"created": 1744828800,
"model": "moonshotai/Kimi-K2",
"choices": [
{
"index": 0,
"message": { "role": "assistant", "content": "…" },
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 128,
"total_tokens": 152,
"cache_hit_ratio": 0.71 // wylon extension field
}
}
cache_hit_ratio is a wylon extension to the OpenAI schema: it reports the share of input tokens served from the system-level context cache — the more repeated prefix (system prompts, long documents), the higher the ratio and the lower the cost.
Finish reasons
| Value | Meaning |
|---|---|
stop | Model completed naturally or matched a stop sequence. |
length | Hit max_tokens or the model’s context limit. |
tool_calls | Model requested one or more tool invocations. |
content_filter | Output was blocked by content safety policy. |
Error codes
On failure the response is an OpenAI-compatible error envelope: {"error": {"type": "...", "message": "..."}}. Common HTTP statuses:
| Status | Meaning |
|---|---|
400 | Bad request: missing required field, value out of range, unknown model ID, etc. |
401 | Authentication failed: invalid or revoked API key, or missing Authorization header. |
403 | Forbidden: account does not have access to this model or region. |
429 | Hit a rate limit; the response includes a retry-after header. |
500 / 503 | Transient server error; retry with exponential backoff. |