Inference/Chat completions

Chat completions

/v1/chat/completions is wylon’s primary inference surface and is fully OpenAI-compatible. It accepts a list of messages and returns a generated assistant turn — streaming, function calling and structured output are all controlled through request parameters. Read Inference overview for the high-level picture.

Endpoint

POSThttps://api.wylon.cn/v1/chat/completions

Request parameters

Parameter	Type	Description
`model` required	string	Model ID, e.g. `moonshotai/Kimi-K2`. The list of available IDs lives in the model catalog or via `GET /v1/models`.
`messages` required	array	Ordered conversation turns. Roles supported: `system`, `user`, `assistant`, `tool`.
`temperature`	number	Sampling temperature, range `0`–`2`. Higher means more random.
`top_p`	number	Nucleus sampling threshold, range `0`–`1`. Tune either this or `temperature`, not both.
`top_k`	integer	Sample from the top-K most probable tokens at each step; `0` disables.
`max_tokens`	integer	Maximum tokens to generate; cannot exceed the model’s context window.
`n`	integer	Number of completion candidates to return for the same input. Default `1`.
`stream`	boolean	Return a server-sent event stream of deltas. Default `false`. See Streaming below.
`stop`	string / array	Up to 4 stop sequences; generation halts as soon as one is matched.
`presence_penalty`	number	`-2.0` to `2.0`. Penalize tokens already present, encouraging novelty.
`frequency_penalty`	number	`-2.0` to `2.0`. Penalize tokens by frequency to discourage repetition.
`seed`	integer	Best-effort deterministic sampling seed; same input ⇒ same output.
`tools`	array	Function definitions the model may call. See Function calling.
`tool_choice`	string / object	Control tool invocation: `"auto"` (default) / `"none"` / `"required"` / a named function.
`response_format`	object	Force JSON or a named schema. See Structured output.
`logprobs`	boolean	Whether to return log probabilities for each sampled token.
`top_logprobs`	integer	When `logprobs` is `true`, return per-step top-N candidate probabilities, `0`–`20`.
`user`	string	Stable end-user identifier used for abuse monitoring and organization-level audit.

info

The defaults for temperature, top_p etc. vary per model. Check the model’s entry in the model catalog before relying on a default value.

Message schema

Each entry in messages is an object with a role and content. Content can be a string or, for vision-language models, a list of content parts.

{
  "role": "user",
  "content": "What is retrieval-augmented generation?"
}

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Describe what is in this image."},
    {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
  ]
}

Multi-turn conversation

To continue a conversation, append the previous assistant turn and the next user turn. The model is stateless — you are responsible for managing history.

from openai import OpenAI
client = OpenAI(base_url="https://api.wylon.cn/v1", api_key=os.environ["WYLON_API_KEY"])

history = [
    {"role": "system", "content": "You are a terse senior engineer."},
    {"role": "user",   "content": "Pick between REST and gRPC for internal services."},
]

while True:
    resp = client.chat.completions.create(model="moonshotai/kimi-k2.5", messages=history)
    reply = resp.choices[0].message.content
    print("assistant:", reply)
    history.append({"role": "assistant", "content": reply})
    user = input("you: ")
    if not user: break
    history.append({"role": "user", "content": user})

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://api.wylon.cn/v1", apiKey: process.env.WYLON_API_KEY });

const history = [
  { role: "system", content: "You are a terse senior engineer." },
  { role: "user",   content: "Pick between REST and gRPC for internal services." },
];

const resp = await client.chat.completions.create({
  model: "moonshotai/kimi-k2.5", messages: history,
});
history.push(resp.choices[0].message);

Streaming

Set stream: true to receive token-level deltas as server-sent events. Final chunk has [DONE].

stream = client.chat.completions.create(
    model="moonshotai/kimi-k2.5",
    messages=[{"role": "user", "content": "Write a haiku about GPUs."}],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta: print(delta, end="", flush=True)

curl -N https://api.wylon.cn/v1/chat/completions \
  -H "Authorization: Bearer $WYLON_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"moonshotai/kimi-k2.5","messages":[{"role":"user","content":"Write a haiku about GPUs."}],"stream":true}'

Response schema

A non-streamed response matches the OpenAI protocol one-for-one.

{
  "id": "cmpl-9f1c7b2e8a41",
  "object": "chat.completion",
  "created": 1744828800,
  "model": "moonshotai/Kimi-K2",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "…" },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 128,
    "total_tokens": 152,
    "cache_hit_ratio": 0.71      // wylon extension field
  }
}

cache_hit_ratio is a wylon extension to the OpenAI schema: it reports the share of input tokens served from the system-level context cache — the more repeated prefix (system prompts, long documents), the higher the ratio and the lower the cost.

Finish reasons

Value	Meaning
`stop`	Model completed naturally or matched a `stop` sequence.
`length`	Hit `max_tokens` or the model’s context limit.
`tool_calls`	Model requested one or more tool invocations.
`content_filter`	Output was blocked by content safety policy.

Error codes

On failure the response is an OpenAI-compatible error envelope: {"error": {"type": "...", "message": "..."}}. Common HTTP statuses:

Status	Meaning
`400`	Bad request: missing required field, value out of range, unknown model ID, etc.
`401`	Authentication failed: invalid or revoked API key, or missing `Authorization` header.
`403`	Forbidden: account does not have access to this model or region.
`429`	Hit a rate limit; the response includes a `retry-after` header.
`500 / 503`	Transient server error; retry with exponential backoff.