POST https://api.wylon.cn/v1/chat/completions

Chat completions

Generate model responses from a list of messages. Fully OpenAI-compatible; streaming, function calling and structured output are controlled through request parameters.

Authorization

Authorization string · header required

Bearer token, e.g. Bearer wl-xxxxxxxx. Create API keys in the dashboard.

Request body

Content-Type: application/json

model string required

Model ID to call, e.g. moonshotai/Kimi-K2. The available list is in the model catalog or via the list models endpoint.

messages array required

Ordered conversation turns; each item has a role and content.

role enum required

One of system, user, assistant or tool.

content string · array required

Text content; vision-language models also accept multimodal content blocks as an array of {type, text|image_url}.

tool_call_id string optional

Only used when role="tool"; correlates to the tool-call ID from the previous assistant turn.

stream boolean false

When true, deltas are streamed as SSE (text/event-stream); the stream terminates with data: [DONE].

temperature number optional

Sampling temperature, range 0 – 2. Higher means more random; tune either this or top_p, not both.

top_p number optional

Nucleus-sampling threshold, range 0 – 1. Samples from the smallest set whose cumulative probability reaches top_p.

top_k integer optional

Sample from the top-K most probable tokens at each step; 0 disables.

max_tokens integer optional

Maximum tokens to generate; cannot exceed the model context window. If unset, the model's default cap is used.

n integer 1

Number of completion candidates to return. Note: billing is n × completion_tokens.

stop string · array optional

Up to 4 stop sequences; generation halts as soon as one is matched.

presence_penalty number 0

Range -2.0 – 2.0. Positive values reduce topic repetition and encourage novelty.

frequency_penalty number 0

Range -2.0 – 2.0. Positive values penalize tokens by frequency, suppressing verbatim repetition.

seed integer optional

Best-effort deterministic sampling seed; same input ⇒ same output.

tools array optional

Function definitions the model may call. Each is {type: "function", function: {name, description, parameters}}. See Function calling.

tool_choice string · object "auto"

Control tool invocation: "auto" lets the model decide; "none" disables tools; "required" forces at least one call; or pass {type:"function", function:{name:"..."}} to force a specific function.

response_format object optional

Force a response format: {type:"json_object"} or {type:"json_schema", json_schema:{...}}. See Structured output.

logprobs boolean false

Whether to return per-token log probabilities.

top_logprobs integer optional

When logprobs is true, return per-step top-N candidate probabilities, range 0 – 20.

user string optional

Stable end-user identifier used for abuse monitoring and organization-level audit.

Response

Non-streaming: returns a chat.completion object.
Streaming (stream=true): chunks of chat.completion.chunk are returned as SSE, terminated by data: [DONE].

idstring

Unique identifier for this request.

objectstring

Either chat.completion or chat.completion.chunk (streaming).

createdinteger

UNIX timestamp (seconds).

modelstring

The model ID that actually served this request.

choicesarray

List of completion candidates.

indexinteger

Index of the candidate, starting at 0.

messageobject

Returned in non-streaming mode: complete {role, content, tool_calls?}.

deltaobject

Incremental content chunk returned in streaming mode.

finish_reasonenum

stop / length / tool_calls / content_filter.

usageobject

Token usage stats.

prompt_tokensinteger

Input token count.

completion_tokensinteger

Output token count.

total_tokensinteger

Total token count.

cache_hit_rationumberwylon extension

Share of input tokens served by the system-level context cache (0 – 1). The more repeated prefix, the higher the ratio and the lower the cost.

Example response

{
  "id": "cmpl-9f1c7b2e8a41",
  "object": "chat.completion",
  "created": 1744828800,
  "model": "moonshotai/Kimi-K2",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "KV cache stores …" },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 128,
    "total_tokens": 152,
    "cache_hit_ratio": 0.71
  }
}

Bad request, authentication failure, or an exceeded rate limit. Errors use the OpenAI-compatible envelope.

error.typestring

Error category, e.g. invalid_request_error, authentication_error, rate_limit_exceeded.

error.messagestring

Human-readable error description.

error.codestring

Optional fine-grained error code for programmatic handling.

Example — 429 rate-limited

{
  "error": {
    "type": "rate_limit_exceeded",
    "message": "Rate limit reached for moonshotai/Kimi-K2 on tier 2.",
    "retry_after": 3.4
  }
}

Transient server error. Retry with exponential backoff and jitter.

Example — 503

{
  "error": {
    "type": "server_overloaded",
    "message": "Upstream model is temporarily unavailable, please retry."
  }
}