wylon

Batch

Batch lets you submit a large number of requests as a single JSONL file; wylon processes them asynchronously and returns the results within the completion window. Compared with real-time chat completions, batch jobs run on a separate quota with higher concurrency limits and discounted pricing — well-suited for offline evaluation, data generation, document processing and similar non-real-time workloads.

When to use Batch

If your requests must return immediately (live chat, agents, real-time RAG), use chat completions instead.

Real-time vs. Batch

Chat completions (real-time)Batch
Call styleSynchronous HTTP/SSEAsynchronous job (poll or callback)
LatencyMillisecondsMinutes to hours, within the completion window
PricingPer-token, standard ratePer-token with batch discount
Rate limitsSubject to RPM/TPM, see rate limitsIndependent quota with higher concurrency limits
Supported endpointschat completions / completionschat completions / completions

Workflow

  1. Prepare a JSONL file

    One request per line. Each line wraps a normal /v1/chat/completions body with a custom_id so you can correlate results.

  2. Upload the file

    Call POST /v1/files to upload the JSONL as the input file and obtain a file_id.

  3. Create the batch job

    Call POST /v1/batches with the file_id, the target endpoint, and a completion window.

  4. Poll status and download results

    Use GET /v1/batches/{id} to track progress. When complete, download the output_file_id — each line is the response for one request.

Input file format

Each line is a complete request object. method and url declare the target endpoint; body matches the chat completions request body.

{"custom_id": "req-001", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "moonshotai/Kimi-K2", "messages": [{"role": "user", "content": "Explain KV cache in one line."}]}}
{"custom_id": "req-002", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "moonshotai/Kimi-K2", "messages": [{"role": "user", "content": "Explain speculative decoding in one line."}]}}

API calls

The full flow uses three endpoints: upload, create, retrieve. All require an account-level API key.

# 1. Upload the input file
curl https://api.wylon.cn/v1/files \
  -H "Authorization: Bearer $WYLON_API_KEY" \
  -F purpose=batch \
  -F file=@requests.jsonl

# Response: { "id": "file-abc...", ... }

# 2. Create the batch job
curl https://api.wylon.cn/v1/batches \
  -H "Authorization: Bearer $WYLON_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input_file_id": "file-abc...",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h"
  }'

# 3. Poll status
curl https://api.wylon.cn/v1/batches/batch_xxx \
  -H "Authorization: Bearer $WYLON_API_KEY"

# 4. When status is completed, download the output file
curl https://api.wylon.cn/v1/files/file-out.../content \
  -H "Authorization: Bearer $WYLON_API_KEY" -o results.jsonl
from openai import OpenAI
import os, time

client = OpenAI(
    api_key=os.environ["WYLON_API_KEY"],
    base_url="https://api.wylon.cn/v1",
)

# 1. Upload
file = client.files.create(file=open("requests.jsonl", "rb"), purpose="batch")

# 2. Create job
batch = client.batches.create(
    input_file_id=file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

# 3. Poll until complete
while batch.status in {"validating", "in_progress", "finalizing"}:
    time.sleep(30)
    batch = client.batches.retrieve(batch.id)

# 4. Download results
result = client.files.content(batch.output_file_id)
open("results.jsonl", "wb").write(result.read())

Job status

StatusMeaning
validatingValidating input file format and quota.
in_progressScheduled or currently executing.
finalizingProcessing complete; writing the output file.
completedFinished. Download via output_file_id.
failedJob failed entirely. See errors for details.
expiredDid not finish within completion_window; partial results are still available.
cancelling / cancelledCancelled by user.

Output file

One response per line, correlated with the input custom_id. Failed requests are written to a separate error_file_id for selective retry.

{"custom_id": "req-001", "response": {"status_code": 200, "body": {"id": "cmpl-...", "choices": [{"message": {"role": "assistant", "content": "…"}, "finish_reason": "stop"}], "usage": {"total_tokens": 128}}}
{"custom_id": "req-002", "response": {"status_code": 200, "body": {"id": "cmpl-...", "choices": [{"message": {"role": "assistant", "content": "…"}}], "usage": {"total_tokens": 96}}}

Quotas and limits

沪ICP备2026010432号-1 沪公网安备31010402336632号