Inference/Batch

Batch

Batch lets you submit a large number of requests as a single JSONL file; wylon processes them asynchronously and returns the results within the completion window. Compared with real-time chat completions, batch jobs run on a separate quota with higher concurrency limits and discounted pricing — well-suited for offline evaluation, data generation, document processing and similar non-real-time workloads.

When to use Batch

Offline data processing. Summarize, classify, label or translate large historical datasets.
Evaluation & benchmarking. Run a fixed set of inputs across multiple models or parameter combinations for comparison.
Synthetic data generation. Produce training or test samples from seed prompts in bulk.
Overnight jobs. Workloads that are not latency-sensitive and can avoid daytime peak quotas.

If your requests must return immediately (live chat, agents, real-time RAG), use chat completions instead.

Real-time vs. Batch

	Chat completions (real-time)	Batch
Call style	Synchronous HTTP/SSE	Asynchronous job (poll or callback)
Latency	Milliseconds	Minutes to hours, within the completion window
Pricing	Per-token, standard rate	Per-token with batch discount
Rate limits	Subject to RPM/TPM, see rate limits	Independent quota with higher concurrency limits
Supported endpoints	chat completions / completions	chat completions / completions

Workflow

Prepare a JSONL file

One request per line. Each line wraps a normal /v1/chat/completions body with a custom_id so you can correlate results.
Upload the file

Call POST /v1/files to upload the JSONL as the input file and obtain a file_id.
Create the batch job

Call POST /v1/batches with the file_id, the target endpoint, and a completion window.
Poll status and download results

Use GET /v1/batches/{id} to track progress. When complete, download the output_file_id — each line is the response for one request.

Input file format

Each line is a complete request object. method and url declare the target endpoint; body matches the chat completions request body.

{"custom_id": "req-001", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "moonshotai/Kimi-K2", "messages": [{"role": "user", "content": "Explain KV cache in one line."}]}}
{"custom_id": "req-002", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "moonshotai/Kimi-K2", "messages": [{"role": "user", "content": "Explain speculative decoding in one line."}]}}

API calls

The full flow uses three endpoints: upload, create, retrieve. All require an account-level API key.

# 1. Upload the input file
curl https://api.wylon.cn/v1/files \
  -H "Authorization: Bearer $WYLON_API_KEY" \
  -F purpose=batch \
  -F file=@requests.jsonl

# Response: { "id": "file-abc...", ... }

# 2. Create the batch job
curl https://api.wylon.cn/v1/batches \
  -H "Authorization: Bearer $WYLON_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input_file_id": "file-abc...",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h"
  }'

# 3. Poll status
curl https://api.wylon.cn/v1/batches/batch_xxx \
  -H "Authorization: Bearer $WYLON_API_KEY"

# 4. When status is completed, download the output file
curl https://api.wylon.cn/v1/files/file-out.../content \
  -H "Authorization: Bearer $WYLON_API_KEY" -o results.jsonl

from openai import OpenAI
import os, time

client = OpenAI(
    api_key=os.environ["WYLON_API_KEY"],
    base_url="https://api.wylon.cn/v1",
)

# 1. Upload
file = client.files.create(file=open("requests.jsonl", "rb"), purpose="batch")

# 2. Create job
batch = client.batches.create(
    input_file_id=file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

# 3. Poll until complete
while batch.status in {"validating", "in_progress", "finalizing"}:
    time.sleep(30)
    batch = client.batches.retrieve(batch.id)

# 4. Download results
result = client.files.content(batch.output_file_id)
open("results.jsonl", "wb").write(result.read())

Job status

Status	Meaning
`validating`	Validating input file format and quota.
`in_progress`	Scheduled or currently executing.
`finalizing`	Processing complete; writing the output file.
`completed`	Finished. Download via `output_file_id`.
`failed`	Job failed entirely. See `errors` for details.
`expired`	Did not finish within `completion_window`; partial results are still available.
`cancelling / cancelled`	Cancelled by user.

Output file

One response per line, correlated with the input custom_id. Failed requests are written to a separate error_file_id for selective retry.

{"custom_id": "req-001", "response": {"status_code": 200, "body": {"id": "cmpl-...", "choices": [{"message": {"role": "assistant", "content": "…"}, "finish_reason": "stop"}], "usage": {"total_tokens": 128}}}
{"custom_id": "req-002", "response": {"status_code": 200, "body": {"id": "cmpl-...", "choices": [{"message": {"role": "assistant", "content": "…"}}], "usage": {"total_tokens": 96}}}

Quotas and limits

File size: up to 100 MB per file; ≤ 50,000 lines per file is recommended.
Concurrent jobs: each organization can queue up to 10 batch jobs by default; contact sales for higher limits.
Completion window: 24h is supported; lines that don’t finish in time are marked expired.
Pricing: billed per processed token at a discounted rate — see the pricing page for the published discount.

Batch

When to use Batch

Real-time vs. Batch

Workflow

Prepare a JSONL file

Upload the file

Create the batch job

Poll status and download results

Input file format

API calls

Job status

Output file

Quotas and limits