Batch
Batch lets you submit a large number of requests as a single JSONL file; wylon processes them asynchronously and returns the results within the completion window. Compared with real-time chat completions, batch jobs run on a separate quota with higher concurrency limits and discounted pricing — well-suited for offline evaluation, data generation, document processing and similar non-real-time workloads.
When to use Batch
- Offline data processing. Summarize, classify, label or translate large historical datasets.
- Evaluation & benchmarking. Run a fixed set of inputs across multiple models or parameter combinations for comparison.
- Synthetic data generation. Produce training or test samples from seed prompts in bulk.
- Overnight jobs. Workloads that are not latency-sensitive and can avoid daytime peak quotas.
If your requests must return immediately (live chat, agents, real-time RAG), use chat completions instead.
Real-time vs. Batch
| Chat completions (real-time) | Batch | |
|---|---|---|
| Call style | Synchronous HTTP/SSE | Asynchronous job (poll or callback) |
| Latency | Milliseconds | Minutes to hours, within the completion window |
| Pricing | Per-token, standard rate | Per-token with batch discount |
| Rate limits | Subject to RPM/TPM, see rate limits | Independent quota with higher concurrency limits |
| Supported endpoints | chat completions / completions | chat completions / completions |
Workflow
-
Prepare a JSONL file
One request per line. Each line wraps a normal
/v1/chat/completionsbody with acustom_idso you can correlate results. -
Upload the file
Call
POST /v1/filesto upload the JSONL as the input file and obtain afile_id. -
Create the batch job
Call
POST /v1/batcheswith thefile_id, the target endpoint, and a completion window. -
Poll status and download results
Use
GET /v1/batches/{id}to track progress. When complete, download theoutput_file_id— each line is the response for one request.
Input file format
Each line is a complete request object. method and url declare the target endpoint; body matches the chat completions request body.
{"custom_id": "req-001", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "moonshotai/Kimi-K2", "messages": [{"role": "user", "content": "Explain KV cache in one line."}]}}
{"custom_id": "req-002", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "moonshotai/Kimi-K2", "messages": [{"role": "user", "content": "Explain speculative decoding in one line."}]}}
API calls
The full flow uses three endpoints: upload, create, retrieve. All require an account-level API key.
# 1. Upload the input file
curl https://api.wylon.cn/v1/files \
-H "Authorization: Bearer $WYLON_API_KEY" \
-F purpose=batch \
-F file=@requests.jsonl
# Response: { "id": "file-abc...", ... }
# 2. Create the batch job
curl https://api.wylon.cn/v1/batches \
-H "Authorization: Bearer $WYLON_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input_file_id": "file-abc...",
"endpoint": "/v1/chat/completions",
"completion_window": "24h"
}'
# 3. Poll status
curl https://api.wylon.cn/v1/batches/batch_xxx \
-H "Authorization: Bearer $WYLON_API_KEY"
# 4. When status is completed, download the output file
curl https://api.wylon.cn/v1/files/file-out.../content \
-H "Authorization: Bearer $WYLON_API_KEY" -o results.jsonl
from openai import OpenAI
import os, time
client = OpenAI(
api_key=os.environ["WYLON_API_KEY"],
base_url="https://api.wylon.cn/v1",
)
# 1. Upload
file = client.files.create(file=open("requests.jsonl", "rb"), purpose="batch")
# 2. Create job
batch = client.batches.create(
input_file_id=file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
# 3. Poll until complete
while batch.status in {"validating", "in_progress", "finalizing"}:
time.sleep(30)
batch = client.batches.retrieve(batch.id)
# 4. Download results
result = client.files.content(batch.output_file_id)
open("results.jsonl", "wb").write(result.read())
Job status
| Status | Meaning |
|---|---|
validating | Validating input file format and quota. |
in_progress | Scheduled or currently executing. |
finalizing | Processing complete; writing the output file. |
completed | Finished. Download via output_file_id. |
failed | Job failed entirely. See errors for details. |
expired | Did not finish within completion_window; partial results are still available. |
cancelling / cancelled | Cancelled by user. |
Output file
One response per line, correlated with the input custom_id. Failed requests are written to a separate error_file_id for selective retry.
{"custom_id": "req-001", "response": {"status_code": 200, "body": {"id": "cmpl-...", "choices": [{"message": {"role": "assistant", "content": "…"}, "finish_reason": "stop"}], "usage": {"total_tokens": 128}}}
{"custom_id": "req-002", "response": {"status_code": 200, "body": {"id": "cmpl-...", "choices": [{"message": {"role": "assistant", "content": "…"}}], "usage": {"total_tokens": 96}}}
Quotas and limits
- File size: up to 100 MB per file; ≤ 50,000 lines per file is recommended.
- Concurrent jobs: each organization can queue up to 10 batch jobs by default; contact sales for higher limits.
- Completion window:
24his supported; lines that don’t finish in time are markedexpired. - Pricing: billed per processed token at a discounted rate — see the pricing page for the published discount.