Inference observability
Every inference call on wylon is measured end-to-end. Pre-built dashboards give you traffic, latency, error, and capacity telemetry out of the box, and a Prometheus-compatible metrics API lets you ship everything into your own stack.
What you get
- Preconfigured dashboards in the web console for every model and endpoint.
- Near-real-time updates — metrics are surfaced within tens of seconds.
- Percentile latency statistics — p50 / p90 / p99, not just averages.
- Prometheus & Grafana integrations for your own dashboards and alerts.
- Request logs with opt-in payload capture for debugging.
Available metrics
| Category | Metric | Description |
|---|---|---|
| Traffic | wylon_requests_per_minute | Total requests per minute. |
wylon_input_tokens_per_minute | Input (prompt) tokens per minute. | |
wylon_output_tokens_per_minute | Generated output tokens per minute. | |
| Latency | wylon_request_duration_seconds | End-to-end time from request sent to full response received. |
wylon_time_to_first_token_seconds | TTFT — first streamed token latency. | |
wylon_output_tokens_per_second | Output speed after first token. | |
| Capacity | wylon_batch_jobs_in_progress | Number of Batch jobs currently executing. |
wylon_queue_depth | Pending requests waiting for a GPU slot. | |
| Errors | wylon_error_rate | Percentage of failed requests, grouped by HTTP status (4xx, 429, 5xx). |
wylon_success_rate | Percentage of 2xx responses. |
Filters and dimensions
Every metric can be sliced by any combination of the labels below.
- Time range (5m, 1h, 24h, 7d, custom)
- Model (e.g.
moonshotai/Kimi-K2) - Call type (real-time chat completions / Batch)
- Project / API key
- Error code (HTTP status)
- Prompt length bucket / latency range
Metrics API
The Prometheus-format endpoint returns all metrics with their live values.
curl https://api.wylon.cn/v1/metrics \
-H "Authorization: Bearer $WYLON_API_KEY"
# p99 TTFT for kimi-k2.5, last 15 minutes
histogram_quantile(0.99,
sum(rate(wylon_time_to_first_token_seconds_bucket{model="moonshotai/kimi-k2.5"}[15m])) by (le)
)
# error rate by status class, last 1 hour
sum(rate(wylon_error_rate[1h])) by (status_class)
Exporters
| Target | How |
|---|---|
| Prometheus | Scrape /v1/metrics with bearer-token auth. |
| Grafana® | Point a Prometheus data source at the same URL; a starter dashboard JSON is published in the cookbook. |
| OpenTelemetry | Push spans + metrics to any OTLP-compatible sink via the collector sidecar. |
| Datadog / New Relic | Use their OTLP intake endpoints with the OTel collector. |
Request logs
Structured JSON logs for each request are retained for 30 days. Payload capture (prompts and completions) is off by default. Enable it per-project only where required — it affects billing and has clear privacy implications.
Access control
Metric and log visibility follows your organization role:
- Organization admin — full visibility across all projects.
- Project admin — full visibility for owned projects.
- Project member — metrics only (not raw payloads) for assigned projects.
FAQ
How fresh are the metrics?
Near-real-time — typically < 30 seconds from request completion to dashboard visibility.
Is there a cost to scrape /v1/metrics?
No. Metrics are included with every plan. High-volume log shipping may incur egress costs.
Can I correlate logs with client-side traces?
Yes — include a traceparent header (W3C Trace Context) and it will be propagated through wylon’s spans.