Inference/Inference observability

Inference observability

Every inference call on wylon is measured end-to-end. Pre-built dashboards give you traffic, latency, error, and capacity telemetry out of the box, and a Prometheus-compatible metrics API lets you ship everything into your own stack.

What you get

Preconfigured dashboards in the web console for every model and endpoint.
Near-real-time updates — metrics are surfaced within tens of seconds.
Percentile latency statistics — p50 / p90 / p99, not just averages.
Prometheus & Grafana integrations for your own dashboards and alerts.
Request logs with opt-in payload capture for debugging.

Available metrics

Category	Metric	Description
Traffic	`wylon_requests_per_minute`	Total requests per minute.
	`wylon_input_tokens_per_minute`	Input (prompt) tokens per minute.
	`wylon_output_tokens_per_minute`	Generated output tokens per minute.
Latency	`wylon_request_duration_seconds`	End-to-end time from request sent to full response received.
	`wylon_time_to_first_token_seconds`	TTFT — first streamed token latency.
	`wylon_output_tokens_per_second`	Output speed after first token.
Capacity	`wylon_batch_jobs_in_progress`	Number of Batch jobs currently executing.
Capacity	`wylon_queue_depth`	Pending requests waiting for a GPU slot.
Errors	`wylon_error_rate`	Percentage of failed requests, grouped by HTTP status (4xx, 429, 5xx).
Errors	`wylon_success_rate`	Percentage of 2xx responses.

Filters and dimensions

Every metric can be sliced by any combination of the labels below.

Time range (5m, 1h, 24h, 7d, custom)
Model (e.g. moonshotai/Kimi-K2)
Call type (real-time chat completions / Batch)
Project / API key
Error code (HTTP status)
Prompt length bucket / latency range

Metrics API

The Prometheus-format endpoint returns all metrics with their live values.

curl https://api.wylon.cn/v1/metrics \
  -H "Authorization: Bearer $WYLON_API_KEY"

# p99 TTFT for kimi-k2.5, last 15 minutes
histogram_quantile(0.99,
  sum(rate(wylon_time_to_first_token_seconds_bucket{model="moonshotai/kimi-k2.5"}[15m])) by (le)
)

# error rate by status class, last 1 hour
sum(rate(wylon_error_rate[1h])) by (status_class)

Exporters

Target	How
Prometheus	Scrape `/v1/metrics` with bearer-token auth.
Grafana®	Point a Prometheus data source at the same URL; a starter dashboard JSON is published in the cookbook.
OpenTelemetry	Push spans + metrics to any OTLP-compatible sink via the collector sidecar.
Datadog / New Relic	Use their OTLP intake endpoints with the OTel collector.

Request logs

Structured JSON logs for each request are retained for 30 days. Payload capture (prompts and completions) is off by default. Enable it per-project only where required — it affects billing and has clear privacy implications.

shield

PII in prompts. Treat payload capture as you would any log with potential personal data. See Privacy policy and Data processing for retention and regional controls.

Access control

Metric and log visibility follows your organization role:

Organization admin — full visibility across all projects.
Project admin — full visibility for owned projects.
Project member — metrics only (not raw payloads) for assigned projects.

FAQ

How fresh are the metrics?
Near-real-time — typically < 30 seconds from request completion to dashboard visibility.

Is there a cost to scrape /v1/metrics?
No. Metrics are included with every plan. High-volume log shipping may incur egress costs.

Can I correlate logs with client-side traces?
Yes — include a traceparent header (W3C Trace Context) and it will be propagated through wylon’s spans.