Skip to main content

Metrics Reference

Angos exposes Prometheus metrics at the /metrics endpoint.


HTTP Metrics

http_requests_total

Total number of HTTP requests.

TypeLabels
Countermethod, route, status

Labels:

  • method: HTTP method (GET, POST, PUT, DELETE, etc.)
  • route: Route action (e.g., get-manifest, put-blob, list-tags)
  • status: HTTP status code (200, 404, 500, etc.)

Example:

# Request rate over 5 minutes
rate(http_requests_total[5m])

# Error rate (5xx responses)
rate(http_requests_total{status=~"5.."}[5m])

# Requests by route
sum by (route) (rate(http_requests_total[5m]))

# GET requests for manifests
rate(http_requests_total{method="GET", route="get-manifest"}[5m])

http_request_duration_ms

HTTP request latency in milliseconds.

TypeLabels
Histogrammethod, route

Example:

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_ms_bucket[5m]))

# Average latency
rate(http_request_duration_ms_sum[5m]) / rate(http_request_duration_ms_count[5m])

# Latency by route
histogram_quantile(0.99, sum by (route, le) (rate(http_request_duration_ms_bucket[5m])))

# Manifest pull latency
histogram_quantile(0.95, rate(http_request_duration_ms_bucket{route="get-manifest"}[5m]))

http_requests_in_flight

Current number of HTTP requests being processed.

TypeLabels
Gaugenone

Example:

# Current in-flight requests
http_requests_in_flight

# Max in-flight over time
max_over_time(http_requests_in_flight[1h])

Route Values

The route label uses action names from the OCI Distribution API:

RouteDescription
healthzHealth check
metricsPrometheus metrics
get-api-versionAPI version check
get-blobDownload blob
delete-blobDelete blob
mount-blobCross-repo blob mount
start-uploadStart blob upload
update-uploadChunk upload
complete-uploadComplete upload
get-uploadUpload status
cancel-uploadCancel upload
get-manifestPull manifest
put-manifestPush manifest
delete-manifestDelete manifest
list-tagsList tags
list-catalogList repositories
get-referrersGet referrers
ui-assetUI static files
ui-configUI configuration
list-repositoriesExtension API
list-namespacesExtension API
list-revisionsExtension API
list-uploadsExtension API
list-jobsList pending jobs
list-failed-jobsList dead-letter jobs
retry-jobRequeue dead-letter job
delete-jobDelete queued job
unknownUnrecognized route

Authentication Metrics

auth_attempts_total

Total number of authentication attempts.

TypeLabels
Countermethod, result

Labels:

  • method: basic, mtls, oidc
  • result: success, failed

Example:

# Authentication success rate
sum(rate(auth_attempts_total{result="success"}[5m])) /
sum(rate(auth_attempts_total[5m]))

# Failed auth attempts by method
sum by (method) (rate(auth_attempts_total{result="failed"}[5m]))

Webhook Metrics

webhook_authorization_requests_total

Total webhook authorization requests.

TypeLabels
Counterwebhook, result

Labels:

  • webhook: Name of the webhook
  • result: allow, deny, cached_allow, cached_deny

Example:

# Webhook hit rate
sum by (webhook) (rate(webhook_authorization_requests_total[5m]))

# Cache effectiveness
sum(rate(webhook_authorization_requests_total{result=~"cached_.*"}[5m])) /
sum(rate(webhook_authorization_requests_total[5m]))

# Denial rate by webhook
sum by (webhook) (rate(webhook_authorization_requests_total{result=~".*deny"}[5m]))

webhook_authorization_duration_seconds

Webhook authorization request duration.

TypeLabels
Histogramwebhook

Example:

# 95th percentile webhook latency
histogram_quantile(0.95, rate(webhook_authorization_duration_seconds_bucket[5m]))

# Slow webhook detection (> 1s)
rate(webhook_authorization_duration_seconds_bucket{le="1"}[5m])

Event Webhook Metrics

event_webhook_deliveries_total

Total event webhook delivery attempts.

TypeLabels
Counterwebhook, event, result

Labels:

  • webhook: Webhook name from configuration
  • event: Event type (e.g., manifest.push)
  • result: success or error

Example:

# Delivery rate by webhook and result
sum by (webhook, result) (rate(event_webhook_deliveries_total[5m]))

# Error rate for a specific webhook
rate(event_webhook_deliveries_total{webhook="audit", result="error"}[5m])

event_webhook_delivery_duration_seconds

Event webhook delivery duration.

TypeLabels
Histogramwebhook, event

Example:

# P95 delivery latency
histogram_quantile(0.95, rate(event_webhook_delivery_duration_seconds_bucket[5m]))

# Delivery latency by webhook
histogram_quantile(0.95, sum by (webhook, le) (rate(event_webhook_delivery_duration_seconds_bucket[5m])))

Job Queue Metrics

angos_job_queue_pending and angos_job_queue_failed are published only when [global.job_queue] is configured; the two counters below increment on every enqueue (the in-process queue included) and are always exposed on the server's /metrics. See Enable Durable Cache Jobs.

angos_job_queue_pending

Pending jobs that are ready for handling within the readiness horizon (not_before ≤ now + pending_ready_horizon_secs, default 600 seconds). Suitable for KEDA-style autoscaling of angos worker pods. Refreshed by a background ticker on every server replica.

Envelopes backed off further into the future are deliberately excluded so the gauge tracks actionable work: spinning up workers for jobs that won't be claimable for an hour wastes capacity. Tune pending_ready_horizon_secs (under [global.job_queue]) to give your autoscaler enough lead time to spin up replicas before the work becomes ready.

The gauge saturates at 10 000: any value at the cap should be read as "≥ 10 000". Combined with the readiness-horizon filter, this bounds the S3 LIST cost per refresh tick to ~10 paginated calls regardless of queue depth. KEDA's ScaledObject only needs ordinal granularity above its scale-to-max threshold, which is normally well below this cap.

TypeLabels
Gaugequeue

angos_job_queue_failed

Dead-lettered jobs currently held in the queue (jobs that exhausted their retry budget). Refreshed by the same server-side background ticker as angos_job_queue_pending, so it stays scrapeable even when angos worker drains the queue. Saturates at 10 000 like the pending gauge. Alert on angos_job_queue_failed{queue="replication"} > 0 to catch stuck replication.

TypeLabels
Gaugequeue

angos_job_queue_enqueued_total

Total jobs submitted to the queue.

TypeLabels
Counterqueue, dedup

Labels:

  • queue: queue name (e.g. cache, replication)
  • dedup: hit when a duplicate lock_key was suppressed, otherwise miss

angos_job_queue_enqueue_failures_total

Total enqueue attempts that did not land on the queue (envelope build or storage error).

TypeLabels
Counterqueue

Labels:

  • queue: queue name (e.g. cache, replication)

Replication Metrics

The angos_job_queue_pending{queue="replication"} gauge (above) reports replication backlog depth; the metrics below cover push outcomes, staleness, and scrub reconciliation. See Bi-Directional Replication.

angos_replication_push_total and angos_replication_last_success_timestamp_seconds increment in the process that drains the replication queue. Without [global.job_queue] the server drains the queue in-process, so both appear on the server's /metrics. With [global.job_queue] the queue is drained by angos worker, which exposes no HTTP endpoint, so in that mode these two metrics are not scrapeable; use the server-published angos_job_queue_failed{queue="replication"} gauge to alert on stuck replication and angos_job_queue_pending{queue="replication"} for backlog.

angos_replication_push_total

Total replication pushes to a downstream, by outcome.

TypeLabels
Counterdownstream, outcome

Labels:

  • downstream: the configured downstream name
  • outcome: pushed (manifest/blobs transferred, or a delete applied), converged (the downstream already matched, a push whose digest was already present, or a delete whose target was already absent so nothing transferred), superseded (downstream already held a newer copy, last-writer-wins, counted as success), unsupported (the downstream rejected the delete method with 405, e.g. it does not support tag deletion; the job completes without converging rather than dead-lettering), or failed (the push errored and the job will retry)

Example:

# Push rate by downstream and outcome
sum by (downstream, outcome) (rate(angos_replication_push_total[5m]))

# Replication failure rate
sum by (downstream) (rate(angos_replication_push_total{outcome="failed"}[5m]))

angos_replication_last_success_timestamp_seconds

Unix timestamp (seconds) of the last pushed, converged, or superseded replication push per downstream (the convergent outcomes set it; unsupported and failed do not). Use it to detect a stalled downstream.

TypeLabels
Gaugedownstream

Example:

# Seconds since the last successful push (staleness) per downstream
time() - angos_replication_last_success_timestamp_seconds

angos_replication_reconcile_total

Replication reconcile enqueues emitted by angos scrub --replicate, by outcome.

TypeLabels
Counteroutcome

Labels:

  • outcome: enqueued (a divergence was enqueued: a push, or a prune delete for a prune = true downstream), failed (the envelope build or enqueue errored), or skipped (a downstream HEAD probe failed, e.g. auth rejection, 5xx, or timeout, so the tag stays unreconciled this pass; a persistently non-zero skipped with zero enqueued typically means bad downstream credentials)

This counter lives in the angos scrub process, which serves no /metrics endpoint and exits when the run completes, so Prometheus cannot scrape it. The warn-level log lines emitted for failed and skipped tags are the operational signal: watch the scrub run's logs (or its exit status) rather than this counter.

Replication Backlog

Replication shares the durable job queue, so backlog depth is reported by angos_job_queue_pending{queue="replication"} (see Job Queue Metrics).

# Pending replication pushes
angos_job_queue_pending{queue="replication"}

A deep replication queue is the normal state during a downstream outage and does not affect /readyz. Alert on sustained backlog instead. When the server drains the queue in-process (no [global.job_queue]), the staleness gauge is also on the server's /metrics and supports a staleness alert:

# Replication stale for over 10 minutes (in-process drain only)
(time() - angos_replication_last_success_timestamp_seconds) > 600

With a separate angos worker, the gauge is not scrapeable; alert on the server-published angos_job_queue_failed{queue="replication"} dead-letter gauge (stuck pushes) and the angos_job_queue_pending{queue="replication"} backlog instead.

# Stuck replication pushes (dead-lettered)
angos_job_queue_failed{queue="replication"} > 0

Example Prometheus Configuration

scrape_configs:
- job_name: 'angos'
static_configs:
- targets: ['registry:8000']
metrics_path: /metrics
scheme: http # or https

Example Grafana Dashboard Queries

Overview

# Request rate
sum(rate(http_requests_total[5m]))

# Error rate percentage
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))

# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_ms_bucket[5m])) by (le))

# Request rate by route
sum by (route) (rate(http_requests_total[5m]))

# Manifest operations latency
histogram_quantile(0.95, sum(rate(http_request_duration_ms_bucket{route=~".*-manifest"}[5m])) by (le))

Authentication

# Auth success rate
100 * sum(rate(auth_attempts_total{result="success"}[5m])) /
sum(rate(auth_attempts_total[5m]))

# Auth method distribution
sum by (method) (rate(auth_attempts_total[5m]))

Authorization Webhooks

# Webhook cache hit rate
100 * sum(rate(webhook_authorization_requests_total{result=~"cached_.*"}[5m])) /
sum(rate(webhook_authorization_requests_total[5m]))

# Webhook error rate (denials)
100 * sum(rate(webhook_authorization_requests_total{result=~".*deny"}[5m])) /
sum(rate(webhook_authorization_requests_total[5m]))

Event Webhooks

# Event webhook delivery rate
sum by (webhook, result) (rate(event_webhook_deliveries_total[5m]))

# Event webhook error rate
100 * sum(rate(event_webhook_deliveries_total{result="error"}[5m])) /
sum(rate(event_webhook_deliveries_total[5m]))

# Event webhook P95 latency
histogram_quantile(0.95, sum by (webhook, le) (rate(event_webhook_delivery_duration_seconds_bucket[5m])))

Alerting Examples

groups:
- name: angos
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on Angos"

- alert: HighLatency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_ms_bucket[5m])) by (le)) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High latency on Angos"

- alert: AuthFailures
expr: |
sum(rate(auth_attempts_total{result="failed"}[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High authentication failure rate"