Skip to main content

Configure Replication

Mirror a repository's content to one or more downstream registries as it changes. Two Angos instances configured as each other's downstreams form an active-active pair. See Bi-Directional Replication for the concepts.

Prerequisites

  • Two or more reachable Angos instances (or any OCI-compliant downstream registry).
  • A credential on each downstream that is allowed to push and delete (put-manifest, delete-manifest, blob uploads). Replication mirrors deletes and prune enqueues deletes, so a push-only credential without delete-manifest has every replicated delete rejected and dead-lettered. See Set Up Access Control.

Declare a Downstream

Replication is configured per repository, alongside upstream. Add one [[repository."<name>".downstream]] table per downstream:

[repository."nginx"]

[[repository."nginx".downstream]]
name = "eu-region" # local identifier (appears in logs and metrics)
url = "https://angos-eu.example.com"
username = "replicator"
password = "..."
mode = "event+reconcile" # "event+reconcile" | "event-only" | "reconcile-only"
namespace_filter = ["^nginx/.*"] # optional regex list; empty matches all namespaces
max_concurrent_pushes = 4 # optional; per-manifest blob fan-out (positive integer, default 4)

Downstream Options

OptionTypeDefaultDescription
namestringrequiredLocal identifier for this downstream (used in logs and the downstream metric label)
urlstringrequiredDownstream registry URL. A bare host mirrors the namespace verbatim; a path (http://host:8000/team) becomes the namespace prefix the content lands under, replacing the source repository prefix (<repo>/xteam/x). Angos serves the OCI API at /v2/ root, so the path is mapped into the namespace, not the HTTP path. See Fan out into sibling repositories.
modestring"event+reconcile""event+reconcile", "event-only", or "reconcile-only"
namespace_filter[string][] (all)Regex patterns; a namespace replicates to this downstream only if it matches one
max_concurrent_pushesusize4Concurrent blob pushes per manifest for this downstream (positive integer, >= 1)
pruneboolfalseWhen true, reconciliation also deletes tags present on this downstream but absent locally (authoritative one-way mirror). Leave false for active-active peers; see Reconcile on Demand.
username / passwordstring-Basic auth for the downstream
max_redirectu85Maximum redirects to follow
server_ca_bundlestring-CA bundle to verify the downstream's TLS certificate
client_certificate / client_private_keystring-mTLS to the downstream (both required together)

Modes

ModeLive pushes on mutationIncluded in scrub --replicate
event+reconcileYesYes
event-onlyYesNo
reconcile-onlyNoYes
Removing or renaming a downstream

Pending jobs for a removed or renamed downstream fail loudly and dead-letter after their retry budget. Clear them with angos scrub --replication-orphans (use --dry-run to preview), or inspect them via the jobs admin UI or the _jobs API (?queue=replication).

Reclaiming stranded blobs on a receiver

When a replicated manifest push uploads a blob but its manifest then loses last-writer-wins or dead-letters, the receiver keeps the blob's per-namespace ownership grant with no manifest referencing it, pinning the bytes. angos scrub --orphan-grants <age> (e.g. 24h) revokes such grants once the blob is older than the given age and reclaims the bytes; the age gate avoids racing an in-flight push.

Fan out into sibling repositories

A bare-host url mirrors the namespace verbatim, which suits a separate registry. To mirror a repository into a different repository on the same instance (or under a different prefix on another instance), put that prefix on the url path: it replaces the source repository prefix, so team/app lands as team-a/app. Angos serves the OCI API at the root, so the path is mapped into the namespace rather than the HTTP path.

[[repository."team".downstream]]
name = "team-a"
url = "https://angos.example.com/team-a" # path -> namespace prefix

[[repository."team".downstream]]
name = "team-b"
url = "https://angos.example.com/team-b"

[repository."team-a"] # the receiving repositories
[repository."team-b"]

A push to team/<image> then fans out to team-a/<image> and team-b/<image>. With a bare-host url, replicating to the same instance would re-target the source repository itself (a self-loop, suppressed as a no-op) rather than landing in a sibling.

Global Knobs

One [global] field tunes replication across all repositories:

[global]
max_concurrent_replication_jobs = 4 # worker concurrency for replication jobs (must be > 0)
  • max_concurrent_replication_jobs bounds how many replication jobs are handled in parallel by each angos worker, the server's in-process drain, and the scrub --replicate end-of-run drain. Default 4; must be greater than zero.
Restrict who may push to replicated repositories

A replication write is an ordinary manifest push carrying the X-Angos-Source-Timestamp header, and the receiver persists that timestamp as the tag's creation time. It's the value that decides last-writer-wins races and age-based retention. Future-dating is clamped, but any identity allowed to push can backdate a tag. On every instance that receives replication, gate the write actions (put-manifest, delete-manifest, uploads) to the replicator identity through the CEL access_policy, see Restrict replication writes.

Worker vs In-Process Drain

How replication work is drained depends on [global.job_queue]:

In-Process (server self-drains)

With no [global.job_queue] section, the server drains the replication queue itself, in-process. This is the simplest setup (run only angos server on each instance) and is ideal for a single-instance or demo deployment. Jobs still persist to the configured fs/S3 store (under _jobs/) and resume after a restart; what you give up versus a separate worker is cross-replica coordination and the queue-depth gauge, not durability.

Separate Worker

With [global.job_queue] configured, the server only enqueues jobs; you must run a worker to drain them. A bare angos worker drains both the replication and cache queues (each on its own pool); pass --queue replication to drain replication alone, for example to scale it independently:

angos -c config.toml worker # drains both cache and replication
angos -c config.toml worker --queue replication # replication only

This is the multi-replica, horizontally-scalable configuration: draining is decoupled from serving and can be scaled independently. (Pending pushes persist under _jobs/pending/replication/ and resume after a restart in both modes.) See Enable Durable Cache Jobs for the job-queue setup, KEDA autoscaling, and angos worker details.

Because the queue is drained by separate processes, [global.job_queue] requires a shared metadata-store lock strategy ([metadata_store.s3.lock_strategy.s3], [metadata_store.s3.lock_strategy.redis], or [metadata_store.fs.lock_strategy.redis]) so workers serialize on the same jobs; the default in-process memory lock is rejected at startup with this section. The in-process mode above (no [global.job_queue]) runs in a single process and works with any lock strategy.

Two-Instance Active-Active Example

Configure each instance with the other as a downstream for the same repository. With no [global.job_queue], each server self-drains in-process; no separate worker is needed.

Instance A (config-a.toml):

[server]
bind_address = "0.0.0.0"
port = 8000

[global]
max_concurrent_replication_jobs = 4

[blob_store.fs]
root_dir = "/data"

[repository."nginx"]

[repository."nginx".access_policy]
default = "allow"

[[repository."nginx".downstream]]
name = "instance-b"
url = "http://angos-b:8000"
mode = "event+reconcile"

Instance B (config-b.toml) is the mirror image, pointing its downstream at instance A.

Push to A and the tag appears on B within a few seconds:

docker push localhost:8000/nginx/app:v1
# ... shortly after ...
docker pull localhost:8001/nginx/app:v1 # served from B

Reconcile on Demand

When the event path misses a change (an instance was down, or two instances drifted after a partition), reconcile explicitly:

# Preview the pushes that would be enqueued; enqueues nothing
angos -c config.toml scrub --replicate --dry-run

# Enqueue the diverging tags (a standalone scrub drains them end-of-run)
angos -c config.toml scrub --replicate

By default reconciliation is additive: it pushes diverging or downstream-missing tags and never deletes. With --dry-run it previews the work without enqueuing anything: it lists an EnqueueReplicationPush for each diverging or downstream-missing tag and, for any downstream marked prune = true, an EnqueueReplicationDelete for each downstream-only tag.

A downstream marked prune = true is treated as an authoritative one-way mirror: reconciliation also enumerates its tags (via the OCI list-tags endpoint) and deletes any that are absent locally, so it converges exactly to the local tag set. Pruning is one-way-mirror-only: enabling it on an active-active peer would delete a tag the peer authored that has not yet replicated back.

Leave prune = false for active-active peers. The delete does carry a source_ts, so the receiver applies last-writer-wins rather than deleting unconditionally. But that only protects a downstream tag dated in the future relative to the reconcile decision. A peer's legitimately-newer tag whose created_at predates the reconcile run is still removed.

Re-running is a no-op once converged (coalesced by the queue). Schedule it like any other maintenance task:

# Cron: reconcile every replicated repository nightly at 4 AM
0 4 * * * /usr/bin/angos -c /etc/registry/config.toml scrub --replicate

Observability

Replication exposes Prometheus metrics:

# Push rate by downstream and outcome (pushed, converged, superseded, unsupported, failed)
sum by (downstream, outcome) (rate(angos_replication_push_total[5m]))

# Seconds since the last successful push per downstream (staleness)
time() - angos_replication_last_success_timestamp_seconds

# Pending replication backlog
angos_job_queue_pending{queue="replication"}

The push and staleness metrics are scrapeable on the server's /metrics only when it drains the queue in-process (no [global.job_queue]); with a separate angos worker they are not scrapeable, so monitor angos_job_queue_pending{queue="replication"} instead. See Replication Metrics for the scrapeability details and the full list.

Troubleshooting

Pushes never reach the downstream:

  • Confirm the downstream url is reachable from the source instance.
  • Verify the credential is authorized to push on the downstream (put-manifest, delete-manifest, blob uploads).
  • If [global.job_queue] is configured, ensure an angos worker is running (it drains the replication queue by default); the server only enqueues.

Replicated deletes fail and dead-letter (403):

  • The downstream credential is missing delete-manifest. Replication issues deletes as ordinary DELETE manifest calls, so a push-only credential has every replicated or pruned delete rejected with 403 and the job dead-letters after its retry budget. Grant delete-manifest to the replicator identity (see Set Up Access Control) and retry the dead-lettered jobs via the _jobs API (?queue=replication).

A tag does not overwrite on the downstream:

  • The downstream copy may be newer (last-writer-wins): a 409 REPLICATION_SUPERSEDED is convergence, not failure.
  • The downstream tag may be immutable: a 409 CONFLICT surfaces and the job retries; relax immutability or pick a different tag.

Reconciliation reports nothing to do but instances differ:

  • Confirm the downstream's mode includes reconciliation (event+reconcile or reconcile-only).
  • Confirm the namespace matches the downstream's namespace_filter.

Debug logging:

RUST_LOG=angos::replication=debug ./angos server

Reference