archy

lfg2025/archy

History

archipelago 0a8db9044f fix(orchestrator): recreate zombie "Up" containers whose process is dead

podman trusts its own state DB: when a container's conmon dies without
podman observing it (cgroup-cascade SIGKILL on archipelago.service
restart, a crash), `podman ps` keeps reporting it "Up" long after the
process is gone. The reconciler NoOp'd such a zombie forever, so a dead
dependency with no published host port never recovered.

Observed live on .228 (2026-06-25): netbird-dashboard reported "Up" with
a dead State.Pid → its nginx proxy 502'd → NetBird login broke
("Unauthenticated"). The dashboard publishes no host port, so the
Running branch had nothing to probe and never recreated it.

Add a zombie guard to the Running branch: verify the recorded State.Pid
is alive (its /proc entry exists) before trusting "running"; on a
concrete dead PID, stop+remove+install_fresh from the manifest.
Conservative by design — any uncertainty (inspect failed, PID
unparseable) assumes alive, so a transient podman hiccup never destroys
a healthy container. Unit test covers live/dead/out-of-range PIDs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-26 02:25:52 -04:00

src

fix(orchestrator): recreate zombie "Up" containers whose process is dead

2026-06-26 02:25:52 -04:00

tests

chore(ci): rustfmt + clippy clean-up to unblock the Rust CI job

2026-04-18 17:23:46 -04:00

Cargo.toml

chore: release v1.7.99-alpha

2026-06-18 01:00:24 -04:00