- prod_orchestrator.rs: the boot reconciler's zombie-guard and start-failed
recreate paths (Created/Stopped/Exited states) had no attempt cap, unlike
health_monitor's independent restart tracker. A container whose entrypoint
fatally crashes right after `podman start` succeeds got stop+remove+
install_fresh'd every ~30s reconcile tick forever (portainer on .198,
2026-07-01: a DB schema newer than the pinned binary could read -- no
amount of recreating fixes that). Added a 5-attempts/30-minute circuit
breaker; once exhausted the container is left alone with an error! log
instead of looping, and an explicit install/start clears the counter.
- content_server.rs: serve_content now prunes a catalog entry whose backing
file is missing on disk, instead of leaving it advertised to every peer
forever with no way to distinguish "gone" from "transient failure."
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>