Container recovery: - Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s - Dependency-aware restarts: won't restart services before their deps - Reset dependent counters when a dependency recovers - Handle "created" state containers (were invisible to health monitor) - Added IndeedHub, mempool-api, mysql to tier system - Crash recovery: podman start timeout 30s→120s with retry - Podman client: socket timeout 5s→30s, added restart policy UI state representation: - Exit code 0 shows "stopped" (gray), not "crashed" (red) - Exit code 137 shows "killed (OOM)" - Non-zero exit shows "crashed" (red) - Added exit_code field to PackageDataEntry Install/uninstall fixes: - Install returns error when container doesn't start (was silent success) - Post-install hooks awaited instead of fire-and-forget tokio::spawn - Uninstall: graceful rm before force, volume prune, network cleanup - Uninstall returns error on partial failure (was 200 OK) Config consistency: - DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded) - Bitcoin: added ZMQ ports 28332/28333 for LND block notifications - IndeedHub port 7777→8190 (was conflicting with strfry) - Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0 Performance: - Metrics collector interval 60s→300s (was duplicating health monitor) - Podman client: proper error propagation instead of unwrap_or_default Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1.6 KiB
1.6 KiB
name, description, type
| name | description | type |
|---|---|---|
| Container Orchestration Hardening | Container orchestration overhaul — stop grace periods, pull retry, persistent restart tracking, scheduled remediation, failsafe install, boot reconciliation | project |
Container orchestration hardening implemented on dev-iso branch (2026-03-28).
Why: Gitea issue requesting true orchestration. Containers were unreliable — 10s stop timeout risked Bitcoin Core UTXO corruption, image pulls failed silently, restart counters reset on process restart enabling infinite loops, doctor/reconcile scripts only ran manually.
What was done (7 changes):
- Per-container stop grace periods (600s bitcoin, 330s lnd, 300s electrs, 120s databases, 60s btcpay, 30s default) + systemd TimeoutStopSec=660
- Image pull retry with exponential backoff (3 attempts: 5s/15s/45s) + post-pull verification + stacks.rs error propagation instead of silent swallow
- Resolved container/health_monitor.rs TODO (documented as orchestrator-level responsibility)
- Persistent restart tracking to restart-tracker.json (survives process restarts, seeded on startup)
- Scheduled systemd timers: container-doctor every 30min, reconcile-containers every 6h
- Failsafe install: post-pull image verify, rollback on start failure, 30s post-start health check with crash diagnosis
- Boot reconciliation: runs reconcile-containers.sh after crash recovery completes
How to apply: These changes affect beta reliability. The other programmer is working on custom base ISO on the same branch — coordinate on build-auto-installer-iso.sh changes.