[Bug] On reboot many containers stay 'stopped' and don't reliably return to running #47
Loading…
x
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Repro (.116): after a reboot a lot of containers show as stopped; unclear if all recover. One did come back, but not confidently all of them.
Need: on reboot, ALL containers must auto-start and reach running (level-triggered reconcile should drive them up, not leave them stopped).
Note: reboot startup can stagger ~5min on heavy nodes (.228), so distinguish 'slow staggered start' from 'stuck stopped'. Overlaps the IndeeHub stack needing 1-2 restarts to recover.
Investigate: crash_recovery.rs, prod_orchestrator.rs reconcile loop, Quadlet default.target.wants + linger, whether reconcile re-asserts running for every installed app after boot.
Live update 2026-06-17: reboot start/restart now works reliably (nodes come back up well), but containers still STOP during normal runtime and don't get restarted back to running. Concrete case: LND keeps stopping on .70.
So the core gap is runtime crash-recovery, not boot ordering: when a container exits during operation, the reconciler must detect it and drive it back to
running(with backoff), for ALL apps. Overlaps the IndeeHub stack needing 1-2 restarts (#41).Investigate why LND in particular exits repeatedly on .70 (logs/OOM/wallet-lock — cf. the LND wallet-password issue), plus the general 'stopped and not restarted' reconcile gap.
Diagnosis: the health monitor (health_monitor.rs) DOES restart exited/stopped/created containers with backoff (max 10 attempts, 1h stability reset), skipping user-stopped ones. So generic 'stopped' containers should recover. 'LND keeps stopping' specifically is most likely the known LND wallet-password lock (see the LND wallet-password work) — restarts won't help while the wallet is locked, and after max attempts LND stays down. Needs per-container live diagnosis to confirm which containers stay stopped and why; not a blind health-monitor change.
Verified on the live .116 node: reboot recovery works — 35 of 36 containers are Up and stable (boot reconciler active, Quadlet WantedBy=default.target). The lone non-running one (
bitcoin-core, Created) was the real defect: bitcoin-core and bitcoin-knots share port 8332, so with knots running the reconciler kept trying to start core and loggingreconcile failed(rootlessport ... 8332: address already in use). The health monitor already skipped it; now the reconciler does too —prod_orchestratorskips starting a Bitcoin variant when the other variant is running (returns NoOp).cargo checkgreen. Stops the churn; the parked variant staying Created is now expected, not a failure.