[Bug] On reboot many containers stay 'stopped' and don't reliably return to running #47

New Issue

lfg2025 · 2026-06-17T09:00:53Z

lfg2025 commented

2026-06-17 09:00:53 +00:00

Repro (.116): after a reboot a lot of containers show as stopped; unclear if all recover. One did come back, but not confidently all of them.

Need: on reboot, ALL containers must auto-start and reach running (level-triggered reconcile should drive them up, not leave them stopped).

Note: reboot startup can stagger ~5min on heavy nodes (.228), so distinguish 'slow staggered start' from 'stuck stopped'. Overlaps the IndeeHub stack needing 1-2 restarts to recover.

Investigate: crash_recovery.rs, prod_orchestrator.rs reconcile loop, Quadlet default.target.wants + linger, whether reconcile re-asserts running for every installed app after boot.

Repro (.116): after a reboot a lot of containers show as stopped; unclear if all recover. One did come back, but not confidently all of them. Need: on reboot, ALL containers must auto-start and reach running (level-triggered reconcile should drive them up, not leave them stopped). Note: reboot startup can stagger ~5min on heavy nodes (.228), so distinguish 'slow staggered start' from 'stuck stopped'. Overlaps the IndeeHub stack needing 1-2 restarts to recover. Investigate: crash_recovery.rs, prod_orchestrator.rs reconcile loop, Quadlet default.target.wants + linger, whether reconcile re-asserts running for every installed app after boot.

lfg2025 commented

2026-06-17 09:30:04 +00:00

Live update 2026-06-17: reboot start/restart now works reliably (nodes come back up well), but containers still STOP during normal runtime and don't get restarted back to running. Concrete case: LND keeps stopping on .70.

So the core gap is runtime crash-recovery, not boot ordering: when a container exits during operation, the reconciler must detect it and drive it back to running (with backoff), for ALL apps. Overlaps the IndeeHub stack needing 1-2 restarts (#41).

Investigate why LND in particular exits repeatedly on .70 (logs/OOM/wallet-lock — cf. the LND wallet-password issue), plus the general 'stopped and not restarted' reconcile gap.

Live update 2026-06-17: reboot start/restart now works *reliably* (nodes come back up well), but containers still STOP during normal runtime and don't get restarted back to running. Concrete case: **LND keeps stopping on .70**. So the core gap is runtime crash-recovery, not boot ordering: when a container exits during operation, the reconciler must detect it and drive it back to `running` (with backoff), for ALL apps. Overlaps the IndeeHub stack needing 1-2 restarts (#41). Investigate why LND in particular exits repeatedly on .70 (logs/OOM/wallet-lock — cf. the LND wallet-password issue), plus the general 'stopped and not restarted' reconcile gap.

lfg2025 commented

2026-06-17 10:35:52 +00:00

Diagnosis: the health monitor (health_monitor.rs) DOES restart exited/stopped/created containers with backoff (max 10 attempts, 1h stability reset), skipping user-stopped ones. So generic 'stopped' containers should recover. 'LND keeps stopping' specifically is most likely the known LND wallet-password lock (see the LND wallet-password work) — restarts won't help while the wallet is locked, and after max attempts LND stays down. Needs per-container live diagnosis to confirm which containers stay stopped and why; not a blind health-monitor change.

lfg2025 commented

2026-06-17 17:56:46 +00:00

Verified on the live .116 node: reboot recovery works — 35 of 36 containers are Up and stable (boot reconciler active, Quadlet WantedBy=default.target). The lone non-running one (bitcoin-core, Created) was the real defect: bitcoin-core and bitcoin-knots share port 8332, so with knots running the reconciler kept trying to start core and logging reconcile failed (rootlessport ... 8332: address already in use). The health monitor already skipped it; now the reconciler does too — prod_orchestrator skips starting a Bitcoin variant when the other variant is running (returns NoOp). cargo check green. Stops the churn; the parked variant staying Created is now expected, not a failure.

Verified on the live .116 node: reboot recovery works — 35 of 36 containers are Up and stable (boot reconciler active, Quadlet WantedBy=default.target). The lone non-running one (`bitcoin-core`, Created) was the real defect: bitcoin-core and bitcoin-knots share port 8332, so with knots running the reconciler kept trying to start core and logging `reconcile failed` (`rootlessport ... 8332: address already in use`). The health monitor already skipped it; now the reconciler does too — `prod_orchestrator` skips starting a Bitcoin variant when the other variant is running (returns NoOp). `cargo check` green. Stops the churn; the parked variant staying Created is now expected, not a failure.

lfg2025 closed this issue

2026-06-17 17:56:47 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: lfg2025/archy#47