---
name: Container Orchestration Hardening
description: Container orchestration overhaul — stop grace periods, pull retry, persistent restart tracking, scheduled remediation, failsafe install, boot reconciliation
type: project
---

Container orchestration hardening implemented on dev-iso branch (2026-03-28).

**Why:** Gitea issue requesting true orchestration. Containers were unreliable — 10s stop timeout risked Bitcoin Core UTXO corruption, image pulls failed silently, restart counters reset on process restart enabling infinite loops, doctor/reconcile scripts only ran manually.

**What was done (7 changes):**
1. Per-container stop grace periods (600s bitcoin, 330s lnd, 300s electrs, 120s databases, 60s btcpay, 30s default) + systemd TimeoutStopSec=660
2. Image pull retry with exponential backoff (3 attempts: 5s/15s/45s) + post-pull verification + stacks.rs error propagation instead of silent swallow
3. Resolved container/health_monitor.rs TODO (documented as orchestrator-level responsibility)
4. Persistent restart tracking to restart-tracker.json (survives process restarts, seeded on startup)
5. Scheduled systemd timers: container-doctor every 30min, reconcile-containers every 6h
6. Failsafe install: post-pull image verify, rollback on start failure, 30s post-start health check with crash diagnosis
7. Boot reconciliation: runs reconcile-containers.sh after crash recovery completes

**How to apply:** These changes affect beta reliability. The other programmer is working on custom base ISO on the same branch — coordinate on build-auto-installer-iso.sh changes.