21 lines
1.6 KiB
Markdown
21 lines
1.6 KiB
Markdown
|
|
---
|
||
|
|
name: Container Orchestration Hardening
|
||
|
|
description: Container orchestration overhaul — stop grace periods, pull retry, persistent restart tracking, scheduled remediation, failsafe install, boot reconciliation
|
||
|
|
type: project
|
||
|
|
---
|
||
|
|
|
||
|
|
Container orchestration hardening implemented on dev-iso branch (2026-03-28).
|
||
|
|
|
||
|
|
**Why:** Gitea issue requesting true orchestration. Containers were unreliable — 10s stop timeout risked Bitcoin Core UTXO corruption, image pulls failed silently, restart counters reset on process restart enabling infinite loops, doctor/reconcile scripts only ran manually.
|
||
|
|
|
||
|
|
**What was done (7 changes):**
|
||
|
|
1. Per-container stop grace periods (600s bitcoin, 330s lnd, 300s electrs, 120s databases, 60s btcpay, 30s default) + systemd TimeoutStopSec=660
|
||
|
|
2. Image pull retry with exponential backoff (3 attempts: 5s/15s/45s) + post-pull verification + stacks.rs error propagation instead of silent swallow
|
||
|
|
3. Resolved container/health_monitor.rs TODO (documented as orchestrator-level responsibility)
|
||
|
|
4. Persistent restart tracking to restart-tracker.json (survives process restarts, seeded on startup)
|
||
|
|
5. Scheduled systemd timers: container-doctor every 30min, reconcile-containers every 6h
|
||
|
|
6. Failsafe install: post-pull image verify, rollback on start failure, 30s post-start health check with crash diagnosis
|
||
|
|
7. Boot reconciliation: runs reconcile-containers.sh after crash recovery completes
|
||
|
|
|
||
|
|
**How to apply:** These changes affect beta reliability. The other programmer is working on custom base ISO on the same branch — coordinate on build-auto-installer-iso.sh changes.
|