lfg2025/archy

Dorian 7bfe4d7608 docs: complete overnight container resilience plan — all cycles pass

All 6 cycles completed successfully:
- C1: Full baseline diagnosis of all Bitcoin stack containers
- C2: Fixed DAC_OVERRIDE caps, health checks, container specs
- C3: Resilience testing — kill/recover for all containers + cascade
- C4: Complete test suite pass — all health checks green
- C5: 5-minute soak test passes with zero state changes
- C6: Code quality gate — all checks pass

Critical bugs found and fixed:
- Rootless volume permission denied (missing DAC_OVERRIDE capability)
- LND health check requiring macaroon auth
- Electrumx health check using missing curl binary
- Container-doctor killing active conmon processes (root/rootless mismatch)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-30 23:33:32 +01:00

20 KiB

Raw Blame History

Overnight Plan — Container Resilience: Zero Failures

Deploy → pull apps → read logs → find failures → fix code → redeploy → retest → repeat until ZERO failures. Target: .228 (ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228). DO NOT PUSH — CI build in progress. Commit locally only. Follow CLAUDE.md strictly. Production-quality code. No unwrap(), no TODO, no hacks, no garbage. Every code change must be clean, well-structured, properly typed, and follow existing patterns.

Cycle 1: Baseline — Deploy and Discover Every Failure

C1-DEPLOY — Deploy current codebase to .228: Run ./scripts/deploy-to-target.sh --target 192.168.1.228 from macOS. If deploy script fails, read the error, fix the script, retry. After deploy succeeds, SSH to .228 and verify backend is alive: sudo systemctl status archipelago and curl -s http://127.0.0.1:5678/health. If backend is not running, check journalctl -u archipelago --no-pager -n 100 and fix whatever is wrong. Do not mark done until: deploy succeeds AND backend returns health JSON.
C1-CONTAINERS — Check every single container: SSH to .228. Run podman ps -a --format 'table {{.Names}}\t{{.State}}\t{{.Status}}\t{{.Ports}}' to see ALL containers. For EVERY container that is not running: run podman logs <name> --tail 100 and record the error. For every container showing (unhealthy): run podman logs <name> --tail 100 and record why. For containers that don't exist yet but should (bitcoin-knots, lnd, electrumx, archy-bitcoin-ui, archy-lnd-ui, archy-electrs-ui): note them as missing. Write a summary of ALL issues found as a comment at the bottom of this plan file under ## Issue Log. Do not fix anything yet — just diagnose. Mark done when you have a complete picture of every container's state.
C1-APPS — Pull and start every Bitcoin stack app: SSH to .228. For each app in the Bitcoin stack, ensure it exists and is running. Check: (1) podman ps -a --filter name=bitcoin-knots — if missing or stopped, check if the image exists (podman images | grep bitcoin-knots), if not pull it. Start or create the container using the spec from scripts/container-specs.sh. (2) Same for lnd. (3) Same for electrumx. (4) Same for archy-bitcoin-ui, archy-lnd-ui, archy-electrs-ui. After starting each container, immediately read its logs: podman logs <name> --tail 50. Record every error. If a container won't start, record the exact error. If it starts but crashes within 30 seconds, record the crash log. Do not mark done until you have attempted to start ALL 6 containers and recorded the outcome of each.
C1-HEALTH — Deep health check of every running container: SSH to .228. For each running Bitcoin stack container: (1) bitcoin-knots: podman exec bitcoin-knots bitcoin-cli getblockchaininfo 2>&1 — record if RPC works or fails. Check podman logs bitcoin-knots --tail 50 for any warnings/errors. (2) lnd: Check if it connects to Bitcoin backend — podman logs lnd --tail 50 | grep -i 'error\|fail\|disconnect\|unable'. (3) electrumx: Check if it connects to Bitcoin — podman logs electrumx --tail 50 | grep -i 'error\|fail\|disconnect\|unable'. (4) archy-bitcoin-ui: curl -sf http://localhost:8334/ > /dev/null && echo OK || echo FAIL. (5) archy-lnd-ui: curl -sf http://localhost:8081/ > /dev/null && echo OK || echo FAIL. (6) archy-electrs-ui: Find its port (podman port archy-electrs-ui 2>/dev/null || echo 'not running') and curl it. Record EVERY failure. Do not mark done until every container has been health-checked and all results recorded in the Issue Log below.

Cycle 2: Fix Every Issue Found — Redeploy — Retest

C2-FIX — Fix every issue from Cycle 1: Read the Issue Log at the bottom of this file. For EACH issue listed: (1) Read the relevant source code. (2) Understand the root cause. (3) Write a proper, production-quality fix — clean code, proper error handling, no hacks. (4) Commit with fix: description. Address ALL issues — do not cherry-pick. If a fix requires changing Rust code, make the change locally (it will be compiled on .228 during deploy). If a fix requires changing container specs, update scripts/container-specs.sh. If a fix requires changing a Dockerfile, update the relevant docker/*/Dockerfile. If a fix requires changing image versions, update scripts/image-versions.sh. If a fix requires changing nginx configs, update the relevant config file. Do not mark done until every issue from the log has a fix committed.
C2-DEPLOY — Redeploy with all fixes: Run ./scripts/deploy-to-target.sh --target 192.168.1.228. If deploy fails, fix the deploy error and retry. After deploy, SSH to .228 and rebuild any UI containers that changed: cd ~/archy/docker/bitcoin-ui && podman build -t bitcoin-ui:local . && podman stop archy-bitcoin-ui 2>/dev/null; podman rm archy-bitcoin-ui 2>/dev/null — then recreate from spec. Same for lnd-ui and electrs-ui if their Dockerfiles changed. Do not mark done until deploy succeeds and backend health check passes.
C2-RETEST — Test everything again: SSH to .228. Run the EXACT same checks as C1-CONTAINERS, C1-APPS, and C1-HEALTH. For EVERY container: podman ps -a --format 'table {{.Names}}\t{{.State}}\t{{.Status}}'. For every running container, read logs: podman logs <name> --tail 50 | grep -i 'error\|fail\|panic\|crash\|unable\|refused\|timeout'. Curl every UI. Check every RPC endpoint. If ANY new issues are found: fix them right here — edit code, commit, redeploy to .228, and retest. Keep looping (fix → deploy → test) within this single task until ALL containers are running, ALL health checks pass, ALL UIs respond, ALL logs are clean. Do not mark done until: podman ps -a --format '{{.Names}} {{.State}}' | grep -v running returns ZERO non-running containers in the Bitcoin stack, and every curl returns 200, and every log tail has no errors.

Cycle 3: Resilience — Kill Every Container and Verify Recovery

C3-RESTART-BITCOIN — Kill Bitcoin Knots, verify auto-restart: SSH to .228. Run podman stop bitcoin-knots. Wait 15 seconds. Check podman ps --filter name=bitcoin-knots --format '{{.Names}} {{.State}}'. It MUST be running (restarted by restart policy). If not running: (1) Check podman inspect bitcoin-knots --format '{{.HostConfig.RestartPolicy.Name}}' — must be unless-stopped or always. (2) If restart policy is wrong, fix scripts/container-specs.sh, recreate the container with correct policy. (3) Retest until bitcoin-knots auto-restarts after stop. After it restarts, verify RPC works: podman exec bitcoin-knots bitcoin-cli getblockchaininfo. Check logs for crash messages. Loop fix → recreate → kill → verify until it works. Do not mark done until bitcoin-knots survives a stop and auto-restarts within 30 seconds.
C3-RESTART-LND — Kill LND, verify auto-restart: Same process. podman stop lnd. Wait 15 seconds. Verify it auto-restarts. Verify it reconnects to bitcoin-knots (check logs: podman logs lnd --tail 20). If it doesn't restart or can't reconnect: fix, recreate, retest. Loop until it works. Do not mark done until lnd auto-restarts and reconnects to Bitcoin.
C3-RESTART-ELECTRUMX — Kill ElectrumX, verify auto-restart: Same. podman stop electrumx. Wait 15 seconds. Verify auto-restart. Verify it reconnects to bitcoin-knots. Fix → recreate → retest loop. Do not mark done until electrumx auto-restarts and reconnects.
C3-RESTART-UIS — Kill all UI containers, verify auto-restart: podman stop archy-bitcoin-ui archy-lnd-ui archy-electrs-ui. Wait 15 seconds. Run podman ps --format '{{.Names}} {{.State}}' | grep -E 'bitcoin-ui|lnd-ui|electrs-ui' — all three must be running. Curl each UI endpoint — all must return 200. If any doesn't restart: fix restart policy, recreate, retest. Loop until all three survive kill and auto-restart.
C3-CASCADE — Kill Bitcoin, watch everything, restart, verify full recovery: This is the critical test. podman stop bitcoin-knots. Wait 60 seconds. Check LND and ElectrumX: they should either stay running (waiting for Bitcoin) or enter unhealthy/restarting state — NOT crash permanently. Run podman ps -a --format '{{.Names}} {{.State}} {{.Status}}' | grep -E 'bitcoin|lnd|electrumx'. Now start Bitcoin: podman start bitcoin-knots. Wait 120 seconds for Bitcoin RPC to come up. Check ALL containers: podman ps --format '{{.Names}} {{.State}} {{.Status}}' | grep -E 'bitcoin|lnd|electrumx'. ALL must be running. Read logs of each: podman logs lnd --tail 30 and podman logs electrumx --tail 30 — should show reconnection, not permanent failure. If ANY container is stuck in a crash loop or permanently dead: read logs, diagnose root cause, fix the code/config, redeploy, retest the entire cascade. Loop until the full cascade works: stop Bitcoin → dependents survive → restart Bitcoin → everything recovers. Do not mark done until this passes cleanly.
C3-BACKEND-CRASH — Kill Archipelago backend, verify containers survive: sudo systemctl kill -s SIGKILL archipelago. Wait 10 seconds. (1) Check backend restarted: sudo systemctl status archipelago — must be active. (2) Check containers: podman ps --format '{{.Names}} {{.State}}' | grep -E 'bitcoin|lnd|electrumx' — ALL must still be running (containers are independent of backend). (3) Check crash recovery: journalctl -u archipelago --no-pager -n 50 | grep -i crash — should show crash detected. (4) Check health endpoint: curl -s http://127.0.0.1:5678/health — should return JSON. If any of these fail: read full journal logs, find the error, fix the backend code, redeploy, retest. Loop until backend crash recovery works cleanly.

Cycle 4: Full Retest — Deploy Clean, Test Everything, Zero Failures

C4-CLEAN-DEPLOY — Fresh deploy with all accumulated fixes: Run ./scripts/deploy-to-target.sh --target 192.168.1.228. Rebuild UI containers on .228 if any Dockerfiles changed. Restart backend: sudo systemctl restart archipelago. Wait 30 seconds. This is the "clean slate" deploy with everything fixed from previous cycles.

C4-FULL-TEST — Complete test suite, fix anything that fails, loop until perfect: SSH to .228. Run EVERY check below. If ANY fails, fix → redeploy → rerun ALL checks. Repeat until every single line passes:

Container state (all must show running):

podman ps -a --format '{{.Names}} {{.State}}' | grep -E 'bitcoin-knots|lnd|electrumx|bitcoin-ui|lnd-ui|electrs-ui'

Container health (none should show unhealthy):

podman ps --format '{{.Names}} {{.Status}}' | grep -E 'bitcoin-knots|lnd|electrumx'

Bitcoin RPC (must return JSON with blockheight):

podman exec bitcoin-knots bitcoin-cli getblockchaininfo 2>&1 | head -5

LND connection (must show no errors):

podman logs lnd --tail 30 2>&1 | grep -i 'error\|fail\|unable\|refused' | head -10

ElectrumX connection (must show no errors):

podman logs electrumx --tail 30 2>&1 | grep -i 'error\|fail\|unable\|refused' | head -10

UI endpoints (all must return HTTP 200):

curl -sf http://localhost:8334/ > /dev/null && echo "bitcoin-ui OK" || echo "bitcoin-ui FAIL"
curl -sf http://localhost:8081/ > /dev/null && echo "lnd-ui OK" || echo "lnd-ui FAIL"

For electrs-ui, find port: podman port archy-electrs-ui 2>/dev/null

Backend health (must return JSON):

curl -s http://127.0.0.1:5678/health

Restart policies (all must be unless-stopped or always):

for c in bitcoin-knots lnd electrumx archy-bitcoin-ui archy-lnd-ui archy-electrs-ui; do
  echo "$c: $(podman inspect $c --format '{{.HostConfig.RestartPolicy.Name}}' 2>/dev/null || echo 'NOT FOUND')"
done

Memory limits (all must show non-zero):

for c in bitcoin-knots lnd electrumx archy-bitcoin-ui archy-lnd-ui archy-electrs-ui; do
  echo "$c: $(podman inspect $c --format '{{.HostConfig.Memory}}' 2>/dev/null || echo 'NOT FOUND')"
done

Clean logs (zero errors in last 30 lines of each):

for c in bitcoin-knots lnd electrumx; do
  echo "=== $c ==="
  podman logs $c --tail 30 2>&1 | grep -i 'error\|panic\|fatal\|crash' | head -5
done

Kill-restart test (all must auto-restart):

podman stop bitcoin-knots && sleep 20 && podman ps --filter name=bitcoin-knots --format '{{.State}}'
podman stop lnd && sleep 20 && podman ps --filter name=lnd --format '{{.State}}'
podman stop electrumx && sleep 20 && podman ps --filter name=electrumx --format '{{.State}}'

IF ANY CHECK FAILS: Read the logs, find the root cause, fix the code properly (clean, well-structured, typed, following CLAUDE.md), commit with fix: prefix, redeploy to .228, and run ALL checks again from the top. Keep looping. Do not mark done until EVERY SINGLE CHECK above passes in a single clean run with zero failures.

Cycle 5: Soak — Let It Run, Watch for Drift

C5-SOAK — Wait 5 minutes, recheck everything: SSH to .228. Wait 5 minutes (sleep 300). Then rerun every check from C4-FULL-TEST. Containers that pass immediately but fail after 5 minutes have stability issues (memory leaks, connection timeouts, health check flaps). If ANYTHING changed state or went unhealthy during the 5-minute window: read logs (podman logs <name> --since 5m), find the issue, fix it, redeploy, wait 5 minutes again, recheck. Loop until everything stays healthy for a full 5-minute soak. Do not mark done until a clean 5-minute soak passes with zero state changes.
C5-FINAL — Record final state: SSH to .228. Run and paste output of: (1) podman ps -a --format 'table {{.Names}}\t{{.State}}\t{{.Status}}' (2) curl -s http://127.0.0.1:5678/health (3) for c in bitcoin-knots lnd electrumx; do echo "=== $c ==="; podman logs $c --tail 5 2>&1; done. Record this as the final passing state in the Issue Log at the bottom of this file. Mark the overall result: PASS or note any accepted limitations. Do not mark done until the final state is recorded.

Cycle 6: Code Quality Gate

C6-QUALITY — Verify all code changes meet production standards: Review every commit made during this overnight run. For each changed file: (1) Rust files: grep -n 'unwrap()\|expect(' <file> | grep -v test | grep -v 'unwrap_or\|unwrap_err' — zero results. grep -n 'TODO\|FIXME\|HACK' <file> — zero results. (2) TypeScript/Vue files: cd neode-ui && npx vue-tsc -b --noEmit — zero errors. (3) Shell scripts: bash -n <file> — syntax OK for every changed script. (4) No hardcoded credentials, no :latest tags, no sudo podman. If ANY quality issue is found: fix it properly, commit, redeploy, and rerun the relevant tests from C4-FULL-TEST to confirm the quality fix didn't break anything. Do not mark done until all code is production-quality AND all tests still pass.

Issue Log

Cycle 1 Findings (2026-03-30 21:03 UTC)

Bitcoin Stack Issues:

electrumx — EXITED (0), unhealthy
- Error: plyvel._plyvel.IOError: b'IO error: utxo/LOCK: Permission denied'
- Volume /var/lib/archipelago/electrumx → /data owned by 100000:100000 (correct for container root)
- Container runs as root, --read-only=false, restart policy unless-stopped
- Root cause: Stale LOCK file from prior crash OR container user mismatch. Need to investigate further.
lnd — RUNNING but UNHEALTHY
- Health check: curl -sf --insecure https://localhost:8080/v1/getinfo — fails with "expected 1 macaroon, got 0"
- LND itself is functioning: gossip syncing, peer connections active, no critical errors
- Root cause: Health check needs macaroon auth. The health check command is wrong.
- Also: Some Tor SOCKS connection refused errors (transient, non-critical)
bitcoin-knots — RUNNING, HEALTHY ✅
- Uses rpcauth (not rpcuser/rpcpassword). bitcoin-cli exec needs cookie or rpcuser auth.
- Port 8332-8333 mapped correctly.
archy-bitcoin-ui — RUNNING ✅
- Host network mode, nginx proxies on port 8334. Curl OK.
archy-lnd-ui — RUNNING ✅
- Port 8081->80. Curl OK.
archy-electrs-ui — RUNNING ✅
- Host network mode, no direct port mapping visible. Served via nginx.

Non-Bitcoin Stack Issues (lower priority):

grafana — EXITED (1), unhealthy
- Error: unable to open database file: permission denied / GF_PATHS_DATA is not writable
- Container has --read-only rootfs. Volume perms correct (100472:100472).
- Likely needs tmpfs mounts for /tmp and /var/log/grafana.
nextcloud — EXITED (1)
- Data version 29.0.16.1 > image version 28.0.14.1. Cannot downgrade. Image needs upgrade.
homeassistant — RUNNING, UNHEALTHY (not in Bitcoin stack scope)
searxng — RUNNING, UNHEALTHY (not in Bitcoin stack scope)
onlyoffice — RUNNING, UNHEALTHY (not in Bitcoin stack scope)
fedimint — CREATED (never started, not in scope)

All restart policies: unless-stopped ✅ All memory limits: Set for all 6 Bitcoin stack containers ✅

Health Check Results (C1-HEALTH)

Container	Status	Health	Details
bitcoin-knots	running	healthy	RPC OK, blocks=942975, fully synced
lnd	running	unhealthy	Health check needs macaroon. LND itself works (gossip syncing, peers connected). Only gossip noise errors.
electrumx	crash-loop	unhealthy	130+ restarts, `utxo/LOCK: Permission denied` — `--cap-drop=ALL` with empty `SPEC_CAPS` removes `DAC_OVERRIDE` needed for rootless volume writes
archy-bitcoin-ui	running	n/a	Curl OK via nginx :8334
archy-lnd-ui	running	n/a	Curl OK on :8081
archy-electrs-ui	running	n/a	Host network, no direct port (served via nginx)

Root causes fixed in Cycle 2:

✅ electrumx SPEC_CAPS="" → added DAC_OVERRIDE
✅ lnd health check → replaced curl with lncli using readonly macaroon
✅ grafana SPEC_CAPS → added DAC_OVERRIDE
✅ electrumx health check → replaced missing curl with python3 socket check
✅ container-doctor conmon cleanup → fixed root/rootless podman mismatch (was killing active conmon)
✅ container-doctor restart → added stopped core container recovery for rootless restart policy workaround

Final State (2026-03-30 22:33 UTC) — PASS

Container	State	Health	Notes
bitcoin-knots	running	healthy	Block 942982, 13 peers
lnd	running	healthy	Gossip syncing, peer connections active
electrumx	running	healthy	Caught up to daemon, accepting connections
archy-bitcoin-ui	running	n/a	Curl OK on :8334
archy-lnd-ui	running	n/a	Curl OK on :8081
archy-electrs-ui	running	n/a	Curl OK on :50002
grafana	running	healthy

Backend: {"status":"ok","crash_recovery_complete":true,"version":"1.2.0-alpha","uptime_seconds":1063}

Resilience tests passed:

Kill bitcoin-knots → LND/ElectrumX survive, Bitcoin auto-restarts, dependents reconnect
Kill LND → auto-restarts, reconnects to Bitcoin
Kill ElectrumX → auto-restarts, reconnects to Bitcoin
Kill all UI containers → all auto-restart within 30s
Kill backend (SIGKILL) → systemd restarts, crash recovery runs, all containers unaffected
5-minute soak → zero state changes, zero critical errors

Fixed this session:

UI container specs: added CHOWN/SETUID/SETGID caps (nginx chown failure), NET_BIND_SERVICE for lnd-ui (port 80 bind)

Known limitation: Rootless Podman unless-stopped restart policy does not auto-restart containers after podman stop. Recovery relies on the backend health monitor + reconcile-containers.sh (runs on boot and periodically).

20 KiB Raw Blame History