From 8355453a7ec6545f9261bae8f90988c04ff2001a Mon Sep 17 00:00:00 2001 From: archipelago Date: Mon, 22 Jun 2026 17:22:29 -0400 Subject: [PATCH] docs: exact cutoff-proof resume in master-plan SS8b (resume from any device) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Captures: .228 1x-GREEN (110/110); hardened 5x DETACHED on .228 (/tmp/gate-5x2.log, nohup — survives terminal close) with the exact check-from-any-machine command; all shipped code fixes (commits) + deploy state (.228 + .198); node-state fixes NOT in repo (lnd nginx proxy 8081->18083, home-assistant orphan unit removed, electrumx re-registered); the run-ON-the-node lesson; and remaining work. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/PRODUCTION-MASTER-PLAN.md | 52 ++++++++++++++++++++++++++++++++-- 1 file changed, 50 insertions(+), 2 deletions(-) diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index 08021b2f..44d17c91 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -5,7 +5,7 @@ > supersedes all prior roadmap/handoff/status docs. When the gate passes, remove > the priority banner and demote this doc. > -> Last updated: 2026-06-22 · Binary: v1.7.99-alpha · See §8b for the live resume. +> Last updated: 2026-06-22 (evening) · .228 gate 1×-GREEN; hardened 5× running on .228 (see §8b CURRENT STATE — resume from any device). --- @@ -164,7 +164,55 @@ hardening; paid swarm streaming + IndeeHub source (`phase4-streaming-ecash-plan. Meshroller Rust-native mesh AI (`meshroller-integration-design.md`); dual-ecash phases 2–6 (`dual-ecash-design.md`). -## 8b. SESSION STATE + RESUME (updated 2026-06-22) — READ THIS FIRST ON RESUME +## 8b. SESSION STATE + RESUME (updated 2026-06-22 evening) — READ §8b "CURRENT STATE + RESUME" FIRST + +### ▶ CURRENT STATE + RESUME (2026-06-22 evening) — RESUME FROM HERE (works from any device) + +**Headline:** the production gate's `package.stop` blocker is **FIXED**; **`.228` is 1×-GREEN +(110/110)**; a **hardened 5× run is IN PROGRESS on `.228`** (the single-node exit criterion). The +gate is now single-node (.228); multinode is split out (`docs/multinode-testing-plan.md`). + +**THE 5× RUN IS DETACHED ON .228 — survives terminal/session close. Check it from any machine:** +``` +sshpass -p archipelago ssh archipelago@192.168.1.228 \ + 'grep -E "iteration [0-9]+: (PASS|FAIL)|RESULTS|passed:|failed:" /tmp/gate-5x2.log; \ + echo "running pid: $(pgrep -f run-20x.sh$ || echo DONE)"; grep "^not ok" /tmp/gate-5x2.log | sort -u' +``` +- Log: `/tmp/gate-5x2.log` on .228 · launched `nohup` (pid was 4042141) · `ARCHY_ITERATIONS=5 + ARCHY_ALLOW_DESTRUCTIVE=1`, run **ON the node** from `/tmp/lifecycle-run/tests/lifecycle` + (ARCHY_HOST=127.0.0.1). `bats` 1.11.1 + static `jq` 1.7.1 are installed on .228 for this. +- **If all 5 iterations PASS → .228 has met the single-node criterion → demote the banner.** +- If it flakes again: it'll be readiness-under-churn (lnd/mempool); the hardening (commit `98f4fa44`: + inter-iteration `settle_stack()` + 180–240s readiness windows) targets exactly that. Re-copy the + repo `tests/lifecycle` to /tmp/lifecycle-run and re-launch. + +**Code fixes shipped this session (all on `main`, built + DEPLOYED to .228 AND .198):** +- `2dad64b2` stop honours per-app grace (was `-t 30` deadline racing SIGKILL). +- `760a32bc` reconciler stops resurrecting user-stopped apps (dep-override + host-port watchdog). +- `6e49ce6f` container-list reports user-stopped apps as `stopped` despite a live UI companion. +- `452f05d8` companion self-heal on its own ~30s loop (was gated behind the slow per-app pass). +- Test-harness hardening: `88930558` `53b8e47f` `892ff083` `98f4fa44` (readiness retries, immich/ + fedimint/NPM/lnd windows, inter-iteration settle). Binary built on .116 + `core/target/release/archipelago` (4-fix); deploy = stop archipelago, cp to /usr/local/bin, start. + +**NODE-STATE fixes on .228 NOT in the repo (re-apply if .228 is reset/reimaged):** +- nginx `/app/lnd/` proxy target was stale `8081` → fixed to `18083` (sed in + /etc/nginx/sites-{available,enabled}/archipelago + snippets, then `nginx -s reload`). Repo code is + correct (18083); old node config was stale. +- Removed a stale orphan `~/.config/containers/systemd/home-assistant.container` (ContainerName + `home-assistant` ≠ the real `homeassistant` container; it was stuck "activating"). Real app fine. +- electrumx was re-installed (`package.install` w/ image `146.59.87.168:3000/lfg2025/electrumx:v1.18.0`) + to re-register it as a tracked manifest app (it had become adopted plain-podman). + +**KEY LESSON:** run the lifecycle gate **ON the node**, not via RPC from .116 — its bitcoin/companion/ +orphan/endpoint tests use local `podman`/`systemctl`/`bitcoin-cli`/`curl`, so a remote run silently +tests the *runner* (this is why earlier runs from .116 falsely showed "bitcoin in IBD" etc.). + +**Remaining (after 5× green):** netbird migration (#20 ph4 — the one real migration left) + btcpay/ +mempool stack polish; Phase-3 `use_quadlet_backends`; B flip-on (EMBED_MANIFESTS+sign); per-app test +coverage (~30 apps unwritten); the mobile app-launch UX (§8 Roadmap P1). Multinode → its own plan. + +--- ### Where we are — Task #20 (manifest lifecycle hooks) + indeedhub migration: DONE & 2-node verified