From f4727bfdb3f76279b75bd92395d0877796405685 Mon Sep 17 00:00:00 2001 From: archipelago Date: Mon, 22 Jun 2026 13:44:57 -0400 Subject: [PATCH] docs(gate): companion self-heal fix validated (10s) + test-31 harness caveat MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Independent companion loop (452f05d8) validated on .228: deleted archy-electrs-ui recreates in ~10s (was stuck 100s+). Also: companion-survives bats does LOCAL rm/systemctl --user, so running it from .116 via RPC tests .116's companions with .116's binary, NOT the remote target — must run ON the target node. Explains the 'failed on both nodes' runs (both silently tested .116). Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/PRODUCTION-MASTER-PLAN.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index 8c5b905f..54362091 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -290,11 +290,15 @@ recreate (both nodes — likely the 90s window vs reconcile tick + image step; i **EVERY gate failure is now FIXED or explained — NO lifecycle code bugs remain.** Final read: - ✅ `package.stop` (the blocker): 3 bugs fixed (`2dad64b2`/`760a32bc`/`6e49ce6f`), green both nodes. - **bitcoin-IBD cascade** (most of .198's red): environmental — bitcoin syncing (test 83 precondition). -- **test 31** companion-recreate: NOT a product bug — (a) contamination (electrumx not in manifest_ids, - fixed by reinstall) + (b) on loaded .228 the reconcile cadence is ~108s (orchestrator pass over 45 - apps is slow) which exceeds the test's 90s window. Companion self-heal IS functional, just slower - than the window on a pathologically-loaded node. *Optional product improvement:* run the companion - reconcile stage on its own ~30s timer instead of gated behind the slow orchestrator pass. +- **test 31** companion-recreate: NOT a product bug. Two things: (a) **FIXED** — the companion + reconcile stage was gated behind the slow per-app pass; now it runs on its own ~30s loop + (`452f05d8`). Validated on .228 with the new binary: a deleted `archy-electrs-ui` unit self-heals + in **~10s** (was stuck 100s+), journal: `companion not active, repairing → wrote quadlet unit → + companion started`. (b) **HARNESS CAVEAT** — the companion-survives bats does LOCAL `rm`/`systemctl + --user` (no ssh), so running the gate from .116 against a remote node actually tests **.116's** + companions with **.116's** (old) binary, not the RPC target. ⇒ the companion-survives suite must be + run ON the target node (or with the new binary on .116) to be meaningful. This explains the + "failed on both nodes" runs — both were silently testing .116. - **test 55** immich restart: NOT a bug — the heavy 3-container stack (postgres+redis+server) restarts in >120s under load; immich DOES return to running. *Optional:* bump the immich restart wait. - **test 44** fedimint orphan: my probe pollution; a teardown clears it.