From 3515344800cb5a2d8b0333165e0165013317d9f1 Mon Sep 17 00:00:00 2001 From: archipelago Date: Fri, 26 Jun 2026 03:41:59 -0400 Subject: [PATCH] =?UTF-8?q?docs(master-plan):=20session=20h=20=E2=80=94=20?= =?UTF-8?q?zombie=20guard=20+=20gitea=20launch-port=20fix?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Banner + §8b: zombie-container guard (0a8db904, live-proven on .228) and gitea launch-port fix (670ebb06) shipped in binary 040df5ce, rolled to the fleet. Logs the mempool env-drift recreate-loop and nostr-rs-relay follow-ups. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/PRODUCTION-MASTER-PLAN.md | 99 ++++++++++++++++++++++++++++++++-- 1 file changed, 96 insertions(+), 3 deletions(-) diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index 60b919dd..e4aa3ae0 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -7,7 +7,7 @@ > in §6 / §8b. Next exit-criteria: multinode (`docs/multinode-testing-plan.md`) + > workstreams B/C/D. > -> Last updated: 2026-06-23 · **.228 gate 5×-GREEN (110/110 ×5, 0 not-ok)** — exit criterion met (see §8b). +> Last updated: 2026-06-26 · zombie-container guard + gitea launch-port fix shipped, binary `040df5ce` rolled to the fleet (see §8b SESSION h). Prior: orchestrator Fix A+B (`a721532f`/`e0343137`) deployed + proven. --- @@ -243,9 +243,102 @@ hardening; paid swarm streaming + IndeeHub source (`phase4-streaming-ecash-plan. Meshroller Rust-native mesh AI (`meshroller-integration-design.md`); dual-ecash phases 2–6 (`dual-ecash-design.md`). -## 8b. SESSION STATE + RESUME (updated 2026-06-23) — READ §8b "CURRENT STATE + RESUME" FIRST +## 8b. SESSION STATE + RESUME (updated 2026-06-26) — READ §8b "CURRENT STATE + RESUME" FIRST -### ▶ SESSION b (2026-06-23 PM) — LATEST, RESUME FROM HERE +### ▶ SESSION h (2026-06-26) — LATEST, RESUME FROM HERE + +**Canonical resume detail: memory `project_session_resume_2026_06_23b` (▶️ top of MEMORY.md).** +Local main = `670ebb06` (3 commits past the previously-pushed `43e70049`: `0a8db904` zombie +guard + `670ebb06` gitea launch-port fix; `43e70049` webview was already pushed). **Combined +release binary `040df5ce2551d17b` rolled to the fleet.** Binary+FE not in git — rebuild on a +fresh machine (`cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago`). + +**DONE this session:** +1. ✅ **Zombie-container guard** (`0a8db904`) — the reconciler's Running branch now verifies a + container's `State.Pid` is alive (`/proc/` exists) before trusting podman's "Up"; on a + concrete dead PID it stop+remove+`install_fresh` from the manifest. Conservative: any + uncertainty (inspect fail / unparseable PID) assumes alive, so a transient hiccup never + destroys a healthy container. Fixes the class that broke NetBird login on .228 (dashboard + "Up" w/ dead PID → proxy 502, no host port → reconciler never recovered it). Unit test + + **live-proven on .228**: synthetic zombie on `jellyfin` (killed conmon+PID → podman still + "Up") → guard logged `…process is dead (zombie) — recreating app_id=jellyfin` → recreated → + settled to NoOp. **Zero false-positives across the other 33 healthy containers.** +2. ✅ **Gitea launch-port fix** (`670ebb06`) — gitea launched at **:2222 (SSH)** instead of + **:3001 (web)** on nodes without the gitea manifest on disk (`manifest_lan_address_for` + returns None → fell through to `extract_lan_address`, which returns podman's first-listed + port; podman lists `2222->22` before `3001->3000`). Added `"gitea" => http://localhost:3001` + to the static `lan_address_for` map (`core/container/src/podman_client.rs`) like every other + core app. Reported on tailscale node **100.82.34.38** — that node still needs the new binary + (or a refreshed gitea manifest) to pick it up. +3. ✅ **Rolled `040df5ce`** to .228/.116/.198/.89 (verified sha+active); .88/.5/.120 rolling. + +**OPEN follow-ups (logged, NOT regressions):** +- **mempool env-drift recreate-loop on .228** — reconciler logs `container env drift detected — + recreating app_id=mempool` every ~30-90s, never converges (pre-existing; the known mempool + nginx stale-IP class, [[project_mempool_nginx_stale_ip_fix]]). mempool stays running but churns. +- **nostr-rs-relay** stuck "Stopping" + ~2s create-loop on .228 (from session g). + +**NEXT:** finish .88/.5/.120 roll → push main to gitea-vps2 → Phase-3 quadlet / Workstream F / +multinode. SSH/sudo pw `ThisIsWeb54321@` (**.88 = `ThisIsWeb54321!`**); UI/RPC .228/.198 = +`ThisIsWeb54321@`. Reusable tooling in scratchpad: `deploy-bin.sh`/`remote-apply.sh` (EXPECT_SHA += `040df5ce…`), `rpc.sh`. + +--- + +### ▶ SESSION g (2026-06-25) — earlier, historical + +**Canonical resume detail: memory `project_session_resume_2026_06_23b` + `project_netbird_ph4_legacy_deletion_map` + `project_workstream_f_lifecycle_perfection`.** +`gitea-vps2/main = a721532f` (pushed). **Local main = `89d397bb`** (2 new commits this session, NOT pushed/deployed: `41e7f500` harness tolerance + `89d397bb` netbird ph4 legacy delete). Binary+FE are NOT in git — rebuild on a fresh machine. + +**TL;DR (SESSION g, 2026-06-25) — everything below DONE this session:** +1. ✅ **Rolled** `e0343137` + fresh FE (`index-a75rd6Hy.js`) to **7 nodes** (.116/.198/.228/.89/.88/.5/.120), all verified. **.15 SKIPPED** (auth rejected — creds don't match). +2. ✅ **Harness tolerance fixes COMMITTED** `41e7f500` (run-gate settle/immich + immich.bats 90s + mempool.bats poll). +3. ✅ **mempool RESOLVED** fleet-wide — see mempool note below. +4. ✅ **netbird #20 ph4 DONE** — legacy Rust installer DELETED, committed `89d397bb` (492 lines gone, manifest-driven only, `cargo check` clean). Release binary BUILDING for the .228 live-verify (build left running — check after). + +**NEXT (resume here):** (a) check the release build, deploy the `89d397bb` binary to .228, live-verify netbird adopts via manifest (https:8087→200, no `bail!`); (b) roll `89d397bb` to the rest of the fleet (behavior-neutral — manifest path already executed); (c) **push local main → gitea-vps2** (2 commits ahead); then **Phase-3 `use_quadlet_backends` → Workstream F → multinode**. + +**ROLL RESULTS (2026-06-25, binary `e0343137b99bf066` + fresh FE bundled):** +| Node | Result | +|------|--------| +| .228 | ✅ already on `e0343137` (prior session, binary-only) | +| .116 (local) | ✅ binary + fresh FE; 36 containers survived restart; UI 200; `index-a75rd6Hy.js` live | +| .198 (LAN) | ✅ binary + fresh FE; 38 containers up; UI 200 | +| .89 (100.89.209.89) | ✅ binary + fresh FE; service active | +| .88 (100.70.96.88, pw `ThisIsWeb54321!`) | ✅ binary + fresh FE; service active | +| .5 (100.72.136.5) | ⏳ attempted — see resume note (cellular x250) | +| .120 (100.66.157.120) | ⏳ attempted — see resume note (cellular x250) | +| .15 (100.64.83.15, archy-dev-pa) | ❌ SKIPPED — `archipelago@` + `ThisIsWeb54321@` rejected (`Permission denied (publickey,password)`); node creds unknown | + +Deploy tooling (reusable): scratchpad `deploy-bin.sh