From d6fa262d69b0a59abddb673e95a1e22ee16dc52a Mon Sep 17 00:00:00 2001 From: archipelago Date: Mon, 22 Jun 2026 04:23:52 -0400 Subject: [PATCH] =?UTF-8?q?docs(#20):=20consolidate=20master-plan=20resume?= =?UTF-8?q?=20=E2=80=94=20indeedhub=20migration=202-node=20verified=20(.22?= =?UTF-8?q?8+.198);=20cutoff-proof=20next-steps=20+=20deploy=20facts?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/PRODUCTION-MASTER-PLAN.md | 201 ++++++++++++++------------------- 1 file changed, 85 insertions(+), 116 deletions(-) diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index dbab3344..4dbe136d 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -5,7 +5,7 @@ > supersedes all prior roadmap/handoff/status docs. When the gate passes, remove > the priority banner and demote this doc. > -> Last updated: 2026-06-21 · Binary: v1.7.99-alpha +> Last updated: 2026-06-22 · Binary: v1.7.99-alpha · See §8b for the live resume. --- @@ -148,126 +148,95 @@ hardening; paid swarm streaming + IndeeHub source (`phase4-streaming-ecash-plan. Meshroller Rust-native mesh AI (`meshroller-integration-design.md`); dual-ecash phases 2–6 (`dual-ecash-design.md`). -## 8b. SESSION STATE + RESUME (2026-06-21, live) +## 8b. SESSION STATE + RESUME (updated 2026-06-22) — READ THIS FIRST ON RESUME -**Landed + committed on main this session (newest first):** -- **#20 phase 3 — ADOPTION PATH LIVE-VERIFIED on .228 (2026-06-21).** Built - v1.7.99-alpha, sideloaded binary + 7 manifests, restarted (stop/replace/start — - containers survived via --restart unless-stopped + podman-restart.service). RPC - `package.install indeedhub` → `complete`, orchestrator-first path adopted all 7 - members (`reconcile action app_id=indeedhub-* action=NoOp`), containers stayed - **Up 4 days (NOT recreated)** — zero data/credential disruption. UI green: - frontend :7778 → 200, nostr-provider.js → 200, **/api/ → 200 (proves - network_aliases: frontend nginx `http://api:4000` resolved on indeedhub-net)**. - Fleet healthy (36 containers, none down). - **FRESH-CREATE PATH = FIXED + VERIFIED (2026-06-21).** Deleted the legacy - indeedhub orchestrator special-cases (`b73084db`, −382 lines: reconcile_indeedhub_stack, - start_indeedhub_backends, the 120s dependency-DNS gate, patch_indeedhub_nostr_provider, - etc.) so "indeedhub" flows through the generic install_fresh path. Then two live fixes - on .228: (1) frontend nginx needs `capabilities: [CHOWN,DAC_OVERRIDE,SETGID,SETUID]` - under the orchestrator's --cap-drop=ALL (workers died "setgid(101) failed"); manifest - fix `ff8f11b8`. (2) NOTE: manifest reload needs an archipelago restart (manifests - cached at startup) — a disk manifest edit alone won't take. RESULT: frontend - fresh-creates via install_fresh, caps applied, post_install hook FIRES - (copy_from_host nostr-provider.js ✅), UI 200 (/, /nostr-provider.js, /api/). - **HOOK EXEC GAP = FIXED + VERIFIED (`ff78b312`).** The post_install `exec` steps - used to fail via `podman exec` from the archipelago.service systemd cgroup - (`crun: write cgroup.procs: Permission denied`). Fixed by wrapping the hook - executor's `exec` in `systemd-run --user --scope --quiet --collect podman exec …` - (its own delegated cgroup; copy_from_host stays a direct `cp`). Verified on .228: - all 4 post_install steps now log `ok` (sed X-Frame, copy nostr-provider.js, inject - script, nginx reload), frontend serves, UI 200. The #20 hook capability is now fully - functional (exec + copy_from_host) on orchestrator-created containers. +### Where we are — Task #20 (manifest lifecycle hooks) + indeedhub migration: DONE & 2-node verified - PRIOR (now resolved) — was: **FRESH-CREATE PATH = BLOCKED (found live 2026-06-21).** Removed the stateless - frontend + reinstalled to exercise install_fresh → it FAILED: - `orchestrator stack install indeedhub failed at app indeedhub: IndeedHub - dependencies were not ready within 120s (indeedhub-api dependency DNS not ready)`, - and the frontend was left down. Recovered manually on .228 (podman run w/ alias - indeedhub on indeedhub-net; UI 200). ROOT CAUSE = hardcoded indeedhub orchestrator - special-cases that predate + conflict with the manifest path: - - prod_orchestrator `ensure_running` ~L1377: `app_id=="indeedhub"` → - `reconcile_indeedhub_stack`, which REFUSES manifest creation when the frontend - is absent (returns Left("stack-managed")). - - `run_pre_start_hooks("indeedhub")` ~L2324 → `start_indeedhub_backends` → - `wait_for_indeedhub_dependencies_ready(120)` — the gate that blocked install_fresh - (`indeedhub_api_dependency_dns_ready` returns false while the frontend's own alias - is absent + a getent transiently fails). - - also `repair_indeedhub_network_aliases`, `patch_indeedhub_nostr_provider`, the - "frontend did not stay reachable; restart" path (~L2474), `INDEEDHUB_BACKEND_*` - consts, and a crash_recovery.rs indeedhub special-case. - **FIX (next, its own build/deploy/test cycle):** delete these special-cases now - that the manifest carries dependencies/network_aliases/post_install — route - "indeedhub" through the GENERIC install_fresh + reconcile path so the frontend - fresh-creates normally (hook fires). Then re-run the destructive lifecycle on .228 - (frontend recreate must succeed + run the hook), then .198, then the gate. - NOTE: .228 currently runs v1.7.99-alpha (these special-cases still present) — the - running stack is fine (adoption NoOp); only a frontend-absent event re-triggers the - bug, and the frontend is up. -- `b1eea8c0` indeedhub (#20) **phase 3 — CODE COMPLETE, unit-tested.** 7 manifests (apps/indeedhub-{postgres,redis,minio,relay,api, - ffmpeg} + apps/indeedhub frontend) + install_indeedhub_stack orchestrator-first - (immich pattern). Data-preserving by construction = ADOPTION on .228: exact live - hyphen container names, named volumes indeedhub-*-data, dedicated indeedhub-net + - network_aliases [postgres|redis|minio|relay|api], generated_secrets reuse live - /var/lib/archipelago/secrets values (ensure_one no-ops on existing). Frontend - carries the post_install nginx hook (replaces patch_indeedhub_nostr_provider; - defensive since indeedhub:1.0.0 already bakes it). .228 GROUND TRUTH captured: - 7 containers Up, volumes indeedhub-{postgres,redis,minio,relay}-data, network - indeedhub-net; frontend nginx upstreams api:4000/minio:9000/relay:8080; image - already bakes X-Frame strip + nostr-provider.js (6347B) + sub_filter. - **NEXT = live verify on .228:** build+sideload binary, restart, package.install - indeedhub → expect adoption (NoOp, no data touch), then full lifecycle. Risk: - service restart SIGKILL-cascade if Quadlet not fully shipped on .228. -- `b94b61f6` `network_aliases` manifest field (ContainerConfig) + podman_client & - quadlet rendering + DNS-label validation; also fixed 4 pre-existing from_manifest - test failures (network_policy: archy-net invalid; bind sources outside - /var/lib/archipelago). Enables indeedhub's short aliases on indeedhub-net. -- `955c54b7` hook capability (#20) **phase 2** — `container::hooks::run_post_install` - executor (podman exec + copy_from_host w/ allowlist canonicalise + symlink-escape - prefix check; best-effort/idempotent) wired into `install_fresh` after container - is up (fresh-container-only). 5 unit tests; `cargo test -p archipelago` green. -- `4c1a4e59` hook capability (#20) **phase 1** — `LifecycleHooks`/`HookStep`/`HostCopy` - schema + validate() + re-exports + 3 schema tests; also fixed 3 pre-existing - `ContainerConfig` test literals missing `generated_secrets` (container crate now - compiles; `cargo test -p archipelago-container` green, 53 pass). -- `f0c6b79d` immich containers named underscore (immich_server/_postgres/_redis) to - match runtime lifecycle code — fixes package.stop/start/restart. **immich fully - migrated + verified on .228** (manifest-driven stack via orchestrator). -- `b0b54a96` immich lifecycle bats suite (tests/lifecycle/bats/immich.bats). -- `d5ef4573`/`9e6c5370`/`011081d1` immich migration (rename→immich, orchestrator-first). -- `f160e0c4` podman-restart.service enabled at startup (reboot-survival). -- `0860dfac` Services-tab UI (backends→Services, parent icons, categories sub-nav, swipe). -- `220666d3`/`7bfbe8fe` registry-manifest infra phases 1+2 (consume + EMBED_MANIFESTS publish). -- `192238cb` docs consolidation 56→28 + CLAUDE.md. -- `03a4ee1b` generated-secrets system + companion/quadlet fixes. +Manifest-driven lifecycle hooks + the IndeedHub stack migration are **complete and +live-verified on BOTH .228 and .198** (adoption + fresh-create + post_install hook +exec, stable under load). 15 commits this session: `4c1a4e59`..`e2a012d0`. Working +tree clean. The release lifecycle gate is temporarily **5×** (was 20×; `ARCHY_ITERATIONS=5`). -**DONE — hook capability (#20), phases 1+2 (schema + executor + wiring):** -controlled post-install hooks so indeedhub/netbird can migrate. Design: -`docs/manifest-hooks-design.md`. Schema, validate(), executor, and install-path -wiring all landed + green (commits `4c1a4e59`/`955c54b7` above). Remaining #20 -phases: 3 = indeedhub migration (NEXT, below); 4 = netbird; 5 = `pre_start` hooks -(type exists, NOT yet executed — wire into `prepare_for_start` if/when needed). +**Shipped (all on `main`, newest first):** +- `e2a012d0` indeedhub frontend health → `tcp:7777` (was http GET `/`; the http check + false-failed under load and the reconciler churned the frontend — fixed). +- `ff78b312` hook `exec` runs in a transient user scope + (`systemd-run --user --scope --quiet --collect podman exec …`) — fixes + "crun: write cgroup.procs: Permission denied" when exec'ing from archipelago.service. +- `ff8f11b8` indeedhub frontend caps `[CHOWN,DAC_OVERRIDE,SETGID,SETUID]` — nginx + workers died "setgid(101) failed" under the orchestrator's `--cap-drop=ALL`. +- `b73084db` DELETED the legacy indeedhub orchestrator special-cases (−382 lines: + reconcile_indeedhub_stack, start_indeedhub_backends, the 120s dependency-DNS gate, + patch_indeedhub_nostr_provider, repair_indeedhub_network_aliases, INDEEDHUB_* consts) + → "indeedhub" now uses the GENERIC install_fresh/reconcile path. +- `b1eea8c0` 7 indeedhub manifests (apps/indeedhub{,-postgres,-redis,-minio,-relay,-api, + -ffmpeg}) + `install_indeedhub_stack` orchestrator-first (immich pattern). +- `b94b61f6` `network_aliases` ContainerConfig field (podman_client + quadlet rendering, + DNS-label validated) — lets the frontend nginx reach `api:4000`/`minio:9000`/`relay:8080` + on the dedicated `indeedhub-net`. +- `955c54b7`/`4c1a4e59` #20 hooks phases 1-2: schema (LifecycleHooks/HookStep/HostCopy in + archipelago-container::manifest) + executor `container::hooks::run_post_install` + (allowlist-canonicalised copy_from_host + scoped exec), wired into `install_fresh`. +- `84031e62` gate 20×→5× (docs only: CLAUDE.md, this file, tests/lifecycle/TESTING.md). -**NEXT — #20 phase 3, indeedhub migration:** author 7 member manifests -(postgres/redis/minio/relay/api/ffmpeg + frontend) on archy-net with container-name -hostnames; frontend carries the `post_install` hook (strip X-Frame-Options, copy -nostr-provider.js, inject script, nginx reload — see `patch_indeedhub_nostr_provider` -in install.rs:68 for exact ops); wire `install_indeedhub_stack` orchestrator-first; -generated_secrets: indeedhub-db-password/indeedhub-jwt/indeedhub-minio-password -(reuse live values); preserve hardcoded AES_MASTER_SECRET literal + minio user -"indeeadmin". Then netbird (assess its setup steps). Then single-container legacy -apps (add to `uses_orchestrator_install_flow` allowlist in install.rs + verify each). -Then the lifecycle gate (#6) — needs harness hardening (#18) + .228 bitcoin synced. +**Design = adoption-safe + manifest-driven.** Manifests reproduce the live install exactly +so existing nodes ADOPT (NoOp) instead of recreate: hyphen container_names the runtime +already references, named volumes `indeedhub-{postgres,redis,minio,relay}-data`, +`indeedhub-net` + network_aliases [postgres|redis|minio|relay|api], generated_secrets reuse +the live /var/lib/archipelago/secrets values (ensure_one no-ops on existing; postgres pw is +fixed at PGDATA init). minio user "indeeadmin" + AES_MASTER_SECRET literal kept. The +frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + nostr-provider.js ++ sub_filter), so the post_install hook (sed X-Frame / copy nostr-provider.js / inject / +nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering +guard is KEPT on purpose (beneficial; not a blocker). -**Test/deploy facts:** .228 = archi resilience node, UI/RPC pw `password123` (https), -SSH pw `archipelago`. Lifecycle harness runs from .116: `cd tests/lifecycle && -ARCHY_HOST=192.168.1.228 ARCHY_SCHEME=https ARCHY_PASSWORD=password123 -ARCHY_ALLOW_DESTRUCTIVE=1 ./run.sh `. RPC trigger: auth.login (sets session -+ csrf cookies) → send csrf cookie value as `X-CSRF-Token` header. package.install -needs `{"id":"","dockerImage":""}` (dockerImage required even -for stacks). Rust workspace root = `core/`. Linker `undefined hidden symbol` → -rebuild with `CARGO_INCREMENTAL=0`. immich on .228: app_id `immich`, containers -immich_server/immich_postgres/immich_redis, data dir owner 100998:100998. +### NEXT STEPS (in order) +1. **Sync .228 to the tcp-health manifest.** .228 still runs the OLD http-health frontend + manifest on disk (stable there at low load, but inconsistent). Deploy `apps/indeedhub/manifest.yml` + → /opt/archipelago/apps/indeedhub/manifest.yml on .228, restart archipelago, reinstall + the frontend (it caches manifests at startup). Verify no churn. +2. **Run the 5× lifecycle gate** (`ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh`) on .228 + then .198 (ARCHY_ALLOW_DESTRUCTIVE=1). Fix until green. This is the production exit criterion. +3. **netbird migration (#20 phase 4)** — same pattern, but assess its setup steps first + (TLS cert gen, config files, resolver IP — may need host-file-write hooks the current + exec/copy_from_host hooks don't cover; legacy is install_netbird_stack in stacks.rs). +4. Then single-container legacy apps onto the orchestrator install flow; then demote the banner. + +### KNOWN ISSUES / WATCH-OUTS +- **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates + containers it deems unhealthy; under load, false-failing health checks → churn. The + tcp-health fix (`e2a012d0`) mitigated the frontend case. If the lifecycle gate churns on + .198, look for other apps whose http health checks false-fail under load → prefer tcp. +- **Many concurrent SSH sessions to .198 wedge its sshd** (MaxStartups) — it pings but SSH + hangs for minutes. Use ONE ssh at a time to .198; `pkill -f 192.168.1.198` to clear strays. +- Hook `exec` only works in the scoped form (committed). `copy_from_host` is direct `cp`. + +### DEPLOY / VERIFY FACTS (both nodes, ISO Debian, glibc 2.41 — binary built on .116 runs on both) +- **Build:** `cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago` + (~12 min, opt-level=3). Binary at `core/target/release/archipelago`. Linker + "undefined hidden symbol" → rebuild with CARGO_INCREMENTAL=0. `archipelago` is a + bin-only crate (no lib). Filtered tests: `cargo test -p archipelago --bin archipelago -- hooks quadlet`. +- **Sideload:** `scp binary $H:/tmp/archipelago-new` → `sudo systemctl stop archipelago; + sudo cp /tmp/archipelago-new /usr/local/bin/archipelago; sudo chmod +x …; sudo systemctl + start archipelago`. Containers SURVIVE the restart (--restart unless-stopped + + podman-restart.service). Binary path is /usr/local/bin/archipelago. +- **Manifests** live at /opt/archipelago/apps//manifest.yml (root-owned ok). The + orchestrator CACHES them at startup → **edit on disk then RESTART archipelago to reload**. + Bulk deploy: `tar czf t.tgz -C apps indeedhub indeedhub-postgres indeedhub-redis + indeedhub-minio indeedhub-relay indeedhub-api indeedhub-ffmpeg`; scp; `sudo tar xzf t.tgz + -C /opt/archipelago/apps`. +- **Nodes:** .228 = 192.168.1.228, SSH pw `archipelago`, RPC/UI pw `password123` (https). + .198 = 192.168.1.198, SSH pw `archipelago`, **RPC/UI pw `ThisIsWeb54321@`** (https). Both + have the 7-container indeedhub stack + secrets + named volumes pre-existing. +- **Trigger install via RPC:** `auth.login` (sets session+csrf cookies) → send the csrf + cookie value as `X-CSRF-Token` header → `package.install` with params + `{"id":"indeedhub","dockerImage":""}` (dockerImage required even for stacks; install + is async → returns `{"status":"installing"}`). install logs go to + /var/log/archipelago/container-installs.log (best-effort) AND journalctl -u archipelago. +- **Fresh-create test recipe:** `podman rm -f indeedhub` (stateless frontend) → package.install + indeedhub → expect install_fresh + post_install hook (all 4 steps `ok`) + UI 200 on :7778 + (/ , /nostr-provider.js, /api/). On adoption the frontend is NoOp (hook does NOT run — + install_fresh is the only hook trigger). ## 9. Documentation map (what survives)