docs(#20): consolidate master-plan resume — indeedhub migration 2-node verified (.228+.198); cutoff-proof next-steps + deploy facts
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
e2a012d086
commit
d6fa262d69
@ -5,7 +5,7 @@
|
|||||||
> supersedes all prior roadmap/handoff/status docs. When the gate passes, remove
|
> supersedes all prior roadmap/handoff/status docs. When the gate passes, remove
|
||||||
> the priority banner and demote this doc.
|
> the priority banner and demote this doc.
|
||||||
>
|
>
|
||||||
> Last updated: 2026-06-21 · Binary: v1.7.99-alpha
|
> Last updated: 2026-06-22 · Binary: v1.7.99-alpha · See §8b for the live resume.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -148,126 +148,95 @@ hardening; paid swarm streaming + IndeeHub source (`phase4-streaming-ecash-plan.
|
|||||||
Meshroller Rust-native mesh AI (`meshroller-integration-design.md`); dual-ecash
|
Meshroller Rust-native mesh AI (`meshroller-integration-design.md`); dual-ecash
|
||||||
phases 2–6 (`dual-ecash-design.md`).
|
phases 2–6 (`dual-ecash-design.md`).
|
||||||
|
|
||||||
## 8b. SESSION STATE + RESUME (2026-06-21, live)
|
## 8b. SESSION STATE + RESUME (updated 2026-06-22) — READ THIS FIRST ON RESUME
|
||||||
|
|
||||||
**Landed + committed on main this session (newest first):**
|
### Where we are — Task #20 (manifest lifecycle hooks) + indeedhub migration: DONE & 2-node verified
|
||||||
- **#20 phase 3 — ADOPTION PATH LIVE-VERIFIED on .228 (2026-06-21).** Built
|
|
||||||
v1.7.99-alpha, sideloaded binary + 7 manifests, restarted (stop/replace/start —
|
|
||||||
containers survived via --restart unless-stopped + podman-restart.service). RPC
|
|
||||||
`package.install indeedhub` → `complete`, orchestrator-first path adopted all 7
|
|
||||||
members (`reconcile action app_id=indeedhub-* action=NoOp`), containers stayed
|
|
||||||
**Up 4 days (NOT recreated)** — zero data/credential disruption. UI green:
|
|
||||||
frontend :7778 → 200, nostr-provider.js → 200, **/api/ → 200 (proves
|
|
||||||
network_aliases: frontend nginx `http://api:4000` resolved on indeedhub-net)**.
|
|
||||||
Fleet healthy (36 containers, none down).
|
|
||||||
**FRESH-CREATE PATH = FIXED + VERIFIED (2026-06-21).** Deleted the legacy
|
|
||||||
indeedhub orchestrator special-cases (`b73084db`, −382 lines: reconcile_indeedhub_stack,
|
|
||||||
start_indeedhub_backends, the 120s dependency-DNS gate, patch_indeedhub_nostr_provider,
|
|
||||||
etc.) so "indeedhub" flows through the generic install_fresh path. Then two live fixes
|
|
||||||
on .228: (1) frontend nginx needs `capabilities: [CHOWN,DAC_OVERRIDE,SETGID,SETUID]`
|
|
||||||
under the orchestrator's --cap-drop=ALL (workers died "setgid(101) failed"); manifest
|
|
||||||
fix `ff8f11b8`. (2) NOTE: manifest reload needs an archipelago restart (manifests
|
|
||||||
cached at startup) — a disk manifest edit alone won't take. RESULT: frontend
|
|
||||||
fresh-creates via install_fresh, caps applied, post_install hook FIRES
|
|
||||||
(copy_from_host nostr-provider.js ✅), UI 200 (/, /nostr-provider.js, /api/).
|
|
||||||
**HOOK EXEC GAP = FIXED + VERIFIED (`ff78b312`).** The post_install `exec` steps
|
|
||||||
used to fail via `podman exec` from the archipelago.service systemd cgroup
|
|
||||||
(`crun: write cgroup.procs: Permission denied`). Fixed by wrapping the hook
|
|
||||||
executor's `exec` in `systemd-run --user --scope --quiet --collect podman exec …`
|
|
||||||
(its own delegated cgroup; copy_from_host stays a direct `cp`). Verified on .228:
|
|
||||||
all 4 post_install steps now log `ok` (sed X-Frame, copy nostr-provider.js, inject
|
|
||||||
script, nginx reload), frontend serves, UI 200. The #20 hook capability is now fully
|
|
||||||
functional (exec + copy_from_host) on orchestrator-created containers.
|
|
||||||
|
|
||||||
PRIOR (now resolved) — was: **FRESH-CREATE PATH = BLOCKED (found live 2026-06-21).** Removed the stateless
|
Manifest-driven lifecycle hooks + the IndeedHub stack migration are **complete and
|
||||||
frontend + reinstalled to exercise install_fresh → it FAILED:
|
live-verified on BOTH .228 and .198** (adoption + fresh-create + post_install hook
|
||||||
`orchestrator stack install indeedhub failed at app indeedhub: IndeedHub
|
exec, stable under load). 15 commits this session: `4c1a4e59`..`e2a012d0`. Working
|
||||||
dependencies were not ready within 120s (indeedhub-api dependency DNS not ready)`,
|
tree clean. The release lifecycle gate is temporarily **5×** (was 20×; `ARCHY_ITERATIONS=5`).
|
||||||
and the frontend was left down. Recovered manually on .228 (podman run w/ alias
|
|
||||||
indeedhub on indeedhub-net; UI 200). ROOT CAUSE = hardcoded indeedhub orchestrator
|
|
||||||
special-cases that predate + conflict with the manifest path:
|
|
||||||
- prod_orchestrator `ensure_running` ~L1377: `app_id=="indeedhub"` →
|
|
||||||
`reconcile_indeedhub_stack`, which REFUSES manifest creation when the frontend
|
|
||||||
is absent (returns Left("stack-managed")).
|
|
||||||
- `run_pre_start_hooks("indeedhub")` ~L2324 → `start_indeedhub_backends` →
|
|
||||||
`wait_for_indeedhub_dependencies_ready(120)` — the gate that blocked install_fresh
|
|
||||||
(`indeedhub_api_dependency_dns_ready` returns false while the frontend's own alias
|
|
||||||
is absent + a getent transiently fails).
|
|
||||||
- also `repair_indeedhub_network_aliases`, `patch_indeedhub_nostr_provider`, the
|
|
||||||
"frontend did not stay reachable; restart" path (~L2474), `INDEEDHUB_BACKEND_*`
|
|
||||||
consts, and a crash_recovery.rs indeedhub special-case.
|
|
||||||
**FIX (next, its own build/deploy/test cycle):** delete these special-cases now
|
|
||||||
that the manifest carries dependencies/network_aliases/post_install — route
|
|
||||||
"indeedhub" through the GENERIC install_fresh + reconcile path so the frontend
|
|
||||||
fresh-creates normally (hook fires). Then re-run the destructive lifecycle on .228
|
|
||||||
(frontend recreate must succeed + run the hook), then .198, then the gate.
|
|
||||||
NOTE: .228 currently runs v1.7.99-alpha (these special-cases still present) — the
|
|
||||||
running stack is fine (adoption NoOp); only a frontend-absent event re-triggers the
|
|
||||||
bug, and the frontend is up.
|
|
||||||
- `b1eea8c0` indeedhub (#20) **phase 3 — CODE COMPLETE, unit-tested.** 7 manifests (apps/indeedhub-{postgres,redis,minio,relay,api,
|
|
||||||
ffmpeg} + apps/indeedhub frontend) + install_indeedhub_stack orchestrator-first
|
|
||||||
(immich pattern). Data-preserving by construction = ADOPTION on .228: exact live
|
|
||||||
hyphen container names, named volumes indeedhub-*-data, dedicated indeedhub-net +
|
|
||||||
network_aliases [postgres|redis|minio|relay|api], generated_secrets reuse live
|
|
||||||
/var/lib/archipelago/secrets values (ensure_one no-ops on existing). Frontend
|
|
||||||
carries the post_install nginx hook (replaces patch_indeedhub_nostr_provider;
|
|
||||||
defensive since indeedhub:1.0.0 already bakes it). .228 GROUND TRUTH captured:
|
|
||||||
7 containers Up, volumes indeedhub-{postgres,redis,minio,relay}-data, network
|
|
||||||
indeedhub-net; frontend nginx upstreams api:4000/minio:9000/relay:8080; image
|
|
||||||
already bakes X-Frame strip + nostr-provider.js (6347B) + sub_filter.
|
|
||||||
**NEXT = live verify on .228:** build+sideload binary, restart, package.install
|
|
||||||
indeedhub → expect adoption (NoOp, no data touch), then full lifecycle. Risk:
|
|
||||||
service restart SIGKILL-cascade if Quadlet not fully shipped on .228.
|
|
||||||
- `b94b61f6` `network_aliases` manifest field (ContainerConfig) + podman_client &
|
|
||||||
quadlet rendering + DNS-label validation; also fixed 4 pre-existing from_manifest
|
|
||||||
test failures (network_policy: archy-net invalid; bind sources outside
|
|
||||||
/var/lib/archipelago). Enables indeedhub's short aliases on indeedhub-net.
|
|
||||||
- `955c54b7` hook capability (#20) **phase 2** — `container::hooks::run_post_install`
|
|
||||||
executor (podman exec + copy_from_host w/ allowlist canonicalise + symlink-escape
|
|
||||||
prefix check; best-effort/idempotent) wired into `install_fresh` after container
|
|
||||||
is up (fresh-container-only). 5 unit tests; `cargo test -p archipelago` green.
|
|
||||||
- `4c1a4e59` hook capability (#20) **phase 1** — `LifecycleHooks`/`HookStep`/`HostCopy`
|
|
||||||
schema + validate() + re-exports + 3 schema tests; also fixed 3 pre-existing
|
|
||||||
`ContainerConfig` test literals missing `generated_secrets` (container crate now
|
|
||||||
compiles; `cargo test -p archipelago-container` green, 53 pass).
|
|
||||||
- `f0c6b79d` immich containers named underscore (immich_server/_postgres/_redis) to
|
|
||||||
match runtime lifecycle code — fixes package.stop/start/restart. **immich fully
|
|
||||||
migrated + verified on .228** (manifest-driven stack via orchestrator).
|
|
||||||
- `b0b54a96` immich lifecycle bats suite (tests/lifecycle/bats/immich.bats).
|
|
||||||
- `d5ef4573`/`9e6c5370`/`011081d1` immich migration (rename→immich, orchestrator-first).
|
|
||||||
- `f160e0c4` podman-restart.service enabled at startup (reboot-survival).
|
|
||||||
- `0860dfac` Services-tab UI (backends→Services, parent icons, categories sub-nav, swipe).
|
|
||||||
- `220666d3`/`7bfbe8fe` registry-manifest infra phases 1+2 (consume + EMBED_MANIFESTS publish).
|
|
||||||
- `192238cb` docs consolidation 56→28 + CLAUDE.md.
|
|
||||||
- `03a4ee1b` generated-secrets system + companion/quadlet fixes.
|
|
||||||
|
|
||||||
**DONE — hook capability (#20), phases 1+2 (schema + executor + wiring):**
|
**Shipped (all on `main`, newest first):**
|
||||||
controlled post-install hooks so indeedhub/netbird can migrate. Design:
|
- `e2a012d0` indeedhub frontend health → `tcp:7777` (was http GET `/`; the http check
|
||||||
`docs/manifest-hooks-design.md`. Schema, validate(), executor, and install-path
|
false-failed under load and the reconciler churned the frontend — fixed).
|
||||||
wiring all landed + green (commits `4c1a4e59`/`955c54b7` above). Remaining #20
|
- `ff78b312` hook `exec` runs in a transient user scope
|
||||||
phases: 3 = indeedhub migration (NEXT, below); 4 = netbird; 5 = `pre_start` hooks
|
(`systemd-run --user --scope --quiet --collect podman exec …`) — fixes
|
||||||
(type exists, NOT yet executed — wire into `prepare_for_start` if/when needed).
|
"crun: write cgroup.procs: Permission denied" when exec'ing from archipelago.service.
|
||||||
|
- `ff8f11b8` indeedhub frontend caps `[CHOWN,DAC_OVERRIDE,SETGID,SETUID]` — nginx
|
||||||
|
workers died "setgid(101) failed" under the orchestrator's `--cap-drop=ALL`.
|
||||||
|
- `b73084db` DELETED the legacy indeedhub orchestrator special-cases (−382 lines:
|
||||||
|
reconcile_indeedhub_stack, start_indeedhub_backends, the 120s dependency-DNS gate,
|
||||||
|
patch_indeedhub_nostr_provider, repair_indeedhub_network_aliases, INDEEDHUB_* consts)
|
||||||
|
→ "indeedhub" now uses the GENERIC install_fresh/reconcile path.
|
||||||
|
- `b1eea8c0` 7 indeedhub manifests (apps/indeedhub{,-postgres,-redis,-minio,-relay,-api,
|
||||||
|
-ffmpeg}) + `install_indeedhub_stack` orchestrator-first (immich pattern).
|
||||||
|
- `b94b61f6` `network_aliases` ContainerConfig field (podman_client + quadlet rendering,
|
||||||
|
DNS-label validated) — lets the frontend nginx reach `api:4000`/`minio:9000`/`relay:8080`
|
||||||
|
on the dedicated `indeedhub-net`.
|
||||||
|
- `955c54b7`/`4c1a4e59` #20 hooks phases 1-2: schema (LifecycleHooks/HookStep/HostCopy in
|
||||||
|
archipelago-container::manifest) + executor `container::hooks::run_post_install`
|
||||||
|
(allowlist-canonicalised copy_from_host + scoped exec), wired into `install_fresh`.
|
||||||
|
- `84031e62` gate 20×→5× (docs only: CLAUDE.md, this file, tests/lifecycle/TESTING.md).
|
||||||
|
|
||||||
**NEXT — #20 phase 3, indeedhub migration:** author 7 member manifests
|
**Design = adoption-safe + manifest-driven.** Manifests reproduce the live install exactly
|
||||||
(postgres/redis/minio/relay/api/ffmpeg + frontend) on archy-net with container-name
|
so existing nodes ADOPT (NoOp) instead of recreate: hyphen container_names the runtime
|
||||||
hostnames; frontend carries the `post_install` hook (strip X-Frame-Options, copy
|
already references, named volumes `indeedhub-{postgres,redis,minio,relay}-data`,
|
||||||
nostr-provider.js, inject script, nginx reload — see `patch_indeedhub_nostr_provider`
|
`indeedhub-net` + network_aliases [postgres|redis|minio|relay|api], generated_secrets reuse
|
||||||
in install.rs:68 for exact ops); wire `install_indeedhub_stack` orchestrator-first;
|
the live /var/lib/archipelago/secrets values (ensure_one no-ops on existing; postgres pw is
|
||||||
generated_secrets: indeedhub-db-password/indeedhub-jwt/indeedhub-minio-password
|
fixed at PGDATA init). minio user "indeeadmin" + AES_MASTER_SECRET literal kept. The
|
||||||
(reuse live values); preserve hardcoded AES_MASTER_SECRET literal + minio user
|
frontend image indeedhub:1.0.0 already bakes the iframe nginx (X-Frame omit + nostr-provider.js
|
||||||
"indeeadmin". Then netbird (assess its setup steps). Then single-container legacy
|
+ sub_filter), so the post_install hook (sed X-Frame / copy nostr-provider.js / inject /
|
||||||
apps (add to `uses_orchestrator_install_flow` allowlist in install.rs + verify each).
|
nginx reload) is defensive/idempotent. crash_recovery.rs's frontend-after-deps ordering
|
||||||
Then the lifecycle gate (#6) — needs harness hardening (#18) + .228 bitcoin synced.
|
guard is KEPT on purpose (beneficial; not a blocker).
|
||||||
|
|
||||||
**Test/deploy facts:** .228 = archi resilience node, UI/RPC pw `password123` (https),
|
### NEXT STEPS (in order)
|
||||||
SSH pw `archipelago`. Lifecycle harness runs from .116: `cd tests/lifecycle &&
|
1. **Sync .228 to the tcp-health manifest.** .228 still runs the OLD http-health frontend
|
||||||
ARCHY_HOST=192.168.1.228 ARCHY_SCHEME=https ARCHY_PASSWORD=password123
|
manifest on disk (stable there at low load, but inconsistent). Deploy `apps/indeedhub/manifest.yml`
|
||||||
ARCHY_ALLOW_DESTRUCTIVE=1 ./run.sh <suite>`. RPC trigger: auth.login (sets session
|
→ /opt/archipelago/apps/indeedhub/manifest.yml on .228, restart archipelago, reinstall
|
||||||
+ csrf cookies) → send csrf cookie value as `X-CSRF-Token` header. package.install
|
the frontend (it caches manifests at startup). Verify no churn.
|
||||||
needs `{"id":"<app>","dockerImage":"<any-valid-image>"}` (dockerImage required even
|
2. **Run the 5× lifecycle gate** (`ARCHY_ITERATIONS=5 tests/lifecycle/run-20x.sh`) on .228
|
||||||
for stacks). Rust workspace root = `core/`. Linker `undefined hidden symbol` →
|
then .198 (ARCHY_ALLOW_DESTRUCTIVE=1). Fix until green. This is the production exit criterion.
|
||||||
rebuild with `CARGO_INCREMENTAL=0`. immich on .228: app_id `immich`, containers
|
3. **netbird migration (#20 phase 4)** — same pattern, but assess its setup steps first
|
||||||
immich_server/immich_postgres/immich_redis, data dir owner 100998:100998.
|
(TLS cert gen, config files, resolver IP — may need host-file-write hooks the current
|
||||||
|
exec/copy_from_host hooks don't cover; legacy is install_netbird_stack in stacks.rs).
|
||||||
|
4. Then single-container legacy apps onto the orchestrator install flow; then demote the banner.
|
||||||
|
|
||||||
|
### KNOWN ISSUES / WATCH-OUTS
|
||||||
|
- **.198 is a weak/loaded node** (load avg ~3–5). The generic reconcile recreates
|
||||||
|
containers it deems unhealthy; under load, false-failing health checks → churn. The
|
||||||
|
tcp-health fix (`e2a012d0`) mitigated the frontend case. If the lifecycle gate churns on
|
||||||
|
.198, look for other apps whose http health checks false-fail under load → prefer tcp.
|
||||||
|
- **Many concurrent SSH sessions to .198 wedge its sshd** (MaxStartups) — it pings but SSH
|
||||||
|
hangs for minutes. Use ONE ssh at a time to .198; `pkill -f 192.168.1.198` to clear strays.
|
||||||
|
- Hook `exec` only works in the scoped form (committed). `copy_from_host` is direct `cp`.
|
||||||
|
|
||||||
|
### DEPLOY / VERIFY FACTS (both nodes, ISO Debian, glibc 2.41 — binary built on .116 runs on both)
|
||||||
|
- **Build:** `cd core && CARGO_INCREMENTAL=0 cargo build --release -p archipelago`
|
||||||
|
(~12 min, opt-level=3). Binary at `core/target/release/archipelago`. Linker
|
||||||
|
"undefined hidden symbol" → rebuild with CARGO_INCREMENTAL=0. `archipelago` is a
|
||||||
|
bin-only crate (no lib). Filtered tests: `cargo test -p archipelago --bin archipelago -- hooks quadlet`.
|
||||||
|
- **Sideload:** `scp binary $H:/tmp/archipelago-new` → `sudo systemctl stop archipelago;
|
||||||
|
sudo cp /tmp/archipelago-new /usr/local/bin/archipelago; sudo chmod +x …; sudo systemctl
|
||||||
|
start archipelago`. Containers SURVIVE the restart (--restart unless-stopped +
|
||||||
|
podman-restart.service). Binary path is /usr/local/bin/archipelago.
|
||||||
|
- **Manifests** live at /opt/archipelago/apps/<app_id>/manifest.yml (root-owned ok). The
|
||||||
|
orchestrator CACHES them at startup → **edit on disk then RESTART archipelago to reload**.
|
||||||
|
Bulk deploy: `tar czf t.tgz -C apps indeedhub indeedhub-postgres indeedhub-redis
|
||||||
|
indeedhub-minio indeedhub-relay indeedhub-api indeedhub-ffmpeg`; scp; `sudo tar xzf t.tgz
|
||||||
|
-C /opt/archipelago/apps`.
|
||||||
|
- **Nodes:** .228 = 192.168.1.228, SSH pw `archipelago`, RPC/UI pw `password123` (https).
|
||||||
|
.198 = 192.168.1.198, SSH pw `archipelago`, **RPC/UI pw `ThisIsWeb54321@`** (https). Both
|
||||||
|
have the 7-container indeedhub stack + secrets + named volumes pre-existing.
|
||||||
|
- **Trigger install via RPC:** `auth.login` (sets session+csrf cookies) → send the csrf
|
||||||
|
cookie value as `X-CSRF-Token` header → `package.install` with params
|
||||||
|
`{"id":"indeedhub","dockerImage":"<any>"}` (dockerImage required even for stacks; install
|
||||||
|
is async → returns `{"status":"installing"}`). install logs go to
|
||||||
|
/var/log/archipelago/container-installs.log (best-effort) AND journalctl -u archipelago.
|
||||||
|
- **Fresh-create test recipe:** `podman rm -f indeedhub` (stateless frontend) → package.install
|
||||||
|
indeedhub → expect install_fresh + post_install hook (all 4 steps `ok`) + UI 200 on :7778
|
||||||
|
(/ , /nostr-provider.js, /api/). On adoption the frontend is NoOp (hook does NOT run —
|
||||||
|
install_fresh is the only hook trigger).
|
||||||
|
|
||||||
## 9. Documentation map (what survives)
|
## 9. Documentation map (what survives)
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user