archy

lfg2025/archy

Author	SHA1	Message	Date
archipelago	8355453a7e	docs: exact cutoff-proof resume in master-plan SS8b (resume from any device) Captures: .228 1x-GREEN (110/110); hardened 5x DETACHED on .228 (/tmp/gate-5x2.log, nohup — survives terminal close) with the exact check-from-any-machine command; all shipped code fixes (commits) + deploy state (.228 + .198); node-state fixes NOT in repo (lnd nginx proxy 8081->18083, home-assistant orphan unit removed, electrumx re-registered); the run-ON-the-node lesson; and remaining work. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 17:22:29 -04:00
archipelago	22b05de6d9	docs(roadmap): P1 mobile app-launch UX — drop 'opens in a tab' interstitial Companion app: open every app in the in-app WebView (not just non-iframeable), carrying the mobile-iframe footer controls into the WebView. Mobile web (PWA): open tab-apps directly in a new tab. No interstitial on either surface. Touch points + prior commits (b5a9deb8, d1fbcd9b) noted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 16:57:44 -04:00
archipelago	27299ea687	docs: make the production test gate a SINGLE-NODE (.228) criterion; split out multinode Per direction: the gate is now 5x green ON .228 only (run on the node, not via RPC). Fleet/multinode verification (.198 + others) moved to a new docs/multinode-testing-plan.md with the bootstrap recipe, per-node preconditions (synced archival bitcoin, no stale nginx proxy targets, no orphan quadlet units), node roster, and cross-node suites. Updated CLAUDE.md, master-plan SS5/SS6/SS8b/WS-E, and TESTING.md release gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 16:47:34 -04:00
archipelago	f4727bfdb3	docs(gate): companion self-heal fix validated (10s) + test-31 harness caveat Independent companion loop (452f05d8) validated on .228: deleted archy-electrs-ui recreates in ~10s (was stuck 100s+). Also: companion-survives bats does LOCAL rm/systemctl --user, so running it from .116 via RPC tests .116's companions with .116's binary, NOT the remote target — must run ON the target node. Explains the 'failed on both nodes' runs (both silently tested .116). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 13:44:57 -04:00
archipelago	de7d3d83dc	docs(gate): final read — every failure fixed/explained, no lifecycle bugs remain Last 2 .228 stragglers confirmed load/timing, not bugs: test 31 (companion recreate) = contamination + ~108s reconcile cadence > 90s window; test 55 (immich restart) = heavy stack restarts >120s under load but DOES return. Path to literally-green gate is infra (bitcoin sync, re-quadletize .228) + minor test-window tuning. Optional product improvement noted: independent ~30s companion-reconcile cadence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 12:36:03 -04:00
archipelago	76b23adcc0	docs(gate): test 31 root-caused = .228 contamination (not a product bug) companion::reconcile only recreates a deleted companion unit when its parent backend is in manifest_ids. On contaminated .228, electrumx ran as plain podman and was NOT a tracked manifest install (manifest on disk but unloaded), so the reconciler never iterated it -> archy-electrs-ui companion orphaned. Proven: package.install electrumx re-registered it + restored the companion. Self-heal logic is sound; test 31 clears on re-quadletize. electrumx on .228 de-contaminated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 11:34:55 -04:00
archipelago	47a5148865	docs(gate): two-node result — stop blocker FIXED; residual red is bitcoin-IBD + node prep .228 104/110, .198 94/110 with the 3-fix binary. Every package.stop test passes on healthy apps. .198's 14/16 failures trace to bitcoin in IBD (test 83: ~137k blocks behind) cascading to lnd/btcpay/electrumx/mempool. 2 node-independent: companion recreate (31, both nodes), fedimint orphan pollution (44). Path to green 5x gate is now infra (sync bitcoin, re-quadletize .228) + minor (test 31), not lifecycle bugs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 11:09:12 -04:00
archipelago	b090235b04	docs(gate): 3 stop bugs FIXED, electrumx suite GREEN on .228 Stop failure was 3 real product bugs (grace / reconcile-resurrection / container-list user-stopped state), all fixed (2dad64b2, 760a32bc, 6e49ce6f) + deployed. electrumx lifecycle suite 10/10 green (66s). fedimint 'crash loop' was probe-induced churn (stable when left alone). Validating breadth next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 09:49:45 -04:00
archipelago	29cd167894	docs(gate): stop-grace fix shipped+validated; gate is multi-caused (5 issues) Fix deployed to .198+.228, vaultwarden stops clean (no regression). But validation showed the gate failures are multi-caused: (2) fedimint crash-looping/unhealthy on both nodes can't be stopped; (3) host-listener repair watchdog restarts port-unreachable containers fighting stop; (4) gate waits for 'stopped' but apps end 'exited'/'absent' (Exited->Stopped conversion key mismatch); (5) grace vs 60s gate-timeout (electrumx 300s); (6) .228 contamination. Documented + re-sequenced NEXT STEPS (fedimint health is the new top blocker). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 08:07:43 -04:00
archipelago	470e3c649a	docs(gate): ROOT-CAUSE the stop blocker — orchestrator ignores per-app stop grace Reproduced live on CLEAN .198: package.stop fedimint -> 'podman stop -t 30 timed out after 30s' -> stop fails -> state reverts to running. Real fleet-wide bug (NOT .228 contamination). stop_timeout_secs() per-app grace (bitcoin 600/lnd 330/electrumx 300/fedimint 60) is used by legacy stop paths but NOT the orchestrator path: ContainerRuntime::stop_container hardcodes API ?t=10 / CLI -t 30, and PODMAN_CLI_DEFAULT_TIMEOUT=30s == the -t grace so the await fires as podman SIGKILLs. Fix = thread per-app grace + widen wrapper deadline; owner picks table-based vs manifest-driven stop_grace_secs. Re-escalated to blocker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 06:17:23 -04:00
archipelago	a111d79a05	docs(gate): downgrade stop-blocker ⛔→⚠️ — .198 has quadlet units, .228 state was my contamination .198 ground truth: backend apps ARE quadlet (.container files present) -> quadlet is the intended runtime. .228's plain-podman state traced to my cascade-gate uninstall + package.start restore (no quadlet regen). Two real robustness sub-bugs remain (start should regen quadlet; stop podman-fallback gap). Next: canonical gate on CLEAN .198 first to tell real-bug from contamination. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 06:00:42 -04:00
archipelago	47026fae30	docs(gate): document package.stop blocker + quadlet-vs-podman finding (.228) 5x gate run surfaced a real blocker: package.stop does not stop electrumx/ bitcoin-knots/btcpay/fedimint/immich (container stays running; gate stop-wait times out). Root cause chain: these backend apps run as plain podman --restart=unless-stopped, NOT quadlet units (PODMAN_SYSTEMD_UNIT empty; only UI companions + home-assistant have .container files; bitcoin-core.container is .disabled). orchestrator.stop() podman-fallback fires for filebrowser but not electrumx -> suspect loaded()/is_unknown_app_id_error gap. stop->stopped state reporting itself is correct (filebrowser proof, user_stopped guard). Also: corrected the canonical gate invocation (DESTRUCTIVE only, not CASCADE); restored .228 after my cascade-gate left apps stranded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 05:47:11 -04:00
archipelago	d6fa262d69	docs(#20 ): consolidate master-plan resume — indeedhub migration 2-node verified (.228+.198); cutoff-proof next-steps + deploy facts Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 04:23:52 -04:00
archipelago	e4d3f94913	docs(#20 ): hook exec cgroup gap FIXED + verified on .228 (scoped exec) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:57:17 -04:00
archipelago	fdb465f8ac	docs(#20 ): indeedhub fresh-create FIXED + verified on .228 (special-cases deleted + nginx caps); hook exec cgroup gap noted Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:26:23 -04:00
archipelago	84031e6209	docs: temporarily reduce release lifecycle gate from 20x to 5x Per user direction: the production test gate is 5x (ARCHY_ITERATIONS=5) on .228 AND .198 for now, down from 20x. Restore to 20x before the final ship. Updated CLAUDE.md, PRODUCTION-MASTER-PLAN.md, and tests/lifecycle/TESTING.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:11:00 -04:00
archipelago	9c45f718a2	docs(#20 ): fresh-create path blocked by legacy indeedhub orchestrator special-cases; fix plan + .228 recovered Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 16:36:22 -04:00
archipelago	8bdc857911	docs(#20 ): indeedhub phase 3 adoption path live-verified on .228 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 16:23:09 -04:00
archipelago	d2f7c4abf3	docs(#20 ): phase 3 code-complete (indeedhub manifests + orchestrator-first); next = .228 live verify Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 15:48:18 -04:00
archipelago	ccb5b7ca39	docs(#20 ): mark hook phases 1+2 done; resume notes point to phase 3 (indeedhub) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 11:49:05 -04:00
archipelago	4c1a4e5976	feat(hooks): manifest lifecycle-hooks schema (#20 phase 1) + fix container test literals Add controlled post_install/pre_start hook schema to AppDefinition: LifecycleHooks/HookStep (Exec \| CopyFromHost)/HostCopy with allowlist validation (relative src, no '..', absolute container dest, non-empty exec). Re-exported from the crate root. Design: docs/manifest-hooks-design.md. Also add the missing generated_secrets: vec![] field to three pre-existing ContainerConfig test literals (the field was added to the struct in 03a4ee1b but the container crate's own tests were never rerun, so -p archipelago-container failed to compile). cargo test green: 53 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 11:07:00 -04:00
archipelago	c548705147	docs: master plan — mark registry-manifest phases 1-3 + immich + reboot-survival done Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 08:25:40 -04:00
archipelago	9e6c5370fc	feat(immich): manifest-driven stack via orchestrator — live-migrated on .228 Completes the immich migration off the legacy hardcoded install_immich_stack (podman run + sudo chown) to the registry-manifest + orchestrator path. Validated live on .228 (clean single set, healthy v2.7.4, data dir ownership correct). - install_immich_stack now tries install_stack_via_orchestrator(immich_stack_app_ids) first; legacy remains only as the no-manifests fallback. - immich-{postgres,redis,server} manifests corrected from live findings: * named by app_id (dropped container_name override) — using container_name spawned DUPLICATE containers (app_id-named install vs name-override reconcile) on the same PGDATA, which corrupted a postgres cluster. Server reaches its siblings via app_id aliases (DB_HOSTNAME=immich-postgres, REDIS=immich-redis). * immich-postgres data_uid 100998:100998 (postgres drops to container 999 → host 100998 under rootless; verified the fresh dir is chowned correctly). * immich-server version "release"→"2.7.4" (manifest validation requires a digit; the bad version made the manifest silently skip → partial orchestrator install → legacy fallback → the duplicate corruption above). - HARDEN install_stack_via_orchestrator: only fall back to the legacy installer when NOTHING was installed yet. An "unknown app_id" AFTER a member is up now errors instead of double-creating containers on shared data (the corruption root cause). - Strict the all-manifests round-trip test: fail (not skip) on any invalid shipped manifest — this gap let the bad immich-server version through. Known follow-up (pre-existing, platform-wide): orchestrator-installed backends (immich, btcpay-db) run as podman --restart, not Quadlet, and podman-restart.service is disabled on .228 → reboot-survival gap independent of this migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 07:08:45 -04:00
archipelago	192238cbb8	docs: consolidate into PRODUCTION-MASTER-PLAN, add CLAUDE.md, prune 25 stale docs Single authoritative hub (docs/PRODUCTION-MASTER-PLAN.md) for the app-platform north star: every app manifest-driven (zero OS-level reliance), manifests via the signed registry, developer-ready external marketplace; rootless/secure/robust/ 100%-uptime. Repo CLAUDE.md (auto-loaded each session) points agents at it until the 20x lifecycle gate is green. New design doc registry-manifest-design.md. Consolidated docs 56 -> 28: deleted dated handoffs/resumes/transcripts and superseded trackers (content folded into the master plan or already in memory). Kept all evergreen design/reference docs + ADRs (the master links them). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 05:11:32 -04:00

24 Commits