Captures: .228 1x-GREEN (110/110); hardened 5x DETACHED on .228 (/tmp/gate-5x2.log,
nohup — survives terminal close) with the exact check-from-any-machine command; all
shipped code fixes (commits) + deploy state (.228 + .198); node-state fixes NOT in
repo (lnd nginx proxy 8081->18083, home-assistant orphan unit removed, electrumx
re-registered); the run-ON-the-node lesson; and remaining work.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Companion app: open every app in the in-app WebView (not just non-iframeable),
carrying the mobile-iframe footer controls into the WebView. Mobile web (PWA):
open tab-apps directly in a new tab. No interstitial on either surface. Touch
points + prior commits (b5a9deb8, d1fbcd9b) noted.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per direction: the gate is now 5x green ON .228 only (run on the node, not via RPC).
Fleet/multinode verification (.198 + others) moved to a new docs/multinode-testing-plan.md
with the bootstrap recipe, per-node preconditions (synced archival bitcoin, no stale
nginx proxy targets, no orphan quadlet units), node roster, and cross-node suites.
Updated CLAUDE.md, master-plan SS5/SS6/SS8b/WS-E, and TESTING.md release gates.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Independent companion loop (452f05d8) validated on .228: deleted archy-electrs-ui
recreates in ~10s (was stuck 100s+). Also: companion-survives bats does LOCAL
rm/systemctl --user, so running it from .116 via RPC tests .116's companions with
.116's binary, NOT the remote target — must run ON the target node. Explains the
'failed on both nodes' runs (both silently tested .116).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Last 2 .228 stragglers confirmed load/timing, not bugs: test 31 (companion recreate)
= contamination + ~108s reconcile cadence > 90s window; test 55 (immich restart) =
heavy stack restarts >120s under load but DOES return. Path to literally-green gate
is infra (bitcoin sync, re-quadletize .228) + minor test-window tuning. Optional
product improvement noted: independent ~30s companion-reconcile cadence.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
companion::reconcile only recreates a deleted companion unit when its parent
backend is in manifest_ids. On contaminated .228, electrumx ran as plain podman
and was NOT a tracked manifest install (manifest on disk but unloaded), so the
reconciler never iterated it -> archy-electrs-ui companion orphaned. Proven:
package.install electrumx re-registered it + restored the companion. Self-heal
logic is sound; test 31 clears on re-quadletize. electrumx on .228 de-contaminated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
.228 104/110, .198 94/110 with the 3-fix binary. Every package.stop test passes on
healthy apps. .198's 14/16 failures trace to bitcoin in IBD (test 83: ~137k blocks
behind) cascading to lnd/btcpay/electrumx/mempool. 2 node-independent: companion
recreate (31, both nodes), fedimint orphan pollution (44). Path to green 5x gate is
now infra (sync bitcoin, re-quadletize .228) + minor (test 31), not lifecycle bugs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stop failure was 3 real product bugs (grace / reconcile-resurrection /
container-list user-stopped state), all fixed (2dad64b2, 760a32bc, 6e49ce6f) +
deployed. electrumx lifecycle suite 10/10 green (66s). fedimint 'crash loop' was
probe-induced churn (stable when left alone). Validating breadth next.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fix deployed to .198+.228, vaultwarden stops clean (no regression). But validation
showed the gate failures are multi-caused: (2) fedimint crash-looping/unhealthy on
both nodes can't be stopped; (3) host-listener repair watchdog restarts
port-unreachable containers fighting stop; (4) gate waits for 'stopped' but apps end
'exited'/'absent' (Exited->Stopped conversion key mismatch); (5) grace vs 60s
gate-timeout (electrumx 300s); (6) .228 contamination. Documented + re-sequenced
NEXT STEPS (fedimint health is the new top blocker).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reproduced live on CLEAN .198: package.stop fedimint -> 'podman stop -t 30
timed out after 30s' -> stop fails -> state reverts to running. Real fleet-wide
bug (NOT .228 contamination). stop_timeout_secs() per-app grace (bitcoin 600/lnd
330/electrumx 300/fedimint 60) is used by legacy stop paths but NOT the
orchestrator path: ContainerRuntime::stop_container hardcodes API ?t=10 / CLI
-t 30, and PODMAN_CLI_DEFAULT_TIMEOUT=30s == the -t grace so the await fires as
podman SIGKILLs. Fix = thread per-app grace + widen wrapper deadline; owner picks
table-based vs manifest-driven stop_grace_secs. Re-escalated to blocker.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
.198 ground truth: backend apps ARE quadlet (.container files present) -> quadlet
is the intended runtime. .228's plain-podman state traced to my cascade-gate
uninstall + package.start restore (no quadlet regen). Two real robustness sub-bugs
remain (start should regen quadlet; stop podman-fallback gap). Next: canonical
gate on CLEAN .198 first to tell real-bug from contamination.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5x gate run surfaced a real blocker: package.stop does not stop electrumx/
bitcoin-knots/btcpay/fedimint/immich (container stays running; gate stop-wait
times out). Root cause chain: these backend apps run as plain podman
--restart=unless-stopped, NOT quadlet units (PODMAN_SYSTEMD_UNIT empty; only UI
companions + home-assistant have .container files; bitcoin-core.container is
.disabled). orchestrator.stop() podman-fallback fires for filebrowser but not
electrumx -> suspect loaded()/is_unknown_app_id_error gap. stop->stopped state
reporting itself is correct (filebrowser proof, user_stopped guard).
Also: corrected the canonical gate invocation (DESTRUCTIVE only, not CASCADE);
restored .228 after my cascade-gate left apps stranded.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per user direction: the production test gate is 5x (ARCHY_ITERATIONS=5) on
.228 AND .198 for now, down from 20x. Restore to 20x before the final ship.
Updated CLAUDE.md, PRODUCTION-MASTER-PLAN.md, and tests/lifecycle/TESTING.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add controlled post_install/pre_start hook schema to AppDefinition:
LifecycleHooks/HookStep (Exec | CopyFromHost)/HostCopy with allowlist
validation (relative src, no '..', absolute container dest, non-empty
exec). Re-exported from the crate root. Design: docs/manifest-hooks-design.md.
Also add the missing generated_secrets: vec![] field to three
pre-existing ContainerConfig test literals (the field was added to the
struct in 03a4ee1b but the container crate's own tests were never rerun,
so -p archipelago-container failed to compile). cargo test green: 53 pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the immich migration off the legacy hardcoded install_immich_stack
(podman run + sudo chown) to the registry-manifest + orchestrator path. Validated
live on .228 (clean single set, healthy v2.7.4, data dir ownership correct).
- install_immich_stack now tries install_stack_via_orchestrator(immich_stack_app_ids)
first; legacy remains only as the no-manifests fallback.
- immich-{postgres,redis,server} manifests corrected from live findings:
* named by app_id (dropped container_name override) — using container_name
spawned DUPLICATE containers (app_id-named install vs name-override reconcile)
on the same PGDATA, which corrupted a postgres cluster. Server reaches its
siblings via app_id aliases (DB_HOSTNAME=immich-postgres, REDIS=immich-redis).
* immich-postgres data_uid 100998:100998 (postgres drops to container 999 →
host 100998 under rootless; verified the fresh dir is chowned correctly).
* immich-server version "release"→"2.7.4" (manifest validation requires a digit;
the bad version made the manifest silently skip → partial orchestrator install
→ legacy fallback → the duplicate corruption above).
- HARDEN install_stack_via_orchestrator: only fall back to the legacy installer
when NOTHING was installed yet. An "unknown app_id" AFTER a member is up now
errors instead of double-creating containers on shared data (the corruption
root cause).
- Strict the all-manifests round-trip test: fail (not skip) on any invalid shipped
manifest — this gap let the bad immich-server version through.
Known follow-up (pre-existing, platform-wide): orchestrator-installed backends
(immich, btcpay-db) run as podman --restart, not Quadlet, and podman-restart.service
is disabled on .228 → reboot-survival gap independent of this migration.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Single authoritative hub (docs/PRODUCTION-MASTER-PLAN.md) for the app-platform
north star: every app manifest-driven (zero OS-level reliance), manifests via the
signed registry, developer-ready external marketplace; rootless/secure/robust/
100%-uptime. Repo CLAUDE.md (auto-loaded each session) points agents at it until
the 20x lifecycle gate is green. New design doc registry-manifest-design.md.
Consolidated docs 56 -> 28: deleted dated handoffs/resumes/transcripts and
superseded trackers (content folded into the master plan or already in memory).
Kept all evergreen design/reference docs + ADRs (the master links them).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>