archy

lfg2025/archy

Author	SHA1	Message	Date
archipelago	b090235b04	docs(gate): 3 stop bugs FIXED, electrumx suite GREEN on .228 Stop failure was 3 real product bugs (grace / reconcile-resurrection / container-list user-stopped state), all fixed (2dad64b2, 760a32bc, 6e49ce6f) + deployed. electrumx lifecycle suite 10/10 green (66s). fedimint 'crash loop' was probe-induced churn (stable when left alone). Validating breadth next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 09:49:45 -04:00
archipelago	6e49ce6f88	fix(container-list): report user-stopped apps as stopped despite live UI companion A user-stopped backend (electrumx, bitcoin, lnd, fedimint) kept reading 'running' in container-list because its UI companion (electrs-ui, …) still serves the launch port, and the state-refresh upgrades any reachable launch port to 'running'. The gate's wait_for_container_status <app> stopped therefore never saw 'stopped'. Fix: load the user_stopped marker in handle_container_list and force 'stopped' for those apps before the launch-port refresh. The reconcile guard keeps the backend down, so the marker is authoritative. package.start clears it first, so a started app reports 'running' normally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 09:26:30 -04:00
archipelago	760a32bccf	fix(reconcile): keep user-stopped apps stopped (reconciler was resurrecting them) package.stop a dependency (e.g. electrumx, a mempool dep) and the reconciler restarts it within ~8s: the reconcile filter's dependency_required override re-includes a user-stopped app that an active app depends on, and the in-memory disabled set is wiped on manifest reload — so ensure_running runs, the stopped app's unreachable ports look like a fault, the host-port repair restarts it, and package.stop never sticks (gate 'transitions to stopped' times out). Fix: guard ensure_running_with_mode on the on-disk user_stopped marker (the single choke point every reconcile flows through) → Left('user-stopped'). Explicit install/start clear the marker first (added clear_user_stopped to orchestrator install/start, symmetric with disabled.remove; start/restart RPC already cleared it) so user actions are unaffected. The container itself already stopped correctly — this stops the resurrection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 09:04:02 -04:00
archipelago	29cd167894	docs(gate): stop-grace fix shipped+validated; gate is multi-caused (5 issues) Fix deployed to .198+.228, vaultwarden stops clean (no regression). But validation showed the gate failures are multi-caused: (2) fedimint crash-looping/unhealthy on both nodes can't be stopped; (3) host-listener repair watchdog restarts port-unreachable containers fighting stop; (4) gate waits for 'stopped' but apps end 'exited'/'absent' (Exited->Stopped conversion key mismatch); (5) grace vs 60s gate-timeout (electrumx 300s); (6) .228 contamination. Documented + re-sequenced NEXT STEPS (fedimint health is the new top blocker). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 08:07:43 -04:00
archipelago	2dad64b2ee	fix(stop): honour per-app graceful-stop grace in orchestrator stop path package.stop left slow-to-SIGTERM apps (fedimint/electrumx/bitcoin/btcpay/immich) running: the orchestrator path hardcoded podman API ?t=10 / CLI -t 30 and the CLI wrapper deadline (30s) equalled the -t grace, so the await fired exactly as podman SIGKILLed -> stop reported failed -> state reverted to running. Reproduced live on clean .198 (fedimint). - container/runtime.rs: add ContainerRuntime::stop_container_with_grace (defaulted so mock/dev impls are unchanged); PodmanRuntime honours grace for API + CLI with deadline = grace + 15s buffer; AutoRuntime delegates. New canonical per-app table stop_grace_secs_for() + DEFAULT_STOP_GRACE_SECS / STOP_GRACE_DEADLINE_BUFFER_SECS. - podman_client.rs: stop_container_with_grace uses ?t=<grace> + longer HTTP deadline. - prod_orchestrator::stop: resolve grace = manifest stop_grace_secs (north-star) else the table; pass to quadlet::stop_service_with_timeout AND stop_container_with_grace. - quadlet.rs: stop_service_with_timeout so slow apps aren't SIGKILLed at 45s. - rpc/package/runtime.rs: doc-note its &str stop_timeout_secs mirrors the canonical table. - tests: resolve_stop_grace_secs (manifest field wins / table fallback / default 30). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 06:59:40 -04:00
archipelago	470e3c649a	docs(gate): ROOT-CAUSE the stop blocker — orchestrator ignores per-app stop grace Reproduced live on CLEAN .198: package.stop fedimint -> 'podman stop -t 30 timed out after 30s' -> stop fails -> state reverts to running. Real fleet-wide bug (NOT .228 contamination). stop_timeout_secs() per-app grace (bitcoin 600/lnd 330/electrumx 300/fedimint 60) is used by legacy stop paths but NOT the orchestrator path: ContainerRuntime::stop_container hardcodes API ?t=10 / CLI -t 30, and PODMAN_CLI_DEFAULT_TIMEOUT=30s == the -t grace so the await fires as podman SIGKILLs. Fix = thread per-app grace + widen wrapper deadline; owner picks table-based vs manifest-driven stop_grace_secs. Re-escalated to blocker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 06:17:23 -04:00
archipelago	a111d79a05	docs(gate): downgrade stop-blocker ⛔→⚠️ — .198 has quadlet units, .228 state was my contamination .198 ground truth: backend apps ARE quadlet (.container files present) -> quadlet is the intended runtime. .228's plain-podman state traced to my cascade-gate uninstall + package.start restore (no quadlet regen). Two real robustness sub-bugs remain (start should regen quadlet; stop podman-fallback gap). Next: canonical gate on CLEAN .198 first to tell real-bug from contamination. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 06:00:42 -04:00
archipelago	47026fae30	docs(gate): document package.stop blocker + quadlet-vs-podman finding (.228) 5x gate run surfaced a real blocker: package.stop does not stop electrumx/ bitcoin-knots/btcpay/fedimint/immich (container stays running; gate stop-wait times out). Root cause chain: these backend apps run as plain podman --restart=unless-stopped, NOT quadlet units (PODMAN_SYSTEMD_UNIT empty; only UI companions + home-assistant have .container files; bitcoin-core.container is .disabled). orchestrator.stop() podman-fallback fires for filebrowser but not electrumx -> suspect loaded()/is_unknown_app_id_error gap. stop->stopped state reporting itself is correct (filebrowser proof, user_stopped guard). Also: corrected the canonical gate invocation (DESTRUCTIVE only, not CASCADE); restored .228 after my cascade-gate left apps stranded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 05:47:11 -04:00
archipelago	d6fa262d69	docs(#20 ): consolidate master-plan resume — indeedhub migration 2-node verified (.228+.198); cutoff-proof next-steps + deploy facts Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 04:23:52 -04:00
archipelago	e2a012d086	fix(indeedhub): frontend health = tcp:7777 not http GET / (stops reconcile churn) On the loaded .198 the frontend churned (created → "unhealthy" → reconciler recreates → loop). The http health check fetched / through nginx (SPA + sub_filter) and false-failed under node load; the reconciler then treated the frontend as wedged and recreated it. nginx binds 7777 at startup, so a tcp liveness check passes immediately and stays green under load while still catching a real "nginx not listening" failure. Generous retries/start_period. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-22 03:39:26 -04:00
archipelago	e4d3f94913	docs(#20 ): hook exec cgroup gap FIXED + verified on .228 (scoped exec) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:57:17 -04:00
archipelago	ff78b31212	fix(hooks): run post_install `exec` in a transient user scope (fixes cgroup denial) Live on .228 the post_install `exec` steps failed with "crun: write cgroup.procs: Permission denied / OCI permission denied": a `podman exec` launched from archipelago.service can't place its child in the container's cgroup (under the service's own slice). Wrap `exec` in `systemd-run --user --scope --quiet --collect podman exec …` so it gets its own delegated cgroup — same trick as `podman_user_scope` for pasta starts. `copy_from_host` (a host-side `cp`, no in-container process) stays direct. Without this only copy_from_host worked; indeedhub happened to be unaffected (its image pre-bakes the nginx config so the exec steps were no-ops), but the hook capability is only generally useful with exec working. hooks unit tests pass; live verify on .228 next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:38:23 -04:00
archipelago	fdb465f8ac	docs(#20 ): indeedhub fresh-create FIXED + verified on .228 (special-cases deleted + nginx caps); hook exec cgroup gap noted Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:26:23 -04:00
archipelago	ff8f11b87e	fix(indeedhub): frontend nginx needs SET{UID,GID}+CHOWN+DAC_OVERRIDE under cap-drop-ALL Live fresh-create on .228 (post special-case removal) had nginx workers die with "setgid(101) failed (Operation not permitted)" → workers exited code 2, port published but nothing served (HTTP 000). The orchestrator does --cap-drop=ALL, so unlike the legacy `podman run` (default caps) nginx's master couldn't drop workers to the nginx user. Declare CHOWN/DAC_OVERRIDE/SETGID/SETUID (SET* to drop the worker user, CHOWN+DAC_OVERRIDE for the tmpfs proxy cache). Verified on .228: frontend fresh-creates, caps applied, nginx serves, UI 200 incl. /api/ and /nostr-provider.js. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:24:34 -04:00
archipelago	b73084dbb0	refactor(indeedhub): delete orchestrator special-cases; use generic path (#20 phase 3) The fresh-create path was blocked by hardcoded indeedhub orchestrator logic that predated and conflicted with the manifest migration: - ensure_running routed app_id=="indeedhub" → reconcile_indeedhub_stack, which REFUSED to create the frontend from its manifest (returned Left("stack-managed")). - run_pre_start_hooks("indeedhub") → start_indeedhub_backends → wait_for_indeedhub_dependencies_ready(120) — a DNS gate with a chicken-and-egg bug (required the frontend's own alias present before the frontend could be created), which failed install_fresh with "dependencies were not ready within 120s" and left the frontend down (caught live on .228). Delete all of it (−382 lines): reconcile_indeedhub_stack, start_indeedhub_backends, wait_for_indeedhub_dependencies_ready, indeedhub_api_dependency_dns_ready, indeedhub_required_aliases_present, repair_indeedhub_network_aliases, indeedhub_alias_present, patch_indeedhub_nostr_provider, and the INDEEDHUB_* consts. The manifests now carry everything these did: network_aliases (short hostnames), generated_secrets, dependencies, and the post_install nginx hook. So "indeedhub" + every member flows through the generic install_fresh/reconcile path — the frontend fresh-creates normally and runs its hook. (crash_recovery.rs's frontend-after-deps ordering guard is kept — it's beneficial startup ordering, not a blocker.) cargo check + release build green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:11:33 -04:00
archipelago	84031e6209	docs: temporarily reduce release lifecycle gate from 20x to 5x Per user direction: the production test gate is 5x (ARCHY_ITERATIONS=5) on .228 AND .198 for now, down from 20x. Restore to 20x before the final ship. Updated CLAUDE.md, PRODUCTION-MASTER-PLAN.md, and tests/lifecycle/TESTING.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:11:00 -04:00
archipelago	9c45f718a2	docs(#20 ): fresh-create path blocked by legacy indeedhub orchestrator special-cases; fix plan + .228 recovered Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 16:36:22 -04:00
archipelago	8bdc857911	docs(#20 ): indeedhub phase 3 adoption path live-verified on .228 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 16:23:09 -04:00
archipelago	d2f7c4abf3	docs(#20 ): phase 3 code-complete (indeedhub manifests + orchestrator-first); next = .228 live verify Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 15:48:18 -04:00
archipelago	b1eea8c053	feat(indeedhub): manifest-driven 7-member stack, orchestrator-first (#20 phase 3) Author the IndeedHub stack as 7 manifests (postgres/redis/minio/relay/api/ ffmpeg + frontend) and route install_indeedhub_stack through the orchestrator first (immich pattern), falling back to the legacy installer only when the manifests aren't deployed. Data-preserving by construction — the manifests reproduce the live install exactly so an existing node ADOPTS rather than recreates: - container_name = the live hyphenated names the runtime already references (health_monitor tiers/deps, crash_recovery). - named volumes indeedhub-{postgres,redis,minio,relay}-data (not bind mounts). - dedicated indeedhub-net + network_aliases [postgres\|redis\|minio\|relay\|api] so the api/ffmpeg env hostnames and the frontend nginx upstreams resolve unchanged. - generated_secrets (indeedhub-db-password/-minio-password owned by their backends, indeedhub-jwt by the api) reuse the live /var/lib/archipelago/ secrets values (ensure_one no-ops on existing files; postgres pw is fixed at PGDATA init). minio user "indeeadmin" + AES_MASTER_SECRET literal kept. The frontend carries the post_install hook (#20) that replaces the hardcoded patch_indeedhub_nostr_provider: strip X-Frame-Options, refresh nostr-provider.js from /opt/archipelago/web-ui, inject the <script> if absent, reload nginx — defensive/idempotent since indeedhub:1.0.0 already bakes these. Frontend manifest also corrected off its dead Next.js shape (health check now nginx :7777, tmpfs /run + /var/cache/nginx). Builds + unit-tested; live adoption/lifecycle verification on .228 next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 15:46:26 -04:00
archipelago	b94b61f640	feat(manifest): network_aliases — extra DNS aliases on a container's network Add `container.network_aliases: Vec<String>` (serde default, DNS-label validated) so a stack member can answer to short hostnames its peers bake in, beyond its own container name. Rendered in both runtime paths: - podman_client: merged (deduped) into the custom-network aliases array. - quadlet from_manifest: appended after the container name; emitted only for Bridge networks (slirp/pasta reject aliases). Needed for the indeedhub migration: its frontend nginx proxies to `api:4000` / `minio:9000` / `relay:8080`, so those members declare `network_aliases: [api\|minio\|relay]` to keep the short names resolvable on the dedicated indeedhub-net (vs. colliding generic aliases on archy-net). Also fixes 4 pre-existing from_manifest test failures (unrelated to this change, surfaced now that the quadlet suite runs green): test manifests used the long-invalid `network_policy: archy-net` (allowlist is isolated/bridge/host → moved to network_policy: isolated + container.network) and bind sources outside /var/lib/archipelago. Tests: container crate 53 pass; archipelago quadlet+alias 47 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 15:45:11 -04:00
archipelago	ccb5b7ca39	docs(#20 ): mark hook phases 1+2 done; resume notes point to phase 3 (indeedhub) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 11:49:05 -04:00
archipelago	955c54b713	feat(hooks): post_install executor + install-path wiring (#20 phase 2) Add container::hooks::run_post_install — runs an app's declarative post_install hooks against its own running container: - Exec -> podman exec <container> <args…> (60s timeout-bounded) - CopyFromHost -> resolve src against allowlist roots (<data_dir>/<app> and /opt/archipelago), canonicalise + prefix-check (defeats symlink escape), then podman cp <abs-src> <container>:<dest> Best-effort + idempotent: a failed step is warned and skipped, never fails the install — matching the legacy patch_indeedhub_nostr_provider behaviour this replaces. Wired into install_fresh after the container is up, so it runs only on a freshly created container (not plain start), and re-applies on recreate-after-drift. 5 unit tests on resolve_copy_src (accept in-data-dir, reject absolute / traversal / missing / symlink-escape). cargo test -p archipelago green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 11:45:28 -04:00
archipelago	4c1a4e5976	feat(hooks): manifest lifecycle-hooks schema (#20 phase 1) + fix container test literals Add controlled post_install/pre_start hook schema to AppDefinition: LifecycleHooks/HookStep (Exec \| CopyFromHost)/HostCopy with allowlist validation (relative src, no '..', absolute container dest, non-empty exec). Re-exported from the crate root. Design: docs/manifest-hooks-design.md. Also add the missing generated_secrets: vec![] field to three pre-existing ContainerConfig test literals (the field was added to the struct in 03a4ee1b but the container crate's own tests were never rerun, so -p archipelago-container failed to compile). cargo test green: 53 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 11:07:00 -04:00
archipelago	b0b54a96fa	test(lifecycle): immich suite — package-level checks, wait-based destructive tier container-list reports stack apps package-level (.name="immich"), so the suite checks the "immich" package (presence, valid state, :2283 lan-address) rather than individual container names. Destructive tier fires async stop/start/restart and asserts on the end state via wait_for_container_status. KNOWN: the destructive tier is flaky for slow multi-container stacks — bats runs ops back-to-back with no settling while immich's async stack ops take 30s+, and stopped reports as "exited" not "stopped". The immich migration itself is verified working (manual stop/start/restart succeed; all 3 containers healthy). Hardening the harness for stack apps (inter-op settling + stopped\|exited acceptance) is a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 09:52:33 -04:00
archipelago	f0c6b79d1a	fix(immich): name containers underscore to match runtime lifecycle code package.stop/start/restart broke ("no containers found" / "no such object immich_postgres") because the runtime hardcodes the immich stack's container names as immich_server/immich_postgres/immich_redis (underscore) across 8 files (lifecycle, health, crash-recovery, ports, config). The migration had named the containers by app_id (hyphen), mismatching all of it. Root cause of the earlier failed attempt: container_name was nested under an `extensions:` block, but `app.extensions` is serde(flatten) — container_name must be a TOP-LEVEL app key to be read by compute_container_name. Fixed: set container_name: immich_server / immich_postgres / immich_redis at top level, and point DB_HOSTNAME/REDIS_HOSTNAME at the underscore aliases. App ids stay hyphen (immich/immich-postgres/immich-redis) so the catalog identity (title+icon) holds. Manifest-only change — container names now match existing runtime references, no code edits to the 8 files. (Deriving stack containers from manifests instead of hardcoded lists remains a north-star follow-up.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 09:20:38 -04:00
archipelago	b1f175b927	test(lifecycle): add immich stack lifecycle suite RPC-based (host-agnostic) lifecycle coverage for the manifest-driven immich stack (immich + immich-postgres + immich-redis): presence + valid state of all 3 members, a guard that no legacy underscore containers exist (catches botched migration / legacy-installer fallback), destructive stop/start/restart of the server with postgres+redis staying up, and cascade uninstall/reinstall (preserve_data). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 09:01:19 -04:00
archipelago	c548705147	docs: master plan — mark registry-manifest phases 1-3 + immich + reboot-survival done Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 08:25:40 -04:00
archipelago	f160e0c404	fix(reboot): enable podman-restart.service at startup (--restart reboot-survival) Orchestrator-installed backends (immich, btcpay-db, …) run as plain podman `--restart=unless-stopped` containers until the Phase-3 Quadlet rollout flips use_quadlet_backends on. Nothing in the codebase enabled the user's podman-restart.service, so those containers had NO reboot-survival mechanism. Enable it (idempotent, best-effort) at orchestrator startup so unless-stopped containers come back after a reboot. Already applied manually on .228 (covers 31 containers incl. immich + btcpay); this codifies it fleet-wide. The deeper fix (render Quadlet for all orchestrator installs) remains the gated Phase-3 Quadlet-everywhere rollout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 08:23:19 -04:00
archipelago	d5ef45731a	fix(immich): restore canonical app_id "immich" (title + icon) After the manifest migration the launcher installed as "immich-server" (app_id), which has no catalog entry → showed the raw id and no icon. Rename the server manifest app_id immich-server→immich so it matches the catalog/curated "immich" entry (title "Immich", icon immich.png) and is recognised as a known launcher app (APP_CATEGORY_MAP) → stays in My Apps. immich_stack_app_ids now installs [immich-postgres, immich-redis, immich]; orchestrator.install bypasses package routing so there's no recursion with the "immich"→stack-installer mapping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 08:07:08 -04:00
archipelago	0860dfacc7	feat(ui): Services tab — backend classification, parent icons, categories sub-nav - Classify databases/APIs/backends into Services (#10): add immich-postgres/redis to SERVICE_NAMES; isServiceContainer matches -postgres/-redis/-valkey/-cache/-db suffixes; isWebsitePackage final fallback now routes any no-UI, non-known package to Services ("anything that isn't the frontend UI launcher"). - Services show their parent app's icon (#14): backends reuse the app logo (immich-* → immich, archy-btcpay-db → btcpay, indeedhub-* → indeedhub, etc.) via explicit APP_ICON_FALLBACKS + prefix map, instead of 404 → 📦. - Categories sub-nav for Services (#12): getServiceCategory + buildServiceCategories + useServiceCategories; Services tab gets the same desktop/mobile category strips (Databases/Caches/APIs/Backends), shown only for categories with items. Shared selectedCategory resets to 'all' on tab switch. - Mobile swipe (#11): the tab-swipe gesture is suppressed over .mobile-category-strip so swiping the category chips scrolls them instead of changing tabs (covers both My Apps and the new Services strip). vue-tsc build clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 07:42:48 -04:00
archipelago	9e6c5370fc	feat(immich): manifest-driven stack via orchestrator — live-migrated on .228 Completes the immich migration off the legacy hardcoded install_immich_stack (podman run + sudo chown) to the registry-manifest + orchestrator path. Validated live on .228 (clean single set, healthy v2.7.4, data dir ownership correct). - install_immich_stack now tries install_stack_via_orchestrator(immich_stack_app_ids) first; legacy remains only as the no-manifests fallback. - immich-{postgres,redis,server} manifests corrected from live findings: * named by app_id (dropped container_name override) — using container_name spawned DUPLICATE containers (app_id-named install vs name-override reconcile) on the same PGDATA, which corrupted a postgres cluster. Server reaches its siblings via app_id aliases (DB_HOSTNAME=immich-postgres, REDIS=immich-redis). * immich-postgres data_uid 100998:100998 (postgres drops to container 999 → host 100998 under rootless; verified the fresh dir is chowned correctly). * immich-server version "release"→"2.7.4" (manifest validation requires a digit; the bad version made the manifest silently skip → partial orchestrator install → legacy fallback → the duplicate corruption above). - HARDEN install_stack_via_orchestrator: only fall back to the legacy installer when NOTHING was installed yet. An "unknown app_id" AFTER a member is up now errors instead of double-creating containers on shared data (the corruption root cause). - Strict the all-manifests round-trip test: fail (not skip) on any invalid shipped manifest — this gap let the bad immich-server version through. Known follow-up (pre-existing, platform-wide): orchestrator-installed backends (immich, btcpay-db) run as podman --restart, not Quadlet, and podman-restart.service is disabled on .228 → reboot-survival gap independent of this migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 07:08:45 -04:00
archipelago	011081d180	feat(immich): scaffold registry manifests for postgres/redis/server (not yet live) immich becomes a manifest-driven stack (the legacy install_immich_stack — hardcoded podman run + sudo chown — is the anti-pattern being retired). Three image-only manifests modelled on the btcpay stack + the live .228 container config: - immich-postgres / immich-redis / immich-server on archy-net; container_name set to the underscore form (immich_postgres/_redis/_server) so the server's DB_HOSTNAME/REDIS_HOSTNAME aliases resolve. - generated_secrets: [immich-db-password] (idempotent — reuses the live secret on existing nodes; postgres is already initialised with it). - server depends on postgres+redis (install ordering); upload bind preserved. Inert for now: not added to the UI catalog and install_immich_stack still the default, so nothing installs these until the orchestrator wiring + on-node ownership (data_uid) validation lands. Schema validated by the all-manifests round-trip test. See docs/PRODUCTION-MASTER-PLAN.md §6. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 05:53:58 -04:00
archipelago	7bfbe8fe40	feat(registry-manifest): phase 2 — publisher embeds manifests into signed catalog generate-app-catalog.sh gains opt-in EMBED_MANIFESTS=1: embeds each apps/<id>/manifest.yml into its catalog entry's `manifest` field (whole document, top-level app: preserved — exactly what the Rust side deserializes). Default off so routine catalog regen is unchanged during the migration window; turn on deliberately, then sign via the existing release-root ceremony. Verified: default embeds 0; EMBED_MANIFESTS=1 embeds 40 manifests (generated_secrets preserved). Adds a round-trip guard test: every shipped apps/*/manifest.yml must deserialize + validate through catalog_manifest_to_overlay (image apps accepted, build apps defer to disk) — catches schema drift between disk manifests and the catalog path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 05:46:17 -04:00
archipelago	220666d3a9	feat(registry-manifest): phase 1 — orchestrator consumes manifests from signed catalog Workstream B phase 1 (node-side consume). The signed app-catalog can now carry a full manifest per entry; the orchestrator overlays it over the disk manifest (origin-wins) with disk as the migration fallback. Moves apps toward registry-distributed manifests with no OTA-shipped disk file. - app_catalog: `manifest: Option<Value>` on AppCatalogEntry (forward-compatible, covered by the existing release-root signature over the raw JSON); `catalog_manifest_values()` accessor. - prod_orchestrator: `load_manifests` overlays catalog manifests after the disk walk; `catalog_manifest_to_overlay()` returns None (→ disk fallback) on unparseable value / app-id mismatch / failed validate() / build source (build contexts aren't registry-distributed yet — phase 1 is image-only). - manifest_dir stays PathBuf (build-only field); image-only apps never read it. - 6 unit tests; compiles clean. No-op until a catalog embeds a manifest, so existing nodes are unaffected. See docs/registry-manifest-design.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 05:30:38 -04:00
archipelago	192238cbb8	docs: consolidate into PRODUCTION-MASTER-PLAN, add CLAUDE.md, prune 25 stale docs Single authoritative hub (docs/PRODUCTION-MASTER-PLAN.md) for the app-platform north star: every app manifest-driven (zero OS-level reliance), manifests via the signed registry, developer-ready external marketplace; rootless/secure/robust/ 100%-uptime. Repo CLAUDE.md (auto-loaded each session) points agents at it until the 20x lifecycle gate is green. New design doc registry-manifest-design.md. Consolidated docs 56 -> 28: deleted dated handoffs/resumes/transcripts and superseded trackers (content folded into the master plan or already in memory). Kept all evergreen design/reference docs + ADRs (the master links them). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 05:11:32 -04:00
archipelago	03a4ee1b30	feat(container): manifest-declared generated secrets + companion/quadlet hardening Generated-secrets system: apps declare `generated_secrets` in their manifest (kinds hex16/hex32/bcrypt); `container::secrets::ensure_generated_secrets` materialises them 0600/rootless in resolve_dynamic_env — idempotent and self-healing (recovers wrongly root-owned secrets with no privilege). Replaces per-app Rust (deletes ensure_fmcd_password). fedimint-clientd/gateway manifests now declare fmcd-password / fedimint-gateway-hash. companion.rs: rebuild the auto-built :latest image when its build context changes (staleness check) so baked-in fixes (e.g. guardian-UI CSS) actually reach nodes. quadlet.rs: skip PublishPort under Network=host (podman rejects the combo, exit 125) + regression tests. UI: "Fedimint Guardian" rename, fedimint-clientd/nostr-rs-relay/meshtastic tagged as Services (headless backends), gateway icon fallback. Deployed + verified on .228 (generated-secrets fixed fedimint-gateway start; grafana/strfry orphan crash-loop units removed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 05:11:07 -04:00
archipelago	db7d424bff	feat(content): owned-content persistence + Fedimint paid downloads, fmcd caps fix, FIPS warm-path perf Buyer-side paid downloads now persist: purchases are cached on disk (content_owned.rs) keyed by (seller onion, content_id), the gallery shows an "Owned" badge unblurred, and items view/play in-app from the local cache with no re-payment or reliance on a browser download (which silently failed on the mobile companion). New RPCs content.owned-list / content.owned-get. Validated e2e .116<-.198 (paid 100 sats via Fedimint, 166KB jpeg returns, survives restart). fedimint-clientd manifest: restore the standard container capability set (CHOWN/DAC_OVERRIDE/FOWNER/SETUID/SETGID) so fmcd's startup chown of an existing-federation /data succeeds instead of dying EPERM (#7). Confirmed the orchestrator applies these to the running container. FIPS perf: tighten the supervisor warm-path keepalive 45s -> 25s so peer paths stay inside the ~30-60s NAT cold window. Dials now reliably land on FIPS instead of re-punching and falling back to Tor. Measured to the same peer: cloud browse 18-22s -> 0.4s; full Fedimint paid download 29s -> 11s (residual is the seller-side guardian reissue round-trip). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 18:58:52 -04:00
archipelago	b0c9bd2a0c	docs: #7 exhaustive isolation — seccomp ruled out; fmcd runs standalone, orchestrator-managed fails (open) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 14:39:33 -04:00
archipelago	63b98599e8	Revert "fix(fedimint): run fmcd with seccomp=unconfined so its DHT can start (#7 )" This reverts commit 409543c41e78025354acbdde5ffc6445895d4508.	2026-06-20 14:37:24 -04:00
archipelago	409543c41e	fix(fedimint): run fmcd with seccomp=unconfined so its DHT can start (#7 ) fmcd crash-looped "Operation not permitted (os error 1)" on .116 (kernel 6.12.74): the default rootless seccomp profile blocks a syscall its Mainline-DHT / iroh transport needs, so the REST API never came up (:8178 → HTTP 000) and federations couldn't be joined. Verified: with seccomp=unconfined fmcd boots and answers /v2/* (HTTP 401 instead of dead). fmcd works on other nodes, so this is kernel/seccomp-specific — but the relaxation is safe for an outbound-networking daemon and harmless where not needed. - new `security.seccomp_unconfined` manifest flag (SecurityPolicy); - libpod backend sets `seccomp_profile_path: "unconfined"` (== --security-opt seccomp=unconfined); quadlet backend emits `SeccompProfile=unconfined`; - enabled in apps/fedimint-clientd/manifest.yml. NOTE: manifests live on-disk at /opt/archipelago/apps/<id>/manifest.yml, so the node needs the updated manifest deployed + the fmcd container recreated to apply. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 13:08:13 -04:00
archipelago	d59cf6d299	docs: session 3 — ecash confirm+refund, #5 confirmed, #7 fmcd-on-.116 EPERM Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 12:28:24 -04:00
archipelago	12f54e390d	feat(wallet): ecash pay confirmation screen + auto-refund on failed sale (#3 ) - PeerFiles: new confirmation step after "pay from ecash" — shows the amount and which wallet will be spent (Cashu/Fedimint) with balances, lets the user switch backends, and a styled Confirm button. The chosen backend is passed to the payment so it spends exactly what was confirmed. - content.download-peer-paid: accept `method` (cashu\|fedimint) to honor the confirmed choice; log the backend + outcome; backend-specific rejection errors ("not in the same Fedimint federation" / "doesn't accept your Cashu mint"). - AUTO-REFUND: a minted token whose sale fails (peer unreachable, rejected, or error) is now reclaimed (fedimint reissue / cashu receive) so the buyer no longer loses the spent ecash — fixes the stuck-Fedimint-notes report. - wallet.ecash-balance already reports cashu_sats/fedimint_sats/total_sats which the confirm screen uses to pick/show the covering wallet. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 12:16:02 -04:00
archipelago	242baf5deb	fix(ui): on-screen error overlay so companion crashes are visible without a console chrome://inspect isn't always reachable on the Android companion WebView, so the real error stayed invisible. Add a plain-DOM, screenshot-able overlay (built without Vue so it survives a crash in Vue itself) that shows the captured error message + stack and a Copy button for the full window.__archyErrors buffer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 10:23:59 -04:00
archipelago	0ab160b5c3	docs: deploy state — all 6 nodes on 4a8f2198 build (#12/#2/#3/#10) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 10:15:59 -04:00
archipelago	a6957a48f7	fix(netbird): wait for OIDC discovery before reporting install done (#10 ) Right after install the dashboard SPA opens and, if it loads before NetBird's embedded OIDC provider is serving, caches a bad auth state — the user appears logged-in but can't log out until it self-corrects. Container "running" != OIDC ready, so gate the install's Done phase on the management server's /oauth2/.well-known/openid-configuration answering (best-effort, 60s cap, never fails the install since the stack is already up). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 08:57:37 -04:00
archipelago	2761f0d70f	docs: handoff — session 2 progress (#12/#2/#3 code-complete, deploy held) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 08:52:07 -04:00
archipelago	a8c668ee0a	fix(ui): stop mobile tab bar covering last row of content (#2 ) On Cloud/files (and any scrolling view), the bottom of the list could sit behind the fixed mobile tab bar. Cause: DashboardMobileNav measured the bar's offsetHeight and wrote it to --mobile-tab-bar-height, but when the bar was hidden or not yet laid out the measurement was 0 — and writing "0px" defeats the ", 88px" fallback in the .mobile-scroll-pad clearance calc (an explicit 0 is still a set value), so the clearance collapsed and the ~88px bar overlapped the last row. - never write 0px: only set a real measured height, else remove the var so the 88px fallback applies. - re-measure after first paint (rAF) and after the WebView safe-area injection, so the clearance reflects the bar's final laid-out height. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 08:50:44 -04:00
archipelago	8f06d88fbf	feat(wallet): pay for peer files from BOTH Cashu and Fedimint ecash (#3 ) Paying for a peer file minted a Cashu-only token, so a node whose ecash balance lived in Fedimint couldn't pay even with funds. Now both backends are tried: - payer (content.download-peer-paid): mint a Cashu token first; on failure fall back to spending Fedimint notes. Only error if BOTH backends can't cover it. - seller (verify_and_receive_payment): accept Fedimint notes as well as Cashu — anything not starting with "cashu" is redeemed via reissue_into_any. - new fedimint_client::spend_from_any() — spend from whichever joined federation has the balance, returning the notes + federation id (mirrors reissue_into_any). - wallet.ecash-balance now also reports fedimint_sats + combined total_sats; the pay-for-file pre-check uses the combined total so a Fedimint-funded node isn't wrongly blocked. Compiles (cargo check + vue-tsc). Live cross-node federation validation pending (dual-ecash phase 6) — needs two nodes sharing a federation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 08:13:23 -04:00
archipelago	b3633ec525	fix(ui): surface real error instead of generic toast + catch async errors The global Vue errorHandler swallowed every crash into "Something went wrong. Please refresh the page." — which hides exactly what we need to diagnose the companion-app (Android WebView) post-login crash. Now: - the toast shows the real (truncated) error message; - a 25-entry ring buffer is kept on window.__archyErrors for retrieval where there's no console (companion WebView via chrome://inspect, or a debug view); - window 'error' and 'unhandledrejection' listeners catch async/non-Vue errors that Vue's errorHandler misses (e.g. a JS API absent in an older WebView). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 08:05:51 -04:00

1 2 3 4 5 ...

1380 Commits