118 Commits

Author SHA1 Message Date
archipelago
c548705147 docs: master plan — mark registry-manifest phases 1-3 + immich + reboot-survival done
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 08:25:40 -04:00
archipelago
9e6c5370fc feat(immich): manifest-driven stack via orchestrator — live-migrated on .228
Completes the immich migration off the legacy hardcoded install_immich_stack
(podman run + sudo chown) to the registry-manifest + orchestrator path. Validated
live on .228 (clean single set, healthy v2.7.4, data dir ownership correct).

- install_immich_stack now tries install_stack_via_orchestrator(immich_stack_app_ids)
  first; legacy remains only as the no-manifests fallback.
- immich-{postgres,redis,server} manifests corrected from live findings:
  * named by app_id (dropped container_name override) — using container_name
    spawned DUPLICATE containers (app_id-named install vs name-override reconcile)
    on the same PGDATA, which corrupted a postgres cluster. Server reaches its
    siblings via app_id aliases (DB_HOSTNAME=immich-postgres, REDIS=immich-redis).
  * immich-postgres data_uid 100998:100998 (postgres drops to container 999 →
    host 100998 under rootless; verified the fresh dir is chowned correctly).
  * immich-server version "release"→"2.7.4" (manifest validation requires a digit;
    the bad version made the manifest silently skip → partial orchestrator install
    → legacy fallback → the duplicate corruption above).
- HARDEN install_stack_via_orchestrator: only fall back to the legacy installer
  when NOTHING was installed yet. An "unknown app_id" AFTER a member is up now
  errors instead of double-creating containers on shared data (the corruption
  root cause).
- Strict the all-manifests round-trip test: fail (not skip) on any invalid shipped
  manifest — this gap let the bad immich-server version through.

Known follow-up (pre-existing, platform-wide): orchestrator-installed backends
(immich, btcpay-db) run as podman --restart, not Quadlet, and podman-restart.service
is disabled on .228 → reboot-survival gap independent of this migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 07:08:45 -04:00
archipelago
220666d3a9 feat(registry-manifest): phase 1 — orchestrator consumes manifests from signed catalog
Workstream B phase 1 (node-side consume). The signed app-catalog can now carry a
full manifest per entry; the orchestrator overlays it over the disk manifest
(origin-wins) with disk as the migration fallback. Moves apps toward
registry-distributed manifests with no OTA-shipped disk file.

- app_catalog: `manifest: Option<Value>` on AppCatalogEntry (forward-compatible,
  covered by the existing release-root signature over the raw JSON);
  `catalog_manifest_values()` accessor.
- prod_orchestrator: `load_manifests` overlays catalog manifests after the disk
  walk; `catalog_manifest_to_overlay()` returns None (→ disk fallback) on
  unparseable value / app-id mismatch / failed validate() / build source
  (build contexts aren't registry-distributed yet — phase 1 is image-only).
- manifest_dir stays PathBuf (build-only field); image-only apps never read it.
- 6 unit tests; compiles clean. No-op until a catalog embeds a manifest, so
  existing nodes are unaffected.

See docs/registry-manifest-design.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 05:30:38 -04:00
archipelago
192238cbb8 docs: consolidate into PRODUCTION-MASTER-PLAN, add CLAUDE.md, prune 25 stale docs
Single authoritative hub (docs/PRODUCTION-MASTER-PLAN.md) for the app-platform
north star: every app manifest-driven (zero OS-level reliance), manifests via the
signed registry, developer-ready external marketplace; rootless/secure/robust/
100%-uptime. Repo CLAUDE.md (auto-loaded each session) points agents at it until
the 20x lifecycle gate is green. New design doc registry-manifest-design.md.

Consolidated docs 56 -> 28: deleted dated handoffs/resumes/transcripts and
superseded trackers (content folded into the master plan or already in memory).
Kept all evergreen design/reference docs + ADRs (the master links them).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 05:11:32 -04:00
archipelago
b0c9bd2a0c docs: #7 exhaustive isolation — seccomp ruled out; fmcd runs standalone, orchestrator-managed fails (open)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 14:39:33 -04:00
archipelago
d59cf6d299 docs: session 3 — ecash confirm+refund, #5 confirmed, #7 fmcd-on-.116 EPERM
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 12:28:24 -04:00
archipelago
0ab160b5c3 docs: deploy state — all 6 nodes on 4a8f2198 build (#12/#2/#3/#10)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 10:15:59 -04:00
archipelago
2761f0d70f docs: handoff — session 2 progress (#12/#2/#3 code-complete, deploy held)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 08:52:07 -04:00
archipelago
5f7e8dca80 docs: handoff — mesh rename done, .120->.89 dup-contact diagnosis, netbird TODO
Resume notes for the 1.8.0 bug-bash mesh work: Meshtastic rename shipped +
verified; .120->.89 'non-delivery' diagnosed to a duplicate-contact surfacing
bug (messages inject fine, split across federation/radio twin contact_ids);
design for the dedup fix (#12) and the netbird logout-race map (#10).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 06:06:03 -04:00
archipelago
1bce694ebb feat(ui): mobile mesh tabs, AIUI-style audio player, cloud grid + map fixes
UI (this session):
- Global audio player now scales the whole interface into the space above it
  on desktop (sidebar + main) and docks directly above the tab bar on mobile;
  it stays visible while navigating.
- Mesh mobile redesign: floating Chat / BTC / Dead Man / AI / Map tab strip
  with a single fixed, internally-scrolling pane (page no longer scrolls);
  tabs hide while a conversation is open; floating back button; collapsible
  Device panel (starts collapsed); keyboard-aware conversation sizing via
  VisualViewport so the chat sits just above the keyboard.
- Cloud file grid: uniform 4/3 card heights (folders + images match).
- Swipe left/right switches tabs on the Apps and Web5 screens.
- Map tool fills its pane (no bottom gap); fix skewed Share Location toggle
  on mobile (global min-height rule was deforming the switch).
- Trim redundant helper copy from the mesh AI tab.

Also bundles pre-existing in-progress work that was already in the tree:
mesh listener/session + wallet + container + bitcoin-status backend changes,
docker UI updates, and assorted other UI tweaks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:52:26 -04:00
archipelago
705e2436ba chore(ops,docs): first-boot containers, image versions, design docs, android remote-input
- first-boot-containers + image-versions for fmcd/fedimint
- dual-ecash, meshroller-integration, and remaining-issues design docs
- Android remote-input two-finger scroll + external-open handling

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 19:22:02 -04:00
archipelago
27a6199939 feat(dht): Phase 4 — paid swarm streaming (cross-mint ecash + Shape-A ALPN)
Fetch-side auto-pay decision layer (payment.rs), Shape-A paid-blobs
negotiation ALPN (paid_alpn.rs), cross-mint ecash swap + payer auto-swap
builder + idempotent resume/liquidity cache (ecash.rs), and the
streaming.prepare-payment RPC. All gated behind the iroh-swarm feature
(off by default). 91/91 tests pass, both build configs clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 07:36:31 -04:00
archipelago
1f3b03bc6d docs(dht): Phase 4 plan (paid streaming/relay/IndeeHub + cross-mint) + RESUME update
phase4-streaming-ecash-plan.md: design for ecash-paid swarm transport, paying
across different mints (§2a, Lightning-bridged swaps), networking-through-nodes
relay, and an IndeeHub "Archipelago" content source. Records the resolved
iroh-blobs paid-serving spike. dht-RESUME.md: task #12 + step F marked done.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 04:48:18 -04:00
archipelago
f14829542b docs(dht): RESUME checkpoint — state, next steps, build/worktree rules
Single source of truth for picking the DHT work back up after a restart:
worktree/branch rules, all phase commits, the exact next task (#12 Phase 3
glue), build-time facts, and the Phase 0 go-live ceremony.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 15:18:00 -04:00
archipelago
4c4cf6d8b4 docs(dht): peer-distributed content design (iroh swarm + signed manifests)
Captures the verified 2026-06-16 design: swarm-assist/origin-always-wins,
iroh-blobs as the swarm engine, BLAKE3 addressing, signed Nostr/release-root
authenticity, and the Phase 0-4 plan. Foundation doc for the dht branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 11:15:47 -04:00
archipelago
aa9e0f02b7 fix(cloud): pin peer file-card filename + action buttons to the bottom (#11)
Make each peer file card a flex column filling its grid cell (flex flex-col
h-full) and pin the body row (filename + Play/Download) with mt-auto, so cards
with a media preview and cards without line their footers up across the row.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 09:27:29 -04:00
archipelago
459046b21c docs: resume notes for LND wallet fix (in-progress, branch lnd-wallet-password-fix)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:26:10 -04:00
archipelago
9d3347463a docs: record v1.7.91 + v1.7.92 published; What's New gate; .116 nginx fix
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:20:35 -04:00
archipelago
329e7811eb test(lifecycle): add os-audit OS-wide health gate; docs: v1.7.91 resume notes
os-audit.sh: one non-destructive scorecard tying backend/RPC health, the
all-apps lifecycle audit (delegates to remote-lifecycle.sh), and the FM-guards
(port-drift, secret-completeness, orphan-container sweep, OTA-wedge). The
per-boot building block for the reboot-survival loop. FM12 check uses jq has()
not // (// treats a legit false as empty). Section A validated all-PASS on .116.

docs: v1.7.91 release-pass resume notes + the bitcoinReceive blocker writeup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 04:36:06 -04:00
archipelago
bea745047d docs: record F1 live validation on .116 (green)
Before/after on the live node confirms the launch_url_port fix:
jellyfin/btcpay/fedimint/gitea/portainer/botfights all went from
lan_address=None to a resolved http://localhost:PORT/ URL; harness
focused audit passed, exit 0. Also documents that archipelago.service
restarts are safe on .116 (containers run in the user-1000 slice, a
different cgroup, and survived the restart).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 03:55:58 -04:00
archipelago
a483fe4baa fix: derive launch port from URL authority, not naive rsplit
reachable_lan_address() parsed the launch port with url.rsplit(':')
which yields "8096/" for manifest interfaces.main URLs that carry a
path (http://localhost:8096/). That fails to parse and silently drops
a perfectly reachable launch URL, so apps like jellyfin, btcpay-server,
fedimint, gitea, nextcloud and portainer showed running with no launch
link in the UI. New launch_url_port() reads digits after the final
colon (mirroring port_from_url in the RPC layer) and tolerates a
trailing path. Adds regression tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 03:35:19 -04:00
archipelago
d6f108d818 chore: snapshot release workspace 2026-06-12 03:00:15 -04:00
archipelago
f818f1dcc1 app-platform: remove unsupported saleor release surface 2026-06-11 01:16:21 -04:00
archipelago
881478a873 app-platform: type manifest launch interfaces 2026-06-11 00:52:16 -04:00
archipelago
755ba5562d app-platform: derive launch URLs from manifests 2026-06-11 00:33:24 -04:00
archipelago
182f18ecf3 docs: capture 1.8 app migration release plan 2026-06-11 00:24:54 -04:00
archipelago
d736364ad7 fix(apps): stabilize btcpay and public proxy launch flows 2026-05-19 09:26:43 -04:00
archipelago
daad50325b chore(release): require curated release notes 2026-05-17 18:59:12 -04:00
Dorian
835c525218 chore(release): stage v1.7.55-alpha 2026-05-13 15:09:22 -04:00
archipelago
c0751e2551 chore(release): stage v1.7.54-alpha 2026-05-06 09:23:57 -04:00
archipelago
8f83b37d51 feat(orchestrator): complete container migration and release hardening 2026-04-28 15:00:58 -04:00
archipelago
ed73e4709b chore(release): archive ISO build recipes, tarball-only releases
Releases no longer ship as bootable ISOs. Archipelago updates are
distributed as the backend binary plus a frontend tarball referenced by
releases/manifest.json. Nodes OTA-update via scripts/self-update.sh.

Filebrowser and AIUI remain bundled inside the frontend tarball and
deployed atomically, verified present in v1.7.43-alpha release artifact
(189 AIUI files, filebrowser-client bundle).

Archived under image-recipe/_archived/ (resurrectable if ISO distribution
is reintroduced):
  - build-auto-installer-iso.sh
  - build-unbundled-iso.sh
  - test-iso-qemu.sh
  - scripts/convert-iso-to-disk.sh
  - BUILD-ISO-STATUS.md, ISO-BUILD-CHECKLIST.md
  - branding/isohdpfx.bin
  - .gitea/workflows/build-iso-dev.yml

Updated release process docs to drop ISO references:
  - scripts/create-release.sh (next-steps text)
  - docs/BETA-RELEASE-CHECKLIST.md
  - docs/hotfix-process.md
  - README.md
2026-04-23 15:36:00 -04:00
archipelago
c4efb30382 docs(release-notes): v1.7.43-alpha bullet for install-log fix; prune stale RESUME note 2026-04-23 12:04:20 -04:00
archipelago
353825b66c docs: release-note image-versions fix, add marketplace QA tracker, update RESUME
- AccountInfoSection.vue: append 5th bullet to v1.7.43-alpha entry
  explaining that update-available badges and version comparisons
  work again now that the pinned-image catalog is found at the
  correct deployed path.

- docs/MARKETPLACE-QA.md: new tracker for the upcoming app-by-app
  install walk on .228. Documents the per-app fix workflow, the
  four layers we might need to fix at (app recipe, registry image,
  backend orchestrator, frontend), status-key table for tracking
  each catalog entry, and the release-notes policy for the walk.

- docs/RESUME.md: refresh with a9908597 commit, updated binary md5
  on .228, and split Immediate Next Step into Phase 1 (browser
  verification) and Phase 2 (marketplace walk) with a pointer to
  the new tracker.
2026-04-23 09:32:41 -04:00
archipelago
4faac9cb74 docs(resume): add RESUME.md for context-restart recovery
Consolidated single-file snapshot of plan + progress for a fresh
OpenCode session to pick up the install UX polish work:

- Where we are: v1.7.43-alpha shipped, 5 commits on main, deployed
  to .228, browser verification in progress.
- Immediate next step: await user's verification results from
  https://192.168.1.228/ browser checklist.
- Working layout: SSHFS mount, ssh archy / archy228, deploy recipes.
- Architecture patterns: async-spawn lifecycle, phase-based install
  progress, scanner kick, .23 auto-purge migration.
- Backlog: Vaultwarden exit-on-start, install log perms, 22 stale
  cargo test failures, historical changelog entries left intact.
- User preferences: "best long-term first", one-by-one, no push,
  Bitcoin-only, conventional commits.

Complements STATUS.md (which remains the engineering log) with a
tighter resume-the-work narrative focused on the current round.
2026-04-23 09:14:36 -04:00
archipelago
b62b731db0 docs(status): record rounds 3-5 + config migration + changelog as shipped
Adds a new top section to STATUS.md covering v1.7.43-alpha:

- Round 3: phase-based install progress bar
- Round 4: post-install scanner kick for instant Launch button
- Round 5: .23 VPS retirement, .168 promoted to Server 1
- Config migration: auto-purge .23 from saved registry/mirror JSONs
- Changelog: new v1.7.43-alpha entry in AccountInfoSection

All 5 commits, deployment md5, verification notes, and git remote
cleanup captured. Round 2 rollback command still valid for the full
stack since backups predate every round in this session.
2026-04-23 09:09:02 -04:00
archipelago
576ff1a6de docs(status): mark install/uninstall/update async-spawn as shipped 2026-04-23 06:58:45 -04:00
archipelago
0ea4f96de9 docs(status): mark async-spawn lifecycle fix as shipped
Records the four landed commits, the .228 deploy (binary + frontend
paths, backups, md5), the manual LND Stop verification, and the
rollback incantation. Leaves the older "NEXT SESSION" design block
in place as historical reference with a note that it's stale.

Adds a follow-ups list: chaos matrix is now unblocked, bundled-app
RPCs are still sync (deprecate or mirror-async?), transitional_since
is in-memory only, and there are 22 pre-existing test failures in
unrelated modules that should get their own cleanup pass.
2026-04-23 05:30:45 -04:00
archipelago
cad63bdd76 docs: STATUS.md — FUSE/SSHFS development loop section
Dedicated section covering the file-ops-via-mount + git/cargo-via-ssh
split that makes this dev setup work. Includes:

- Exact running mount command (pulled from ps)
- macFUSE + sshfs-mac brew install path
- Health check + recovery sequence for when mount hangs (it will)
- Full which-path-for-which-operation table
- Don't-do list (cargo from mount, rsync without AppleDouble exclude, etc)
- Cache caveat and inode-sharing note between mount and SSH views

No code change.
2026-04-23 04:51:53 -04:00
archipelago
bb2e3fab42 docs: STATUS.md — complete SSH/key/sudo/deploy reference for next session
Expands NEXT SESSION header with fully verified access info so a fresh
agent has zero ambiguity:

- SSH key inventory across laptop, .116, .228 (every file, purpose noted)
- Actual SSH config aliases (archy, archy228) with IdentitiesOnly
- Verified connectivity matrix (laptop -> both; .116 -> .228; .228 has no outbound key)
- Corrected sudo state: .228 sudoers file is /etc/sudoers.d/archipelago
  (not archipelago-ci); .116 has archipelago-ci + archipelago-wg scope-limited drop-ins
- SSHFS mount source command + AppleDouble gotcha
- Cargo over SSH PATH gotcha + detached build pattern for >2min timeout
- End-to-end deploy-to-.228 recipe (build, SCP, atomic swap, verify)
- Git workflow rules (no push, no amend, no force, conventional commits)

Removes duplicate host-reference block that the prior edit left trailing.
No code change.
2026-04-23 04:49:45 -04:00
archipelago
6a5fab709a docs: STATUS.md — dashboard Stop UX bug diagnosis + async-spawn fix plan
Captures full design for the next session:
- Full bug sequence (5.5min blocking RPC + 30s scan clobbering transitional state)
- 4-commit implementation order with exact file:line targets
- Single-button UI spec with full label table
- Verification gates including manual LND stop test on .228
- Architectural decision: spawn lives in RPC layer, orchestrator trait stays sync

No code change yet; next session implements.
2026-04-23 04:45:12 -04:00
archipelago
2a2f10608b docs: STATUS.md — .228 dashboard bugs fixed (macaroon + ExtraHost) 2026-04-23 04:17:56 -04:00
archipelago
28819d1197 docs: STATUS.md through Step 9 (.228 hot-swap verified)
Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.
2026-04-23 03:46:23 -04:00
archipelago
c396be8068 feat(iso): Step 8a — retire archipelago-reconcile systemd timer
BootReconciler (in-process, 30s interval, spawned from main.rs as of
Step 6 commit 48f08aa3) fully replaces the timer-driven bash
reconciliation path. Delete the systemd unit + timer and their
ISO-builder touchpoints.

Removed:
- image-recipe/configs/archipelago-reconcile.service
- image-recipe/configs/archipelago-reconcile.timer
- image-recipe/build-auto-installer-iso.sh L412-413 (COPY unit+timer)
- image-recipe/build-auto-installer-iso.sh L449 (systemctl enable)
- image-recipe/build-auto-installer-iso.sh L542-543 (cp to WORK_DIR)

Kept (intentionally):
- scripts/reconcile-containers.sh
- scripts/container-specs.sh

Reason: core/archipelago/src/api/rpc/package/update.rs still invokes
reconcile-containers.sh at two sites (OTA update + rollback paths).
Porting those call sites to ContainerOrchestrator::upgrade() requires
manifests for every container update.rs might touch — that scope
belongs in Step 8b. Until then the script stays on disk, just no
longer runs on a periodic timer.

No Rust code changes. cargo check -p archipelago clean, 6 pre-existing
warnings. Skipped full ISO rebuild validation per user decision —
edits are 5 textual deletions with zero behavioral ambiguity; Step 9
live hot-swap on .228 will catch any regression.
2026-04-23 03:04:58 -04:00
archipelago
236a2dee85 docs: split Step 8 into 8a/8b/8c
Discovered during Step 8 execution that first-boot-containers.sh
creates 30+ containers with per-container logic (wallet loads, DB
init, rpcauth derivations, post-create health waits) and does
substantial non-container setup (secret gen, rootless-podman subuid
chowns, Tor hostnames, WireGuard, firewall, nostr-relay). Only 3 of
the 30+ containers have manifests today (the UIs from Step 7).

Deleting the bash in a single step bricks first-boot on fresh
installs. Split into:

- 8a: delete reconcile-containers.sh + container-specs.sh + reconcile
  systemd unit + timer. BootReconciler fully covers these. Safe,
  atomic, no manifest porting required.
- 8b: port remaining ~25 containers into apps/<id>/manifest.yml. One
  manifest per commit, validated against current bash behavior.
  Multi-day scope.
- 8c: rename first-boot-containers.sh -> first-boot-setup.sh, strip
  container ops, keep secret/dir/Tor/WG/firewall setup. Final
  one-way door, requires 8b complete.
2026-04-23 02:34:43 -04:00
archipelago
758d3e47d8 docs: STATUS.md through Step 7 2026-04-23 02:21:01 -04:00
archipelago
ba8bd0bb86 docs: STATUS.md through Step 6 2026-04-22 19:20:17 -04:00
archipelago
89199bb03b docs: update STATUS.md — Step 4 done, Step 5 next
Records acceptance evidence for Steps 1-4 (container tests 21/21 pass, build
clean with expected unused-method warnings) and queues the BootReconciler
implementation for Step 5.
2026-04-22 18:57:43 -04:00
archipelago
919055f3f1 feat(container): add build source to manifest schema
ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.
2026-04-22 17:46:36 -04:00
archipelago
d1bcf271f9 release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet
Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 +
v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500)
with no recovery path short of SSH. This release adds a self-check
guardrail to the update flow.

What changed:
- apply_update() writes a pending-verify marker with old+new version and
  a 150s deadline immediately before scheduling the service restart.
- verify_pending_update() runs from main.rs startup. If the marker is
  present and within its freshness window, the new binary waits 15s for
  nginx + backend to settle, then probes https://127.0.0.1/ every 5s for
  up to 90s (self-signed certs accepted).
- On any probe success within the window, the marker is cleared and
  nothing else happens.
- On window-exhaust, the new binary:
    1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts>
       (quarantined, not deleted, so we can post-mortem).
    2. Restores web-ui.bak on top of web-ui.
    3. Calls rollback_update() to restore the previous binary.
    4. Updates state.current_version to reflect the rollback.
    5. systemctl --no-block restart archipelago so the OLD binary boots.
- Markers older than 10 minutes are treated as stale and cleared without
  probing, so a crashed-during-startup marker from weeks ago cannot
  spontaneously roll back a healthy node on a later reboot.
- rollback_update() binary copy now goes through host_sudo instead of
  tokio::fs::copy, so it escapes the service's ProtectSystem=strict
  mount namespace. Without this, the rollback silently failed with
  EROFS on /usr/local/bin and orphaned the rollback - the exact
  opposite of what auto-rollback is for.

Tests: 4 new unit tests in update::tests covering marker round-trip,
absent-marker noop, no-panic on verify_pending_update with nothing to
verify, and an invariant assert that the 90s probe window stays below
the 600s stale threshold. All passing.

Side fix: scripts/create-release-manifest.sh was dying with exit 141
(SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail.
Replaced with a single awk NR==1 that doesn't short-circuit the upstream
pipe, so the release-build flow is idempotent again.
2026-04-22 16:14:35 -04:00