archy/docs/HANDOVER-2026-07-02-iso-feedback.md
archipelago c375ecc441 fix: fresh-ISO feedback bug-bash — onboarding, status truthfulness, recovery, kiosk, logs
Fixes from real fresh-install feedback (Framework node .81) + its log bundle:

Backend:
- websocket: subscribe before initial snapshot — broadcasts in the gap were
  silently lost, stranding clients on stale state until a hard refresh
  (the "everything needs ctrl-r" bug: My Apps stuck Loading, App Store
  stuck Checking, containers-scanned never arriving)
- crash recovery: check the crash marker BEFORE writing our own PID —
  recovery had never run on any node (always saw its own PID and skipped);
  PID-reuse guard via /proc cmdline
- boot status: pending-boot-starts registry (recovery, stack recovery,
  reconciler, adoption) — scanner overlays queued-but-down apps as
  Restarting instead of Stopped after a reboot; scanner-authored
  Restarting resolves immediately on a settled scan (no transitional wedge)
- install deps: bounded wait (36x5s) when a dependency is installed but
  still starting ("Waiting for Bitcoin to start…") instead of instant
  rejection; dependency-gate rejections remove the optimistic entry (no
  phantom Stopped tile) and surface as a notification
- seed backup: auth.setup persists the onboarding mnemonic as the
  encrypted seed backup (reveal previously failed on EVERY node — nothing
  ever wrote master_seed.enc); seed.restore stashes too; error sanitizer
  lets seed/2FA errors through instead of "Check server logs"
- lnd: bitcoind.rpchost resolved from the running Bitcoin variant
  (hardcoded bitcoin-knots broke Core nodes); manifest uses derived_env
- bitcoin status: clean human message for connection-reset/startup; raw
  URLs + os-error chains no longer reach the app card
- fedimint-clientd: chown /var/lib/archipelago/fmcd to 1000:1000 (root-
  created dir crash-looped the rootless container, EACCES) — first-boot
  script + pre-start self-heal
- log volume (>1GB/day on a day-old node): journald caps drop-in (ISO +
  bootstrap self-heal), bitcoind -printtoconsole=0 everywhere (90% of the
  journal was IBD UpdateTip spam), tracing default debug→info

Frontend:
- Login: Enter advances to confirm field then submits; submit always
  clickable with inline errors (was silently disabled on mismatch);
  Restart Onboarding needs a confirming second click (the mismatch →
  "onboarding restarted" trap)
- sync store: 30s state reconciliation + refetch on re-entrant connect;
  20s containers-scanned escape hatch so Checking can never show forever;
  fresh empty node reaches the real "no apps yet" state
- intro video: CRF20 re-encode (SSIM 0.988) + faststart — moov was at EOF
  so playback needed the full 15MB first (the intro lag)
- backgrounds: 10 heaviest JPEGs → WebP q90 (9.4MB→6.6MB); 7 stayed JPEG
  (WebP larger on noisy sources)
- Web5ConnectedNodes: drop unused template ref that failed vue-tsc -b

ISO/kiosk:
- nginx: /assets/ 404s no longer cached immutable for a year; HTTPS block
  gained the missing /assets/ location (served index.html as images)
- kiosk: launcher/service spliced from configs/ at ISO build (stale
  heredoc force-disabled GPU); MemoryHigh/Max 1200/1500→2200/2800M (kiosk
  rode the reclaim throttle = the lag); firmware-intel-graphics +
  firmware-amd-graphics (trixie split DMC blobs out of misc-nonfree)

Verified: cargo test 898/898 green, npm run build green with dist
contents confirmed (webp refs, lnd.png, faststart video, new strings).
Handover for ISO build + deploy: docs/HANDOVER-2026-07-02-iso-feedback.md

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 08:00:39 -04:00

8.8 KiB
Raw Blame History

Handover — fresh-ISO feedback bug-bash (2026-07-02)

For: the agent building the next ISO + fleet deploy. All fixes below are uncommitted in this working tree (per the user's flow: you audit, build the ISO, deploy). Source feedback: user's fresh ISO install on a Framework (11th-gen Tiger Lake) machine, node 192.168.1.81 (SSH archipelago / archipelago). Diagnostic bundle: /home/archipelago/incoming-logs/node-logs-192.168.1.81/.

⚠️ Outstanding user request for the deploy

  • Change .81's web-UI password to ThisIsWeb54321@ — the user forgot the current one. Node was unreachable from .116 during this session (flaky WiFi AP, IP flapped .68↔.81). Do this during deploy (SSH works from the user's machine; archipelago/archipelago).

What changed (by file)

Backend (core/archipelago/src) — builds clean, targeted tests pass

  • api/handler/websocket.rssubscribe BEFORE initial snapshot (the "everything needs ctrl-r" root cause: broadcasts in the snapshot→subscribe gap were silently lost; a stale client never learned containers-scanned).
  • main.rs — crash check now runs BEFORE writing the PID marker (crash recovery had never run on any node — it always saw its own PID and skipped); tracing default demoted debug→info (journal volume).
  • crash_recovery.rs — PID-reuse guard (process_is_archipelago); new pending-boot-starts registry (names queued for recovery/reconcile) with writers in recover_containers + stack recovery.
  • server.rs — scanner overlays Stopped/Exited → Restarting for pending-boot-start ids (user ask: "status should be restarting if they are being restarted"); SCANNER_RESTARTING ownership set so scanner-authored Restarting resolves immediately instead of wedging in the 20-min transitional-preserve.
  • container/prod_orchestrator.rs — reconcile pass + adopt_existing register/deregister pending boot-starts; LND pre-start hook passes detected bitcoin_host() (Knots vs Core) into lnd::ensure_config; new fedimint-clientd pre-start hook (mkdir + chown 1000:1000 of /var/lib/archipelago/fmcd — self-heals the crash-loop).
  • container/lnd.rsensure_config(paths, rpc_pass, bitcoin_host); bitcoind.rpchost no longer hardcoded bitcoin-knots; drift check rewrites host changes; +unit test ensure_config_repairs_bitcoin_host_drift.
  • api/rpc/package/dependencies.rs — bounded dependency wait (wait_for_install_deps, 36×5s): installed-but-starting deps wait with "Waiting for Bitcoin to start…" on the card; not-installed deps fail fast with DependencyGateError marker; +5 unit tests.
  • api/rpc/package/install.rs, stacks.rs — call sites wired to gate_install_deps (lnd/electrumx/mempool/btcpay).
  • api/rpc/package/async_lifecycle.rsDependencyGateError removes the optimistic entry (no more phantom "Stopped" LND tile) + pushes an Error notification with the reason.
  • api/rpc/package/progress.rsset_install_message helper.
  • api/rpc/seed_rpc.rssave_pending_seed_encrypted; seed.restore also stashes the mnemonic; auth.rsauth.setup persists the encrypted seed backup (recovery-phrase reveal previously failed on EVERY node because nothing ever wrote master_seed.enc).
  • api/rpc/middleware.rs — sanitizer allowlist extended (seed/2FA/auth errors reach the user instead of "Check server logs"); +2 tests.
  • bitcoin_status.rs — friendly status for "connection reset" (bitcoind starting); raw URL/os-error chains no longer shown; +3 tests.
  • bootstrap.rs — journald drop-in self-heal (OTA nodes get log caps); bitcoin.conf printtoconsole heal. (Log-spam agent's work; verified.)
  • api/rpc/package/config.rs — bitcoin args -printtoconsole=0.

Manifests / scripts / configs

  • apps/lnd/manifest.yml — BITCOIND_HOST now derived_env {{BITCOIN_HOST}}.
  • apps/bitcoin-knots/manifest.yml, apps/bitcoin-core/manifest.yml-printtoconsole=0 (90.6% of the journal was IBD UpdateTip spam; debug.log in the datadir keeps full logs).
  • scripts/first-boot-containers.sh — chown 1000:1000 of /var/lib/archipelago/fmcd in BOTH fmcd blocks (root-owned dir was the fedimint-clientd "Permission denied os error 13" crash-loop); printtoconsole=0.
  • scripts/container-doctor.sh, scripts/reconcile-containers.sh — printtoconsole=0.
  • image-recipe/configs/journald-archipelago.conf (NEW) — SystemMaxUse=500M, rate limits; baked by ISO builder + bootstrap self-heal.
  • image-recipe/configs/nginx-archipelago.conf/assets/ 404s no longer cacheable (the always immutable header could pin a missing background for a YEAR); HTTPS block gained the missing /assets/ location (was silently serving index.html as images).
  • image-recipe/configs/archipelago-kiosk.service — MemoryMax 1500→2800M, MemoryHigh 1200→2200M (kiosk was riding reclaim-throttle = the lag).
  • image-recipe/_archived/build-auto-installer-iso.sh — kiosk launcher/service now spliced from image-recipe/configs/ at build time (was a stale inline heredoc that force-disabled GPU); + firmware-intel-graphics + firmware-amd-graphics (Debian trixie split the i915 DMC blobs out of firmware-misc-nonfree; the .81 kernel logged tgl_dmc missing).

Frontend (neode-ui) — vue-tsc clean, vitest green

  • views/Login.vue — Enter in field 1 → focus confirm; Enter in confirm → submit; submit button always clickable (shows inline mismatch/length error instead of being silently disabled); errors clear on input; Restart Onboarding needs a confirming second click (5s window) — this button is the likely cause of the "onboarding restarted after mismatch" report. +login.restartConfirm key in en/es locales.
  • stores/sync.ts — 30s staleness reconciliation (server.get-state) while connected; already-connected fast path now refetches too.
  • composables/useContainersScanTimeout.ts (NEW, +tests) — 20s escape hatch; wired into Apps.vue / Discover.vue / Marketplace.vue; fresh empty node reaches the real "no apps yet" empty state; "Checking…" can never persist.
  • Backgrounds: 10 heaviest bg JPEGs → WebP q90 (9.4MB→6.6MB; refs updated in OnboardingWrapper/Dashboard/useRouteTransitions); 7 remaining images stayed JPEG (WebP came out LARGER on those — noisy sources; deliberate).
  • public/assets/video/video-intro.mp4 — re-encoded CRF20 (SSIM 0.988) with +faststart (moov was at EOF → browser had to download all 15MB before playing = the intro lag). 12.7MB now, streams immediately.
  • LND icon: stale dist artifact; any fresh npm run build ships app-icons/lnd.png correctly.

Verification done here

  • cargo build -p archipelago + cargo check clean; targeted tests (bitcoin_status, middleware sanitize, dep_wait, lnd, crash_recovery, boot_reconciler, bitcoin_host, prod_orchestrator lnd hooks): 52 passed, 0 failed. Full suite: 898 passed, 0 failed, 1 ignored (22s).
  • npm run build green; dist verified: 10 bg-*.webp present, lnd.png icon present, restartConfirm string in bundle, optimized faststart video (12,740,782 bytes) in place. Note: main had a latent build breaker (unused template ref in Web5ConnectedNodes.vue from commit 8256fde1, vue-tsc TS6133) — fixed here by removing the dead ref/binding; without this fix npm run build fails on current main.
  • vitest: new composable tests + related suites pass.
  • bash -n clean on all touched scripts; nginx conf live-verified by agent (200/404/cache headers on both HTTP+HTTPS blocks).
  • ISO kiosk splice byte-verified against configs/ by agent simulation.

NOT done / left for you

  1. Full test-suite run + gate: run the complete cargo test and (after deploy) tests/lifecycle/run-gate.sh ON .228 per CLAUDE.md before any tag.
  2. Frontend bundle grep before shipping (per memory/feedback): verify new strings (e.g. restartConfirm, bg-home.webp) in the built tarball.
  3. Diagnostics collector (data-dir-listing.txt = 15MB of podman overlay internals; dmidecode empty) — collector script wasn't found in this repo (likely lives on-node or in the user's collection script); fix when found.
  4. podman healthcheck cgroup EPERM spam (1,250 journal errors, healthchecks unreliable fleet-wide) — real open bug, Quadlet-phase territory, NOT fixed.
  5. DP link-training failures on .81 (display corruption) — likely cable/dock/port hardware; firmware fix may help; tell user to try another cable/port if corruption recurs.
  6. LoRa/RNode onboarding surface — never scoped; user may want it as a feature (mesh device-found modal exists only on Mesh page post-login).
  7. The concurrent audit agent's files (docs/1.8.0-RELEASE-HARDENING-PLAN.md, core/.../trust/*, parts of bootstrap.rs) are ALSO uncommitted here — coordinate before committing; don't mix attribution.