archy/docs/HANDOVER-2026-07-02-iso-feedback.md
archipelago c375ecc441 fix: fresh-ISO feedback bug-bash — onboarding, status truthfulness, recovery, kiosk, logs
Fixes from real fresh-install feedback (Framework node .81) + its log bundle:

Backend:
- websocket: subscribe before initial snapshot — broadcasts in the gap were
  silently lost, stranding clients on stale state until a hard refresh
  (the "everything needs ctrl-r" bug: My Apps stuck Loading, App Store
  stuck Checking, containers-scanned never arriving)
- crash recovery: check the crash marker BEFORE writing our own PID —
  recovery had never run on any node (always saw its own PID and skipped);
  PID-reuse guard via /proc cmdline
- boot status: pending-boot-starts registry (recovery, stack recovery,
  reconciler, adoption) — scanner overlays queued-but-down apps as
  Restarting instead of Stopped after a reboot; scanner-authored
  Restarting resolves immediately on a settled scan (no transitional wedge)
- install deps: bounded wait (36x5s) when a dependency is installed but
  still starting ("Waiting for Bitcoin to start…") instead of instant
  rejection; dependency-gate rejections remove the optimistic entry (no
  phantom Stopped tile) and surface as a notification
- seed backup: auth.setup persists the onboarding mnemonic as the
  encrypted seed backup (reveal previously failed on EVERY node — nothing
  ever wrote master_seed.enc); seed.restore stashes too; error sanitizer
  lets seed/2FA errors through instead of "Check server logs"
- lnd: bitcoind.rpchost resolved from the running Bitcoin variant
  (hardcoded bitcoin-knots broke Core nodes); manifest uses derived_env
- bitcoin status: clean human message for connection-reset/startup; raw
  URLs + os-error chains no longer reach the app card
- fedimint-clientd: chown /var/lib/archipelago/fmcd to 1000:1000 (root-
  created dir crash-looped the rootless container, EACCES) — first-boot
  script + pre-start self-heal
- log volume (>1GB/day on a day-old node): journald caps drop-in (ISO +
  bootstrap self-heal), bitcoind -printtoconsole=0 everywhere (90% of the
  journal was IBD UpdateTip spam), tracing default debug→info

Frontend:
- Login: Enter advances to confirm field then submits; submit always
  clickable with inline errors (was silently disabled on mismatch);
  Restart Onboarding needs a confirming second click (the mismatch →
  "onboarding restarted" trap)
- sync store: 30s state reconciliation + refetch on re-entrant connect;
  20s containers-scanned escape hatch so Checking can never show forever;
  fresh empty node reaches the real "no apps yet" state
- intro video: CRF20 re-encode (SSIM 0.988) + faststart — moov was at EOF
  so playback needed the full 15MB first (the intro lag)
- backgrounds: 10 heaviest JPEGs → WebP q90 (9.4MB→6.6MB); 7 stayed JPEG
  (WebP larger on noisy sources)
- Web5ConnectedNodes: drop unused template ref that failed vue-tsc -b

ISO/kiosk:
- nginx: /assets/ 404s no longer cached immutable for a year; HTTPS block
  gained the missing /assets/ location (served index.html as images)
- kiosk: launcher/service spliced from configs/ at ISO build (stale
  heredoc force-disabled GPU); MemoryHigh/Max 1200/1500→2200/2800M (kiosk
  rode the reclaim throttle = the lag); firmware-intel-graphics +
  firmware-amd-graphics (trixie split DMC blobs out of misc-nonfree)

Verified: cargo test 898/898 green, npm run build green with dist
contents confirmed (webp refs, lnd.png, faststart video, new strings).
Handover for ISO build + deploy: docs/HANDOVER-2026-07-02-iso-feedback.md

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 08:00:39 -04:00

143 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Handover — fresh-ISO feedback bug-bash (2026-07-02)
**For: the agent building the next ISO + fleet deploy.** All fixes below are
**uncommitted in this working tree** (per the user's flow: you audit, build the
ISO, deploy). Source feedback: user's fresh ISO install on a Framework
(11th-gen Tiger Lake) machine, node `192.168.1.81` (SSH `archipelago` /
`archipelago`). Diagnostic bundle: `/home/archipelago/incoming-logs/node-logs-192.168.1.81/`.
## ⚠️ Outstanding user request for the deploy
- **Change .81's web-UI password to `ThisIsWeb54321@`** — the user forgot the
current one. Node was unreachable from .116 during this session (flaky WiFi
AP, IP flapped .68↔.81). Do this during deploy (SSH works from the user's
machine; `archipelago`/`archipelago`).
## What changed (by file)
### Backend (core/archipelago/src) — builds clean, targeted tests pass
- `api/handler/websocket.rs`**subscribe BEFORE initial snapshot** (the
"everything needs ctrl-r" root cause: broadcasts in the snapshot→subscribe
gap were silently lost; a stale client never learned containers-scanned).
- `main.rs` — crash check now runs BEFORE writing the PID marker (**crash
recovery had never run on any node** — it always saw its own PID and
skipped); tracing default demoted debug→info (journal volume).
- `crash_recovery.rs` — PID-reuse guard (`process_is_archipelago`); new
**pending-boot-starts registry** (names queued for recovery/reconcile) with
writers in `recover_containers` + stack recovery.
- `server.rs` — scanner overlays Stopped/Exited → **Restarting** for
pending-boot-start ids (user ask: "status should be restarting if they are
being restarted"); `SCANNER_RESTARTING` ownership set so scanner-authored
Restarting resolves immediately instead of wedging in the 20-min
transitional-preserve.
- `container/prod_orchestrator.rs` — reconcile pass + `adopt_existing`
register/deregister pending boot-starts; LND pre-start hook passes detected
`bitcoin_host()` (Knots vs Core) into `lnd::ensure_config`; new
`fedimint-clientd` pre-start hook (mkdir + chown 1000:1000 of
`/var/lib/archipelago/fmcd` — self-heals the crash-loop).
- `container/lnd.rs``ensure_config(paths, rpc_pass, bitcoin_host)`;
bitcoind.rpchost no longer hardcoded `bitcoin-knots`; drift check rewrites
host changes; +unit test `ensure_config_repairs_bitcoin_host_drift`.
- `api/rpc/package/dependencies.rs` — bounded **dependency wait**
(`wait_for_install_deps`, 36×5s): installed-but-starting deps wait with
"Waiting for Bitcoin to start…" on the card; not-installed deps fail fast
with `DependencyGateError` marker; +5 unit tests.
- `api/rpc/package/install.rs`, `stacks.rs` — call sites wired to
`gate_install_deps` (lnd/electrumx/mempool/btcpay).
- `api/rpc/package/async_lifecycle.rs``DependencyGateError` removes the
optimistic entry (**no more phantom "Stopped" LND tile**) + pushes an Error
notification with the reason.
- `api/rpc/package/progress.rs``set_install_message` helper.
- `api/rpc/seed_rpc.rs``save_pending_seed_encrypted`; seed.restore also
stashes the mnemonic; `auth.rs` — **auth.setup persists the encrypted seed
backup** (recovery-phrase reveal previously failed on EVERY node because
nothing ever wrote `master_seed.enc`).
- `api/rpc/middleware.rs` — sanitizer allowlist extended (seed/2FA/auth
errors reach the user instead of "Check server logs"); +2 tests.
- `bitcoin_status.rs` — friendly status for "connection reset" (bitcoind
starting); raw URL/os-error chains no longer shown; +3 tests.
- `bootstrap.rs` — journald drop-in self-heal (OTA nodes get log caps);
bitcoin.conf printtoconsole heal. (Log-spam agent's work; verified.)
- `api/rpc/package/config.rs` — bitcoin args `-printtoconsole=0`.
### Manifests / scripts / configs
- `apps/lnd/manifest.yml` — BITCOIND_HOST now `derived_env {{BITCOIN_HOST}}`.
- `apps/bitcoin-knots/manifest.yml`, `apps/bitcoin-core/manifest.yml`
`-printtoconsole=0` (90.6% of the journal was IBD UpdateTip spam;
debug.log in the datadir keeps full logs).
- `scripts/first-boot-containers.sh` — chown 1000:1000 of
`/var/lib/archipelago/fmcd` in BOTH fmcd blocks (root-owned dir was the
fedimint-clientd "Permission denied os error 13" crash-loop);
printtoconsole=0.
- `scripts/container-doctor.sh`, `scripts/reconcile-containers.sh`
printtoconsole=0.
- `image-recipe/configs/journald-archipelago.conf` (NEW) — SystemMaxUse=500M,
rate limits; baked by ISO builder + bootstrap self-heal.
- `image-recipe/configs/nginx-archipelago.conf``/assets/` 404s no longer
cacheable (the `always` immutable header could pin a missing background for
a YEAR); HTTPS block gained the missing `/assets/` location (was silently
serving index.html as images).
- `image-recipe/configs/archipelago-kiosk.service` — MemoryMax 1500→2800M,
MemoryHigh 1200→2200M (kiosk was riding reclaim-throttle = the lag).
- `image-recipe/_archived/build-auto-installer-iso.sh` — kiosk launcher/service
now spliced from `image-recipe/configs/` at build time (was a stale inline
heredoc that force-disabled GPU); **+ `firmware-intel-graphics` +
`firmware-amd-graphics`** (Debian trixie split the i915 DMC blobs out of
firmware-misc-nonfree; the .81 kernel logged tgl_dmc missing).
### Frontend (neode-ui) — vue-tsc clean, vitest green
- `views/Login.vue` — Enter in field 1 → focus confirm; Enter in confirm →
submit; submit button always clickable (shows inline mismatch/length error
instead of being silently disabled); errors clear on input; **Restart
Onboarding needs a confirming second click** (5s window) — this button is
the likely cause of the "onboarding restarted after mismatch" report.
+`login.restartConfirm` key in en/es locales.
- `stores/sync.ts` — 30s staleness reconciliation (server.get-state) while
connected; already-connected fast path now refetches too.
- `composables/useContainersScanTimeout.ts` (NEW, +tests) — 20s escape hatch;
wired into `Apps.vue` / `Discover.vue` / `Marketplace.vue`; fresh empty node
reaches the real "no apps yet" empty state; "Checking…" can never persist.
- Backgrounds: 10 heaviest bg JPEGs → **WebP q90** (9.4MB→6.6MB; refs updated
in OnboardingWrapper/Dashboard/useRouteTransitions); 7 remaining images
stayed JPEG (WebP came out LARGER on those — noisy sources; deliberate).
- `public/assets/video/video-intro.mp4` — re-encoded CRF20 (SSIM 0.988) with
**+faststart** (moov was at EOF → browser had to download all 15MB before
playing = the intro lag). 12.7MB now, streams immediately.
- LND icon: stale dist artifact; any fresh `npm run build` ships
`app-icons/lnd.png` correctly.
## Verification done here
- `cargo build -p archipelago` + `cargo check` clean; targeted tests
(bitcoin_status, middleware sanitize, dep_wait, lnd, crash_recovery,
boot_reconciler, bitcoin_host, prod_orchestrator lnd hooks): **52 passed,
0 failed**. Full suite: **898 passed, 0 failed, 1 ignored** (22s).
- `npm run build` green; dist verified: 10 bg-*.webp present, `lnd.png`
icon present, `restartConfirm` string in bundle, optimized faststart
video (12,740,782 bytes) in place. Note: main had a latent build breaker
(unused template ref in `Web5ConnectedNodes.vue` from commit 8256fde1,
vue-tsc TS6133) — fixed here by removing the dead ref/binding; without
this fix `npm run build` fails on current main.
- vitest: new composable tests + related suites pass.
- `bash -n` clean on all touched scripts; nginx conf live-verified by agent
(200/404/cache headers on both HTTP+HTTPS blocks).
- ISO kiosk splice byte-verified against configs/ by agent simulation.
## NOT done / left for you
1. **Full test-suite run + gate**: run the complete `cargo test` and (after
deploy) `tests/lifecycle/run-gate.sh` ON .228 per CLAUDE.md before any tag.
2. **Frontend bundle grep before shipping** (per memory/feedback): verify new
strings (e.g. `restartConfirm`, `bg-home.webp`) in the built tarball.
3. **Diagnostics collector** (`data-dir-listing.txt` = 15MB of podman overlay
internals; dmidecode empty) — collector script wasn't found in this repo
(likely lives on-node or in the user's collection script); fix when found.
4. **podman healthcheck cgroup EPERM spam** (1,250 journal errors, healthchecks
unreliable fleet-wide) — real open bug, Quadlet-phase territory, NOT fixed.
5. **DP link-training failures on .81** (display corruption) — likely
cable/dock/port hardware; firmware fix may help; tell user to try another
cable/port if corruption recurs.
6. **LoRa/RNode onboarding surface** — never scoped; user may want it as a
feature (mesh device-found modal exists only on Mesh page post-login).
7. The concurrent audit agent's files (`docs/1.8.0-RELEASE-HARDENING-PLAN.md`,
`core/.../trust/*`, parts of `bootstrap.rs`) are ALSO uncommitted here —
coordinate before committing; don't mix attribution.