archy/docs/HANDOVER-2026-07-02-iso-feedback.md

152 lines
9.3 KiB
Markdown
Raw Normal View History

fix: fresh-ISO feedback bug-bash — onboarding, status truthfulness, recovery, kiosk, logs Fixes from real fresh-install feedback (Framework node .81) + its log bundle: Backend: - websocket: subscribe before initial snapshot — broadcasts in the gap were silently lost, stranding clients on stale state until a hard refresh (the "everything needs ctrl-r" bug: My Apps stuck Loading, App Store stuck Checking, containers-scanned never arriving) - crash recovery: check the crash marker BEFORE writing our own PID — recovery had never run on any node (always saw its own PID and skipped); PID-reuse guard via /proc cmdline - boot status: pending-boot-starts registry (recovery, stack recovery, reconciler, adoption) — scanner overlays queued-but-down apps as Restarting instead of Stopped after a reboot; scanner-authored Restarting resolves immediately on a settled scan (no transitional wedge) - install deps: bounded wait (36x5s) when a dependency is installed but still starting ("Waiting for Bitcoin to start…") instead of instant rejection; dependency-gate rejections remove the optimistic entry (no phantom Stopped tile) and surface as a notification - seed backup: auth.setup persists the onboarding mnemonic as the encrypted seed backup (reveal previously failed on EVERY node — nothing ever wrote master_seed.enc); seed.restore stashes too; error sanitizer lets seed/2FA errors through instead of "Check server logs" - lnd: bitcoind.rpchost resolved from the running Bitcoin variant (hardcoded bitcoin-knots broke Core nodes); manifest uses derived_env - bitcoin status: clean human message for connection-reset/startup; raw URLs + os-error chains no longer reach the app card - fedimint-clientd: chown /var/lib/archipelago/fmcd to 1000:1000 (root- created dir crash-looped the rootless container, EACCES) — first-boot script + pre-start self-heal - log volume (>1GB/day on a day-old node): journald caps drop-in (ISO + bootstrap self-heal), bitcoind -printtoconsole=0 everywhere (90% of the journal was IBD UpdateTip spam), tracing default debug→info Frontend: - Login: Enter advances to confirm field then submits; submit always clickable with inline errors (was silently disabled on mismatch); Restart Onboarding needs a confirming second click (the mismatch → "onboarding restarted" trap) - sync store: 30s state reconciliation + refetch on re-entrant connect; 20s containers-scanned escape hatch so Checking can never show forever; fresh empty node reaches the real "no apps yet" state - intro video: CRF20 re-encode (SSIM 0.988) + faststart — moov was at EOF so playback needed the full 15MB first (the intro lag) - backgrounds: 10 heaviest JPEGs → WebP q90 (9.4MB→6.6MB); 7 stayed JPEG (WebP larger on noisy sources) - Web5ConnectedNodes: drop unused template ref that failed vue-tsc -b ISO/kiosk: - nginx: /assets/ 404s no longer cached immutable for a year; HTTPS block gained the missing /assets/ location (served index.html as images) - kiosk: launcher/service spliced from configs/ at ISO build (stale heredoc force-disabled GPU); MemoryHigh/Max 1200/1500→2200/2800M (kiosk rode the reclaim throttle = the lag); firmware-intel-graphics + firmware-amd-graphics (trixie split DMC blobs out of misc-nonfree) Verified: cargo test 898/898 green, npm run build green with dist contents confirmed (webp refs, lnd.png, faststart video, new strings). Handover for ISO build + deploy: docs/HANDOVER-2026-07-02-iso-feedback.md Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 08:00:39 -04:00
# Handover — fresh-ISO feedback bug-bash (2026-07-02)
**For: the agent building the next ISO + fleet deploy.** All fixes below are
**merged and pushed: gitea-ai main = `f5d24796`** (merge of `c375ecc4`,
65 files; branch `iso-feedback-fixes-2026-07-02` also pushed). Source
feedback: user's fresh ISO install on a Framework (11th-gen Tiger Lake)
machine, node `192.168.1.81` (SSH `archipelago` / `archipelago`).
Diagnostic bundle: `/home/archipelago/incoming-logs/node-logs-192.168.1.81/`.
**⚠️ Known-red tests on main (NOT from this work):** `trust::anchor::
unset_constant_is_none` + 2 `trust::signed_doc` tests fail because a prior
commit pinned `RELEASE_ROOT_PUBKEY_HEX` without updating them. The signing/
audit agent's uncommitted changes in the shared tree fix exactly these —
coordinate with them; don't "fix" it independently or you'll collide. This
bug-bash branch alone was 898/898 green; merged with main it's 894/898 with
only those three.
fix: fresh-ISO feedback bug-bash — onboarding, status truthfulness, recovery, kiosk, logs Fixes from real fresh-install feedback (Framework node .81) + its log bundle: Backend: - websocket: subscribe before initial snapshot — broadcasts in the gap were silently lost, stranding clients on stale state until a hard refresh (the "everything needs ctrl-r" bug: My Apps stuck Loading, App Store stuck Checking, containers-scanned never arriving) - crash recovery: check the crash marker BEFORE writing our own PID — recovery had never run on any node (always saw its own PID and skipped); PID-reuse guard via /proc cmdline - boot status: pending-boot-starts registry (recovery, stack recovery, reconciler, adoption) — scanner overlays queued-but-down apps as Restarting instead of Stopped after a reboot; scanner-authored Restarting resolves immediately on a settled scan (no transitional wedge) - install deps: bounded wait (36x5s) when a dependency is installed but still starting ("Waiting for Bitcoin to start…") instead of instant rejection; dependency-gate rejections remove the optimistic entry (no phantom Stopped tile) and surface as a notification - seed backup: auth.setup persists the onboarding mnemonic as the encrypted seed backup (reveal previously failed on EVERY node — nothing ever wrote master_seed.enc); seed.restore stashes too; error sanitizer lets seed/2FA errors through instead of "Check server logs" - lnd: bitcoind.rpchost resolved from the running Bitcoin variant (hardcoded bitcoin-knots broke Core nodes); manifest uses derived_env - bitcoin status: clean human message for connection-reset/startup; raw URLs + os-error chains no longer reach the app card - fedimint-clientd: chown /var/lib/archipelago/fmcd to 1000:1000 (root- created dir crash-looped the rootless container, EACCES) — first-boot script + pre-start self-heal - log volume (>1GB/day on a day-old node): journald caps drop-in (ISO + bootstrap self-heal), bitcoind -printtoconsole=0 everywhere (90% of the journal was IBD UpdateTip spam), tracing default debug→info Frontend: - Login: Enter advances to confirm field then submits; submit always clickable with inline errors (was silently disabled on mismatch); Restart Onboarding needs a confirming second click (the mismatch → "onboarding restarted" trap) - sync store: 30s state reconciliation + refetch on re-entrant connect; 20s containers-scanned escape hatch so Checking can never show forever; fresh empty node reaches the real "no apps yet" state - intro video: CRF20 re-encode (SSIM 0.988) + faststart — moov was at EOF so playback needed the full 15MB first (the intro lag) - backgrounds: 10 heaviest JPEGs → WebP q90 (9.4MB→6.6MB); 7 stayed JPEG (WebP larger on noisy sources) - Web5ConnectedNodes: drop unused template ref that failed vue-tsc -b ISO/kiosk: - nginx: /assets/ 404s no longer cached immutable for a year; HTTPS block gained the missing /assets/ location (served index.html as images) - kiosk: launcher/service spliced from configs/ at ISO build (stale heredoc force-disabled GPU); MemoryHigh/Max 1200/1500→2200/2800M (kiosk rode the reclaim throttle = the lag); firmware-intel-graphics + firmware-amd-graphics (trixie split DMC blobs out of misc-nonfree) Verified: cargo test 898/898 green, npm run build green with dist contents confirmed (webp refs, lnd.png, faststart video, new strings). Handover for ISO build + deploy: docs/HANDOVER-2026-07-02-iso-feedback.md Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 08:00:39 -04:00
## ⚠️ Outstanding user request for the deploy
- **Change .81's web-UI password to `ThisIsWeb54321@`** — the user forgot the
current one. Node was unreachable from .116 during this session (flaky WiFi
AP, IP flapped .68↔.81). Do this during deploy (SSH works from the user's
machine; `archipelago`/`archipelago`).
## What changed (by file)
### Backend (core/archipelago/src) — builds clean, targeted tests pass
- `api/handler/websocket.rs`**subscribe BEFORE initial snapshot** (the
"everything needs ctrl-r" root cause: broadcasts in the snapshot→subscribe
gap were silently lost; a stale client never learned containers-scanned).
- `main.rs` — crash check now runs BEFORE writing the PID marker (**crash
recovery had never run on any node** — it always saw its own PID and
skipped); tracing default demoted debug→info (journal volume).
- `crash_recovery.rs` — PID-reuse guard (`process_is_archipelago`); new
**pending-boot-starts registry** (names queued for recovery/reconcile) with
writers in `recover_containers` + stack recovery.
- `server.rs` — scanner overlays Stopped/Exited → **Restarting** for
pending-boot-start ids (user ask: "status should be restarting if they are
being restarted"); `SCANNER_RESTARTING` ownership set so scanner-authored
Restarting resolves immediately instead of wedging in the 20-min
transitional-preserve.
- `container/prod_orchestrator.rs` — reconcile pass + `adopt_existing`
register/deregister pending boot-starts; LND pre-start hook passes detected
`bitcoin_host()` (Knots vs Core) into `lnd::ensure_config`; new
`fedimint-clientd` pre-start hook (mkdir + chown 1000:1000 of
`/var/lib/archipelago/fmcd` — self-heals the crash-loop).
- `container/lnd.rs``ensure_config(paths, rpc_pass, bitcoin_host)`;
bitcoind.rpchost no longer hardcoded `bitcoin-knots`; drift check rewrites
host changes; +unit test `ensure_config_repairs_bitcoin_host_drift`.
- `api/rpc/package/dependencies.rs` — bounded **dependency wait**
(`wait_for_install_deps`, 36×5s): installed-but-starting deps wait with
"Waiting for Bitcoin to start…" on the card; not-installed deps fail fast
with `DependencyGateError` marker; +5 unit tests.
- `api/rpc/package/install.rs`, `stacks.rs` — call sites wired to
`gate_install_deps` (lnd/electrumx/mempool/btcpay).
- `api/rpc/package/async_lifecycle.rs``DependencyGateError` removes the
optimistic entry (**no more phantom "Stopped" LND tile**) + pushes an Error
notification with the reason.
- `api/rpc/package/progress.rs``set_install_message` helper.
- `api/rpc/seed_rpc.rs``save_pending_seed_encrypted`; seed.restore also
stashes the mnemonic; `auth.rs` — **auth.setup persists the encrypted seed
backup** (recovery-phrase reveal previously failed on EVERY node because
nothing ever wrote `master_seed.enc`).
- `api/rpc/middleware.rs` — sanitizer allowlist extended (seed/2FA/auth
errors reach the user instead of "Check server logs"); +2 tests.
- `bitcoin_status.rs` — friendly status for "connection reset" (bitcoind
starting); raw URL/os-error chains no longer shown; +3 tests.
- `bootstrap.rs` — journald drop-in self-heal (OTA nodes get log caps);
bitcoin.conf printtoconsole heal. (Log-spam agent's work; verified.)
- `api/rpc/package/config.rs` — bitcoin args `-printtoconsole=0`.
### Manifests / scripts / configs
- `apps/lnd/manifest.yml` — BITCOIND_HOST now `derived_env {{BITCOIN_HOST}}`.
- `apps/bitcoin-knots/manifest.yml`, `apps/bitcoin-core/manifest.yml`
`-printtoconsole=0` (90.6% of the journal was IBD UpdateTip spam;
debug.log in the datadir keeps full logs).
- `scripts/first-boot-containers.sh` — chown 1000:1000 of
`/var/lib/archipelago/fmcd` in BOTH fmcd blocks (root-owned dir was the
fedimint-clientd "Permission denied os error 13" crash-loop);
printtoconsole=0.
- `scripts/container-doctor.sh`, `scripts/reconcile-containers.sh`
printtoconsole=0.
- `image-recipe/configs/journald-archipelago.conf` (NEW) — SystemMaxUse=500M,
rate limits; baked by ISO builder + bootstrap self-heal.
- `image-recipe/configs/nginx-archipelago.conf``/assets/` 404s no longer
cacheable (the `always` immutable header could pin a missing background for
a YEAR); HTTPS block gained the missing `/assets/` location (was silently
serving index.html as images).
- `image-recipe/configs/archipelago-kiosk.service` — MemoryMax 1500→2800M,
MemoryHigh 1200→2200M (kiosk was riding reclaim-throttle = the lag).
- `image-recipe/_archived/build-auto-installer-iso.sh` — kiosk launcher/service
now spliced from `image-recipe/configs/` at build time (was a stale inline
heredoc that force-disabled GPU); **+ `firmware-intel-graphics` +
`firmware-amd-graphics`** (Debian trixie split the i915 DMC blobs out of
firmware-misc-nonfree; the .81 kernel logged tgl_dmc missing).
### Frontend (neode-ui) — vue-tsc clean, vitest green
- `views/Login.vue` — Enter in field 1 → focus confirm; Enter in confirm →
submit; submit button always clickable (shows inline mismatch/length error
instead of being silently disabled); errors clear on input; **Restart
Onboarding needs a confirming second click** (5s window) — this button is
the likely cause of the "onboarding restarted after mismatch" report.
+`login.restartConfirm` key in en/es locales.
- `stores/sync.ts` — 30s staleness reconciliation (server.get-state) while
connected; already-connected fast path now refetches too.
- `composables/useContainersScanTimeout.ts` (NEW, +tests) — 20s escape hatch;
wired into `Apps.vue` / `Discover.vue` / `Marketplace.vue`; fresh empty node
reaches the real "no apps yet" empty state; "Checking…" can never persist.
- Backgrounds: 10 heaviest bg JPEGs → **WebP q90** (9.4MB→6.6MB; refs updated
in OnboardingWrapper/Dashboard/useRouteTransitions); 7 remaining images
stayed JPEG (WebP came out LARGER on those — noisy sources; deliberate).
- `public/assets/video/video-intro.mp4` — re-encoded CRF20 (SSIM 0.988) with
**+faststart** (moov was at EOF → browser had to download all 15MB before
playing = the intro lag). 12.7MB now, streams immediately.
- LND icon: stale dist artifact; any fresh `npm run build` ships
`app-icons/lnd.png` correctly.
## Verification done here
- `cargo build -p archipelago` + `cargo check` clean; targeted tests
(bitcoin_status, middleware sanitize, dep_wait, lnd, crash_recovery,
boot_reconciler, bitcoin_host, prod_orchestrator lnd hooks): **52 passed,
0 failed**. Full suite: **898 passed, 0 failed, 1 ignored** (22s).
- `npm run build` green; dist verified: 10 bg-*.webp present, `lnd.png`
icon present, `restartConfirm` string in bundle, optimized faststart
video (12,740,782 bytes) in place. Note: main had a latent build breaker
(unused template ref in `Web5ConnectedNodes.vue` from commit 8256fde1,
vue-tsc TS6133) — fixed here by removing the dead ref/binding; without
this fix `npm run build` fails on current main.
- vitest: new composable tests + related suites pass.
- `bash -n` clean on all touched scripts; nginx conf live-verified by agent
(200/404/cache headers on both HTTP+HTTPS blocks).
- ISO kiosk splice byte-verified against configs/ by agent simulation.
## NOT done / left for you
1. **Full test-suite run + gate**: run the complete `cargo test` and (after
deploy) `tests/lifecycle/run-gate.sh` ON .228 per CLAUDE.md before any tag.
2. **Frontend bundle grep before shipping** (per memory/feedback): verify new
strings (e.g. `restartConfirm`, `bg-home.webp`) in the built tarball.
3. **Diagnostics collector** (`data-dir-listing.txt` = 15MB of podman overlay
internals; dmidecode empty) — collector script wasn't found in this repo
(likely lives on-node or in the user's collection script); fix when found.
4. **podman healthcheck cgroup EPERM spam** (1,250 journal errors, healthchecks
unreliable fleet-wide) — real open bug, Quadlet-phase territory, NOT fixed.
5. **DP link-training failures on .81** (display corruption) — likely
cable/dock/port hardware; firmware fix may help; tell user to try another
cable/port if corruption recurs.
6. **LoRa/RNode onboarding surface** — never scoped; user may want it as a
feature (mesh device-found modal exists only on Mesh page post-login).
7. The concurrent audit agent's files (`docs/1.8.0-RELEASE-HARDENING-PLAN.md`,
`core/.../trust/*`, parts of `bootstrap.rs`) are ALSO uncommitted here —
coordinate before committing; don't mix attribution.