archy/docs/HANDOVER-2026-07-02-iso-feedback.md
archipelago 8b6485078a docs(handover): pushed-to-main state + pre-existing trust test failures caveat
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 08:36:12 -04:00

152 lines
9.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Handover — fresh-ISO feedback bug-bash (2026-07-02)
**For: the agent building the next ISO + fleet deploy.** All fixes below are
**merged and pushed: gitea-ai main = `f5d24796`** (merge of `c375ecc4`,
65 files; branch `iso-feedback-fixes-2026-07-02` also pushed). Source
feedback: user's fresh ISO install on a Framework (11th-gen Tiger Lake)
machine, node `192.168.1.81` (SSH `archipelago` / `archipelago`).
Diagnostic bundle: `/home/archipelago/incoming-logs/node-logs-192.168.1.81/`.
**⚠️ Known-red tests on main (NOT from this work):** `trust::anchor::
unset_constant_is_none` + 2 `trust::signed_doc` tests fail because a prior
commit pinned `RELEASE_ROOT_PUBKEY_HEX` without updating them. The signing/
audit agent's uncommitted changes in the shared tree fix exactly these —
coordinate with them; don't "fix" it independently or you'll collide. This
bug-bash branch alone was 898/898 green; merged with main it's 894/898 with
only those three.
## ⚠️ Outstanding user request for the deploy
- **Change .81's web-UI password to `ThisIsWeb54321@`** — the user forgot the
current one. Node was unreachable from .116 during this session (flaky WiFi
AP, IP flapped .68↔.81). Do this during deploy (SSH works from the user's
machine; `archipelago`/`archipelago`).
## What changed (by file)
### Backend (core/archipelago/src) — builds clean, targeted tests pass
- `api/handler/websocket.rs`**subscribe BEFORE initial snapshot** (the
"everything needs ctrl-r" root cause: broadcasts in the snapshot→subscribe
gap were silently lost; a stale client never learned containers-scanned).
- `main.rs` — crash check now runs BEFORE writing the PID marker (**crash
recovery had never run on any node** — it always saw its own PID and
skipped); tracing default demoted debug→info (journal volume).
- `crash_recovery.rs` — PID-reuse guard (`process_is_archipelago`); new
**pending-boot-starts registry** (names queued for recovery/reconcile) with
writers in `recover_containers` + stack recovery.
- `server.rs` — scanner overlays Stopped/Exited → **Restarting** for
pending-boot-start ids (user ask: "status should be restarting if they are
being restarted"); `SCANNER_RESTARTING` ownership set so scanner-authored
Restarting resolves immediately instead of wedging in the 20-min
transitional-preserve.
- `container/prod_orchestrator.rs` — reconcile pass + `adopt_existing`
register/deregister pending boot-starts; LND pre-start hook passes detected
`bitcoin_host()` (Knots vs Core) into `lnd::ensure_config`; new
`fedimint-clientd` pre-start hook (mkdir + chown 1000:1000 of
`/var/lib/archipelago/fmcd` — self-heals the crash-loop).
- `container/lnd.rs``ensure_config(paths, rpc_pass, bitcoin_host)`;
bitcoind.rpchost no longer hardcoded `bitcoin-knots`; drift check rewrites
host changes; +unit test `ensure_config_repairs_bitcoin_host_drift`.
- `api/rpc/package/dependencies.rs` — bounded **dependency wait**
(`wait_for_install_deps`, 36×5s): installed-but-starting deps wait with
"Waiting for Bitcoin to start…" on the card; not-installed deps fail fast
with `DependencyGateError` marker; +5 unit tests.
- `api/rpc/package/install.rs`, `stacks.rs` — call sites wired to
`gate_install_deps` (lnd/electrumx/mempool/btcpay).
- `api/rpc/package/async_lifecycle.rs``DependencyGateError` removes the
optimistic entry (**no more phantom "Stopped" LND tile**) + pushes an Error
notification with the reason.
- `api/rpc/package/progress.rs``set_install_message` helper.
- `api/rpc/seed_rpc.rs``save_pending_seed_encrypted`; seed.restore also
stashes the mnemonic; `auth.rs` — **auth.setup persists the encrypted seed
backup** (recovery-phrase reveal previously failed on EVERY node because
nothing ever wrote `master_seed.enc`).
- `api/rpc/middleware.rs` — sanitizer allowlist extended (seed/2FA/auth
errors reach the user instead of "Check server logs"); +2 tests.
- `bitcoin_status.rs` — friendly status for "connection reset" (bitcoind
starting); raw URL/os-error chains no longer shown; +3 tests.
- `bootstrap.rs` — journald drop-in self-heal (OTA nodes get log caps);
bitcoin.conf printtoconsole heal. (Log-spam agent's work; verified.)
- `api/rpc/package/config.rs` — bitcoin args `-printtoconsole=0`.
### Manifests / scripts / configs
- `apps/lnd/manifest.yml` — BITCOIND_HOST now `derived_env {{BITCOIN_HOST}}`.
- `apps/bitcoin-knots/manifest.yml`, `apps/bitcoin-core/manifest.yml`
`-printtoconsole=0` (90.6% of the journal was IBD UpdateTip spam;
debug.log in the datadir keeps full logs).
- `scripts/first-boot-containers.sh` — chown 1000:1000 of
`/var/lib/archipelago/fmcd` in BOTH fmcd blocks (root-owned dir was the
fedimint-clientd "Permission denied os error 13" crash-loop);
printtoconsole=0.
- `scripts/container-doctor.sh`, `scripts/reconcile-containers.sh`
printtoconsole=0.
- `image-recipe/configs/journald-archipelago.conf` (NEW) — SystemMaxUse=500M,
rate limits; baked by ISO builder + bootstrap self-heal.
- `image-recipe/configs/nginx-archipelago.conf``/assets/` 404s no longer
cacheable (the `always` immutable header could pin a missing background for
a YEAR); HTTPS block gained the missing `/assets/` location (was silently
serving index.html as images).
- `image-recipe/configs/archipelago-kiosk.service` — MemoryMax 1500→2800M,
MemoryHigh 1200→2200M (kiosk was riding reclaim-throttle = the lag).
- `image-recipe/_archived/build-auto-installer-iso.sh` — kiosk launcher/service
now spliced from `image-recipe/configs/` at build time (was a stale inline
heredoc that force-disabled GPU); **+ `firmware-intel-graphics` +
`firmware-amd-graphics`** (Debian trixie split the i915 DMC blobs out of
firmware-misc-nonfree; the .81 kernel logged tgl_dmc missing).
### Frontend (neode-ui) — vue-tsc clean, vitest green
- `views/Login.vue` — Enter in field 1 → focus confirm; Enter in confirm →
submit; submit button always clickable (shows inline mismatch/length error
instead of being silently disabled); errors clear on input; **Restart
Onboarding needs a confirming second click** (5s window) — this button is
the likely cause of the "onboarding restarted after mismatch" report.
+`login.restartConfirm` key in en/es locales.
- `stores/sync.ts` — 30s staleness reconciliation (server.get-state) while
connected; already-connected fast path now refetches too.
- `composables/useContainersScanTimeout.ts` (NEW, +tests) — 20s escape hatch;
wired into `Apps.vue` / `Discover.vue` / `Marketplace.vue`; fresh empty node
reaches the real "no apps yet" empty state; "Checking…" can never persist.
- Backgrounds: 10 heaviest bg JPEGs → **WebP q90** (9.4MB→6.6MB; refs updated
in OnboardingWrapper/Dashboard/useRouteTransitions); 7 remaining images
stayed JPEG (WebP came out LARGER on those — noisy sources; deliberate).
- `public/assets/video/video-intro.mp4` — re-encoded CRF20 (SSIM 0.988) with
**+faststart** (moov was at EOF → browser had to download all 15MB before
playing = the intro lag). 12.7MB now, streams immediately.
- LND icon: stale dist artifact; any fresh `npm run build` ships
`app-icons/lnd.png` correctly.
## Verification done here
- `cargo build -p archipelago` + `cargo check` clean; targeted tests
(bitcoin_status, middleware sanitize, dep_wait, lnd, crash_recovery,
boot_reconciler, bitcoin_host, prod_orchestrator lnd hooks): **52 passed,
0 failed**. Full suite: **898 passed, 0 failed, 1 ignored** (22s).
- `npm run build` green; dist verified: 10 bg-*.webp present, `lnd.png`
icon present, `restartConfirm` string in bundle, optimized faststart
video (12,740,782 bytes) in place. Note: main had a latent build breaker
(unused template ref in `Web5ConnectedNodes.vue` from commit 8256fde1,
vue-tsc TS6133) — fixed here by removing the dead ref/binding; without
this fix `npm run build` fails on current main.
- vitest: new composable tests + related suites pass.
- `bash -n` clean on all touched scripts; nginx conf live-verified by agent
(200/404/cache headers on both HTTP+HTTPS blocks).
- ISO kiosk splice byte-verified against configs/ by agent simulation.
## NOT done / left for you
1. **Full test-suite run + gate**: run the complete `cargo test` and (after
deploy) `tests/lifecycle/run-gate.sh` ON .228 per CLAUDE.md before any tag.
2. **Frontend bundle grep before shipping** (per memory/feedback): verify new
strings (e.g. `restartConfirm`, `bg-home.webp`) in the built tarball.
3. **Diagnostics collector** (`data-dir-listing.txt` = 15MB of podman overlay
internals; dmidecode empty) — collector script wasn't found in this repo
(likely lives on-node or in the user's collection script); fix when found.
4. **podman healthcheck cgroup EPERM spam** (1,250 journal errors, healthchecks
unreliable fleet-wide) — real open bug, Quadlet-phase territory, NOT fixed.
5. **DP link-training failures on .81** (display corruption) — likely
cable/dock/port hardware; firmware fix may help; tell user to try another
cable/port if corruption recurs.
6. **LoRa/RNode onboarding surface** — never scoped; user may want it as a
feature (mesh device-found modal exists only on Mesh page post-login).
7. The concurrent audit agent's files (`docs/1.8.0-RELEASE-HARDENING-PLAN.md`,
`core/.../trust/*`, parts of `bootstrap.rs`) are ALSO uncommitted here —
coordinate before committing; don't mix attribution.