9.3 KiB
Handover — fresh-ISO feedback bug-bash (2026-07-02)
For: the agent building the next ISO + fleet deploy. All fixes below are
merged and pushed: gitea-ai main = f5d24796 (merge of c375ecc4,
65 files; branch iso-feedback-fixes-2026-07-02 also pushed). Source
feedback: user's fresh ISO install on a Framework (11th-gen Tiger Lake)
machine, node 192.168.1.81 (SSH archipelago / archipelago).
Diagnostic bundle: /home/archipelago/incoming-logs/node-logs-192.168.1.81/.
⚠️ Known-red tests on main (NOT from this work): trust::anchor:: unset_constant_is_none + 2 trust::signed_doc tests fail because a prior
commit pinned RELEASE_ROOT_PUBKEY_HEX without updating them. The signing/
audit agent's uncommitted changes in the shared tree fix exactly these —
coordinate with them; don't "fix" it independently or you'll collide. This
bug-bash branch alone was 898/898 green; merged with main it's 894/898 with
only those three.
⚠️ Outstanding user request for the deploy
- Change .81's web-UI password to
ThisIsWeb54321@— the user forgot the current one. Node was unreachable from .116 during this session (flaky WiFi AP, IP flapped .68↔.81). Do this during deploy (SSH works from the user's machine;archipelago/archipelago).
What changed (by file)
Backend (core/archipelago/src) — builds clean, targeted tests pass
api/handler/websocket.rs— subscribe BEFORE initial snapshot (the "everything needs ctrl-r" root cause: broadcasts in the snapshot→subscribe gap were silently lost; a stale client never learned containers-scanned).main.rs— crash check now runs BEFORE writing the PID marker (crash recovery had never run on any node — it always saw its own PID and skipped); tracing default demoted debug→info (journal volume).crash_recovery.rs— PID-reuse guard (process_is_archipelago); new pending-boot-starts registry (names queued for recovery/reconcile) with writers inrecover_containers+ stack recovery.server.rs— scanner overlays Stopped/Exited → Restarting for pending-boot-start ids (user ask: "status should be restarting if they are being restarted");SCANNER_RESTARTINGownership set so scanner-authored Restarting resolves immediately instead of wedging in the 20-min transitional-preserve.container/prod_orchestrator.rs— reconcile pass +adopt_existingregister/deregister pending boot-starts; LND pre-start hook passes detectedbitcoin_host()(Knots vs Core) intolnd::ensure_config; newfedimint-clientdpre-start hook (mkdir + chown 1000:1000 of/var/lib/archipelago/fmcd— self-heals the crash-loop).container/lnd.rs—ensure_config(paths, rpc_pass, bitcoin_host); bitcoind.rpchost no longer hardcodedbitcoin-knots; drift check rewrites host changes; +unit testensure_config_repairs_bitcoin_host_drift.api/rpc/package/dependencies.rs— bounded dependency wait (wait_for_install_deps, 36×5s): installed-but-starting deps wait with "Waiting for Bitcoin to start…" on the card; not-installed deps fail fast withDependencyGateErrormarker; +5 unit tests.api/rpc/package/install.rs,stacks.rs— call sites wired togate_install_deps(lnd/electrumx/mempool/btcpay).api/rpc/package/async_lifecycle.rs—DependencyGateErrorremoves the optimistic entry (no more phantom "Stopped" LND tile) + pushes an Error notification with the reason.api/rpc/package/progress.rs—set_install_messagehelper.api/rpc/seed_rpc.rs—save_pending_seed_encrypted; seed.restore also stashes the mnemonic;auth.rs— auth.setup persists the encrypted seed backup (recovery-phrase reveal previously failed on EVERY node because nothing ever wrotemaster_seed.enc).api/rpc/middleware.rs— sanitizer allowlist extended (seed/2FA/auth errors reach the user instead of "Check server logs"); +2 tests.bitcoin_status.rs— friendly status for "connection reset" (bitcoind starting); raw URL/os-error chains no longer shown; +3 tests.bootstrap.rs— journald drop-in self-heal (OTA nodes get log caps); bitcoin.conf printtoconsole heal. (Log-spam agent's work; verified.)api/rpc/package/config.rs— bitcoin args-printtoconsole=0.
Manifests / scripts / configs
apps/lnd/manifest.yml— BITCOIND_HOST nowderived_env {{BITCOIN_HOST}}.apps/bitcoin-knots/manifest.yml,apps/bitcoin-core/manifest.yml—-printtoconsole=0(90.6% of the journal was IBD UpdateTip spam; debug.log in the datadir keeps full logs).scripts/first-boot-containers.sh— chown 1000:1000 of/var/lib/archipelago/fmcdin BOTH fmcd blocks (root-owned dir was the fedimint-clientd "Permission denied os error 13" crash-loop); printtoconsole=0.scripts/container-doctor.sh,scripts/reconcile-containers.sh— printtoconsole=0.image-recipe/configs/journald-archipelago.conf(NEW) — SystemMaxUse=500M, rate limits; baked by ISO builder + bootstrap self-heal.image-recipe/configs/nginx-archipelago.conf—/assets/404s no longer cacheable (thealwaysimmutable header could pin a missing background for a YEAR); HTTPS block gained the missing/assets/location (was silently serving index.html as images).image-recipe/configs/archipelago-kiosk.service— MemoryMax 1500→2800M, MemoryHigh 1200→2200M (kiosk was riding reclaim-throttle = the lag).image-recipe/_archived/build-auto-installer-iso.sh— kiosk launcher/service now spliced fromimage-recipe/configs/at build time (was a stale inline heredoc that force-disabled GPU); +firmware-intel-graphics+firmware-amd-graphics(Debian trixie split the i915 DMC blobs out of firmware-misc-nonfree; the .81 kernel logged tgl_dmc missing).
Frontend (neode-ui) — vue-tsc clean, vitest green
views/Login.vue— Enter in field 1 → focus confirm; Enter in confirm → submit; submit button always clickable (shows inline mismatch/length error instead of being silently disabled); errors clear on input; Restart Onboarding needs a confirming second click (5s window) — this button is the likely cause of the "onboarding restarted after mismatch" report. +login.restartConfirmkey in en/es locales.stores/sync.ts— 30s staleness reconciliation (server.get-state) while connected; already-connected fast path now refetches too.composables/useContainersScanTimeout.ts(NEW, +tests) — 20s escape hatch; wired intoApps.vue/Discover.vue/Marketplace.vue; fresh empty node reaches the real "no apps yet" empty state; "Checking…" can never persist.- Backgrounds: 10 heaviest bg JPEGs → WebP q90 (9.4MB→6.6MB; refs updated in OnboardingWrapper/Dashboard/useRouteTransitions); 7 remaining images stayed JPEG (WebP came out LARGER on those — noisy sources; deliberate).
public/assets/video/video-intro.mp4— re-encoded CRF20 (SSIM 0.988) with +faststart (moov was at EOF → browser had to download all 15MB before playing = the intro lag). 12.7MB now, streams immediately.- LND icon: stale dist artifact; any fresh
npm run buildshipsapp-icons/lnd.pngcorrectly.
Verification done here
cargo build -p archipelago+cargo checkclean; targeted tests (bitcoin_status, middleware sanitize, dep_wait, lnd, crash_recovery, boot_reconciler, bitcoin_host, prod_orchestrator lnd hooks): 52 passed, 0 failed. Full suite: 898 passed, 0 failed, 1 ignored (22s).npm run buildgreen; dist verified: 10 bg-*.webp present,lnd.pngicon present,restartConfirmstring in bundle, optimized faststart video (12,740,782 bytes) in place. Note: main had a latent build breaker (unused template ref inWeb5ConnectedNodes.vuefrom commit8256fde1, vue-tsc TS6133) — fixed here by removing the dead ref/binding; without this fixnpm run buildfails on current main.- vitest: new composable tests + related suites pass.
bash -nclean on all touched scripts; nginx conf live-verified by agent (200/404/cache headers on both HTTP+HTTPS blocks).- ISO kiosk splice byte-verified against configs/ by agent simulation.
NOT done / left for you
- Full test-suite run + gate: run the complete
cargo testand (after deploy)tests/lifecycle/run-gate.shON .228 per CLAUDE.md before any tag. - Frontend bundle grep before shipping (per memory/feedback): verify new
strings (e.g.
restartConfirm,bg-home.webp) in the built tarball. - Diagnostics collector (
data-dir-listing.txt= 15MB of podman overlay internals; dmidecode empty) — collector script wasn't found in this repo (likely lives on-node or in the user's collection script); fix when found. - podman healthcheck cgroup EPERM spam (1,250 journal errors, healthchecks unreliable fleet-wide) — real open bug, Quadlet-phase territory, NOT fixed.
- DP link-training failures on .81 (display corruption) — likely cable/dock/port hardware; firmware fix may help; tell user to try another cable/port if corruption recurs.
- LoRa/RNode onboarding surface — never scoped; user may want it as a feature (mesh device-found modal exists only on Mesh page post-login).
- The concurrent audit agent's files (
docs/1.8.0-RELEASE-HARDENING-PLAN.md,core/.../trust/*, parts ofbootstrap.rs) are ALSO uncommitted here — coordinate before committing; don't mix attribution.