archy/docs/WEEKLY_RELEASE_TRACKER.md
archipelago 459046b21c docs: resume notes for LND wallet fix (in-progress, branch lnd-wallet-password-fix)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:26:10 -04:00

289 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Weekly Release Tracker
Last updated: 2026-06-14 (session on node .116 / archi-thinkpad)
---
# ▶ IN PROGRESS — LND wallet auto-unlock fix (2026-06-14)
## RESUME PROMPT (paste into a fresh session, on .116 / archi-thinkpad, tree at /home/archipelago/Projects/archy)
> Resume the LND wallet-password fix. Read memory `project_lnd_wallet_password.md` FIRST (full
> root-cause + design + validated facts). Work is on branch `lnd-wallet-password-fix` (pushed to
> gitea-vps2, commit 91adc281, NOT merged to main, NOT shipped). Bug: hardcoded
> `WALLET_PASSWORD="hellohello"` left LND wallets LOCKED fleet-wide after OTA → Bitcoin-receive
> shows "wallet is locked" on every updated node. DONE + cargo-checked: per-node random secret
> (secrets/lnd-wallet-password), both init paths unified, candidate-unlock with fail-fast,
> login-time candidate-migration (ChangePassword). DETECTION GATE already shipped on main
> (commit 8c8e4d7a). DECISION: alpha, NO funds on nodes → destructive wipe+recreate is OK and
> wanted UNATTENDED for ALL nodes in the next update. A wallet locked with an unknown password is
> already inaccessible, so wiping loses nothing reachable.
## EXACT NEXT STEPS — LND fix (in order)
1. **Finish seed/fresh recovery** (REMAINING piece): in `container/lnd.rs ensure_wallet_initialized`,
when wallet.db exists but ALL unlock candidates fail → wipe wallet.db (+ macaroons + graph/chain
mainnet state, as root via host_sudo) and re-init fresh (random genseed + per-node secret) so the
node self-heals unattended at boot. (Login-time candidate-migration already handles nodes whose
pw matches.) Validate the wipe→reinit mechanic on the scratch LND first (see below).
2. **Scratch validation** (was in progress, .249 unreachable from .116's subnet → use a throwaway
`lnd-scratch` podman container on .116, regtest/neutrino, REST :18099 — already proven for
init/unlock/ChangePassword). Test: init(passA) → restart→LOCKED → delete wallet.db while locked →
confirm /v1/state→NON_EXISTING (may need container restart) → genseed+initwallet fresh → unlock.
NOTE: scratch wallet.db lives at the container's LND data dir (regtest), `podman exec lnd-scratch
find / -name wallet.db`. CLEAN UP: `podman rm -f lnd-scratch` when done.
3. `cargo check -p archipelago` (on .116 ~15-30s incremental; full test compile ~9min).
4. **End-to-end on .228** (reachable 192.168.1.x, SSH pw `archipelago`, UI pw unknown, NO funds —
has a locked unknown-pw wallet = perfect auto-recreate test): build binary
(`ARCHIPELAGO_TARGET=archipelago@192.168.1.228 scripts/deploy-to-target.sh` or per
reference_deploy_to_nodes), deploy, restart, confirm wallet auto-recreates+unlocks, lncli state
RPC_ACTIVE, lnd.newaddress returns an address. Run os-audit against .228 → lnd check PASS.
5. Merge `lnd-wallet-password-fix` → main, then **cut + publish v1.7.93-alpha** (carries the LND
fix). Ship ritual: create-release.sh 1.7.93-alpha → add CHANGELOG (≥3 layman bullets) → run
sync-whats-new.py (the new What's-New gate will require it) → publish-release-assets.sh gitea-vps2
→ push origin/gitea-vps2 + tags → verify live manifest==1.7.93-alpha. Heads-up: create-release
leaves core/Cargo.lock version-bump uncommitted (commit it as a chore, both .91 and .92 hit this).
## Context: how we got here (this session, all on node .116)
- Shipped **v1.7.91-alpha** (bitcoinReceive TS2538 build fix) and **v1.7.92-alpha** (ElectrumX
overlay-during-sync fix; L3 reboot os-audit gate; What's-New sync gate + 8-version backfill) —
both LIVE on vps2. Restored .116-local nginx `/lnd-connect-info` route (was dropped 2026-06-10).
- Triaged user symptoms: ElectrumX "can't connect" = electrs syncing / Bitcoin verifying (not a
regression); .228 "5/14 apps after reboot" = normal ~5min staggered startup (all 14 came up).
- LND lock bug found + detection gate shipped + forward fix & migration implemented (this section).
---
# ✔ DONE PASS — v1.7.91-alpha + v1.7.92-alpha (2026-06-14)
## Outcome (both releases PUBLISHED + LIVE on vps2)
- **v1.7.91-alpha** — bitcoinReceive.ts TS2538 build-blocker fixed; cut, published, verified
live (`manifest.version==1.7.91-alpha`), tag `v1.7.91-alpha` on vps2. The fleet OTA'd to it
(confirmed on .116 + .198).
- **v1.7.92-alpha** — cut, published, verified live (`manifest.version==1.7.92-alpha`), tag on
vps2, main@d462e444. Carries:
- `fix(ui)` ElectrumX **overlay-during-sync** bug — the "App not reachable / retry" overlay
no longer paints over the ElectrumX sync screen (AppSessionFrame.vue gated on `!electrsSync`).
- `test(resilience)` **L3 per-boot health gate**`batch_host_reboot` now runs os-audit.sh
after reboot (RPC/OTA/all-apps/FM-guards), not just container-set equality. os-audit validated
11/0/0 green on .116.
- `feat(release)` **What's New sync gate**`scripts/sync-whats-new.py` + `whats-new-sync`
stage in tests/release/run.sh. Backfilled the 8 missing modal blocks (v1.7.85→.92); the gate
fails any release whose CHANGELOG version isn't in the Settings modal.
- **.116 node fix (not shipped — local config)**: restored the `/lnd-connect-info` nginx proxy
route that a 2026-06-10 "before-116-routing" change had dropped (fell through to SPA). Backup at
`/etc/nginx/conf.d/rpc.tx1138.com.conf.bak-lndconnect-*`. Shipped template already has the route.
- **User symptoms triaged (none were .91/.92 regressions)**: receive-generate "unchanged" = .91's
receive change was a behavior-preserving build guard; ElectrumX "can't connect" on .198 = Bitcoin
node mid-"Verifying blocks…" (-28) so electrs was "waiting for Bitcoin node"; on .116 electrs was
~59% mid-sync. The overlay UX bug is fixed regardless.
## Known follow-ups (not blockers)
- **gitea-local mirror push fails** (`localhost:3000` → redirect to `/login`, token auth). vps2 is
the OTA source and is fine; gitea-local secondary mirror is stale. Diagnose the local Gitea token.
- `sync-whats-new.py` only **inserts missing** versions; it does not rewrite a block when CHANGELOG
bullets for an already-present version change (had to delete+resync the .92 block by hand to pick
up its 3rd bullet). Fine for the forward case; enhance to idempotently re-render if needed.
## What happened this session
- `scripts/create-release.sh 1.7.91-alpha` was running; its release gate PASSED all 7 checks,
backend built clean (7m22s), then it **FAILED at step [4/8] frontend build** with:
`src/utils/bitcoinReceive.ts(23,24): error TS2538: Type 'undefined' cannot be used as an index type.`
Cause: `noUncheckedIndexedAccess``codeMatch[1]` is `string | undefined` and was used directly
to index `RECEIVE_CODE_MESSAGES`. **FIXED**`const code = message.match(/\[([A-Z_]+)\]/)?.[1]`
then `if (code && RECEIVE_CODE_MESSAGES[code])`. `npx vue-tsc --noEmit` is now clean (exit 0).
The failed run aborted BEFORE bumping the manifest (still 1.7.90) or tagging (no v1.7.91 tag),
but it HAD already partial-bumped Cargo.toml/package.json/locks to 1.7.91 — those partial bumps
are reverted (create-release.sh re-owns the bump); only the genuine TS fix + harness are committed.
- Built a new OS-wide health harness `tests/lifecycle/os-audit.sh` (non-destructive, one scorecard):
Section A backend/RPC health, Section B all-apps lifecycle audit (delegates to remote-lifecycle.sh),
Section C FM-guards (port-drift + secret-completeness bats, orphan-container sweep). Section A
validated all-PASS on .116. Fixed a jq bug in the FM12 OTA-wedge check: `//` treats a legit
`false` as empty and fell through to "unknown" — now uses `has()`. Section B is slow (~3 min) and
opaque while running because output is captured (`out=$(...)`) not streamed — minor wart, TODO.
## EXACT NEXT STEPS — v1.7.91 (in order)
1. Confirm clean tree + on main (`git status`; create-release.sh requires `git diff --quiet HEAD`).
The TS fix + os-audit.sh are committed & pushed; version-bump artifacts reverted to 1.7.90.
2. Re-run the release: `scripts/create-release.sh 1.7.91-alpha`. Backend is cached (only a .ts
changed) so it's fast; the frontend build now passes. It bumps versions, builds, writes
releases/manifest.json (→1.7.91-alpha), commits, and tags v1.7.91-alpha.
- Memory guards: grep the staged frontend tarball for "1.7.91-alpha" before shipping (silent
vue-tsc failures); tarball must be flat (`tar -C web/dist/neode-ui .`).
3. Publish: `scripts/publish-release-assets.sh 1.7.91-alpha gitea-vps2`, then
`git push origin main && git push origin --tags` (origin pushes to BOTH gitea-local + vps2).
4. Verify manifest LIVE (this is "published"):
`curl -fsS http://146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version`
must show `1.7.91-alpha`. **Then notify the user — they asked to be told when 1.7.91 publishes.**
5. os-audit harness: run a full green pass on .116
(`ARCHY_HOST=127.0.0.1 ARCHY_SCHEME=http ARCHY_PASSWORD='ThisIsWeb54321@' tests/lifecycle/os-audit.sh`),
confirm Section A FM12 now reads `update_in_progress=false` (PASS not WARN), review B + C findings,
then wire os-audit.sh into the reboot-survival (L3) loop as the per-boot gate.
---
# ─ HISTORY — v1.7.89-alpha pass (2026-06-12), superseded ─
Last updated: 2026-06-12 ~17:45 EDT (session on node .116)
## RESUME PROMPT (paste into a fresh session)
> Continue the v1.7.89-alpha release pass from /home/archipelago/Projects/archy on node .116.
> Read docs/WEEKLY_RELEASE_TRACKER.md fully first — it has root causes, fixes already made,
> and exact next steps. Do NOT redo: AIUI revert (done, validated), updater fixes in
> core/archipelago/src/update.rs (done, uncommitted), .116 OTA unwedge (done). Resume at
> "EXACT NEXT STEPS" below.
## EXACT NEXT STEPS (in order)
1. Backend focused tests were running in background:
`cd core && timeout 1500 cargo test -p archipelago -- update:: lnd container::image_versions scanner`
(log: /tmp/claude-.../tasks/bds4jk19e.output — if lost, just rerun the command; first
attempt died at 400s timeout during test compile, 1500s is the right budget).
Need: all green.
2. RESOLVED before session end: vitest recheck passed clean — EXIT=0, 79 files / 645 tests,
even while cargo test was compiling. The earlier harness ui-unit-tests FAIL was load/flake
(machine saturated by the parallel cargo test compile), not a real failure. On resume just
rerun `tests/release/run.sh --quick` WITHOUT a parallel cargo build to confirm green;
if it ever fails again, the failing test name is in the stage output (drop `--silent`).
3. Run full harness: `tests/release/run.sh` (static+frontend+backend). Then commit ALL
working-tree changes (one commit, e.g. "fix: harden OTA updates, AIUI desktop gap, LND
no-proxy" — CHANGELOG v1.7.89 section is already curated).
4. Cut release: `scripts/create-release.sh 1.7.89-alpha` (needs clean tree, on main,
validates CHANGELOG section exists — it does). Then
`tests/release/run.sh --manifest` should pass, and grep the staged frontend tarball
for 1.7.89-alpha (memory: silent build failures).
5. Publish: `scripts/publish-release-assets.sh 1.7.89-alpha gitea-vps2`, then
`git push origin main && git push origin --tags` and push gitea-local + tags too.
Verify manifest live on http://146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/manifest.json
6. Verify OTA on THIS node (.116): schedule is auto_apply; either wait for the scheduler
or trigger via UI. Confirm /var/lib/archipelago/update_state.json current_version
becomes 1.7.89-alpha, `update_in_progress` returns to false, web-ui + binary versions
MATCH (this node currently has web-ui 1.7.84 / binary 1.7.85 mismatch — the OTA heals it),
and journalctl shows "Post-OTA verification succeeded" (the new probe falls back to
http://127.0.0.1/ which is what .116 serves).
7. Update this tracker + docs/PROGRESS_MEMORY.md, mark tasks done.
Purpose: live tracker for this pass — test everything shipped this week (v1.7.83→v1.7.89),
build the release test harness, fix OTA updates on .116, make updates bulletproof, cut v1.7.89-alpha.
If the session is cut off, resume from here.
## Task status
| # | Task | Status |
|---|------|--------|
| 1 | AIUI revert (mobile back/close gone, desktop gap fixed) | DONE — validated |
| 2 | Dev server on :8100 with embedded AIUI | DONE — see below |
| 3 | Inventory this week's release-log items | DONE — see checklist |
| 4 | Test harness covering this week + seed of system-wide harness | IN PROGRESS |
| 5 | Fix OTA updates on .116 + bulletproof updates | IN PROGRESS — diagnosis below |
| 6 | Cut v1.7.89-alpha release | PENDING (gates: 4, 5) |
## State of the working tree
- HEAD = 495b9078 (v1.7.89 changelog + AIUI mobile restore committed).
- Uncommitted, intended for v1.7.89-alpha:
- `neode-ui/src/views/Dashboard.vue` — chat route back to plain `h-full` (desktop bottom-gap fix). Validated.
- `core/.../rpc/lnd/*` + `container/lnd.rs` — LND REST no-proxy + wallet readiness/unlock fixes.
- Version bumps to 1.7.89-alpha (Cargo.toml, package.json, locks), CHANGELOG entry.
- `neode-ui/vite.config.ts` — added `/aiui` dev proxy (keep; dev-only convenience).
## AIUI validation (task 1) — DONE
- HEAD already removed the mobile back button and restored `hideClose=true` (495b9078).
- Working-tree Dashboard.vue removes `dashboard-scroll-panel mobile-scroll-pad` from the chat
route (that padding caused the desktop bottom gap); mesh keeps its styling.
- Chat CSS verified byte-identical to last-good 34c4e87d (May 20).
- Playwright check (desktop 1440x900, mobile 390x844): chat fills full viewport, no bottom gap,
no mobile back/close. `npm run type-check` + focused route tests + full vitest (645/645) pass.
## Dev server on :8100 (task 2) — DONE
- Running: `BACKEND_URL=http://127.0.0.1:5678 VITE_AIUI_URL=/aiui/ npx vite --host 0.0.0.0 --port 8100`
from `neode-ui/` (real local backend on 5678).
- AIUI now embeds in /dashboard/chat via new vite proxy `/aiui``http://127.0.0.1:80`
(the node's deployed AIUI), same-origin like production.
- Secondary throwaway instance for automated checks: :8101 against mock backend
(`node mock-backend.js` on 5959, password `password123`).
## This week's shipped items (v1.7.83 → v1.7.89) — test checklist
### Frontend (vitest/type-check/build cover most; full suite 645/645 green 2026-06-12)
- [x] AIUI fast launch, no availability probe (v1.7.88) — covered by visual check + Chat.vue tests
- [x] AIUI mobile layout restore (v1.7.89) — playwright visual check
- [x] App-session launch metadata from manifests / typed interfaces (v1.7.83) — appSessionConfig tests
- [x] OnlyOffice + Saleor removal (v1.7.83) — catalog tests
- [ ] Bitcoin receive UI flow end-to-end (v1.7.87/88) — needs live LND node check
- [ ] Fleet tab keeps node list/alerts during refresh, names not hashes (v1.7.85/86) — store tests?
- [ ] Credential interstitial full-screen overlay (v1.7.87) — visual
- [ ] Mobile federation/system-update buttons full width (v1.7.86) — visual
### Backend (cargo)
- [ ] LND REST no-proxy client + GET newaddress p2wkh (v1.7.88/89) — unit tests + live check
- [ ] LND wallet readiness/unlock after restart (v1.7.89) — unit + live
- [ ] Bitcoin trusted-node relay rpcauth/txrelay (v1.7.84) — unit tests exist? check
- [ ] Container scanner RAII in-flight guard (v1.7.84) — cargo test
- [ ] ElectrumX health-check startup window + cache tuning (v1.7.85/86)
- [ ] Portainer pin 2.19.4 / bitcoin-ui image pin (v1.7.84/85) — image-versions tests
- [ ] Fleet telemetry name/hostname/URL fields (v1.7.85)
- [ ] Federation no self-import (v1.7.85)
- [ ] Kiosk safe-area + self-update refreshes kiosk files (v1.7.84)
- [ ] Wi-Fi scan error/retry/escaped SSID/open networks (v1.7.84)
### OTA / updates (task 5)
- [ ] .116 stuck: current 1.7.85-alpha, `update_in_progress: true` since 1.7.88 attempt — diagnose+fix
- [ ] Updater hardening: stuck-in-progress recovery, resumable/atomic apply, verify post-restart version
## OTA diagnosis on .116 — ROOT CAUSES FOUND + FIXED (code staged for v1.7.89)
Four bugs, all reproduced from the journal (Jun 12 03:4504:33):
1. Post-OTA probe only tries `https://127.0.0.1/`; .116's nginx binds only :80 (443 is
tailscale's) → connection refused × 18 → a GOOD 1.7.85 update was "rolled back".
FIX: probe falls back to `http://127.0.0.1/` on connect error (update.rs probe_frontend_once).
2. That rollback's binary restore did `host_sudo cp` onto the RUNNING binary → ETXTBSY exit 1
→ binary stayed 1.7.85 while web-ui rolled back to 1.7.84 (mismatch confirmed live).
FIX: rollback now cp→tmp→atomic mv, same pattern as apply (update.rs rollback_update).
3. The rollback chown'd `update-backup/archipelago` root:root IN PLACE → next apply's
fs::copy (as service user) hit EACCES → "Failed to backup current binary" × 3 → 1.7.86/88
never applied. FIX: apply unlinks stale backup first; rollback chowns only its temp copy.
4. Failed apply left `update_in_progress: true` wedged (staging still populated so the
stale-flag guard never fires). Unwedged operationally; fixed structurally by 13.
Operational cleanup DONE on .116 (2026-06-12 17:15): removed root-owned
`update-backup/archipelago`, stale `update-staging/` (1.7.86), and the stale
`update-pending-verify.json`. Next state load clears `update_in_progress`.
NOTE: live web-ui is 1.7.84 / binary 1.7.85 (mismatch from bug 2). Not hand-patched —
the v1.7.89 OTA will resync both. Good 1.7.85 frontend is quarantined at
`/opt/archipelago/web-ui.failed.1781250438247`.
Verification plan: after v1.7.89 release, watch .116 auto-apply (schedule auto_apply),
confirm `update_state.json.current_version == 1.7.89-alpha` and web-ui version matches.
## Test harness (task 4) — CREATED at tests/release/run.sh
- Stages: static (git diff --check, cargo fmt, catalog drift, optional --manifest),
frontend (type-check, full vitest), optional --with-build (build + grep dist for version),
backend (cargo check + focused cargo test: update:: lnd container::image_versions scanner,
all wrapped in `timeout`), optional --live URL smoke (/, /aiui/, /rpc/v1).
- Results so far (2026-06-12): type-check PASS, full vitest 645/645 PASS, cargo fmt PASS,
cargo check PASS, catalog drift PASS (3 pre-existing MISSING_CATALOG warnings, exit 0,
identical on HEAD). Focused backend cargo tests running (first run hit the known slow
test-compile on .116 at 400s timeout; rerunning with 1500s).
- AIUI embed verified end-to-end via playwright on :8101 (mock backend): iframe loads,
`ready` handshake clears the loading overlay, hideClose honored.
- Release flow confirmed: commit all → `scripts/create-release.sh 1.7.89-alpha` (validates
curated CHANGELOG section, builds, manifests, commits, tags) →
`scripts/publish-release-assets.sh 1.7.89-alpha gitea-vps2` → push origin main + tags.
Tarball layout/perms safety is already inside create-release-manifest.sh.
- CHANGELOG v1.7.89 section rewritten layman-readable (updater fixes added).
## Release gates for v1.7.89-alpha (task 6)
1. All harness stages green locally.
2. OTA fix for stuck `update_in_progress` included + .116 updates successfully to the new release.
3. Frontend build: grep packaged tarball for "1.7.89-alpha" before shipping (memory: silent vue-tsc failures).
4. Flat tarball layout (`tar -C web/dist/neode-ui .`).
5. Commit, tag `v1.7.89-alpha`, push origin + gitea-local + tags, publish release assets, verify
manifest + node OTA picks it up.