261 lines
38 KiB
Markdown
261 lines
38 KiB
Markdown
# ▶▶ SESSION SAVE / RESUME (2026-06-16) — v1.7.97-alpha CUT, mid-rollout
|
||
|
||
**v1.7.97-alpha is BUILT + TAGGED LOCALLY but NOT yet published to the fleet.**
|
||
- Release commit `47c16971` ("chore: release v1.7.97-alpha") + tag `v1.7.97-alpha` exist on LOCAL main only. NOT pushed to gitea-vps2. Fleet still sees 1.7.96-alpha.
|
||
- Contents (14 fixes + image-opt): B5,B1,B2,B4,B14,B21,B3,B15,B7,B13,B12,B16,**B17**, B6-pruned-gate + lossless background-image optimization (bg-mesh PNG→JPEG).
|
||
- Release artifacts staged: `releases/v1.7.97-alpha/{archipelago, archipelago-frontend-1.7.97-alpha.tar.gz}` + `/tmp/archipelago-frontend-1.7.97-alpha.tar.gz` (177MB, flat layout verified, optimized images baked in, no APK).
|
||
- **Deployed (sideload, NOT fleet OTA):** .116 = on 1.7.97-alpha, healthy, B17 self-heal CONFIRMED (unit now has RequiresMountsFor, 36 containers survived restart). .198 = deploying (sideload binary+frontend).
|
||
- **Backup binaries for rollback:** `/usr/local/bin/archipelago.1.7.96-alpha.bak` on .116 and .198.
|
||
|
||
**REMAINING (this session, user wants to do WITH them):**
|
||
1. Finish .198 sideload; then **UI-confirm fixes together on .116/.198** + close passing Gitea issues (#8,#9,#10,#11,#12,#14,#19(code-only),#20,#21,#22,#23,#24,#29). Issue map below.
|
||
2. **Publish to fleet:** `scripts/publish-release-assets.sh 1.7.97-alpha gitea-vps2` + `git push gitea-vps2 main + tag` (AFTER joint confirm — user's call).
|
||
3. **Cut a fresh ISO** (bakes B13 nginx + B17 unit + all frontend). ISO builds run on a server (deploy-to-target / .228). Then test the ISO together.
|
||
|
||
⚠️ LESSON: never run the release binary to "check --version" — it has no such flag and BOOTS A FULL NODE (adopts containers, grabs mesh radio). Use `strings <bin> | grep version`. (Did this on .116; the instance exited on the :5678 port conflict, no harm.)
|
||
|
||
---
|
||
|
||
# ▶▶ SESSION SAVE / RESUME (2026-06-15)
|
||
|
||
**State:** v1.7.96-alpha SHIPPED. v1.7.97-alpha NOT cut yet — 10 fixes committed on **vps2 main** (`git remote: gitea-vps2`), nothing on the fleet yet. Validate on .116/.198 + UI-confirm BEFORE cutting .97.
|
||
|
||
**Resume command (run elsewhere):**
|
||
```
|
||
cd ~/Projects/archy && git fetch gitea-vps2 && git checkout main && git reset --hard gitea-vps2/main && cat tests/production-quality/TRACKER.md
|
||
```
|
||
Then continue from "IN PROGRESS" below.
|
||
|
||
**Committed & ready for .97 (vps2 main):** B5 (LND CORS, verified .116/.198/.103), B1, B2, B4, B14, B21, B3 (incl. /api/peer-content nginx via bootstrap), B15, B7, **B13 (fedimint CSS self-heal — main conf + HTTPS snippet, verified .198 both paths app-icon 404→200)**, **B12 (mempool bitcoin-host detect across 3 render paths — unit-tested; live bitcoin-core validation pending)**, **B16 (bitcoin sync tile retain/Updating… — unit-tested 6/6, commit 83dbd25c)**. B6 pruned-gate already live. = 13 fixes. PLUS **image-optimization** (commit 386d4bfc — all bg images losslessly optimized, bg-mesh PNG→JPEG; user asked to include it in the .97 release).
|
||
|
||
**IN PROGRESS — B16 DONE (commit 83dbd25c). Pick up at B6 no-node-present half.** B13 + B12 + B16 DONE (committed; see entries below). REMAINING:
|
||
1. **B6** no-node-present half, **B12b** (sibling bitcoin-host hardcodes: LND/BTCPay/electrumx/fedimint + mempool dep declaration — reuse `{{BITCOIN_HOST}}`; needs validation, esp. LND/fedimint), **B14b** (FIPS reachability depth), **B22/B23** (peer download + group chat — need live repro), B9/B10/B11/B17/B18/B19, B8 (low), B20 (mesh-headers feature).
|
||
3. **Loose end:** 4 pre-existing prod_orchestrator test failures (generated-files/data_uid fixtures use disallowed tempdir volume sources) — see B12 NOTE; separate small fix.
|
||
|
||
Note: .198 is running a sideloaded B13-era .97-dev binary (md5 4c83803d). The B12 binary was built (`core/target/release/archipelago`) but NOT sideloaded (mempool isn't on .198; .198 is Knots so B12 is a no-op there). Reflashing/OTA replaces the dev binary.
|
||
|
||
**Ship .97 when ready:** ./scripts/create-release.sh 1.7.97-alpha (curate CHANGELOG ≥3 layman bullets first + run scripts/sync-whats-new.py; SKIP_RELEASE_TESTS=1 only for the 2 known-flaky vitest timing tests) → scripts/publish-release-assets.sh 1.7.97-alpha gitea-vps2 → git push gitea-vps2 main + tag. (gitea-local push fails: token rejected — non-blocking.)
|
||
|
||
---
|
||
|
||
# Production-Quality Bug Tracker
|
||
|
||
Living tracker for the post-v1.7.96 "no new features until production quality" push.
|
||
Updated continuously as we investigate → fix → test → pass. Kept in-repo so progress
|
||
survives a session cutoff.
|
||
|
||
## Rules (from user, 2026-06-15)
|
||
- **No new features** until the OS is production / no-bugs quality.
|
||
- **Test-harness-first**: build/extend a harness for each bug before fixing.
|
||
- **Validate every fix on `.116` + `.198`** (both 192.168.1.x, pw ThisIsWeb54321@) **+ the harness** BEFORE it goes into any release. (.198 still carries the LND CORS nginx duplicate → good for fix-(a) validation; .116 does not.)
|
||
- **Priority order**: cloud/federated-nodes + mesh FIRST, then app-specific, then low-pri.
|
||
|
||
## Status legend
|
||
`TODO` · `INVESTIGATING` · `ROOT-CAUSED` · `FIXING` · `TESTING` (on .116+harness) · `PASSED` · `SHIPPED`
|
||
|
||
## Release status
|
||
- **v1.7.96-alpha — SHIPPED** (2026-06-15). Live on vps2 (primary OTA): manifest v1.7.96-alpha, assets HTTP 200, `main@8c3c7954` + tag present. Contents: kiosk grid removal + FIPS TCP/UDP anchor selector. NOTE: gitea-local (localhost) mirror push failed (token rejected → /login); non-blocking, needs refreshed token.
|
||
- **v1.7.97-alpha — IN PROGRESS** (this push). Will bundle the verified fixes below.
|
||
|
||
---
|
||
|
||
## 🔴🔴 TOP PRIORITY
|
||
|
||
### B5 — LND "connect your wallet" details/QR broken fleet-wide — ROOT-CAUSED
|
||
Origin: user escalation. Symptom: LND connect screen (served on app port :18083) can't load details/QR.
|
||
Two distinct root causes (confirmed live):
|
||
- **(a) Duplicate ACAO** on `/lnd-connect-info` (seen on .103): backend sets `Access-Control-Allow-Origin` (proxy.rs:108) AND nginx `add_header` adds a second → browser rejects "multiple values". nginx config drift. Fix: bootstrap.rs nginx patch must strip the redundant `add_header` from the `/lnd-connect-info` location (backend owns CORS).
|
||
- **(b) No ACAO on `/proxy/lnd/v1/*` 401** (fleet-wide): the unauth/auth-layer 401 is produced before the CORS-adding proxy handler (proxy.rs:135 `handle_lnd_proxy`). Browser → "No 'Access-Control-Allow-Origin' header". Fix: ensure auth-layer/early-return responses for `/proxy/lnd` + `/lnd-connect-info` carry CORS headers.
|
||
- `.116` `/lnd-connect-info` returns a single correct ACAO → symptom varies by node's nginx state.
|
||
- Backend CORS helper: handler/mod.rs `app_cors_origin()` (:270) — reflects Origin when its host == request host.
|
||
- Backend change → ships in .97. **Status: ✅ PASSED — verified on .116, .198, .103 (harness 4/4 each). Ready to bundle into .97.**
|
||
- Caveat: bootstrap's nginx dup-strip runs a few seconds AFTER /health goes green (async patch+reload) — converges within ~1 min of restart; not instant. Acceptable.
|
||
- **CODE CHANGES MADE (uncommitted):**
|
||
- `core/archipelago/src/bootstrap.rs`: added `NGINX_LND_DUP_CORS` const + strip in `patch_nginx_conf()` (removes the duplicate nginx `add_header` ACAO from `/lnd-connect-info` so the backend's single header wins). Idempotent; runs on startup nginx bootstrap. → fixes (a)
|
||
- `core/archipelago/src/api/handler/mod.rs`: new `unauthorized_cors(origin)` helper (:~205) + `/proxy/lnd/` route (:~505) computes origin first and returns `unauthorized_cors` so the 401 carries ACAO. → fixes (b)
|
||
- Test on **.116** for (b); test on **.103** for (a) [.116 has no dup to strip].
|
||
- **2026-06-15 RESULT — .116 (fix b): harness 4/4 PASS** (sideloaded built binary, restarted). `/proxy/lnd/v1/*` now returns CORS on the 401. ✅
|
||
- (Correction: an earlier "LND container MISSING" reading was a FALSE alarm — `docker` isn't in the non-interactive PATH; runtime is **podman**. Verified `lnd Up 9h` — containers SURVIVED the restart cleanly.)
|
||
- Next: deploy to .103 + run harness to confirm fix (a) (nginx dup strip).
|
||
- **Harness:** `tests/production-quality/lnd-cors-test.sh <node>` — asserts single correct ACAO on /lnd-connect-info + ACAO present on /proxy/lnd/v1/{getinfo,channels}. Baseline (2026-06-15): .116 = 2 pass/2 fail (proxy missing ACAO); .103 = 1 pass/3 fail (connect-info dup + proxy missing).
|
||
- **FIX PLAN (precise):**
|
||
1. (b) handler/mod.rs:504-508 `/proxy/lnd/` returns `Self::unauthorized()` (401, NO CORS) when session check fails → browser CORS wall. Add CORS (app_cors_origin) to that 401. Same pattern for any other app-origin early-return.
|
||
2. (a) nginx `/lnd-connect-info` location double-adds ACAO (backend + nginx `add_header`). Strip the nginx `add_header Access-Control-Allow-Origin` there; backend owns CORS. Update bootstrap.rs nginx patch to remove it on existing nodes (idempotent).
|
||
- Verify: rebuild backend, deploy to .116, run harness → expect 3/3 (or 4 assertions) PASS on .116 AND .103.
|
||
|
||
---
|
||
|
||
## 🔴 PRIORITY — cloud / federation / mesh
|
||
|
||
### B1 — Trusted-node list not clean — PASSED (onion-dedup; unit test 2/2; live .198 15→13 distinct, healthy). UI visual-confirm recommended.
|
||
Dupes, erroneous names, and non-convergent group membership across nodes. Expected: trusted nodes form a transitive group (every node connects to any newly-added trusted node; all nodes show the same set). `.103` has a long/dirty list.
|
||
|
||
### B2 — Duplicate chat contact for one node — PASSED (resolved by load-dedup feeding mesh seed; unit-tested). UI visual-confirm recommended.
|
||
Federated peer "sapien" shows TWO chats: one "sapien" WITHOUT archy logo (looks non-federated) + one named by raw DID `did:key:z6MkoSbN5CM7fBaQg2nWbCymEkFXsHnuXvec9Mjo5RtJf9dQ`. Same node keyed by both federated identity and raw DID → merge to one. Code: core/archipelago/src/mesh + mesh/typed_messages.rs (note :233 — meshcore adverts don't carry archy pubkey).
|
||
|
||
### B3 — Cloud peer media won't preview/play — FIXING (code done: /api/peer-content streaming proxy + playMedia streams free content)
|
||
Music/video preview files on peer nodes' cloud don't play (streaming/range/content-type over mesh+Tor peer fetch).
|
||
|
||
### B4 — Cloud "my folders" fails (JSON parse / 502) — PASSED (content-type guard; built, guard in bundle, deployed .198). UI visual-confirm recommended.
|
||
`Unexpected token '<', "<!doctype"` when FileBrowser absent (`/app/filebrowser/api/resources` → SPA index.html), and **502** when FileBrowser is down (seen on .103). filebrowser-client.ts:102/:106. Fix: detect FileBrowser unavailable, friendly prompt; consider nginx returning JSON 404/502 for missing `/app/<app>/` instead of SPA shell. Handle BOTH absent + down.
|
||
|
||
### B14 — cloud browse transport not recorded — FIXED (record_peer_transport in 4 content handlers; build OK). NOTE: live data shows FIPS reaches only ~4/15 peers, 6 fall back to Tor genuinely → see B14b.
|
||
Browsing trusted/peer nodes in the Cloud tab connects over Tor instead of FIPS (should prefer FIPS like the rest of mesh; same for peer browsing). cf project_fips_integration, project_tor_node_to_node_works (last_transport should be fips/mesh).
|
||
|
||
---
|
||
|
||
## 🟠 APP-SPECIFIC
|
||
|
||
### B6 — ElectrumX install gate — PARTIAL (pruned-node gate already works; "no node present" half DEFERRED: false-positive risk without UI test, needs package-presence check)
|
||
Show the yellow requirement badge when no full node / only a pruned node is present (reuse existing yellow badge pattern).
|
||
|
||
### B7 — ElectrumX UI stuck loader on top — FIXED (overlay hides + iframe shows when status stale; type-check green). UI-confirm.
|
||
UI renders but a loader sits on top; possibly stale pre-sync screen not clearing.
|
||
|
||
### B9 — IndeedHub keeps stopping on nodes — TODO
|
||
Container won't stay running (crash-loop / reconcile stop). Check logs + restart policy + health.
|
||
|
||
### B10 — Immich still crashes — TODO
|
||
Recurring crash ("still" → prior attempts). Check container logs + resource limits + DB/ML deps.
|
||
|
||
### B11 — Companion app: "open in external browser" apps don't work — TODO
|
||
Apps meant to open in a new/external browser don't launch from the companion app; need the phone-default-browser request-modal pattern mobile apps use. Relates to v1.7.90 "open in new tab from companion app".
|
||
|
||
### B12 — Mempool not connecting — FIXED (mempool host detect, 3 paths; unit-tested). Live bitcoin-core validation PENDING (no core node available).
|
||
**Bigger than the original "stacks.rs:1278" framing.** `CORE_RPC_HOST=bitcoin-knots` was hardcoded in THREE env-render paths; on a bitcoin-core node the container is named `bitcoin-core`, so mempool-api can't resolve RPC. Both Knots and Core are reachable on `archy-net` by container name — only the name differs.
|
||
- **Path 1 — legacy direct-podman** (`stacks.rs::install_mempool_stack`, used when no orchestrator): now `format!("CORE_RPC_HOST={}", detect_bitcoin_rpc_host())`. FIXED.
|
||
- **Path 2 — `config.rs::get_app_config`** (install.rs legacy path): same. FIXED.
|
||
- **Path 3 — Quadlet/manifest (THE MODERN FLEET PATH, e.g. .198)**: `prod_orchestrator` renders env from `apps/mempool-api/manifest.yml` static YAML. FIXED via a new `{{BITCOIN_HOST}}` derived-env placeholder: `HostFacts.bitcoin_host` (container/manifest.rs) + `resolve_derived_env` renders it; `prod_orchestrator::bitcoin_host()` detects Knots/Core via `podman ps` (test-injectable `set_bitcoin_host_for_test`); resolved on-demand only for manifests using the placeholder (perf). mempool-api manifest moved `CORE_RPC_HOST` from static env → `derived_env: {{BITCOIN_HOST}}`.
|
||
- New helper `dependencies::detect_bitcoin_rpc_host()` + pure `pick_bitcoin_host()`.
|
||
- **TESTS (all green):** `pick_bitcoin_host` 5 cases (knots/core/plain/none/substring-safety); container-crate `resolve_derived_env` renders `{{BITCOIN_HOST}}`; orchestrator `mempool_core_rpc_host_follows_bitcoin_node` (core→bitcoin-core, knots→bitcoin-knots). No-regression verified: picker returns `bitcoin-knots` live on .198 (so Knots nodes unchanged; existing mempool installs see no env drift).
|
||
- **VALIDATION GAP:** cannot exercise on a live bitcoin-core node (none available; .198 is Knots where the fix is a no-op). Need a Core node to confirm end-to-end.
|
||
- **FOLLOW-UP (B12b, NOT done):** same hardcode exists for siblings on bitcoin-core nodes — `config.rs` lnd(:724)/btcpay(:739)/electrumx(:782), and `prod_orchestrator::resolve_dynamic_env` fedimint `FM_BITCOIND_URL=...bitcoin-knots` (~:2425). Plus mempool-api manifest `dependencies: bitcoin-knots` (line 18) is Knots-specific bookkeeping (install-time check already accepts Core via BITCOIN_NAMES, so non-blocking). All can reuse `{{BITCOIN_HOST}}`. Deferred per user (mempool-only scope) — each needs its own validation, esp. LND/fedimint.
|
||
- **NOTE (unrelated pre-existing failures):** 4 prod_orchestrator tests fail on clean HEAD too — `install_applies_data_uid_chown_before_create`, `install_writes_manifest_generated_files_before_create`, `manifest_generated_files_{do_not_overwrite_by_default,can_overwrite_when_declared}` — their fixtures pass tempdir volume sources that `validate_bind_source` rejects (only `/var/lib/archipelago/*` + 2 sockets allowed). NOT caused by B12; worth a separate fix.
|
||
mempool can't reach the Bitcoin backend on some nodes. Investigate on .116. Check mempool→electrs→bitcoind wiring + deps.
|
||
|
||
### B13 — Fedimint UI not applying CSS — FIXED + VERIFIED on .198 (both HTTP + HTTPS)
|
||
Root cause confirmed: the Fedimint Guardian page (served by :8175) is a server-rendered status page with ~7.8KB INLINE CSS plus image assets referenced root-rooted (`src="/assets/img/app-icons/fedimint.jpg"`, `url("/assets/img/bg-network.jpg")`). Without an asset rewrite those `/assets/...` URLs resolve against the archipelago SPA root: `bg-network.jpg` happens to exist there (shared design asset → loaded by luck) but `app-icons/fedimint.jpg` does NOT → **404** (the broken/visibly-missing icon). The `location /assets/` block uses `try_files $uri =404`, so missing fedimint assets 404 rather than fall through.
|
||
|
||
Fix = nginx sub_filter set that reroots every root-rooted asset URL (`href="/`, `src="/`, `url("/`, and single-quote variants) under `/app/fedimint/`, plus `proxy_set_header Accept-Encoding ""` so the upstream doesn't gzip (sub_filter can't rewrite gzipped bodies). Shipped two ways:
|
||
- **Fresh ISOs** (committed a50b6df2): templates `image-recipe/configs/nginx-archipelago.conf` (HTTP) + `image-recipe/configs/snippets/archipelago-https-app-proxies.conf` (HTTPS).
|
||
- **Already-deployed nodes** (bootstrap self-heal, this commit): `core/archipelago/src/bootstrap.rs::patch_nginx_conf` now heals BOTH the main conf (Style A — swaps the old single nostr-provider sub_filter tail for the full reroot set, byte-matches the shipped template) AND the HTTPS app-proxy snippet (Style B — anchors on the unique `:8175` proxy_pass and inserts the reroot set; robust to the snippet's varying trailing directive). `missing_*` flags now gated on their splice anchors so the healed snippet early-returns cleanly (no per-boot warn-skips). Idempotent via the `'href="/' 'href="/app/fedimint/'` marker.
|
||
|
||
VERIFIED on .198 (sideloaded built binary, restart, async self-heal converged ~15s):
|
||
- HTTP `/app/fedimint/`: live conf healed byte-identical to template; app-icon **404→200 image/jpeg (41944b)**.
|
||
- HTTPS `/app/fedimint/` (snippet): healed; same app-icon **404→200**; bg-network 200; root `/assets/img/app-icons/fedimint.jpg` returns 200 **text/html** (SPA shell) — proving the reroot is necessary.
|
||
- `nginx -t` OK both times; containers survived restart (Quadlet); both files carry the marker exactly once (idempotent steady state); no warn spam in logs.
|
||
NOTE: self-healed snippet is functionally correct but NOT byte-identical to the fresh-ISO snippet template (insert-after-proxy_pass vs full block) — acceptable; nginx ignores directive order/whitespace.
|
||
|
||
### B15 — Bitcoin UI sync progress lags — FIXED (Home.vue poll 30s→10s). UI-confirm.
|
||
Bitcoin UI doesn't update its sync progress fast enough even though the console clearly already has the block-height data. Likely a polling-interval / reactive-update gap between the status source and the UI.
|
||
|
||
### B16 — Bitcoin sync status vanishes — FIXED + UNIT-TESTED (commit 83dbd25c). UI-confirm.
|
||
The bitcoin sync status in the Home > System container disappears when it should persist/cache and show an "updating" state. Related to B15 (Bitcoin UI sync lag). Root cause: the tile is gated `v-if="stats.bitcoinAvailable===true"` (HomeSystemCard.vue:60); a transient `bitcoin.getinfo` failure (RPC busy during heavy IBD, or a route-change/scan where the packages map is momentarily empty) could blank it.
|
||
FIX (commit 83dbd25c): added a `bitcoinStale` flag to homeStatus.ts —
|
||
- getinfo fails while the bitcoin container is **Running**, OR package data is momentarily **absent** → retain last-known value + `bitcoinStale=true` (tile stays, renders **"Updating…"** instead of a frozen figure shown as live).
|
||
- container authoritatively **Stopped/Exited** → `bitcoinAvailable=false`, `stale=false` (no stale-as-live — genuinely down is reflected).
|
||
- first-ever poll times out but container Running (syncing node) → show the tile as updating rather than staying hidden.
|
||
Wired `bitcoinStale` through Home.vue `systemStats` → HomeSystemCard prop; card shows "Updating…" (dimmed) when stale.
|
||
**Harness:** `neode-ui/src/stores/__tests__/homeStatus.test.ts` (6 cases) — RED before fix (5/6 fail), GREEN after (6/6). `vue-tsc --noEmit` exit 0. Full vitest suite: only pre-existing AppIconGrid cross-test teardown flake (passes 7/7 standalone; not my change). UI-confirm on .116/.198 still recommended (hard to trigger transient failure on demand — unit test is the authoritative harness here).
|
||
|
||
### B17 — archipelago.service flaps on boot before starting — FIXED + VERIFIED on .198 (commit 34b1fdc1)
|
||
On some boots, `[FAILED] Failed to start archipelago.service` printed ~20× over ~5 min before starting. ROOT CAUSE (proven live on .198): on production nodes `/var/lib/archipelago` is a **separate `/dev/mapper/archipelago-data` ext4 volume** (systemd unit `var-lib-archipelago.mount`), and podman's **graphroot=`/var/lib/archipelago/containers/storage`** lives on it too. The unit ordered only `After=network-online.target` — NO mount dependency — so on cold boots the service (and its `ExecStartPre`) could start BEFORE the volume mounted, write to the bare mountpoint on rootfs, fail every podman call, exit, and be restarted every 5s (`Restart=on-failure RestartSec=5`) until the mount appeared. Smoking gun in .198's journal: `var-lib-archipelago.mount: Directory /var/lib/archipelago to mount over is not empty, mounting anyway` — the service had written there pre-mount. Dev laptop .116 has the data dir on rootfs → never flaps (explains "on some boots"). Diagnostic: every node showed `banners == "Server listening"` (process always succeeds once it runs) ⇒ failure is systemd-level, not a Rust crash.
|
||
FIX (commit 34b1fdc1): `RequiresMountsFor=/var/lib/archipelago` (adds `Requires=` + `After=` on the mount unit).
|
||
- `image-recipe/configs/archipelago.service`: ships the directive on fresh ISOs.
|
||
- `bootstrap::ensure_archipelago_mount_ordering()`: self-heals already-deployed nodes' installed `/etc/systemd/system/archipelago.service` + `daemon-reload` (boot-ordering only — effective next reboot; never restarts the running service). Idempotent; harmless on rootfs installs.
|
||
VERIFIED on .198: applied directive → `systemctl show -p After` includes `var-lib-archipelago.mount`, `systemd-analyze verify` clean → rebooted: mount@07:35:22, archipelago banner@07:35:35 (13s AFTER mount), `banners=1 listening=1 failed_to_start=0` (zero flap), directive persisted. `cargo check` EXIT 0. NOTE: self-heal CODE (auto-patch on deployed nodes) still to be exercised with the built binary on .228 (directive was applied manually on .198); residual rootfs shadow files under the mountpoint are benign.
|
||
|
||
### B18 — Apps stop right after install (or become unstartable) — TODO
|
||
Many apps install but immediately stop, requiring a manual Start — or become unstartable entirely. Likely the install→start handoff / reconciler doesn't bring them up (or starts then they exit). Related to B9 (IndeedHub stopping), B10 (Immich). Possibly linked to the cgroup-SIGKILL-on-archipelago.service-restart issue (feedback_no_systemctl_deploy_until_quadlet) — but NOTE: on .116 (Quadlet) containers survived a service restart cleanly, so the reconciler may be fine there; reproduce on the affected nodes. Check post-install start sequencing + boot_reconciler + container restart policy + cgroup placement.
|
||
|
||
### B19 — Failed download-update lands on Install button (should be Download) — TODO
|
||
When an update download fails, the UI sometimes shows the Install button instead of returning to the Download button — a big UX issue (user can't retry the download cleanly). Check the SystemUpdate state machine's error/failure transition.
|
||
|
||
### B20 — Surface bitcoin-headers-over-mesh broadcast (send/receive toggles) — TODO (feature-adjacent, surfacing existing work)
|
||
We previously broadcast bitcoin block headers over mesh to archipelago nodes but never fully surfaced it. Want two switches: "send headers" (you broadcast) and "receive headers" (you accept). NOTE: this is feature-adjacent — surfacing existing functionality; the user added it during the no-new-features push, so treat as low-priority polish until the bug list is clear. Code: mesh block-headers (mesh.block-headers RPC seen in logs; core/archipelago/src/mesh).
|
||
|
||
### B14b — FIPS reachability: many peers fall back to Tor — INVESTIGATED (needs FIPS-network depth)
|
||
Live (2026-06-15) federation sync last_transport on .116/.198: ~4 peers fips, ~6 tor, ~5 none. So beyond the recording fix (B14), FIPS genuinely doesn't reach many federated peers (they use Tor). Investigate WHY: is fips_npub known for those peers? are they FIPS-online? is the shared anchor connecting them? (cf project_fips_integration, project_tor_node_to_node_works). This is the real "Tor not FIPS" depth.
|
||
FINDINGS (.198, 2026-06-15): archipelago-fips ACTIVE; ALL 13 peers HAVE fips_npub; last_transport = 5 fips / 5 tor / 3 none. So it's NOT a missing-npub or service-down bug — FIPS genuinely reaches some peers and not others = DIAL-TIME reachability: the 'tor' peers aren't FIPS-reachable at dial time (offline, NAT, their FIPS not registered with the shared anchor), and 'none' = fully offline (X250 roam/beta/cellular). NEXT (deeper, needs FIPS-network debugging): verify a known-online peer (e.g. .228/.116) is reachable over FIPS from .198 right now; if an online FIPS peer still falls back to Tor → real anchor/registration bug; check fips daemon peer table + anchor connectivity. Likely partly peer-availability (not fully fixable in code).
|
||
|
||
### B21 — Show Tor/FIPS transport pill on cloud browse — FIXED (build+type-check green; deploy+UI-confirm on .116/.198)
|
||
Tag whether the peer connection is Tor or FIPS and surface it as a small pill on the cloud browse screens / connection loader. Data source: federation node last_transport (now recorded by B14) exposed via federation.list-nodes; frontend renders a pill (FIPS=fast/green, Tor=slower) on PeerFiles.vue / Cloud peer view + the connection loader. Frontend-only-ish. FINDINGS: PeerFiles.vue:46 loader HARDCODES 'Connecting via Tor...' even when FIPS used (bug). Frontend types already have last_transport ('fips'|'tor'|'mesh'|'lan') federation/types.ts:31; NodeList.vue:167 already renders a transport indicator. PLAN: have content.browse-peer RETURN the transport used (B14 already computes it) → frontend shows a pill (FIPS green / Tor amber) on PeerFiles header + fix the loader text to reflect actual/attempted transport. Small backend (add transport to browse response) + frontend pill.
|
||
|
||
### B22 — Peer cloud download/audio errors (.228→.198) — TODO (pairs with B3)
|
||
Observed 2026-06-15 browsing .228's cloud from .198: (a) downloading a peer cloud file → "Operation failed. Check server logs for details." (b) playing a peer AUDIO file → "Could not play audio. File Browser may not be running." (misleading — it's a peer file, not File Browser; that's the OLD base64/blob path B3 replaces). ACTION: (a) check content.download-peer backend error on .198 logs while downloading (likely the same Range/transport/timeout path as B3, or a peer-side 4xx); (b) verify B3 streaming fixes peer audio once deployed, and fix the misleading audioPlayer error string. Get server logs: ssh .198, journalctl -u archipelago | grep -i 'content\|peer\|download'.
|
||
|
||
### B23 — Archipelago group chat (all nodes) broken/slow over Tor — TODO (PRIORITY, mesh)
|
||
The all-nodes "Archipelago group" chat (over Tor) doesn't seem to work. Facets:
|
||
- (a) Group delivery unreliable / "doesn't work" over Tor.
|
||
- (b) Messages may just be VERY SLOW (latency — likely Tor-only path; should use FIPS+Tor per the new transport method like B14, preferring FIPS).
|
||
- (c) Add the SENDER CONTACT NAME to each message so you can differentiate who sent what (group messages lack attribution).
|
||
- (d) Messages sometimes DUPLICATED (dedup by message id / sender_seq — cf mesh.ts:73 cross-transport identity (sender_pubkey, sender_seq); duplicate likely from receiving same msg over both transports or re-broadcast).
|
||
Code: core/archipelago/src/mesh (typed_messages, listener), frontend Mesh.vue/stores/mesh.ts. Relates to B2 (identity), B14/B14b (transport). Test on .116/.198 (+ a Tor-only peer like .228).
|
||
|
||
### B8 — netbird app doesn't work — TODO (LOW / much later)
|
||
|
||
(RETRACTED: CryptPad placeholder-icon — user says cryptpad is fine.)
|
||
|
||
---
|
||
|
||
## 📋 vps2 Gitea issues (lfg2025/archy) — imported 2026-06-15
|
||
- G#1 [Bug] Strange peer request behaviour — TODO (likely related to B1/federation)
|
||
- G#2 [Bug] Fix flashing USB from kiosk — TODO
|
||
- G#3 [Feature] VPN Configuration — DEFERRED (feature; no new features until production quality)
|
||
- G#4 [Bug] Bitcoind is slow — TODO
|
||
- G#5 [Feature] OpenWRT and TollGate integration — DEFERRED (feature)
|
||
- G#6 [Feature] Move dashboard/monitoring link to home screen — DEFERRED (feature)
|
||
- G#7 [Bug] Scrolling with Companion app — TODO
|
||
|
||
---
|
||
|
||
## Gitea issue mapping (vps2 lfg2025/archy)
|
||
All backlog bugs now mirrored as Gitea issues: B1→#8, B2→#9, B3→#10, B4→#11, B5→#12, B6→#13, B7→#14, B8→#15, B9→#16, B10→#17, B11→#18, B12→#19, B13→#20, B14→#21, B15→#22, B16→#23, B17→#24, B18→#25, B19→#26. (Pre-existing G#1–7 remain; some overlap, e.g. G#1 strange-peer ≈ B1.) Close the Gitea issue when a bug is verified+shipped.
|
||
|
||
## INVESTIGATION FINDINGS 2026-06-15 (B1/B2/B3/B4/B14) — cutoff insurance
|
||
|
||
**B1 trusted-node divergence** — ROOT-CAUSED. `federation/sync.rs` `merge_transitive_peers()` (~:140) dedupes ONLY by DID; the SAME physical node appears under multiple DIDs (same `onion` + `fips_npub`) → duplicate entries ("Arch Dev" ×2, "Sapien" ×2). No background convergence → lists diverge (.103=16 nodes, .116/.198=15). Model: `federation/types.rs:24` FederatedNode (PK=did); storage `federation/storage.rs` nodes.json; add_node dedupes by DID only (:125). FIX: in merge_transitive_peers add a SECOND match arm — if no DID match, match by normalized `onion` (trim .onion); if found, treat as same node (merge fips_npub/name, don't add). Same dedup on add_node. Plus a one-time cleanup of existing dup DIDs (remove-node the stale one). TEST: after sync, all 3 nodes have identical node set, no two entries share an onion.
|
||
|
||
**B2 duplicate chat contact** — ROOT-CAUSED (same root as B1). Two federation DIDs (same onion/fips_npub, e.g. "Sapien" dids z6MkoSbN… + z6MkeYMU…) get seeded as TWO mesh contacts: `mesh/mod.rs` `seed_federation_peers_into_mesh()` (~:94) upserts per-pubkey contact_id; frontend `Mesh.vue` `mergeKeyForPeer()` (~:492) keys by DID so two DIDs = two rows. FIX: (backend) in seed, skip a node whose onion was already seeded (HashSet of onions); (frontend) Mesh.vue merge by onion when DIDs differ but onion matches. Fixing B1's onion-dedup largely resolves this too. TEST: one "Sapien" row; `mesh.peers` has one contact for the shared onion.
|
||
|
||
**B3 peer media won't play** — ROOT-CAUSED. `PeerFiles.vue` `playMedia()`/`loadPreview()` (~:358,:508) fetch the WHOLE file via RPC `content.preview-peer`/`content.download-peer` (`api/rpc/content.rs` :393,:213) which base64-encodes the entire file; frontend makes a Blob URL → browser can't Range-seek → video/large-audio won't play (+ 30/120s timeouts truncate big files). The peer's HTTP `/content/<id>` handler (`api/handler/content.rs` :49) ALREADY supports Range/206 + Accept-Ranges. FIX (bigger): add a local streaming proxy endpoint `/api/peer-content/{onion}/{id}` in `api/handler/mod.rs` that forwards the browser's Range header to the peer's `/content/<id>` (via fips::dial PeerRequest) and streams back 206 + Content-Range + Content-Type; frontend sets `<video>/<audio>` src to that URL (not a blob). TEST: curl Range on the new endpoint → 206 + Content-Range; video seeks/plays.
|
||
|
||
**B4 cloud my-folders <!doctype/502** — ROOT-CAUSED. `filebrowser-client.ts` `listDirectory()` (:99) does `res.json()` (:106) after only an `res.ok` check; when FileBrowser is ABSENT nginx serves SPA index.html (200, '<!doctype') → JSON crash; when DOWN → 502. FIX (frontend, low-risk): guard res content-type !== application/json → throw typed "FileBrowser unavailable" handled by Cloud.vue/CloudFolder.vue empty-state; same guard in login() (:71) + getUsage() (:215). OPTIONAL nginx: add `error_page 502 503 = @filebrowser_unavailable` returning JSON in the /app/filebrowser/ block (image-recipe/configs/nginx-archipelago.conf ~:411). TEST: stop filebrowser on .116/.198 → Cloud shows friendly state, no doctype crash.
|
||
|
||
**B14 cloud browse Tor-not-FIPS** — ROOT-CAUSED (nuance). FIPS-first logic WORKS (`fips/dial.rs` send_get :331 tries FIPS, falls back to Tor on 404/5xx; v1.7.94 fix). BUT the 4 content handlers in `api/rpc/content.rs` (browse :297, download :237, download_paid :356, preview :421) capture `_transport` and NEVER call `record_peer_transport()` → UI badge shows Tor/null even when FIPS used. FIX: add `record_peer_transport(data_dir, None, Some(onion), &transport.to_string())` after each successful send_get (storage.rs:84 has the fn). ⚠️ VERIFY on nodes whether FIPS is ACTUALLY used or genuinely falling back to Tor (if genuinely Tor, deeper FIPS-reachability issue beyond recording). TEST: after browse, last_transport = fips (when peer FIPS-reachable).
|
||
|
||
## INVESTIGATION FINDINGS 2026-06-15 (B6/B7/B12/B13/B15/B16) — cutoff insurance
|
||
|
||
**B13 Fedimint CSS** — app HTML (docker/fedimint-ui/index.html) uses absolute /assets/* paths; under /app/fedimint/ the browser requests /assets/* which hit the main SPA, not :8175 → unstyled. FIX: nginx sub_filter rewrite (same proven pattern as indeedhub/botfights blocks) in image-recipe/configs/nginx-archipelago.conf (/app/fedimint/ ~:641) + snippets/archipelago-https-app-proxies.conf (~:164) + bootstrap patch for existing nodes. Rewrites href/src/url '/' → '/app/fedimint/'. TEST: curl .../app/fedimint/assets/...css → 200 real CSS.
|
||
|
||
**B6 ElectrumX archival gate** — electrs needs a NON-pruned full node; install card doesn't warn at a glance. /bitcoin-status returns blockchain_info.pruned. Yellow badge pattern exists (MarketplaceAppCard.vue). FIX (frontend, simple): show a yellow "Requires a full archive Bitcoin node (not pruned)" note on the electrumx card (MarketplaceAppCard.vue ~:53). catalog.json electrumx already has requires.
|
||
|
||
**B7 ElectrumX stuck loader** — sync overlay gated by electrsSync (useElectrsSync.ts syncing = status!=='synced'); if status never flips to 'synced' (stale/crash) the overlay blocks the UI forever. AppSessionFrame.vue:44 iframe gate `!electrsSync`. FIX (frontend): fail-open — allow iframe when electrsSync?.stale (and add a timeout in useElectrsSync.ts so a slow/stale status stops blocking after ~5min).
|
||
|
||
**B15 bitcoin sync UI lag** — Home.vue:485 polls every 30s. FIX: faster bitcoin refresh (~5-10s) (separate interval for bitcoin vs system stats).
|
||
|
||
**B16 bitcoin status vanishes** — homeStatus.ts refreshBitcoin clears/leaves bitcoinAvailable null on a failed/transitional poll → HomeSystemCard.vue:60 v-if hides the card. FIX: retain last-known bitcoinAvailable on transient failure + show an "Updating…" badge instead of disappearing.
|
||
|
||
**B12 mempool not connecting** — stacks.rs:1278 + apps/mempool-api/manifest.yml:50 hardcode CORE_RPC_HOST=bitcoin-knots; on nodes running bitcoin-core (not knots) mempool-api gets getaddrinfo ENOTFOUND bitcoin-knots. Also ELECTRUM_HOST=electrumx absent on pruned nodes (docs/CONTAINER_LIFECYCLE_HANDOFF.md:654). FIX: detect which bitcoin container runs (knots vs core) + set CORE_RPC_HOST dynamically; qualify the mempool stack so it doesn't half-start without electrumx. Backend (stacks.rs) — medium risk, test on .116.
|
||
|
||
- 2026-06-15 (cont. 2): **B15 ✅** (poll 30s→10s) + **B7 ✅** (ElectrumX loader fail-open on stale) — committed `c0d41cf8`, type-check green. **B6 PARTIAL** (pruned gate already works; no-node-present half deferred). Fanned out investigations for B6/B7/B12/B13/B15/B16 — all root-caused with fix plans in FINDINGS above.
|
||
- **DEFERRED with ready plans (need a backend build + careful patch, or UI test, or live repro):** B13 (fedimint CSS — nginx sub_filter asset rewrite; bootstrap exact-match patch is fragile, do carefully), B12 (mempool host — dynamic bitcoin-knots/core detect in stacks.rs), B16 (bitcoin status retain — UI-test to avoid stale-as-live), B6 no-node-present half, B14b (FIPS net depth), B22/B23 (need live repro).
|
||
- **NEXT options:** (a) continue backend batch B13+B12 (one build); (b) do UI confirms on .116/.198 + cut v1.7.97-alpha with the ~10 committed fixes (LND incident + cloud/federation/mesh).
|
||
- **Committed fixes awaiting .97:** B5, B1, B2, B4, B14, B21, B3, B15, B7 (+ B6 pruned-gate already live). All on vps2 main; NOT on fleet yet.
|
||
|
||
## Progress log
|
||
- 2026-06-15: tracker created. v1.7.96-alpha shipped. All 19 bugs filed as Gitea issues #8–#26. vps2 feature issues (G#3/5/6) deferred (no new features).
|
||
- 2026-06-15: **B5 (LND CORS) ✅ DONE** — root-caused, both fixes implemented, verified on .116/.198/.103 (harness 4/4 each), committed `1db720af`, pushed to vps2 main. Will bundle into .97 (Gitea #12 to close on .97 ship).
|
||
- Validation nodes: .116 + .198 (pw ThisIsWeb54321@). Runtime is podman (docker not in non-interactive PATH). Sideload binary → /usr/local/bin/archipelago + restart (containers survive on these nodes).
|
||
- 2026-06-15 (cont.): **B1,B2,B4 ✅** dedup+guard — committed `ed493106`, unit-tested 2/2, live .198 healthy. **B14 ✅** transport recording — committed `1c6dc153` (after build-repair: used private `crate::federation::storage::` path → E0603; fixed to re-exported `crate::federation::`). **B21 ✅** Tor/FIPS pill — committed `0801dd66`. All pushed to vps2 main; builds verified EXIT 0.
|
||
- **Discovered B14b** (FIPS reaches only ~4/15 peers; rest genuinely Tor) and **B21** (pill) during the block.
|
||
- ⚠️ LESSON: a backgrounded build "completed" notification does NOT mean success — grep the EXIT code before committing (a broken commit reached main once; repaired by 1c6dc153; no release cut from it → fleet unaffected).
|
||
- **NEXT: B3 (peer media streaming — big), then B14b (FIPS reachability), then app-specific (B6,B7,B9–B13,B15–B19).** None deployed to fleet yet — all on vps2 main awaiting the .97 release after full .116/.198 + UI verification.
|
||
|
||
## New backlog issues filed 2026-06-16 (this session)
|
||
- #32 Tor chat: message stuck on spinner though peers received it (task #8)
|
||
- #33 Message toast: click-to-open chat + close icon (task #9)
|
||
- #34 Local UI images never rebuild on source change — orchestrator gap (task #7); blocks OTA of bitcoin-ui relay + fedimint CSS to existing fleet
|
||
- #35 Paid 10% video previews unplayable — truncated MP4 (task #6)
|
||
NOTE: bitcoin RPC relay UI + fedimint guardian CSS now LIVE on .116 (image rebuilds); .198 deploy in progress. Bitcoin app launches host-net UI at <node>:8334 (not /app/bitcoin-ui/ proxy).
|