archipelago 4cac6bc835 docs(tracker): record B1/B2/B4/B14/B21 done + B14b; next B3
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 13:27:51 -04:00

154 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Production-Quality Bug Tracker
Living tracker for the post-v1.7.96 "no new features until production quality" push.
Updated continuously as we investigate → fix → test → pass. Kept in-repo so progress
survives a session cutoff.
## Rules (from user, 2026-06-15)
- **No new features** until the OS is production / no-bugs quality.
- **Test-harness-first**: build/extend a harness for each bug before fixing.
- **Validate every fix on `.116` + `.198`** (both 192.168.1.x, pw ThisIsWeb54321@) **+ the harness** BEFORE it goes into any release. (.198 still carries the LND CORS nginx duplicate → good for fix-(a) validation; .116 does not.)
- **Priority order**: cloud/federated-nodes + mesh FIRST, then app-specific, then low-pri.
## Status legend
`TODO` · `INVESTIGATING` · `ROOT-CAUSED` · `FIXING` · `TESTING` (on .116+harness) · `PASSED` · `SHIPPED`
## Release status
- **v1.7.96-alpha — SHIPPED** (2026-06-15). Live on vps2 (primary OTA): manifest v1.7.96-alpha, assets HTTP 200, `main@8c3c7954` + tag present. Contents: kiosk grid removal + FIPS TCP/UDP anchor selector. NOTE: gitea-local (localhost) mirror push failed (token rejected → /login); non-blocking, needs refreshed token.
- **v1.7.97-alpha — IN PROGRESS** (this push). Will bundle the verified fixes below.
---
## 🔴🔴 TOP PRIORITY
### B5 — LND "connect your wallet" details/QR broken fleet-wide — ROOT-CAUSED
Origin: user escalation. Symptom: LND connect screen (served on app port :18083) can't load details/QR.
Two distinct root causes (confirmed live):
- **(a) Duplicate ACAO** on `/lnd-connect-info` (seen on .103): backend sets `Access-Control-Allow-Origin` (proxy.rs:108) AND nginx `add_header` adds a second → browser rejects "multiple values". nginx config drift. Fix: bootstrap.rs nginx patch must strip the redundant `add_header` from the `/lnd-connect-info` location (backend owns CORS).
- **(b) No ACAO on `/proxy/lnd/v1/*` 401** (fleet-wide): the unauth/auth-layer 401 is produced before the CORS-adding proxy handler (proxy.rs:135 `handle_lnd_proxy`). Browser → "No 'Access-Control-Allow-Origin' header". Fix: ensure auth-layer/early-return responses for `/proxy/lnd` + `/lnd-connect-info` carry CORS headers.
- `.116` `/lnd-connect-info` returns a single correct ACAO → symptom varies by node's nginx state.
- Backend CORS helper: handler/mod.rs `app_cors_origin()` (:270) — reflects Origin when its host == request host.
- Backend change → ships in .97. **Status: ✅ PASSED — verified on .116, .198, .103 (harness 4/4 each). Ready to bundle into .97.**
- Caveat: bootstrap's nginx dup-strip runs a few seconds AFTER /health goes green (async patch+reload) — converges within ~1 min of restart; not instant. Acceptable.
- **CODE CHANGES MADE (uncommitted):**
- `core/archipelago/src/bootstrap.rs`: added `NGINX_LND_DUP_CORS` const + strip in `patch_nginx_conf()` (removes the duplicate nginx `add_header` ACAO from `/lnd-connect-info` so the backend's single header wins). Idempotent; runs on startup nginx bootstrap. → fixes (a)
- `core/archipelago/src/api/handler/mod.rs`: new `unauthorized_cors(origin)` helper (:~205) + `/proxy/lnd/` route (:~505) computes origin first and returns `unauthorized_cors` so the 401 carries ACAO. → fixes (b)
- Test on **.116** for (b); test on **.103** for (a) [.116 has no dup to strip].
- **2026-06-15 RESULT — .116 (fix b): harness 4/4 PASS** (sideloaded built binary, restarted). `/proxy/lnd/v1/*` now returns CORS on the 401. ✅
- (Correction: an earlier "LND container MISSING" reading was a FALSE alarm — `docker` isn't in the non-interactive PATH; runtime is **podman**. Verified `lnd Up 9h` — containers SURVIVED the restart cleanly.)
- Next: deploy to .103 + run harness to confirm fix (a) (nginx dup strip).
- **Harness:** `tests/production-quality/lnd-cors-test.sh <node>` — asserts single correct ACAO on /lnd-connect-info + ACAO present on /proxy/lnd/v1/{getinfo,channels}. Baseline (2026-06-15): .116 = 2 pass/2 fail (proxy missing ACAO); .103 = 1 pass/3 fail (connect-info dup + proxy missing).
- **FIX PLAN (precise):**
1. (b) handler/mod.rs:504-508 `/proxy/lnd/` returns `Self::unauthorized()` (401, NO CORS) when session check fails → browser CORS wall. Add CORS (app_cors_origin) to that 401. Same pattern for any other app-origin early-return.
2. (a) nginx `/lnd-connect-info` location double-adds ACAO (backend + nginx `add_header`). Strip the nginx `add_header Access-Control-Allow-Origin` there; backend owns CORS. Update bootstrap.rs nginx patch to remove it on existing nodes (idempotent).
- Verify: rebuild backend, deploy to .116, run harness → expect 3/3 (or 4 assertions) PASS on .116 AND .103.
---
## 🔴 PRIORITY — cloud / federation / mesh
### B1 — Trusted-node list not clean — PASSED (onion-dedup; unit test 2/2; live .198 15→13 distinct, healthy). UI visual-confirm recommended.
Dupes, erroneous names, and non-convergent group membership across nodes. Expected: trusted nodes form a transitive group (every node connects to any newly-added trusted node; all nodes show the same set). `.103` has a long/dirty list.
### B2 — Duplicate chat contact for one node — PASSED (resolved by load-dedup feeding mesh seed; unit-tested). UI visual-confirm recommended.
Federated peer "sapien" shows TWO chats: one "sapien" WITHOUT archy logo (looks non-federated) + one named by raw DID `did:key:z6MkoSbN5CM7fBaQg2nWbCymEkFXsHnuXvec9Mjo5RtJf9dQ`. Same node keyed by both federated identity and raw DID → merge to one. Code: core/archipelago/src/mesh + mesh/typed_messages.rs (note :233 — meshcore adverts don't carry archy pubkey).
### B3 — Cloud peer media won't preview/play — ROOT-CAUSED (plan ready: streaming proxy endpoint)
Music/video preview files on peer nodes' cloud don't play (streaming/range/content-type over mesh+Tor peer fetch).
### B4 — Cloud "my folders" fails (JSON parse / 502) — PASSED (content-type guard; built, guard in bundle, deployed .198). UI visual-confirm recommended.
`Unexpected token '<', "<!doctype"` when FileBrowser absent (`/app/filebrowser/api/resources` → SPA index.html), and **502** when FileBrowser is down (seen on .103). filebrowser-client.ts:102/:106. Fix: detect FileBrowser unavailable, friendly prompt; consider nginx returning JSON 404/502 for missing `/app/<app>/` instead of SPA shell. Handle BOTH absent + down.
### B14 — cloud browse transport not recorded — FIXED (record_peer_transport in 4 content handlers; build OK). NOTE: live data shows FIPS reaches only ~4/15 peers, 6 fall back to Tor genuinely → see B14b.
Browsing trusted/peer nodes in the Cloud tab connects over Tor instead of FIPS (should prefer FIPS like the rest of mesh; same for peer browsing). cf project_fips_integration, project_tor_node_to_node_works (last_transport should be fips/mesh).
---
## 🟠 APP-SPECIFIC
### B6 — ElectrumX install button missing "Requires Archival Node" gate — TODO
Show the yellow requirement badge when no full node / only a pruned node is present (reuse existing yellow badge pattern).
### B7 — ElectrumX UI stuck loader on top — TODO
UI renders but a loader sits on top; possibly stale pre-sync screen not clearing.
### B9 — IndeedHub keeps stopping on nodes — TODO
Container won't stay running (crash-loop / reconcile stop). Check logs + restart policy + health.
### B10 — Immich still crashes — TODO
Recurring crash ("still" → prior attempts). Check container logs + resource limits + DB/ML deps.
### B11 — Companion app: "open in external browser" apps don't work — TODO
Apps meant to open in a new/external browser don't launch from the companion app; need the phone-default-browser request-modal pattern mobile apps use. Relates to v1.7.90 "open in new tab from companion app".
### B12 — Mempool not connecting to Bitcoin on some nodes — TODO
mempool can't reach the Bitcoin backend on some nodes. Investigate on .116. Check mempool→electrs→bitcoind wiring + deps.
### B13 — Fedimint UI not applying CSS — TODO
Actual Fedimint UI (not pre-sync) renders unstyled. Likely asset path / proxy base-href (assets rooted at `/` vs `/app/fedimint/`).
### B15 — Bitcoin UI sync progress lags — TODO
Bitcoin UI doesn't update its sync progress fast enough even though the console clearly already has the block-height data. Likely a polling-interval / reactive-update gap between the status source and the UI.
### B16 — Bitcoin sync status on Home > System container vanishes — TODO
The bitcoin sync status in the Home > System container disappears when it should persist/cache and show an "updating" state. Related to B15 (Bitcoin UI sync lag). Likely the status component clears on empty/transitional poll instead of retaining last-known + showing updating.
### B17 — archipelago.service flaps on boot before starting — TODO
On some boots, `[FAILED] Failed to start archipelago.service - Archipelago Backend` prints ~20 times over ~5 min before it finally starts properly. Likely a startup dependency/timing race (DB lock, port bind, crash-recovery, or a dependency not ready) causing systemd restart loop until a precondition is met. Check service Restart=/RestartSec, ExecStartPre gates, and what the early failures log. May tie to B16/crash-recovery.
### B18 — Apps stop right after install (or become unstartable) — TODO
Many apps install but immediately stop, requiring a manual Start — or become unstartable entirely. Likely the install→start handoff / reconciler doesn't bring them up (or starts then they exit). Related to B9 (IndeedHub stopping), B10 (Immich). Possibly linked to the cgroup-SIGKILL-on-archipelago.service-restart issue (feedback_no_systemctl_deploy_until_quadlet) — but NOTE: on .116 (Quadlet) containers survived a service restart cleanly, so the reconciler may be fine there; reproduce on the affected nodes. Check post-install start sequencing + boot_reconciler + container restart policy + cgroup placement.
### B19 — Failed download-update lands on Install button (should be Download) — TODO
When an update download fails, the UI sometimes shows the Install button instead of returning to the Download button — a big UX issue (user can't retry the download cleanly). Check the SystemUpdate state machine's error/failure transition.
### B20 — Surface bitcoin-headers-over-mesh broadcast (send/receive toggles) — TODO (feature-adjacent, surfacing existing work)
We previously broadcast bitcoin block headers over mesh to archipelago nodes but never fully surfaced it. Want two switches: "send headers" (you broadcast) and "receive headers" (you accept). NOTE: this is feature-adjacent — surfacing existing functionality; the user added it during the no-new-features push, so treat as low-priority polish until the bug list is clear. Code: mesh block-headers (mesh.block-headers RPC seen in logs; core/archipelago/src/mesh).
### B14b — FIPS reachability: many peers fall back to Tor — TODO (priority, deeper)
Live (2026-06-15) federation sync last_transport on .116/.198: ~4 peers fips, ~6 tor, ~5 none. So beyond the recording fix (B14), FIPS genuinely doesn't reach many federated peers (they use Tor). Investigate WHY: is fips_npub known for those peers? are they FIPS-online? is the shared anchor connecting them? (cf project_fips_integration, project_tor_node_to_node_works). This is the real "Tor not FIPS" depth.
### B21 — Show Tor/FIPS transport pill on cloud browse — FIXED (build+type-check green; deploy+UI-confirm on .116/.198)
Tag whether the peer connection is Tor or FIPS and surface it as a small pill on the cloud browse screens / connection loader. Data source: federation node last_transport (now recorded by B14) exposed via federation.list-nodes; frontend renders a pill (FIPS=fast/green, Tor=slower) on PeerFiles.vue / Cloud peer view + the connection loader. Frontend-only-ish. FINDINGS: PeerFiles.vue:46 loader HARDCODES 'Connecting via Tor...' even when FIPS used (bug). Frontend types already have last_transport ('fips'|'tor'|'mesh'|'lan') federation/types.ts:31; NodeList.vue:167 already renders a transport indicator. PLAN: have content.browse-peer RETURN the transport used (B14 already computes it) → frontend shows a pill (FIPS green / Tor amber) on PeerFiles header + fix the loader text to reflect actual/attempted transport. Small backend (add transport to browse response) + frontend pill.
### B8 — netbird app doesn't work — TODO (LOW / much later)
(RETRACTED: CryptPad placeholder-icon — user says cryptpad is fine.)
---
## 📋 vps2 Gitea issues (lfg2025/archy) — imported 2026-06-15
- G#1 [Bug] Strange peer request behaviour — TODO (likely related to B1/federation)
- G#2 [Bug] Fix flashing USB from kiosk — TODO
- G#3 [Feature] VPN Configuration — DEFERRED (feature; no new features until production quality)
- G#4 [Bug] Bitcoind is slow — TODO
- G#5 [Feature] OpenWRT and TollGate integration — DEFERRED (feature)
- G#6 [Feature] Move dashboard/monitoring link to home screen — DEFERRED (feature)
- G#7 [Bug] Scrolling with Companion app — TODO
---
## Gitea issue mapping (vps2 lfg2025/archy)
All backlog bugs now mirrored as Gitea issues: B1→#8, B2→#9, B3→#10, B4→#11, B5→#12, B6→#13, B7→#14, B8→#15, B9→#16, B10→#17, B11→#18, B12→#19, B13→#20, B14→#21, B15→#22, B16→#23, B17→#24, B18→#25, B19→#26. (Pre-existing G#17 remain; some overlap, e.g. G#1 strange-peer ≈ B1.) Close the Gitea issue when a bug is verified+shipped.
## INVESTIGATION FINDINGS 2026-06-15 (B1/B2/B3/B4/B14) — cutoff insurance
**B1 trusted-node divergence** — ROOT-CAUSED. `federation/sync.rs` `merge_transitive_peers()` (~:140) dedupes ONLY by DID; the SAME physical node appears under multiple DIDs (same `onion` + `fips_npub`) → duplicate entries ("Arch Dev" ×2, "Sapien" ×2). No background convergence → lists diverge (.103=16 nodes, .116/.198=15). Model: `federation/types.rs:24` FederatedNode (PK=did); storage `federation/storage.rs` nodes.json; add_node dedupes by DID only (:125). FIX: in merge_transitive_peers add a SECOND match arm — if no DID match, match by normalized `onion` (trim .onion); if found, treat as same node (merge fips_npub/name, don't add). Same dedup on add_node. Plus a one-time cleanup of existing dup DIDs (remove-node the stale one). TEST: after sync, all 3 nodes have identical node set, no two entries share an onion.
**B2 duplicate chat contact** — ROOT-CAUSED (same root as B1). Two federation DIDs (same onion/fips_npub, e.g. "Sapien" dids z6MkoSbN… + z6MkeYMU…) get seeded as TWO mesh contacts: `mesh/mod.rs` `seed_federation_peers_into_mesh()` (~:94) upserts per-pubkey contact_id; frontend `Mesh.vue` `mergeKeyForPeer()` (~:492) keys by DID so two DIDs = two rows. FIX: (backend) in seed, skip a node whose onion was already seeded (HashSet of onions); (frontend) Mesh.vue merge by onion when DIDs differ but onion matches. Fixing B1's onion-dedup largely resolves this too. TEST: one "Sapien" row; `mesh.peers` has one contact for the shared onion.
**B3 peer media won't play** — ROOT-CAUSED. `PeerFiles.vue` `playMedia()`/`loadPreview()` (~:358,:508) fetch the WHOLE file via RPC `content.preview-peer`/`content.download-peer` (`api/rpc/content.rs` :393,:213) which base64-encodes the entire file; frontend makes a Blob URL → browser can't Range-seek → video/large-audio won't play (+ 30/120s timeouts truncate big files). The peer's HTTP `/content/<id>` handler (`api/handler/content.rs` :49) ALREADY supports Range/206 + Accept-Ranges. FIX (bigger): add a local streaming proxy endpoint `/api/peer-content/{onion}/{id}` in `api/handler/mod.rs` that forwards the browser's Range header to the peer's `/content/<id>` (via fips::dial PeerRequest) and streams back 206 + Content-Range + Content-Type; frontend sets `<video>/<audio>` src to that URL (not a blob). TEST: curl Range on the new endpoint → 206 + Content-Range; video seeks/plays.
**B4 cloud my-folders <!doctype/502** — ROOT-CAUSED. `filebrowser-client.ts` `listDirectory()` (:99) does `res.json()` (:106) after only an `res.ok` check; when FileBrowser is ABSENT nginx serves SPA index.html (200, '<!doctype') → JSON crash; when DOWN → 502. FIX (frontend, low-risk): guard res content-type !== application/json → throw typed "FileBrowser unavailable" handled by Cloud.vue/CloudFolder.vue empty-state; same guard in login() (:71) + getUsage() (:215). OPTIONAL nginx: add `error_page 502 503 = @filebrowser_unavailable` returning JSON in the /app/filebrowser/ block (image-recipe/configs/nginx-archipelago.conf ~:411). TEST: stop filebrowser on .116/.198 → Cloud shows friendly state, no doctype crash.
**B14 cloud browse Tor-not-FIPS** — ROOT-CAUSED (nuance). FIPS-first logic WORKS (`fips/dial.rs` send_get :331 tries FIPS, falls back to Tor on 404/5xx; v1.7.94 fix). BUT the 4 content handlers in `api/rpc/content.rs` (browse :297, download :237, download_paid :356, preview :421) capture `_transport` and NEVER call `record_peer_transport()` → UI badge shows Tor/null even when FIPS used. FIX: add `record_peer_transport(data_dir, None, Some(onion), &transport.to_string())` after each successful send_get (storage.rs:84 has the fn). ⚠️ VERIFY on nodes whether FIPS is ACTUALLY used or genuinely falling back to Tor (if genuinely Tor, deeper FIPS-reachability issue beyond recording). TEST: after browse, last_transport = fips (when peer FIPS-reachable).
## Progress log
- 2026-06-15: tracker created. v1.7.96-alpha shipped. All 19 bugs filed as Gitea issues #8#26. vps2 feature issues (G#3/5/6) deferred (no new features).
- 2026-06-15: **B5 (LND CORS) ✅ DONE** — root-caused, both fixes implemented, verified on .116/.198/.103 (harness 4/4 each), committed `1db720af`, pushed to vps2 main. Will bundle into .97 (Gitea #12 to close on .97 ship).
- Validation nodes: .116 + .198 (pw ThisIsWeb54321@). Runtime is podman (docker not in non-interactive PATH). Sideload binary → /usr/local/bin/archipelago + restart (containers survive on these nodes).
- 2026-06-15 (cont.): **B1,B2,B4 ✅** dedup+guard — committed `ed493106`, unit-tested 2/2, live .198 healthy. **B14 ✅** transport recording — committed `1c6dc153` (after build-repair: used private `crate::federation::storage::` path → E0603; fixed to re-exported `crate::federation::`). **B21 ✅** Tor/FIPS pill — committed `0801dd66`. All pushed to vps2 main; builds verified EXIT 0.
- **Discovered B14b** (FIPS reaches only ~4/15 peers; rest genuinely Tor) and **B21** (pill) during the block.
- ⚠️ LESSON: a backgrounded build "completed" notification does NOT mean success — grep the EXIT code before committing (a broken commit reached main once; repaired by 1c6dc153; no release cut from it → fleet unaffected).
- **NEXT: B3 (peer media streaming — big), then B14b (FIPS reachability), then app-specific (B6,B7,B9B13,B15B19).** None deployed to fleet yet — all on vps2 main awaiting the .97 release after full .116/.198 + UI verification.