154 lines
19 KiB
Markdown
154 lines
19 KiB
Markdown
# Production-Quality Bug Tracker
|
||
|
||
Living tracker for the post-v1.7.96 "no new features until production quality" push.
|
||
Updated continuously as we investigate → fix → test → pass. Kept in-repo so progress
|
||
survives a session cutoff.
|
||
|
||
## Rules (from user, 2026-06-15)
|
||
- **No new features** until the OS is production / no-bugs quality.
|
||
- **Test-harness-first**: build/extend a harness for each bug before fixing.
|
||
- **Validate every fix on `.116` + `.198`** (both 192.168.1.x, pw ThisIsWeb54321@) **+ the harness** BEFORE it goes into any release. (.198 still carries the LND CORS nginx duplicate → good for fix-(a) validation; .116 does not.)
|
||
- **Priority order**: cloud/federated-nodes + mesh FIRST, then app-specific, then low-pri.
|
||
|
||
## Status legend
|
||
`TODO` · `INVESTIGATING` · `ROOT-CAUSED` · `FIXING` · `TESTING` (on .116+harness) · `PASSED` · `SHIPPED`
|
||
|
||
## Release status
|
||
- **v1.7.96-alpha — SHIPPED** (2026-06-15). Live on vps2 (primary OTA): manifest v1.7.96-alpha, assets HTTP 200, `main@8c3c7954` + tag present. Contents: kiosk grid removal + FIPS TCP/UDP anchor selector. NOTE: gitea-local (localhost) mirror push failed (token rejected → /login); non-blocking, needs refreshed token.
|
||
- **v1.7.97-alpha — IN PROGRESS** (this push). Will bundle the verified fixes below.
|
||
|
||
---
|
||
|
||
## 🔴🔴 TOP PRIORITY
|
||
|
||
### B5 — LND "connect your wallet" details/QR broken fleet-wide — ROOT-CAUSED
|
||
Origin: user escalation. Symptom: LND connect screen (served on app port :18083) can't load details/QR.
|
||
Two distinct root causes (confirmed live):
|
||
- **(a) Duplicate ACAO** on `/lnd-connect-info` (seen on .103): backend sets `Access-Control-Allow-Origin` (proxy.rs:108) AND nginx `add_header` adds a second → browser rejects "multiple values". nginx config drift. Fix: bootstrap.rs nginx patch must strip the redundant `add_header` from the `/lnd-connect-info` location (backend owns CORS).
|
||
- **(b) No ACAO on `/proxy/lnd/v1/*` 401** (fleet-wide): the unauth/auth-layer 401 is produced before the CORS-adding proxy handler (proxy.rs:135 `handle_lnd_proxy`). Browser → "No 'Access-Control-Allow-Origin' header". Fix: ensure auth-layer/early-return responses for `/proxy/lnd` + `/lnd-connect-info` carry CORS headers.
|
||
- `.116` `/lnd-connect-info` returns a single correct ACAO → symptom varies by node's nginx state.
|
||
- Backend CORS helper: handler/mod.rs `app_cors_origin()` (:270) — reflects Origin when its host == request host.
|
||
- Backend change → ships in .97. **Status: ✅ PASSED — verified on .116, .198, .103 (harness 4/4 each). Ready to bundle into .97.**
|
||
- Caveat: bootstrap's nginx dup-strip runs a few seconds AFTER /health goes green (async patch+reload) — converges within ~1 min of restart; not instant. Acceptable.
|
||
- **CODE CHANGES MADE (uncommitted):**
|
||
- `core/archipelago/src/bootstrap.rs`: added `NGINX_LND_DUP_CORS` const + strip in `patch_nginx_conf()` (removes the duplicate nginx `add_header` ACAO from `/lnd-connect-info` so the backend's single header wins). Idempotent; runs on startup nginx bootstrap. → fixes (a)
|
||
- `core/archipelago/src/api/handler/mod.rs`: new `unauthorized_cors(origin)` helper (:~205) + `/proxy/lnd/` route (:~505) computes origin first and returns `unauthorized_cors` so the 401 carries ACAO. → fixes (b)
|
||
- Test on **.116** for (b); test on **.103** for (a) [.116 has no dup to strip].
|
||
- **2026-06-15 RESULT — .116 (fix b): harness 4/4 PASS** (sideloaded built binary, restarted). `/proxy/lnd/v1/*` now returns CORS on the 401. ✅
|
||
- (Correction: an earlier "LND container MISSING" reading was a FALSE alarm — `docker` isn't in the non-interactive PATH; runtime is **podman**. Verified `lnd Up 9h` — containers SURVIVED the restart cleanly.)
|
||
- Next: deploy to .103 + run harness to confirm fix (a) (nginx dup strip).
|
||
- **Harness:** `tests/production-quality/lnd-cors-test.sh <node>` — asserts single correct ACAO on /lnd-connect-info + ACAO present on /proxy/lnd/v1/{getinfo,channels}. Baseline (2026-06-15): .116 = 2 pass/2 fail (proxy missing ACAO); .103 = 1 pass/3 fail (connect-info dup + proxy missing).
|
||
- **FIX PLAN (precise):**
|
||
1. (b) handler/mod.rs:504-508 `/proxy/lnd/` returns `Self::unauthorized()` (401, NO CORS) when session check fails → browser CORS wall. Add CORS (app_cors_origin) to that 401. Same pattern for any other app-origin early-return.
|
||
2. (a) nginx `/lnd-connect-info` location double-adds ACAO (backend + nginx `add_header`). Strip the nginx `add_header Access-Control-Allow-Origin` there; backend owns CORS. Update bootstrap.rs nginx patch to remove it on existing nodes (idempotent).
|
||
- Verify: rebuild backend, deploy to .116, run harness → expect 3/3 (or 4 assertions) PASS on .116 AND .103.
|
||
|
||
---
|
||
|
||
## 🔴 PRIORITY — cloud / federation / mesh
|
||
|
||
### B1 — Trusted-node list not clean — PASSED (onion-dedup; unit test 2/2; live .198 15→13 distinct, healthy). UI visual-confirm recommended.
|
||
Dupes, erroneous names, and non-convergent group membership across nodes. Expected: trusted nodes form a transitive group (every node connects to any newly-added trusted node; all nodes show the same set). `.103` has a long/dirty list.
|
||
|
||
### B2 — Duplicate chat contact for one node — PASSED (resolved by load-dedup feeding mesh seed; unit-tested). UI visual-confirm recommended.
|
||
Federated peer "sapien" shows TWO chats: one "sapien" WITHOUT archy logo (looks non-federated) + one named by raw DID `did:key:z6MkoSbN5CM7fBaQg2nWbCymEkFXsHnuXvec9Mjo5RtJf9dQ`. Same node keyed by both federated identity and raw DID → merge to one. Code: core/archipelago/src/mesh + mesh/typed_messages.rs (note :233 — meshcore adverts don't carry archy pubkey).
|
||
|
||
### B3 — Cloud peer media won't preview/play — ROOT-CAUSED (plan ready: streaming proxy endpoint)
|
||
Music/video preview files on peer nodes' cloud don't play (streaming/range/content-type over mesh+Tor peer fetch).
|
||
|
||
### B4 — Cloud "my folders" fails (JSON parse / 502) — PASSED (content-type guard; built, guard in bundle, deployed .198). UI visual-confirm recommended.
|
||
`Unexpected token '<', "<!doctype"` when FileBrowser absent (`/app/filebrowser/api/resources` → SPA index.html), and **502** when FileBrowser is down (seen on .103). filebrowser-client.ts:102/:106. Fix: detect FileBrowser unavailable, friendly prompt; consider nginx returning JSON 404/502 for missing `/app/<app>/` instead of SPA shell. Handle BOTH absent + down.
|
||
|
||
### B14 — cloud browse transport not recorded — FIXED (record_peer_transport in 4 content handlers; build OK). NOTE: live data shows FIPS reaches only ~4/15 peers, 6 fall back to Tor genuinely → see B14b.
|
||
Browsing trusted/peer nodes in the Cloud tab connects over Tor instead of FIPS (should prefer FIPS like the rest of mesh; same for peer browsing). cf project_fips_integration, project_tor_node_to_node_works (last_transport should be fips/mesh).
|
||
|
||
---
|
||
|
||
## 🟠 APP-SPECIFIC
|
||
|
||
### B6 — ElectrumX install button missing "Requires Archival Node" gate — TODO
|
||
Show the yellow requirement badge when no full node / only a pruned node is present (reuse existing yellow badge pattern).
|
||
|
||
### B7 — ElectrumX UI stuck loader on top — TODO
|
||
UI renders but a loader sits on top; possibly stale pre-sync screen not clearing.
|
||
|
||
### B9 — IndeedHub keeps stopping on nodes — TODO
|
||
Container won't stay running (crash-loop / reconcile stop). Check logs + restart policy + health.
|
||
|
||
### B10 — Immich still crashes — TODO
|
||
Recurring crash ("still" → prior attempts). Check container logs + resource limits + DB/ML deps.
|
||
|
||
### B11 — Companion app: "open in external browser" apps don't work — TODO
|
||
Apps meant to open in a new/external browser don't launch from the companion app; need the phone-default-browser request-modal pattern mobile apps use. Relates to v1.7.90 "open in new tab from companion app".
|
||
|
||
### B12 — Mempool not connecting to Bitcoin on some nodes — TODO
|
||
mempool can't reach the Bitcoin backend on some nodes. Investigate on .116. Check mempool→electrs→bitcoind wiring + deps.
|
||
|
||
### B13 — Fedimint UI not applying CSS — TODO
|
||
Actual Fedimint UI (not pre-sync) renders unstyled. Likely asset path / proxy base-href (assets rooted at `/` vs `/app/fedimint/`).
|
||
|
||
### B15 — Bitcoin UI sync progress lags — TODO
|
||
Bitcoin UI doesn't update its sync progress fast enough even though the console clearly already has the block-height data. Likely a polling-interval / reactive-update gap between the status source and the UI.
|
||
|
||
### B16 — Bitcoin sync status on Home > System container vanishes — TODO
|
||
The bitcoin sync status in the Home > System container disappears when it should persist/cache and show an "updating" state. Related to B15 (Bitcoin UI sync lag). Likely the status component clears on empty/transitional poll instead of retaining last-known + showing updating.
|
||
|
||
### B17 — archipelago.service flaps on boot before starting — TODO
|
||
On some boots, `[FAILED] Failed to start archipelago.service - Archipelago Backend` prints ~20 times over ~5 min before it finally starts properly. Likely a startup dependency/timing race (DB lock, port bind, crash-recovery, or a dependency not ready) causing systemd restart loop until a precondition is met. Check service Restart=/RestartSec, ExecStartPre gates, and what the early failures log. May tie to B16/crash-recovery.
|
||
|
||
### B18 — Apps stop right after install (or become unstartable) — TODO
|
||
Many apps install but immediately stop, requiring a manual Start — or become unstartable entirely. Likely the install→start handoff / reconciler doesn't bring them up (or starts then they exit). Related to B9 (IndeedHub stopping), B10 (Immich). Possibly linked to the cgroup-SIGKILL-on-archipelago.service-restart issue (feedback_no_systemctl_deploy_until_quadlet) — but NOTE: on .116 (Quadlet) containers survived a service restart cleanly, so the reconciler may be fine there; reproduce on the affected nodes. Check post-install start sequencing + boot_reconciler + container restart policy + cgroup placement.
|
||
|
||
### B19 — Failed download-update lands on Install button (should be Download) — TODO
|
||
When an update download fails, the UI sometimes shows the Install button instead of returning to the Download button — a big UX issue (user can't retry the download cleanly). Check the SystemUpdate state machine's error/failure transition.
|
||
|
||
### B20 — Surface bitcoin-headers-over-mesh broadcast (send/receive toggles) — TODO (feature-adjacent, surfacing existing work)
|
||
We previously broadcast bitcoin block headers over mesh to archipelago nodes but never fully surfaced it. Want two switches: "send headers" (you broadcast) and "receive headers" (you accept). NOTE: this is feature-adjacent — surfacing existing functionality; the user added it during the no-new-features push, so treat as low-priority polish until the bug list is clear. Code: mesh block-headers (mesh.block-headers RPC seen in logs; core/archipelago/src/mesh).
|
||
|
||
### B14b — FIPS reachability: many peers fall back to Tor — TODO (priority, deeper)
|
||
Live (2026-06-15) federation sync last_transport on .116/.198: ~4 peers fips, ~6 tor, ~5 none. So beyond the recording fix (B14), FIPS genuinely doesn't reach many federated peers (they use Tor). Investigate WHY: is fips_npub known for those peers? are they FIPS-online? is the shared anchor connecting them? (cf project_fips_integration, project_tor_node_to_node_works). This is the real "Tor not FIPS" depth.
|
||
|
||
### B21 — Show Tor/FIPS transport pill on cloud browse — FIXED (build+type-check green; deploy+UI-confirm on .116/.198)
|
||
Tag whether the peer connection is Tor or FIPS and surface it as a small pill on the cloud browse screens / connection loader. Data source: federation node last_transport (now recorded by B14) exposed via federation.list-nodes; frontend renders a pill (FIPS=fast/green, Tor=slower) on PeerFiles.vue / Cloud peer view + the connection loader. Frontend-only-ish. FINDINGS: PeerFiles.vue:46 loader HARDCODES 'Connecting via Tor...' even when FIPS used (bug). Frontend types already have last_transport ('fips'|'tor'|'mesh'|'lan') federation/types.ts:31; NodeList.vue:167 already renders a transport indicator. PLAN: have content.browse-peer RETURN the transport used (B14 already computes it) → frontend shows a pill (FIPS green / Tor amber) on PeerFiles header + fix the loader text to reflect actual/attempted transport. Small backend (add transport to browse response) + frontend pill.
|
||
|
||
### B8 — netbird app doesn't work — TODO (LOW / much later)
|
||
|
||
(RETRACTED: CryptPad placeholder-icon — user says cryptpad is fine.)
|
||
|
||
---
|
||
|
||
## 📋 vps2 Gitea issues (lfg2025/archy) — imported 2026-06-15
|
||
- G#1 [Bug] Strange peer request behaviour — TODO (likely related to B1/federation)
|
||
- G#2 [Bug] Fix flashing USB from kiosk — TODO
|
||
- G#3 [Feature] VPN Configuration — DEFERRED (feature; no new features until production quality)
|
||
- G#4 [Bug] Bitcoind is slow — TODO
|
||
- G#5 [Feature] OpenWRT and TollGate integration — DEFERRED (feature)
|
||
- G#6 [Feature] Move dashboard/monitoring link to home screen — DEFERRED (feature)
|
||
- G#7 [Bug] Scrolling with Companion app — TODO
|
||
|
||
---
|
||
|
||
## Gitea issue mapping (vps2 lfg2025/archy)
|
||
All backlog bugs now mirrored as Gitea issues: B1→#8, B2→#9, B3→#10, B4→#11, B5→#12, B6→#13, B7→#14, B8→#15, B9→#16, B10→#17, B11→#18, B12→#19, B13→#20, B14→#21, B15→#22, B16→#23, B17→#24, B18→#25, B19→#26. (Pre-existing G#1–7 remain; some overlap, e.g. G#1 strange-peer ≈ B1.) Close the Gitea issue when a bug is verified+shipped.
|
||
|
||
## INVESTIGATION FINDINGS 2026-06-15 (B1/B2/B3/B4/B14) — cutoff insurance
|
||
|
||
**B1 trusted-node divergence** — ROOT-CAUSED. `federation/sync.rs` `merge_transitive_peers()` (~:140) dedupes ONLY by DID; the SAME physical node appears under multiple DIDs (same `onion` + `fips_npub`) → duplicate entries ("Arch Dev" ×2, "Sapien" ×2). No background convergence → lists diverge (.103=16 nodes, .116/.198=15). Model: `federation/types.rs:24` FederatedNode (PK=did); storage `federation/storage.rs` nodes.json; add_node dedupes by DID only (:125). FIX: in merge_transitive_peers add a SECOND match arm — if no DID match, match by normalized `onion` (trim .onion); if found, treat as same node (merge fips_npub/name, don't add). Same dedup on add_node. Plus a one-time cleanup of existing dup DIDs (remove-node the stale one). TEST: after sync, all 3 nodes have identical node set, no two entries share an onion.
|
||
|
||
**B2 duplicate chat contact** — ROOT-CAUSED (same root as B1). Two federation DIDs (same onion/fips_npub, e.g. "Sapien" dids z6MkoSbN… + z6MkeYMU…) get seeded as TWO mesh contacts: `mesh/mod.rs` `seed_federation_peers_into_mesh()` (~:94) upserts per-pubkey contact_id; frontend `Mesh.vue` `mergeKeyForPeer()` (~:492) keys by DID so two DIDs = two rows. FIX: (backend) in seed, skip a node whose onion was already seeded (HashSet of onions); (frontend) Mesh.vue merge by onion when DIDs differ but onion matches. Fixing B1's onion-dedup largely resolves this too. TEST: one "Sapien" row; `mesh.peers` has one contact for the shared onion.
|
||
|
||
**B3 peer media won't play** — ROOT-CAUSED. `PeerFiles.vue` `playMedia()`/`loadPreview()` (~:358,:508) fetch the WHOLE file via RPC `content.preview-peer`/`content.download-peer` (`api/rpc/content.rs` :393,:213) which base64-encodes the entire file; frontend makes a Blob URL → browser can't Range-seek → video/large-audio won't play (+ 30/120s timeouts truncate big files). The peer's HTTP `/content/<id>` handler (`api/handler/content.rs` :49) ALREADY supports Range/206 + Accept-Ranges. FIX (bigger): add a local streaming proxy endpoint `/api/peer-content/{onion}/{id}` in `api/handler/mod.rs` that forwards the browser's Range header to the peer's `/content/<id>` (via fips::dial PeerRequest) and streams back 206 + Content-Range + Content-Type; frontend sets `<video>/<audio>` src to that URL (not a blob). TEST: curl Range on the new endpoint → 206 + Content-Range; video seeks/plays.
|
||
|
||
**B4 cloud my-folders <!doctype/502** — ROOT-CAUSED. `filebrowser-client.ts` `listDirectory()` (:99) does `res.json()` (:106) after only an `res.ok` check; when FileBrowser is ABSENT nginx serves SPA index.html (200, '<!doctype') → JSON crash; when DOWN → 502. FIX (frontend, low-risk): guard res content-type !== application/json → throw typed "FileBrowser unavailable" handled by Cloud.vue/CloudFolder.vue empty-state; same guard in login() (:71) + getUsage() (:215). OPTIONAL nginx: add `error_page 502 503 = @filebrowser_unavailable` returning JSON in the /app/filebrowser/ block (image-recipe/configs/nginx-archipelago.conf ~:411). TEST: stop filebrowser on .116/.198 → Cloud shows friendly state, no doctype crash.
|
||
|
||
**B14 cloud browse Tor-not-FIPS** — ROOT-CAUSED (nuance). FIPS-first logic WORKS (`fips/dial.rs` send_get :331 tries FIPS, falls back to Tor on 404/5xx; v1.7.94 fix). BUT the 4 content handlers in `api/rpc/content.rs` (browse :297, download :237, download_paid :356, preview :421) capture `_transport` and NEVER call `record_peer_transport()` → UI badge shows Tor/null even when FIPS used. FIX: add `record_peer_transport(data_dir, None, Some(onion), &transport.to_string())` after each successful send_get (storage.rs:84 has the fn). ⚠️ VERIFY on nodes whether FIPS is ACTUALLY used or genuinely falling back to Tor (if genuinely Tor, deeper FIPS-reachability issue beyond recording). TEST: after browse, last_transport = fips (when peer FIPS-reachable).
|
||
|
||
## Progress log
|
||
- 2026-06-15: tracker created. v1.7.96-alpha shipped. All 19 bugs filed as Gitea issues #8–#26. vps2 feature issues (G#3/5/6) deferred (no new features).
|
||
- 2026-06-15: **B5 (LND CORS) ✅ DONE** — root-caused, both fixes implemented, verified on .116/.198/.103 (harness 4/4 each), committed `1db720af`, pushed to vps2 main. Will bundle into .97 (Gitea #12 to close on .97 ship).
|
||
- Validation nodes: .116 + .198 (pw ThisIsWeb54321@). Runtime is podman (docker not in non-interactive PATH). Sideload binary → /usr/local/bin/archipelago + restart (containers survive on these nodes).
|
||
- 2026-06-15 (cont.): **B1,B2,B4 ✅** dedup+guard — committed `ed493106`, unit-tested 2/2, live .198 healthy. **B14 ✅** transport recording — committed `1c6dc153` (after build-repair: used private `crate::federation::storage::` path → E0603; fixed to re-exported `crate::federation::`). **B21 ✅** Tor/FIPS pill — committed `0801dd66`. All pushed to vps2 main; builds verified EXIT 0.
|
||
- **Discovered B14b** (FIPS reaches only ~4/15 peers; rest genuinely Tor) and **B21** (pill) during the block.
|
||
- ⚠️ LESSON: a backgrounded build "completed" notification does NOT mean success — grep the EXIT code before committing (a broken commit reached main once; repaired by 1c6dc153; no release cut from it → fleet unaffected).
|
||
- **NEXT: B3 (peer media streaming — big), then B14b (FIPS reachability), then app-specific (B6,B7,B9–B13,B15–B19).** None deployed to fleet yet — all on vps2 main awaiting the .97 release after full .116/.198 + UI verification.
|