diff --git a/docs/UNIFIED-TASK-TRACKER.md b/docs/UNIFIED-TASK-TRACKER.md index 2654dda9..1d974621 100644 --- a/docs/UNIFIED-TASK-TRACKER.md +++ b/docs/UNIFIED-TASK-TRACKER.md @@ -89,15 +89,59 @@ those are marked ✅ below with the commit that did it, so we stop re-litigating "next exit criterion" called out in `CLAUDE.md`. - [ ] **Phase-3 Quadlet default-flip** — code is validated + opt-in via `ARCHIPELAGO_USE_QUADLET_BACKENDS=true` on .228/.198 already (confirmed live - 2026-07-01). Flip needs: re-test on a healthy idle legacy node, then flip the - default, then multinode gate re-run. -- [ ] **Per-app test coverage for the ~30 apps with zero automated coverage** — - framework exists (bats + reusable helpers), just needs per-app suites written. -- [ ] **Convert remaining multi-container legacy stacks to the manifest-owned model** - (workstream A tail) — netbird's legacy installer is already deleted (`89d397bb`); - immich (see Tier 1) and any other multi-container stacks are what's left. + 2026-07-01). Ready to flip (`config.rs:256` + its test) the moment the .5 gate + reports clean — deliberately NOT staged uncommitted in the tree (a prior attempt + left an uncommitted flip sitting around and that caused confusion; it's a 2-line + change, faster to just do it fresh once confirmed). +- [x] ~~Per-app test coverage for the ~30 apps with zero automated coverage~~ — + **reframed 2026-07-01, mostly a non-issue.** `all-apps-matrix.bats` + + `all-apps-lifecycle.bats` already give EVERY installed app generic baseline + coverage (no stuck state, no error state, stop/start/restart survives, UI + reachable). The real gap is narrower: **34 apps lack app-specific assertions** + (health endpoints, API queryability, data integrity) beyond that baseline — + aiui, bitcoin-core, botfights, core-lightning, did-wallet, fedimint-clientd, + fedimint-gateway, fips-ui, gitea, grafana, home-assistant, indeedhub (+5 + sub-containers), jellyfin, lightning-stack, lnd-ui, morphos-server, netbird + (+2 sub-containers), nextcloud, nostr-rs-relay, photoprism, portainer, router, + searxng, strfry, uptime-kuma, vaultwarden. Not urgent — baseline coverage is + real safety net; treat as a backlog "nice to harden further," not a gate item. +- [x] ~~Convert remaining multi-container legacy stacks to the manifest-owned model~~ — + **investigated 2026-07-01, DONE, nothing left.** All 5 real multi-container + stacks (btcpay, mempool, immich, netbird, indeedhub) are on the + `install_stack_via_orchestrator` pattern (`stacks.rs`). saleor was removed from + the codebase; portainer/home-assistant/grafana are single-container + manifest-driven apps, never stacks; fedimint/fedimint-gateway/fedimint-clientd + are 3 separate single-container apps with manifest dependency edges, not a + coordinated stack. Workstream A's stack-migration tail is fully closed. - [ ] **Developer tooling CLI suite** (validate/render/local-install/lifecycle-test) — APP-PACKAGING-MIGRATION-PLAN.md step 5, needed before external devs can publish. +- [~] **Cross-node federation/mesh/transport suites** — **big find 2026-07-01: these + already exist**, just aren't wired into the gate or documented as existing: + `tests/multinode/smoke.sh` (federation pairing/sync, FIPS anchor, peer content + browse, tombstone-removal regression tests), `tests/multinode/meshtastic.sh` + (8-stage on-air mesh test), harness in `tests/multinode/lib/multinode.bash`. + **Actually ran `smoke.sh` live against .116↔.228 2026-07-01: 14 passed, 1 + failed, 1 skipped.** Confirms federation pairing (both directions), FIPS + anchor connectivity (both nodes), and peer-content-browse-over-mesh (the + v1.7.95 fix) all genuinely work node-to-node right now. + - ⚠️ **Real robustness gap found**: `node_rpc()` in `tests/multinode/lib/multinode.bash` + has no `--max-time` on its curl calls — a slow server-side RPC hangs the whole + suite with zero feedback (this is what looked like a hang before it eventually + completed on its own). Cheap fix, not yet applied. + - 🐛 **Real regression found and root-caused**: removing a federation node + (`federation.remove-node`) doesn't reliably stick — B reappeared in A's peer + list after removal in the live test. Root cause: `remove_node()` + (`core/archipelago/src/federation/storage.rs:187`) does + `let _ = tombstone_did(data_dir, did).await` — **silently swallows the + tombstone write's errors.** If that write fails (disk I/O, permission, + transient issue), the peer is removed from `nodes.json` but never actually + tombstoned, so the next background sync/notify-join re-adds it — the + tombstone check at `handlers.rs:592-599` passes because the DID was never + recorded as removed. Diagnosed as a **pre-existing logic gap**, not a fresh + regression from the v1.7.95 fix. **Not fixed yet** — this is federation/trust + code, deliberately not touching it blind; needs a careful fix (surface the + tombstone-write failure instead of swallowing it, and/or retry) plus + re-verification with `smoke.sh` before considering it closed. ## Tier 3 — Blocked on a decision or resource only you can supply