archy/docs/UNIFIED-TASK-TRACKER.md
archipelago b9e4fbe9f7 docs: PR#67 + back-button fix merged/pushed but NOT deployed — resume note
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
2026-07-02 03:33:52 -04:00

15 KiB
Raw Blame History

Unified Task Tracker — OTA 1.8.0 + Master Plan

Single working list for everything left before 1.8.0 ships and the next master-plan exit criteria (multinode + workstreams B/C/D) are met. Supersedes the open-task sections of docs/SESSION-1.8.0-OTA-PROGRESS.md and docs/PRODUCTION-MASTER-PLAN.md as the day-to-day tracker — those docs remain the historical record / detailed narrative and are still linked from here where useful. Ordered fastest/simplest first so we work top-down instead of hunting across docs.

Verified against actual code state on 2026-07-01 (not just doc text — several items the source docs still listed as "open" turned out to already be shipped; those are marked below with the commit that did it, so we stop re-litigating them).


Tier 0 — Quick / mechanical, no blockers

  • Update tests/lifecycle/TESTING.md's stale Release Gates checklist (lines 289296) — several boxes are unchecked but actually true now:
    • #1 bitcoin-stops: covered by tests/lifecycle/bats/bitcoin-knots.bats stop/restart tier, included in the 5/5 green gate run.
    • #2 ARCHY_ITERATIONS=5 on .228: GREEN 2026-06-23 per CLAUDE.md — check the box.
    • #5 cargo 0 warnings: confirmed 0 warnings on cargo build --release (2026-07-01).
    • #7 layman changelog: CHANGELOG.md is backfilled with layman-readable entries through v1.8.00-alpha — check the box.
    • Leave #3 (multinode), #4 (backend-survives-restart / Phase-3 default-on), #6 (LoC decision), #8 (tag pushed) unchecked — genuinely still open, see Tier 2/3.
  • Finish the archival/full-node manifest generalization — investigated 2026-07-01: the hardcoded fallback names in dependencies.rs:48-52 (electrs, mempool-electrs, mempool-web) are legacy alias ids for electrumx/mempool, resolved via id-mapping in a dozen other places (install.rs, runtime.rs, config.rs, etc.), not separate un-migrated apps with their own manifests. electrumx and mempool themselves already declare bitcoin:archival. The fallback is correct as-is — not tech debt, closing this item rather than risk breaking alias resolution.
  • Confirm/close the Portainer image-pin item — confirmed 2026-07-01: 146.59.87.168:3000/lfg2025/portainer:2.19.4 is present in podman images on all 3 LAN nodes (.116/.198/.228), i.e. actually resolvable/pulled from the mirror. Not a live bug.
  • grafana Quadlet "stuck activating" — checked live on .116 (2026-07-01): grafana.service is active (running), container Up 2 hours (healthy). The 2026-06-21 report is stale for grafana. strfry still unconfirmed — not installed on any of .116/.198/.228 to check directly; low priority until someone actually needs it installed.

Tier 1 — Medium effort, unblocked

  • immich → Quadlet migration — investigated 2026-07-01, turned out already done: immich uses the same install_stack_via_orchestrator primitive as netbird/btcpay (immich_stack_app_ids() in stacks.rs:690), and is confirmed running as real Quadlet units live on .228 (immich_server.container, immich_postgres.container, immich_redis.container, all active). Not a legacy in-cgroup app — the only remaining piece is the fleet-wide Phase-3 default-flip, already tracked in Tier 2.
  • Netbird reinstall adoption path — investigated 2026-07-01, not a bug, by design. adopt_stack_if_exists() (stacks.rs:140-198) is only used as a fallback when the orchestrator has no manifest for the app — there's nothing to render certs/config from in that case, so skipping rendering is correct. When the orchestrator does have the manifest (the normal path), the reconcile loop already re-renders certs even for adopted-running containers, fixed in 4519dbf0 (prod_orchestrator.rs:1707-1708).
  • TanStack Query (or equivalent) investigation — spike complete 2026-07-01, recommendation: don't adopt / close as not needed. Only 3 stores actually fetch data, WebSocket push already handles hot data (server-info/package-data), no cache-invalidation or stale-data bugs found, migration would touch 62 RPC call sites for no concrete payoff. If boilerplate ever bothers us, extract a usePolling() composable instead — much cheaper than a query-cache migration.

Tier 2 — High effort, mostly unblocked (the actual next exit criteria)

  • [~] Multinode test pass (docs/multinode-testing-plan.md) — worked the preconditions on .198 2026-07-01:
    • cleared 2 stale failed-unit records (archy-mempool-db.service, meshtastic.service — both not-found/dead since 6 and 5 days ago, harmless bookkeeping, systemctl --user reset-failed).
    • nginx /app/lnd/ proxy target confirmed correct (→ 18083, matches the running archy-lnd-ui port) — the plan's "stale proxy target" concern doesn't apply here.
    • .198 disk (448GB) is below the 1TB archival threshold + was only 21% through IBD — user chose to swap in a different node rather than wait/add storage. .116 ruled out (no bitcoin container installed at all, just the UI companion). .120 ruled out (reserved for another developer). .5 (archy-x250-beta, Tailscale 100.72.136.5) chosen: also sub-1TB (472GB, so still pruned — that ceiling is shared by every non-.228 node), but fully synced (ibd:false, blocks==headers 956,240). Bootstrapped bats 1.11.1 + jq 1.7.1 onto it 2026-07-01 and launched the 5× destructive gate (ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1) — running now, log at /tmp/gate.log on .5, background poller watching for the RESULTS banner.
    • Once .5's gate reports: bring the rest of the fleet to precondition, then the cross-node federation/mesh/transport suites. This is the literal "next exit criterion" called out in CLAUDE.md.
  • Phase-3 Quadlet default-flip — code is validated + opt-in via ARCHIPELAGO_USE_QUADLET_BACKENDS=true on .228/.198 already (confirmed live 2026-07-01). Ready to flip (config.rs:256 + its test) the moment the .5 gate reports clean — deliberately NOT staged uncommitted in the tree (a prior attempt left an uncommitted flip sitting around and that caused confusion; it's a 2-line change, faster to just do it fresh once confirmed).
  • ~Per-app test coverage for the 30 apps with zero automated coveragereframed 2026-07-01, mostly a non-issue. all-apps-matrix.bats + all-apps-lifecycle.bats already give EVERY installed app generic baseline coverage (no stuck state, no error state, stop/start/restart survives, UI reachable). The real gap is narrower: 34 apps lack app-specific assertions (health endpoints, API queryability, data integrity) beyond that baseline — aiui, bitcoin-core, botfights, core-lightning, did-wallet, fedimint-clientd, fedimint-gateway, fips-ui, gitea, grafana, home-assistant, indeedhub (+5 sub-containers), jellyfin, lightning-stack, lnd-ui, morphos-server, netbird (+2 sub-containers), nextcloud, nostr-rs-relay, photoprism, portainer, router, searxng, strfry, uptime-kuma, vaultwarden. Not urgent — baseline coverage is real safety net; treat as a backlog "nice to harden further," not a gate item.
  • Convert remaining multi-container legacy stacks to the manifest-owned modelinvestigated 2026-07-01, DONE, nothing left. All 5 real multi-container stacks (btcpay, mempool, immich, netbird, indeedhub) are on the install_stack_via_orchestrator pattern (stacks.rs). saleor was removed from the codebase; portainer/home-assistant/grafana are single-container manifest-driven apps, never stacks; fedimint/fedimint-gateway/fedimint-clientd are 3 separate single-container apps with manifest dependency edges, not a coordinated stack. Workstream A's stack-migration tail is fully closed.
  • Developer tooling CLI suite (validate/render/local-install/lifecycle-test) — APP-PACKAGING-MIGRATION-PLAN.md step 5, needed before external devs can publish.
  • Consolidated deploy 2026-07-01: merged PR #67 (reticulum daemon process-group fix, 469b0203), the UI/UX work (8256fde1 — mesh/web5/apps layout, modal, search UX), and archy-openwrt (TollGate/OpenWrt gateway integration — new core/openwrt crate, RPC surface, OpenWrtGateway.vue) into main, alongside the indeedhub self-heal fix — all merged clean, no conflicts. Found + fixed 2 real build-breaking issues during verification, not caught by whoever authored them: a vestigial unused ref in Web5ConnectedNodes.vue that broke vue-tsc, and a stale MeshMap.test.ts mock missing federatedPositions (predated this session's Mesh Map feature) that crashed on mount. Full test suite green (667 passed) after fixes. Deployed fleet-wide 2026-07-01, all 5 nodes sha256-verified: .116, .198, .228, .5 (recovered cleanly from one truncated-transfer hiccup, caught via checksum before it hit the live service), 100.82.34.38 (non-Quadlet node — all containers survived the restart intact, unlike the worst-case risk flagged beforehand). Also built an unbundled installer ISO from this same merged source (archipelago-installer-1.7.99-alpha-unbundled-x86_64.iso, 2.4GB) — the ISO pipeline was archived from the release process at v1.7.43-alpha (OTA tarballs are now primary) but the wrapper script still works.
  • ⚠️ NOT YET DEPLOYED — start here next session. After the fleet deploy above, found that PR #67 ("kill whole daemon process group on drop", branch fix/reticulum-daemon-process-group, head be50c886) is a different, separate reticulum-daemon fix from the one already deployed (469b0203 on fix/reticulum-daemon-pdeathsig) — I'd conflated the two by topic similarity and only merged/deployed the Python-level pdeathsig fix, missing PR #67's Rust-level kill-whole-process-group-on-Drop fix entirely. Merged PR #67 into main (7a7fec21, clean, cargo check green, complementary not conflicting with the already-deployed fix) and separately fixed a real bug found live: OpenWrtGateway.vue's back button had no @click handler at all (7d7ba573, vue-tsc clean). Both committed + pushed to main but genuinely NOT deployed to any node — user asked to hold off deploying to restart their computer. Also spot-checked openwrt.scan live on .116: RPC plumbing works, but no physical OpenWrt router was available to confirm true-positive detection, and detect::scan_subnet does blocking TCP/SSH calls inside an async fn with no .await — untested at scale, worth hardening. Next steps: build release binary + frontend from current main, deploy to all 5 fleet nodes (.116/.198/.228/.5/100.82.34.38) the same way as the earlier consolidated deploy, then verify the back button + (if a real OpenWrt router is available) router detection live.
  • [~] Cross-node federation/mesh/transport suitesbig find 2026-07-01: these already exist, just aren't wired into the gate or documented as existing: tests/multinode/smoke.sh (federation pairing/sync, FIPS anchor, peer content browse, tombstone-removal regression tests), tests/multinode/meshtastic.sh (8-stage on-air mesh test), harness in tests/multinode/lib/multinode.bash. Actually ran smoke.sh live against .116↔.228 2026-07-01: 14 passed, 1 failed, 1 skipped. Confirms federation pairing (both directions), FIPS anchor connectivity (both nodes), and peer-content-browse-over-mesh (the v1.7.95 fix) all genuinely work node-to-node right now.
    • ⚠️ Real robustness gap found: node_rpc() in tests/multinode/lib/multinode.bash has no --max-time on its curl calls — a slow server-side RPC hangs the whole suite with zero feedback (this is what looked like a hang before it eventually completed on its own). Cheap fix, not yet applied.
    • 🐛 Real regression found and root-caused: removing a federation node (federation.remove-node) doesn't reliably stick — B reappeared in A's peer list after removal in the live test. Root cause: remove_node() (core/archipelago/src/federation/storage.rs:187) does let _ = tombstone_did(data_dir, did).awaitsilently swallows the tombstone write's errors. If that write fails (disk I/O, permission, transient issue), the peer is removed from nodes.json but never actually tombstoned, so the next background sync/notify-join re-adds it — the tombstone check at handlers.rs:592-599 passes because the DID was never recorded as removed. Diagnosed as a pre-existing logic gap, not a fresh regression from the v1.7.95 fix. Not fixed yet — this is federation/trust code, deliberately not touching it blind; needs a careful fix (surface the tombstone-write failure instead of swallowing it, and/or retry) plus re-verification with smoke.sh before considering it closed.

Tier 3 — Blocked on a decision or resource only you can supply

  • Version naming decision (1.7.99-alpha → 1.8.0 vs 1.8.00-alpha) — code is otherwise ready to tag; this is a one-line decision, then a mechanical bump + tag + push. Needs your call, not more engineering.
  • Workstream B signing ceremonycore/archipelago/src/trust/anchor.rs:21 still has RELEASE_ROOT_PUBKEY_HEX = None. Needs the offline RELEASE_MASTER_MNEMONIC to run docs/workstream-b-signing-runbook.md's 4-step ceremony — can't be automated by me.
  • Bitcoin multi-version fleet-wide OTA.228 fully working on branch, per your prior gating this rollout is explicitly held for your decision on timing (docs/bitcoin-version-bulletproof-rollout.md).
  • 3ccc stock-Meshtastic RF validation — needs a live send/receive test with physical radios in your hands; code fix is in place, just unverified live.

Backlog — deferred, no scope decided, low priority

  • Marketplace protocol (workstream C) — design-only (docs/marketplace-protocol.md), no tooling/trust UX built. Future work, not urgent.
  • DHT distribution (workstream D) — confirmed design-only, no code (docs/dht-distribution-design.md explicitly says "Status: Design (no code yet)"); an experimental iroh provider skeleton exists behind a feature flag for future PoC measurement, nothing fleet-facing.
  • Custom live voice-call protocol — deprioritized 2026-07-01 per user request; scope not yet decided. Revisit after the tiers above are worked down.

Historical narrative and detailed per-session logs remain in docs/SESSION-1.8.0-OTA-PROGRESS.md and docs/PRODUCTION-MASTER-PLAN.md §6/§8b — this doc is the live "what's left, in priority order" list. Update it (don't just append to the old docs) as items close or new ones surface.