archy/docs/UNIFIED-TASK-TRACKER.md
archipelago 61bfde3200 docs: consolidated deploy done — all 5 fleet nodes verified + unbundled ISO built
Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
2026-07-01 19:54:07 -04:00

197 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Unified Task Tracker — OTA 1.8.0 + Master Plan
Single working list for everything left before 1.8.0 ships and the next master-plan
exit criteria (multinode + workstreams B/C/D) are met. Supersedes the open-task
sections of `docs/SESSION-1.8.0-OTA-PROGRESS.md` and `docs/PRODUCTION-MASTER-PLAN.md`
as the day-to-day tracker — those docs remain the historical record / detailed
narrative and are still linked from here where useful. **Ordered fastest/simplest
first** so we work top-down instead of hunting across docs.
Verified against actual code state on 2026-07-01 (not just doc text — several
items the source docs still listed as "open" turned out to already be shipped;
those are marked ✅ below with the commit that did it, so we stop re-litigating them).
---
## Tier 0 — Quick / mechanical, no blockers
- [ ] **Update `tests/lifecycle/TESTING.md`'s stale Release Gates checklist** (lines
289296) — several boxes are unchecked but actually true now:
- #1 bitcoin-stops: covered by `tests/lifecycle/bats/bitcoin-knots.bats` stop/restart
tier, included in the 5/5 green gate run.
- #2 `ARCHY_ITERATIONS=5` on .228: **GREEN 2026-06-23 per CLAUDE.md** — check the box.
- #5 cargo 0 warnings: confirmed 0 warnings on `cargo build --release` (2026-07-01).
- #7 layman changelog: `CHANGELOG.md` is backfilled with layman-readable entries
through v1.8.00-alpha — check the box.
- Leave #3 (multinode), #4 (backend-survives-restart / Phase-3 default-on), #6
(LoC decision), #8 (tag pushed) unchecked — genuinely still open, see Tier 2/3.
- [x] ~~Finish the archival/full-node manifest generalization~~ — investigated 2026-07-01:
the hardcoded fallback names in `dependencies.rs:48-52` (`electrs`, `mempool-electrs`,
`mempool-web`) are legacy **alias** ids for `electrumx`/`mempool`, resolved via
id-mapping in a dozen other places (`install.rs`, `runtime.rs`, `config.rs`, etc.),
not separate un-migrated apps with their own manifests. `electrumx` and `mempool`
themselves already declare `bitcoin:archival`. The fallback is correct as-is —
not tech debt, closing this item rather than risk breaking alias resolution.
- [x] ~~Confirm/close the Portainer image-pin item~~ — confirmed 2026-07-01:
`146.59.87.168:3000/lfg2025/portainer:2.19.4` is present in `podman images` on
all 3 LAN nodes (.116/.198/.228), i.e. actually resolvable/pulled from the mirror.
Not a live bug.
- [x] ~~grafana Quadlet "stuck activating"~~ — checked live on .116 (2026-07-01):
`grafana.service` is `active (running)`, container `Up 2 hours (healthy)`. The
2026-06-21 report is stale for grafana. **strfry still unconfirmed** — not
installed on any of .116/.198/.228 to check directly; low priority until someone
actually needs it installed.
## Tier 1 — Medium effort, unblocked
- [x] ~~immich → Quadlet migration~~ — investigated 2026-07-01, turned out already done:
immich uses the same `install_stack_via_orchestrator` primitive as netbird/btcpay
(`immich_stack_app_ids()` in `stacks.rs:690`), and is confirmed running as real
Quadlet units live on .228 (`immich_server.container`, `immich_postgres.container`,
`immich_redis.container`, all active). Not a legacy in-cgroup app — the only
remaining piece is the fleet-wide Phase-3 default-flip, already tracked in Tier 2.
- [x] ~~Netbird reinstall adoption path~~ — investigated 2026-07-01, **not a bug, by
design.** `adopt_stack_if_exists()` (`stacks.rs:140-198`) is only used as a
fallback when the orchestrator has no manifest for the app — there's nothing to
render certs/config from in that case, so skipping rendering is correct. When
the orchestrator *does* have the manifest (the normal path), the reconcile loop
already re-renders certs even for adopted-running containers, fixed in
`4519dbf0` (`prod_orchestrator.rs:1707-1708`).
- [x] ~~TanStack Query (or equivalent) investigation~~ — spike complete 2026-07-01,
**recommendation: don't adopt / close as not needed.** Only 3 stores actually fetch
data, WebSocket push already handles hot data (server-info/package-data), no
cache-invalidation or stale-data bugs found, migration would touch 62 RPC call
sites for no concrete payoff. If boilerplate ever bothers us, extract a
`usePolling()` composable instead — much cheaper than a query-cache migration.
## Tier 2 — High effort, mostly unblocked (the actual next exit criteria)
- [~] **Multinode test pass** (`docs/multinode-testing-plan.md`) — worked the
preconditions on .198 2026-07-01:
- ✅ cleared 2 stale failed-unit records (`archy-mempool-db.service`,
`meshtastic.service` — both `not-found`/dead since 6 and 5 days ago, harmless
bookkeeping, `systemctl --user reset-failed`).
- ✅ nginx `/app/lnd/` proxy target confirmed correct (→ `18083`, matches the
running `archy-lnd-ui` port) — the plan's "stale proxy target" concern doesn't
apply here.
- ⛔ .198 disk (448GB) is below the 1TB archival threshold + was only 21%
through IBD — user chose to **swap in a different node** rather than wait/add
storage. **.116 ruled out** (no bitcoin container installed at all, just the
UI companion). **.120 ruled out** (reserved for another developer). **.5**
(archy-x250-beta, Tailscale `100.72.136.5`) chosen: also sub-1TB (472GB, so
still pruned — that ceiling is shared by every non-.228 node), but **fully
synced** (`ibd:false`, blocks==headers 956,240). Bootstrapped bats 1.11.1 +
jq 1.7.1 onto it 2026-07-01 and **launched the 5× destructive gate
(`ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`) — running now**, log at
`/tmp/gate.log` on .5, background poller watching for the `RESULTS` banner.
- Once .5's gate reports: bring the rest of the fleet to precondition, then the
cross-node federation/mesh/transport suites. This is the literal
"next exit criterion" called out in `CLAUDE.md`.
- [ ] **Phase-3 Quadlet default-flip** — code is validated + opt-in via
`ARCHIPELAGO_USE_QUADLET_BACKENDS=true` on .228/.198 already (confirmed live
2026-07-01). Ready to flip (`config.rs:256` + its test) the moment the .5 gate
reports clean — deliberately NOT staged uncommitted in the tree (a prior attempt
left an uncommitted flip sitting around and that caused confusion; it's a 2-line
change, faster to just do it fresh once confirmed).
- [x] ~~Per-app test coverage for the ~30 apps with zero automated coverage~~ —
**reframed 2026-07-01, mostly a non-issue.** `all-apps-matrix.bats` +
`all-apps-lifecycle.bats` already give EVERY installed app generic baseline
coverage (no stuck state, no error state, stop/start/restart survives, UI
reachable). The real gap is narrower: **34 apps lack app-specific assertions**
(health endpoints, API queryability, data integrity) beyond that baseline —
aiui, bitcoin-core, botfights, core-lightning, did-wallet, fedimint-clientd,
fedimint-gateway, fips-ui, gitea, grafana, home-assistant, indeedhub (+5
sub-containers), jellyfin, lightning-stack, lnd-ui, morphos-server, netbird
(+2 sub-containers), nextcloud, nostr-rs-relay, photoprism, portainer, router,
searxng, strfry, uptime-kuma, vaultwarden. Not urgent — baseline coverage is
real safety net; treat as a backlog "nice to harden further," not a gate item.
- [x] ~~Convert remaining multi-container legacy stacks to the manifest-owned model~~
**investigated 2026-07-01, DONE, nothing left.** All 5 real multi-container
stacks (btcpay, mempool, immich, netbird, indeedhub) are on the
`install_stack_via_orchestrator` pattern (`stacks.rs`). saleor was removed from
the codebase; portainer/home-assistant/grafana are single-container
manifest-driven apps, never stacks; fedimint/fedimint-gateway/fedimint-clientd
are 3 separate single-container apps with manifest dependency edges, not a
coordinated stack. Workstream A's stack-migration tail is fully closed.
- [ ] **Developer tooling CLI suite** (validate/render/local-install/lifecycle-test) —
APP-PACKAGING-MIGRATION-PLAN.md step 5, needed before external devs can publish.
- [x] ~~**Consolidated deploy 2026-07-01**: merged PR #67 (reticulum daemon
process-group fix, `469b0203`), the UI/UX work (`8256fde1` — mesh/web5/apps
layout, modal, search UX), and `archy-openwrt` (TollGate/OpenWrt gateway
integration — new `core/openwrt` crate, RPC surface, `OpenWrtGateway.vue`)
into `main`, alongside the indeedhub self-heal fix~~ — all merged clean, no
conflicts. **Found + fixed 2 real build-breaking issues during
verification, not caught by whoever authored them**: a vestigial unused
`ref` in `Web5ConnectedNodes.vue` that broke `vue-tsc`, and a stale
`MeshMap.test.ts` mock missing `federatedPositions` (predated this
session's Mesh Map feature) that crashed on mount. Full test suite green
(667 passed) after fixes. **Deployed fleet-wide 2026-07-01, all 5 nodes
sha256-verified**: .116, .198, .228, .5 (recovered cleanly from one
truncated-transfer hiccup, caught via checksum before it hit the live
service), 100.82.34.38 (non-Quadlet node — all containers survived the
restart intact, unlike the worst-case risk flagged beforehand). Also
built an unbundled installer ISO from this same merged source
(`archipelago-installer-1.7.99-alpha-unbundled-x86_64.iso`, 2.4GB) —
the ISO pipeline was archived from the release process at v1.7.43-alpha
(OTA tarballs are now primary) but the wrapper script still works.
- [~] **Cross-node federation/mesh/transport suites** — **big find 2026-07-01: these
already exist**, just aren't wired into the gate or documented as existing:
`tests/multinode/smoke.sh` (federation pairing/sync, FIPS anchor, peer content
browse, tombstone-removal regression tests), `tests/multinode/meshtastic.sh`
(8-stage on-air mesh test), harness in `tests/multinode/lib/multinode.bash`.
**Actually ran `smoke.sh` live against .116↔.228 2026-07-01: 14 passed, 1
failed, 1 skipped.** Confirms federation pairing (both directions), FIPS
anchor connectivity (both nodes), and peer-content-browse-over-mesh (the
v1.7.95 fix) all genuinely work node-to-node right now.
- ⚠️ **Real robustness gap found**: `node_rpc()` in `tests/multinode/lib/multinode.bash`
has no `--max-time` on its curl calls — a slow server-side RPC hangs the whole
suite with zero feedback (this is what looked like a hang before it eventually
completed on its own). Cheap fix, not yet applied.
- 🐛 **Real regression found and root-caused**: removing a federation node
(`federation.remove-node`) doesn't reliably stick — B reappeared in A's peer
list after removal in the live test. Root cause: `remove_node()`
(`core/archipelago/src/federation/storage.rs:187`) does
`let _ = tombstone_did(data_dir, did).await` — **silently swallows the
tombstone write's errors.** If that write fails (disk I/O, permission,
transient issue), the peer is removed from `nodes.json` but never actually
tombstoned, so the next background sync/notify-join re-adds it — the
tombstone check at `handlers.rs:592-599` passes because the DID was never
recorded as removed. Diagnosed as a **pre-existing logic gap**, not a fresh
regression from the v1.7.95 fix. **Not fixed yet** — this is federation/trust
code, deliberately not touching it blind; needs a careful fix (surface the
tombstone-write failure instead of swallowing it, and/or retry) plus
re-verification with `smoke.sh` before considering it closed.
## Tier 3 — Blocked on a decision or resource only you can supply
- [ ] **Version naming decision (1.7.99-alpha → 1.8.0 vs 1.8.00-alpha)** — code is
otherwise ready to tag; this is a one-line decision, then a mechanical bump +
tag + push. **Needs your call**, not more engineering.
- [ ] **Workstream B signing ceremony**`core/archipelago/src/trust/anchor.rs:21`
still has `RELEASE_ROOT_PUBKEY_HEX = None`. Needs the offline
`RELEASE_MASTER_MNEMONIC` to run `docs/workstream-b-signing-runbook.md`'s
4-step ceremony — can't be automated by me.
- [ ] **Bitcoin multi-version fleet-wide OTA**`.228` fully working on branch,
per your prior gating this rollout is explicitly held for your decision on
timing (`docs/bitcoin-version-bulletproof-rollout.md`).
- [ ] **3ccc stock-Meshtastic RF validation** — needs a live send/receive test with
physical radios in your hands; code fix is in place, just unverified live.
## Backlog — deferred, no scope decided, low priority
- [ ] **Marketplace protocol (workstream C)** — design-only (`docs/marketplace-protocol.md`),
no tooling/trust UX built. Future work, not urgent.
- [ ] **DHT distribution (workstream D)** — confirmed design-only, no code
(`docs/dht-distribution-design.md` explicitly says "Status: Design (no code yet)");
an experimental iroh provider skeleton exists behind a feature flag for future
PoC measurement, nothing fleet-facing.
- [ ] **Custom live voice-call protocol** — deprioritized 2026-07-01 per user request;
scope not yet decided. Revisit after the tiers above are worked down.
---
*Historical narrative and detailed per-session logs remain in
`docs/SESSION-1.8.0-OTA-PROGRESS.md` and `docs/PRODUCTION-MASTER-PLAN.md` §6/§8b —
this doc is the live "what's left, in priority order" list. Update it (don't just
append to the old docs) as items close or new ones surface.*