219 lines
15 KiB
Markdown
219 lines
15 KiB
Markdown
# Unified Task Tracker — OTA 1.8.0 + Master Plan
|
||
|
||
Single working list for everything left before 1.8.0 ships and the next master-plan
|
||
exit criteria (multinode + workstreams B/C/D) are met. Supersedes the open-task
|
||
sections of `docs/SESSION-1.8.0-OTA-PROGRESS.md` and `docs/PRODUCTION-MASTER-PLAN.md`
|
||
as the day-to-day tracker — those docs remain the historical record / detailed
|
||
narrative and are still linked from here where useful. **Ordered fastest/simplest
|
||
first** so we work top-down instead of hunting across docs.
|
||
|
||
Verified against actual code state on 2026-07-01 (not just doc text — several
|
||
items the source docs still listed as "open" turned out to already be shipped;
|
||
those are marked ✅ below with the commit that did it, so we stop re-litigating them).
|
||
|
||
---
|
||
|
||
## Tier 0 — Quick / mechanical, no blockers
|
||
|
||
- [ ] **Update `tests/lifecycle/TESTING.md`'s stale Release Gates checklist** (lines
|
||
289–296) — several boxes are unchecked but actually true now:
|
||
- #1 bitcoin-stops: covered by `tests/lifecycle/bats/bitcoin-knots.bats` stop/restart
|
||
tier, included in the 5/5 green gate run.
|
||
- #2 `ARCHY_ITERATIONS=5` on .228: **GREEN 2026-06-23 per CLAUDE.md** — check the box.
|
||
- #5 cargo 0 warnings: confirmed 0 warnings on `cargo build --release` (2026-07-01).
|
||
- #7 layman changelog: `CHANGELOG.md` is backfilled with layman-readable entries
|
||
through v1.8.00-alpha — check the box.
|
||
- Leave #3 (multinode), #4 (backend-survives-restart / Phase-3 default-on), #6
|
||
(LoC decision), #8 (tag pushed) unchecked — genuinely still open, see Tier 2/3.
|
||
- [x] ~~Finish the archival/full-node manifest generalization~~ — investigated 2026-07-01:
|
||
the hardcoded fallback names in `dependencies.rs:48-52` (`electrs`, `mempool-electrs`,
|
||
`mempool-web`) are legacy **alias** ids for `electrumx`/`mempool`, resolved via
|
||
id-mapping in a dozen other places (`install.rs`, `runtime.rs`, `config.rs`, etc.),
|
||
not separate un-migrated apps with their own manifests. `electrumx` and `mempool`
|
||
themselves already declare `bitcoin:archival`. The fallback is correct as-is —
|
||
not tech debt, closing this item rather than risk breaking alias resolution.
|
||
- [x] ~~Confirm/close the Portainer image-pin item~~ — confirmed 2026-07-01:
|
||
`146.59.87.168:3000/lfg2025/portainer:2.19.4` is present in `podman images` on
|
||
all 3 LAN nodes (.116/.198/.228), i.e. actually resolvable/pulled from the mirror.
|
||
Not a live bug.
|
||
- [x] ~~grafana Quadlet "stuck activating"~~ — checked live on .116 (2026-07-01):
|
||
`grafana.service` is `active (running)`, container `Up 2 hours (healthy)`. The
|
||
2026-06-21 report is stale for grafana. **strfry still unconfirmed** — not
|
||
installed on any of .116/.198/.228 to check directly; low priority until someone
|
||
actually needs it installed.
|
||
|
||
## Tier 1 — Medium effort, unblocked
|
||
|
||
- [x] ~~immich → Quadlet migration~~ — investigated 2026-07-01, turned out already done:
|
||
immich uses the same `install_stack_via_orchestrator` primitive as netbird/btcpay
|
||
(`immich_stack_app_ids()` in `stacks.rs:690`), and is confirmed running as real
|
||
Quadlet units live on .228 (`immich_server.container`, `immich_postgres.container`,
|
||
`immich_redis.container`, all active). Not a legacy in-cgroup app — the only
|
||
remaining piece is the fleet-wide Phase-3 default-flip, already tracked in Tier 2.
|
||
- [x] ~~Netbird reinstall adoption path~~ — investigated 2026-07-01, **not a bug, by
|
||
design.** `adopt_stack_if_exists()` (`stacks.rs:140-198`) is only used as a
|
||
fallback when the orchestrator has no manifest for the app — there's nothing to
|
||
render certs/config from in that case, so skipping rendering is correct. When
|
||
the orchestrator *does* have the manifest (the normal path), the reconcile loop
|
||
already re-renders certs even for adopted-running containers, fixed in
|
||
`4519dbf0` (`prod_orchestrator.rs:1707-1708`).
|
||
- [x] ~~TanStack Query (or equivalent) investigation~~ — spike complete 2026-07-01,
|
||
**recommendation: don't adopt / close as not needed.** Only 3 stores actually fetch
|
||
data, WebSocket push already handles hot data (server-info/package-data), no
|
||
cache-invalidation or stale-data bugs found, migration would touch 62 RPC call
|
||
sites for no concrete payoff. If boilerplate ever bothers us, extract a
|
||
`usePolling()` composable instead — much cheaper than a query-cache migration.
|
||
|
||
## Tier 2 — High effort, mostly unblocked (the actual next exit criteria)
|
||
|
||
- [~] **Multinode test pass** (`docs/multinode-testing-plan.md`) — worked the
|
||
preconditions on .198 2026-07-01:
|
||
- ✅ cleared 2 stale failed-unit records (`archy-mempool-db.service`,
|
||
`meshtastic.service` — both `not-found`/dead since 6 and 5 days ago, harmless
|
||
bookkeeping, `systemctl --user reset-failed`).
|
||
- ✅ nginx `/app/lnd/` proxy target confirmed correct (→ `18083`, matches the
|
||
running `archy-lnd-ui` port) — the plan's "stale proxy target" concern doesn't
|
||
apply here.
|
||
- ⛔ .198 disk (448GB) is below the 1TB archival threshold + was only 21%
|
||
through IBD — user chose to **swap in a different node** rather than wait/add
|
||
storage. **.116 ruled out** (no bitcoin container installed at all, just the
|
||
UI companion). **.120 ruled out** (reserved for another developer). **.5**
|
||
(archy-x250-beta, Tailscale `100.72.136.5`) chosen: also sub-1TB (472GB, so
|
||
still pruned — that ceiling is shared by every non-.228 node), but **fully
|
||
synced** (`ibd:false`, blocks==headers 956,240). Bootstrapped bats 1.11.1 +
|
||
jq 1.7.1 onto it 2026-07-01 and **launched the 5× destructive gate
|
||
(`ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`) — running now**, log at
|
||
`/tmp/gate.log` on .5, background poller watching for the `RESULTS` banner.
|
||
- Once .5's gate reports: bring the rest of the fleet to precondition, then the
|
||
cross-node federation/mesh/transport suites. This is the literal
|
||
"next exit criterion" called out in `CLAUDE.md`.
|
||
- [ ] **Phase-3 Quadlet default-flip** — code is validated + opt-in via
|
||
`ARCHIPELAGO_USE_QUADLET_BACKENDS=true` on .228/.198 already (confirmed live
|
||
2026-07-01). Ready to flip (`config.rs:256` + its test) the moment the .5 gate
|
||
reports clean — deliberately NOT staged uncommitted in the tree (a prior attempt
|
||
left an uncommitted flip sitting around and that caused confusion; it's a 2-line
|
||
change, faster to just do it fresh once confirmed).
|
||
- [x] ~~Per-app test coverage for the ~30 apps with zero automated coverage~~ —
|
||
**reframed 2026-07-01, mostly a non-issue.** `all-apps-matrix.bats` +
|
||
`all-apps-lifecycle.bats` already give EVERY installed app generic baseline
|
||
coverage (no stuck state, no error state, stop/start/restart survives, UI
|
||
reachable). The real gap is narrower: **34 apps lack app-specific assertions**
|
||
(health endpoints, API queryability, data integrity) beyond that baseline —
|
||
aiui, bitcoin-core, botfights, core-lightning, did-wallet, fedimint-clientd,
|
||
fedimint-gateway, fips-ui, gitea, grafana, home-assistant, indeedhub (+5
|
||
sub-containers), jellyfin, lightning-stack, lnd-ui, morphos-server, netbird
|
||
(+2 sub-containers), nextcloud, nostr-rs-relay, photoprism, portainer, router,
|
||
searxng, strfry, uptime-kuma, vaultwarden. Not urgent — baseline coverage is
|
||
real safety net; treat as a backlog "nice to harden further," not a gate item.
|
||
- [x] ~~Convert remaining multi-container legacy stacks to the manifest-owned model~~ —
|
||
**investigated 2026-07-01, DONE, nothing left.** All 5 real multi-container
|
||
stacks (btcpay, mempool, immich, netbird, indeedhub) are on the
|
||
`install_stack_via_orchestrator` pattern (`stacks.rs`). saleor was removed from
|
||
the codebase; portainer/home-assistant/grafana are single-container
|
||
manifest-driven apps, never stacks; fedimint/fedimint-gateway/fedimint-clientd
|
||
are 3 separate single-container apps with manifest dependency edges, not a
|
||
coordinated stack. Workstream A's stack-migration tail is fully closed.
|
||
- [ ] **Developer tooling CLI suite** (validate/render/local-install/lifecycle-test) —
|
||
APP-PACKAGING-MIGRATION-PLAN.md step 5, needed before external devs can publish.
|
||
- [x] ~~**Consolidated deploy 2026-07-01**: merged PR #67 (reticulum daemon
|
||
process-group fix, `469b0203`), the UI/UX work (`8256fde1` — mesh/web5/apps
|
||
layout, modal, search UX), and `archy-openwrt` (TollGate/OpenWrt gateway
|
||
integration — new `core/openwrt` crate, RPC surface, `OpenWrtGateway.vue`)
|
||
into `main`, alongside the indeedhub self-heal fix~~ — all merged clean, no
|
||
conflicts. **Found + fixed 2 real build-breaking issues during
|
||
verification, not caught by whoever authored them**: a vestigial unused
|
||
`ref` in `Web5ConnectedNodes.vue` that broke `vue-tsc`, and a stale
|
||
`MeshMap.test.ts` mock missing `federatedPositions` (predated this
|
||
session's Mesh Map feature) that crashed on mount. Full test suite green
|
||
(667 passed) after fixes. **Deployed fleet-wide 2026-07-01, all 5 nodes
|
||
sha256-verified**: .116, .198, .228, .5 (recovered cleanly from one
|
||
truncated-transfer hiccup, caught via checksum before it hit the live
|
||
service), 100.82.34.38 (non-Quadlet node — all containers survived the
|
||
restart intact, unlike the worst-case risk flagged beforehand). Also
|
||
built an unbundled installer ISO from this same merged source
|
||
(`archipelago-installer-1.7.99-alpha-unbundled-x86_64.iso`, 2.4GB) —
|
||
the ISO pipeline was archived from the release process at v1.7.43-alpha
|
||
(OTA tarballs are now primary) but the wrapper script still works.
|
||
- [ ] **⚠️ NOT YET DEPLOYED — start here next session.** After the fleet deploy
|
||
above, found that PR #67 ("kill whole daemon process group on drop",
|
||
branch `fix/reticulum-daemon-process-group`, head `be50c886`) is a
|
||
**different, separate** reticulum-daemon fix from the one already
|
||
deployed (`469b0203` on `fix/reticulum-daemon-pdeathsig`) — I'd
|
||
conflated the two by topic similarity and only merged/deployed the
|
||
Python-level `pdeathsig` fix, missing PR #67's Rust-level
|
||
kill-whole-process-group-on-`Drop` fix entirely. Merged PR #67 into
|
||
`main` (`7a7fec21`, clean, `cargo check` green, complementary not
|
||
conflicting with the already-deployed fix) and separately fixed a real
|
||
bug found live: `OpenWrtGateway.vue`'s back button had no `@click`
|
||
handler at all (`7d7ba573`, `vue-tsc` clean). **Both committed + pushed
|
||
to `main` but genuinely NOT deployed to any node** — user asked to hold
|
||
off deploying to restart their computer. Also spot-checked
|
||
`openwrt.scan` live on .116: RPC plumbing works, but no physical
|
||
OpenWrt router was available to confirm true-positive detection, and
|
||
`detect::scan_subnet` does blocking TCP/SSH calls inside an `async fn`
|
||
with no `.await` — untested at scale, worth hardening. **Next steps**:
|
||
build release binary + frontend from current `main`, deploy to all 5
|
||
fleet nodes (.116/.198/.228/.5/100.82.34.38) the same way as the
|
||
earlier consolidated deploy, then verify the back button + (if a real
|
||
OpenWrt router is available) router detection live.
|
||
- [~] **Cross-node federation/mesh/transport suites** — **big find 2026-07-01: these
|
||
already exist**, just aren't wired into the gate or documented as existing:
|
||
`tests/multinode/smoke.sh` (federation pairing/sync, FIPS anchor, peer content
|
||
browse, tombstone-removal regression tests), `tests/multinode/meshtastic.sh`
|
||
(8-stage on-air mesh test), harness in `tests/multinode/lib/multinode.bash`.
|
||
**Actually ran `smoke.sh` live against .116↔.228 2026-07-01: 14 passed, 1
|
||
failed, 1 skipped.** Confirms federation pairing (both directions), FIPS
|
||
anchor connectivity (both nodes), and peer-content-browse-over-mesh (the
|
||
v1.7.95 fix) all genuinely work node-to-node right now.
|
||
- ⚠️ **Real robustness gap found**: `node_rpc()` in `tests/multinode/lib/multinode.bash`
|
||
has no `--max-time` on its curl calls — a slow server-side RPC hangs the whole
|
||
suite with zero feedback (this is what looked like a hang before it eventually
|
||
completed on its own). Cheap fix, not yet applied.
|
||
- 🐛 **Real regression found and root-caused**: removing a federation node
|
||
(`federation.remove-node`) doesn't reliably stick — B reappeared in A's peer
|
||
list after removal in the live test. Root cause: `remove_node()`
|
||
(`core/archipelago/src/federation/storage.rs:187`) does
|
||
`let _ = tombstone_did(data_dir, did).await` — **silently swallows the
|
||
tombstone write's errors.** If that write fails (disk I/O, permission,
|
||
transient issue), the peer is removed from `nodes.json` but never actually
|
||
tombstoned, so the next background sync/notify-join re-adds it — the
|
||
tombstone check at `handlers.rs:592-599` passes because the DID was never
|
||
recorded as removed. Diagnosed as a **pre-existing logic gap**, not a fresh
|
||
regression from the v1.7.95 fix. **Not fixed yet** — this is federation/trust
|
||
code, deliberately not touching it blind; needs a careful fix (surface the
|
||
tombstone-write failure instead of swallowing it, and/or retry) plus
|
||
re-verification with `smoke.sh` before considering it closed.
|
||
|
||
## Tier 3 — Blocked on a decision or resource only you can supply
|
||
|
||
- [ ] **Version naming decision (1.7.99-alpha → 1.8.0 vs 1.8.00-alpha)** — code is
|
||
otherwise ready to tag; this is a one-line decision, then a mechanical bump +
|
||
tag + push. **Needs your call**, not more engineering.
|
||
- [ ] **Workstream B signing ceremony** — `core/archipelago/src/trust/anchor.rs:21`
|
||
still has `RELEASE_ROOT_PUBKEY_HEX = None`. Needs the offline
|
||
`RELEASE_MASTER_MNEMONIC` to run `docs/workstream-b-signing-runbook.md`'s
|
||
4-step ceremony — can't be automated by me.
|
||
- [ ] **Bitcoin multi-version fleet-wide OTA** — `.228` fully working on branch,
|
||
per your prior gating this rollout is explicitly held for your decision on
|
||
timing (`docs/bitcoin-version-bulletproof-rollout.md`).
|
||
- [ ] **3ccc stock-Meshtastic RF validation** — needs a live send/receive test with
|
||
physical radios in your hands; code fix is in place, just unverified live.
|
||
|
||
## Backlog — deferred, no scope decided, low priority
|
||
|
||
- [ ] **Marketplace protocol (workstream C)** — design-only (`docs/marketplace-protocol.md`),
|
||
no tooling/trust UX built. Future work, not urgent.
|
||
- [ ] **DHT distribution (workstream D)** — confirmed design-only, no code
|
||
(`docs/dht-distribution-design.md` explicitly says "Status: Design (no code yet)");
|
||
an experimental iroh provider skeleton exists behind a feature flag for future
|
||
PoC measurement, nothing fleet-facing.
|
||
- [ ] **Custom live voice-call protocol** — deprioritized 2026-07-01 per user request;
|
||
scope not yet decided. Revisit after the tiers above are worked down.
|
||
|
||
---
|
||
|
||
*Historical narrative and detailed per-session logs remain in
|
||
`docs/SESSION-1.8.0-OTA-PROGRESS.md` and `docs/PRODUCTION-MASTER-PLAN.md` §6/§8b —
|
||
this doc is the live "what's left, in priority order" list. Update it (don't just
|
||
append to the old docs) as items close or new ones surface.*
|