archy/docs/UNIFIED-TASK-TRACKER.md
archipelago 2c1d2a2572 docs: multinode gate finished + boot-reconciler self-heal bug found+fixed
.5's 5x gate done: 5/5 iterations, all technically FAIL per run-gate.sh's
tally but only from .5's permanent pruned-bitcoin ceiling (accepted going
in); down to 2 failures/iteration by the end. Found + fixed a real hang
(lnd cached a dead bitcoin-knots IP after a restart) live mid-run.

Separately found a real boot-reconciler bug via indeedhub going stuck on
.116: any genuinely-installed-but-fully-absent app was left stuck forever
unless it was one of 8 hardcoded "baseline" apps. Fix tracked, code change
in the shared working tree pending test confirmation.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
2026-07-01 17:24:42 -04:00

213 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Unified Task Tracker — OTA 1.8.0 + Master Plan
Single working list for everything left before 1.8.0 ships and the next master-plan
exit criteria (multinode + workstreams B/C/D) are met. Supersedes the open-task
sections of `docs/SESSION-1.8.0-OTA-PROGRESS.md` and `docs/PRODUCTION-MASTER-PLAN.md`
as the day-to-day tracker — those docs remain the historical record / detailed
narrative and are still linked from here where useful. **Ordered fastest/simplest
first** so we work top-down instead of hunting across docs.
Verified against actual code state on 2026-07-01 (not just doc text — several
items the source docs still listed as "open" turned out to already be shipped;
those are marked ✅ below with the commit that did it, so we stop re-litigating them).
---
## Tier 0 — Quick / mechanical, no blockers
- [ ] **Update `tests/lifecycle/TESTING.md`'s stale Release Gates checklist** (lines
289296) — several boxes are unchecked but actually true now:
- #1 bitcoin-stops: covered by `tests/lifecycle/bats/bitcoin-knots.bats` stop/restart
tier, included in the 5/5 green gate run.
- #2 `ARCHY_ITERATIONS=5` on .228: **GREEN 2026-06-23 per CLAUDE.md** — check the box.
- #5 cargo 0 warnings: confirmed 0 warnings on `cargo build --release` (2026-07-01).
- #7 layman changelog: `CHANGELOG.md` is backfilled with layman-readable entries
through v1.8.00-alpha — check the box.
- Leave #3 (multinode), #4 (backend-survives-restart / Phase-3 default-on), #6
(LoC decision), #8 (tag pushed) unchecked — genuinely still open, see Tier 2/3.
- [x] ~~Finish the archival/full-node manifest generalization~~ — investigated 2026-07-01:
the hardcoded fallback names in `dependencies.rs:48-52` (`electrs`, `mempool-electrs`,
`mempool-web`) are legacy **alias** ids for `electrumx`/`mempool`, resolved via
id-mapping in a dozen other places (`install.rs`, `runtime.rs`, `config.rs`, etc.),
not separate un-migrated apps with their own manifests. `electrumx` and `mempool`
themselves already declare `bitcoin:archival`. The fallback is correct as-is —
not tech debt, closing this item rather than risk breaking alias resolution.
- [x] ~~Confirm/close the Portainer image-pin item~~ — confirmed 2026-07-01:
`146.59.87.168:3000/lfg2025/portainer:2.19.4` is present in `podman images` on
all 3 LAN nodes (.116/.198/.228), i.e. actually resolvable/pulled from the mirror.
Not a live bug.
- [x] ~~grafana Quadlet "stuck activating"~~ — checked live on .116 (2026-07-01):
`grafana.service` is `active (running)`, container `Up 2 hours (healthy)`. The
2026-06-21 report is stale for grafana. **strfry still unconfirmed** — not
installed on any of .116/.198/.228 to check directly; low priority until someone
actually needs it installed.
## Tier 1 — Medium effort, unblocked
- [x] ~~immich → Quadlet migration~~ — investigated 2026-07-01, turned out already done:
immich uses the same `install_stack_via_orchestrator` primitive as netbird/btcpay
(`immich_stack_app_ids()` in `stacks.rs:690`), and is confirmed running as real
Quadlet units live on .228 (`immich_server.container`, `immich_postgres.container`,
`immich_redis.container`, all active). Not a legacy in-cgroup app — the only
remaining piece is the fleet-wide Phase-3 default-flip, already tracked in Tier 2.
- [x] ~~Netbird reinstall adoption path~~ — investigated 2026-07-01, **not a bug, by
design.** `adopt_stack_if_exists()` (`stacks.rs:140-198`) is only used as a
fallback when the orchestrator has no manifest for the app — there's nothing to
render certs/config from in that case, so skipping rendering is correct. When
the orchestrator *does* have the manifest (the normal path), the reconcile loop
already re-renders certs even for adopted-running containers, fixed in
`4519dbf0` (`prod_orchestrator.rs:1707-1708`).
- [x] ~~TanStack Query (or equivalent) investigation~~ — spike complete 2026-07-01,
**recommendation: don't adopt / close as not needed.** Only 3 stores actually fetch
data, WebSocket push already handles hot data (server-info/package-data), no
cache-invalidation or stale-data bugs found, migration would touch 62 RPC call
sites for no concrete payoff. If boilerplate ever bothers us, extract a
`usePolling()` composable instead — much cheaper than a query-cache migration.
## Tier 2 — High effort, mostly unblocked (the actual next exit criteria)
- [~] **Multinode test pass** (`docs/multinode-testing-plan.md`) — worked the
preconditions on .198 2026-07-01:
- ✅ cleared 2 stale failed-unit records (`archy-mempool-db.service`,
`meshtastic.service` — both `not-found`/dead since 6 and 5 days ago, harmless
bookkeeping, `systemctl --user reset-failed`).
- ✅ nginx `/app/lnd/` proxy target confirmed correct (→ `18083`, matches the
running `archy-lnd-ui` port) — the plan's "stale proxy target" concern doesn't
apply here.
- ⛔ .198 disk (448GB) is below the 1TB archival threshold + was only 21%
through IBD — user chose to **swap in a different node** rather than wait/add
storage. **.116 ruled out** (no bitcoin container installed at all, just the
UI companion). **.120 ruled out** (reserved for another developer). **.5**
(archy-x250-beta, Tailscale `100.72.136.5`) chosen: also sub-1TB (472GB, so
still pruned — that ceiling is shared by every non-.228 node), but **fully
synced** (`ibd:false`, blocks==headers 956,240). Bootstrapped bats 1.11.1 +
jq 1.7.1 onto it 2026-07-01 and **launched the 5× destructive gate
(`ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1`).
- ✅ **Gate finished 2026-07-01: 5/5 iterations, technically all "FAIL" per
run-gate.sh's tally — but only because .5's pruned-bitcoin limitation
(expected, permanent, accepted going in) fails one test every single
iteration.** By iteration 4-5 that was down to exactly 2 failures per run:
the expected pruned-bitcoin one, plus a reproducible `lnd` proxy timeout
(`https://host/app/lnd/`, distinct from the DNS bug below — happened
consistently on both of the last 2 iterations, worth its own investigation,
not yet root-caused). Iterations 1-3 also hit test-suite bugs since fixed
live mid-run (see Tier 0/below) and one ~2h hang (also below) — none of
those are real product bugs.
- 🐛 **Real hang found + fixed live**: `lnd` cached a dead IP for
`bitcoin-knots` after an earlier restart gave it a new container IP —
every RPC needing chain data blocked forever (client-side `timeout`
wrappers don't reliably kill `podman exec`'s in-container process).
Blocked iteration 4 for ~2 hours before diagnosed + fixed (`podman
restart lnd`, forces fresh DNS resolution). **Product-level gap, not
fixed at the code level**: dependent services should reconnect/re-resolve
after a backend container is recreated, not cache indefinitely. Logged as
a follow-up, not yet implemented.
- Next: bring the rest of the fleet to precondition, then the cross-node
federation/mesh/transport suites. This is the literal "next exit
criterion" called out in `CLAUDE.md`.
- [x] ~~**Real bug found + fixed 2026-07-01**: boot reconciler left any
genuinely-installed-but-fully-absent app stuck forever unless it was one
of 8 hardcoded "required baseline" apps~~ — surfaced by indeedhub's
backend containers (minio/postgres/relay) never recovering on .116 after
going absent. Root cause: `ensure_running_with_mode()`
(`prod_orchestrator.rs`) only called `install_fresh()` for
`is_required_baseline_app()` apps in the fully-absent case; every other
installed app was left as `Left("absent")` with no path back short of an
explicit reinstall. Fixed: self-heal now applies to any app that
reaches this point (i.e. already confirmed NOT user-stopped / NOT
user-uninstalled earlier in the same function — those markers are
properly set/cleared on uninstall/reinstall, so this can't resurrect a
deliberately-removed app). Deleted the now-dead
`is_required_baseline_app()`, updated/renamed the test that had locked
in the old behavior. Compiles clean; test suite run in progress.
indeedhub itself not yet manually recovered on .116 — the code fix
will self-heal it on the next reconcile tick once deployed there.
- [ ] **Phase-3 Quadlet default-flip** — code is validated + opt-in via
`ARCHIPELAGO_USE_QUADLET_BACKENDS=true` on .228/.198 already (confirmed live
2026-07-01). Ready to flip (`config.rs:256` + its test) the moment the .5 gate
reports clean — deliberately NOT staged uncommitted in the tree (a prior attempt
left an uncommitted flip sitting around and that caused confusion; it's a 2-line
change, faster to just do it fresh once confirmed).
- [x] ~~Per-app test coverage for the ~30 apps with zero automated coverage~~ —
**reframed 2026-07-01, mostly a non-issue.** `all-apps-matrix.bats` +
`all-apps-lifecycle.bats` already give EVERY installed app generic baseline
coverage (no stuck state, no error state, stop/start/restart survives, UI
reachable). The real gap is narrower: **34 apps lack app-specific assertions**
(health endpoints, API queryability, data integrity) beyond that baseline —
aiui, bitcoin-core, botfights, core-lightning, did-wallet, fedimint-clientd,
fedimint-gateway, fips-ui, gitea, grafana, home-assistant, indeedhub (+5
sub-containers), jellyfin, lightning-stack, lnd-ui, morphos-server, netbird
(+2 sub-containers), nextcloud, nostr-rs-relay, photoprism, portainer, router,
searxng, strfry, uptime-kuma, vaultwarden. Not urgent — baseline coverage is
real safety net; treat as a backlog "nice to harden further," not a gate item.
- [x] ~~Convert remaining multi-container legacy stacks to the manifest-owned model~~
**investigated 2026-07-01, DONE, nothing left.** All 5 real multi-container
stacks (btcpay, mempool, immich, netbird, indeedhub) are on the
`install_stack_via_orchestrator` pattern (`stacks.rs`). saleor was removed from
the codebase; portainer/home-assistant/grafana are single-container
manifest-driven apps, never stacks; fedimint/fedimint-gateway/fedimint-clientd
are 3 separate single-container apps with manifest dependency edges, not a
coordinated stack. Workstream A's stack-migration tail is fully closed.
- [ ] **Developer tooling CLI suite** (validate/render/local-install/lifecycle-test) —
APP-PACKAGING-MIGRATION-PLAN.md step 5, needed before external devs can publish.
- [~] **Cross-node federation/mesh/transport suites** — **big find 2026-07-01: these
already exist**, just aren't wired into the gate or documented as existing:
`tests/multinode/smoke.sh` (federation pairing/sync, FIPS anchor, peer content
browse, tombstone-removal regression tests), `tests/multinode/meshtastic.sh`
(8-stage on-air mesh test), harness in `tests/multinode/lib/multinode.bash`.
**Actually ran `smoke.sh` live against .116↔.228 2026-07-01: 14 passed, 1
failed, 1 skipped.** Confirms federation pairing (both directions), FIPS
anchor connectivity (both nodes), and peer-content-browse-over-mesh (the
v1.7.95 fix) all genuinely work node-to-node right now.
- ⚠️ **Real robustness gap found**: `node_rpc()` in `tests/multinode/lib/multinode.bash`
has no `--max-time` on its curl calls — a slow server-side RPC hangs the whole
suite with zero feedback (this is what looked like a hang before it eventually
completed on its own). Cheap fix, not yet applied.
- 🐛 **Real regression found and root-caused**: removing a federation node
(`federation.remove-node`) doesn't reliably stick — B reappeared in A's peer
list after removal in the live test. Root cause: `remove_node()`
(`core/archipelago/src/federation/storage.rs:187`) does
`let _ = tombstone_did(data_dir, did).await` — **silently swallows the
tombstone write's errors.** If that write fails (disk I/O, permission,
transient issue), the peer is removed from `nodes.json` but never actually
tombstoned, so the next background sync/notify-join re-adds it — the
tombstone check at `handlers.rs:592-599` passes because the DID was never
recorded as removed. Diagnosed as a **pre-existing logic gap**, not a fresh
regression from the v1.7.95 fix. **Not fixed yet** — this is federation/trust
code, deliberately not touching it blind; needs a careful fix (surface the
tombstone-write failure instead of swallowing it, and/or retry) plus
re-verification with `smoke.sh` before considering it closed.
## Tier 3 — Blocked on a decision or resource only you can supply
- [ ] **Version naming decision (1.7.99-alpha → 1.8.0 vs 1.8.00-alpha)** — code is
otherwise ready to tag; this is a one-line decision, then a mechanical bump +
tag + push. **Needs your call**, not more engineering.
- [ ] **Workstream B signing ceremony**`core/archipelago/src/trust/anchor.rs:21`
still has `RELEASE_ROOT_PUBKEY_HEX = None`. Needs the offline
`RELEASE_MASTER_MNEMONIC` to run `docs/workstream-b-signing-runbook.md`'s
4-step ceremony — can't be automated by me.
- [ ] **Bitcoin multi-version fleet-wide OTA**`.228` fully working on branch,
per your prior gating this rollout is explicitly held for your decision on
timing (`docs/bitcoin-version-bulletproof-rollout.md`).
- [ ] **3ccc stock-Meshtastic RF validation** — needs a live send/receive test with
physical radios in your hands; code fix is in place, just unverified live.
## Backlog — deferred, no scope decided, low priority
- [ ] **Marketplace protocol (workstream C)** — design-only (`docs/marketplace-protocol.md`),
no tooling/trust UX built. Future work, not urgent.
- [ ] **DHT distribution (workstream D)** — confirmed design-only, no code
(`docs/dht-distribution-design.md` explicitly says "Status: Design (no code yet)");
an experimental iroh provider skeleton exists behind a feature flag for future
PoC measurement, nothing fleet-facing.
- [ ] **Custom live voice-call protocol** — deprioritized 2026-07-01 per user request;
scope not yet decided. Revisit after the tiers above are worked down.
---
*Historical narrative and detailed per-session logs remain in
`docs/SESSION-1.8.0-OTA-PROGRESS.md` and `docs/PRODUCTION-MASTER-PLAN.md` §6/§8b —
this doc is the live "what's left, in priority order" list. Update it (don't just
append to the old docs) as items close or new ones surface.*