lfg2025/archy

archipelago 2c1d2a2572 docs: multinode gate finished + boot-reconciler self-heal bug found+fixed

.5's 5x gate done: 5/5 iterations, all technically FAIL per run-gate.sh's
tally but only from .5's permanent pruned-bitcoin ceiling (accepted going
in); down to 2 failures/iteration by the end. Found + fixed a real hang
(lnd cached a dead bitcoin-knots IP after a restart) live mid-run.

Separately found a real boot-reconciler bug via indeedhub going stuck on
.116: any genuinely-installed-but-fully-absent app was left stuck forever
unless it was one of 8 hardcoded "baseline" apps. Fix tracked, code change
in the shared working tree pending test confirmation.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

2026-07-01 17:24:42 -04:00

14 KiB

Raw Blame History

Unified Task Tracker — OTA 1.8.0 + Master Plan

Single working list for everything left before 1.8.0 ships and the next master-plan exit criteria (multinode + workstreams B/C/D) are met. Supersedes the open-task sections of docs/SESSION-1.8.0-OTA-PROGRESS.md and docs/PRODUCTION-MASTER-PLAN.md as the day-to-day tracker — those docs remain the historical record / detailed narrative and are still linked from here where useful. Ordered fastest/simplest first so we work top-down instead of hunting across docs.

Verified against actual code state on 2026-07-01 (not just doc text — several items the source docs still listed as "open" turned out to already be shipped; those are marked ✅ below with the commit that did it, so we stop re-litigating them).

Tier 0 — Quick / mechanical, no blockers

Update tests/lifecycle/TESTING.md's stale Release Gates checklist (lines 289–296) — several boxes are unchecked but actually true now:
- #1 bitcoin-stops: covered by tests/lifecycle/bats/bitcoin-knots.bats stop/restart tier, included in the 5/5 green gate run.
- #2 ARCHY_ITERATIONS=5 on .228: GREEN 2026-06-23 per CLAUDE.md — check the box.
- #5 cargo 0 warnings: confirmed 0 warnings on cargo build --release (2026-07-01).
- #7 layman changelog: CHANGELOG.md is backfilled with layman-readable entries through v1.8.00-alpha — check the box.
- Leave #3 (multinode), #4 (backend-survives-restart / Phase-3 default-on), #6 (LoC decision), #8 (tag pushed) unchecked — genuinely still open, see Tier 2/3.
~~Finish the archival/full-node manifest generalization~~ — investigated 2026-07-01: the hardcoded fallback names in dependencies.rs:48-52 (electrs, mempool-electrs, mempool-web) are legacy alias ids for electrumx/mempool, resolved via id-mapping in a dozen other places (install.rs, runtime.rs, config.rs, etc.), not separate un-migrated apps with their own manifests. electrumx and mempool themselves already declare bitcoin:archival. The fallback is correct as-is — not tech debt, closing this item rather than risk breaking alias resolution.
~~Confirm/close the Portainer image-pin item~~ — confirmed 2026-07-01: 146.59.87.168:3000/lfg2025/portainer:2.19.4 is present in podman images on all 3 LAN nodes (.116/.198/.228), i.e. actually resolvable/pulled from the mirror. Not a live bug.
~~grafana Quadlet "stuck activating"~~ — checked live on .116 (2026-07-01): grafana.service is active (running), container Up 2 hours (healthy). The 2026-06-21 report is stale for grafana. strfry still unconfirmed — not installed on any of .116/.198/.228 to check directly; low priority until someone actually needs it installed.

Tier 1 — Medium effort, unblocked

~~immich → Quadlet migration~~ — investigated 2026-07-01, turned out already done: immich uses the same install_stack_via_orchestrator primitive as netbird/btcpay (immich_stack_app_ids() in stacks.rs:690), and is confirmed running as real Quadlet units live on .228 (immich_server.container, immich_postgres.container, immich_redis.container, all active). Not a legacy in-cgroup app — the only remaining piece is the fleet-wide Phase-3 default-flip, already tracked in Tier 2.
~~Netbird reinstall adoption path~~ — investigated 2026-07-01, not a bug, by design. adopt_stack_if_exists() (stacks.rs:140-198) is only used as a fallback when the orchestrator has no manifest for the app — there's nothing to render certs/config from in that case, so skipping rendering is correct. When the orchestrator does have the manifest (the normal path), the reconcile loop already re-renders certs even for adopted-running containers, fixed in 4519dbf0 (prod_orchestrator.rs:1707-1708).
~~TanStack Query (or equivalent) investigation~~ — spike complete 2026-07-01, recommendation: don't adopt / close as not needed. Only 3 stores actually fetch data, WebSocket push already handles hot data (server-info/package-data), no cache-invalidation or stale-data bugs found, migration would touch 62 RPC call sites for no concrete payoff. If boilerplate ever bothers us, extract a usePolling() composable instead — much cheaper than a query-cache migration.

Tier 2 — High effort, mostly unblocked (the actual next exit criteria)

[~] Multinode test pass (docs/multinode-testing-plan.md) — worked the preconditions on .198 2026-07-01:
- ✅ cleared 2 stale failed-unit records (archy-mempool-db.service, meshtastic.service — both not-found/dead since 6 and 5 days ago, harmless bookkeeping, systemctl --user reset-failed).
- ✅ nginx /app/lnd/ proxy target confirmed correct (→ 18083, matches the running archy-lnd-ui port) — the plan's "stale proxy target" concern doesn't apply here.
- ⛔ .198 disk (448GB) is below the 1TB archival threshold + was only 21% through IBD — user chose to swap in a different node rather than wait/add storage. .116 ruled out (no bitcoin container installed at all, just the UI companion). .120 ruled out (reserved for another developer). .5 (archy-x250-beta, Tailscale 100.72.136.5) chosen: also sub-1TB (472GB, so still pruned — that ceiling is shared by every non-.228 node), but fully synced (ibd:false, blocks==headers 956,240). Bootstrapped bats 1.11.1 + jq 1.7.1 onto it 2026-07-01 and **launched the 5× destructive gate (ARCHY_ITERATIONS=5 ARCHY_ALLOW_DESTRUCTIVE=1).
- ✅ Gate finished 2026-07-01: 5/5 iterations, technically all "FAIL" per run-gate.sh's tally — but only because .5's pruned-bitcoin limitation (expected, permanent, accepted going in) fails one test every single iteration. By iteration 4-5 that was down to exactly 2 failures per run: the expected pruned-bitcoin one, plus a reproducible lnd proxy timeout (https://host/app/lnd/, distinct from the DNS bug below — happened consistently on both of the last 2 iterations, worth its own investigation, not yet root-caused). Iterations 1-3 also hit test-suite bugs since fixed live mid-run (see Tier 0/below) and one ~2h hang (also below) — none of those are real product bugs.
- 🐛 Real hang found + fixed live: lnd cached a dead IP for bitcoin-knots after an earlier restart gave it a new container IP — every RPC needing chain data blocked forever (client-side timeout wrappers don't reliably kill podman exec's in-container process). Blocked iteration 4 for ~2 hours before diagnosed + fixed (podman restart lnd, forces fresh DNS resolution). Product-level gap, not fixed at the code level: dependent services should reconnect/re-resolve after a backend container is recreated, not cache indefinitely. Logged as a follow-up, not yet implemented.
- Next: bring the rest of the fleet to precondition, then the cross-node federation/mesh/transport suites. This is the literal "next exit criterion" called out in CLAUDE.md.
Real bug found + fixed 2026-07-01: boot reconciler left any genuinely-installed-but-fully-absent app stuck forever unless it was one of 8 hardcoded "required baseline" apps — surfaced by indeedhub's backend containers (minio/postgres/relay) never recovering on .116 after going absent. Root cause: ensure_running_with_mode() (prod_orchestrator.rs) only called install_fresh() for is_required_baseline_app() apps in the fully-absent case; every other installed app was left as Left("absent") with no path back short of an explicit reinstall. Fixed: self-heal now applies to any app that reaches this point (i.e. already confirmed NOT user-stopped / NOT user-uninstalled earlier in the same function — those markers are properly set/cleared on uninstall/reinstall, so this can't resurrect a deliberately-removed app). Deleted the now-dead is_required_baseline_app(), updated/renamed the test that had locked in the old behavior. Compiles clean; test suite run in progress. indeedhub itself not yet manually recovered on .116 — the code fix will self-heal it on the next reconcile tick once deployed there.
Phase-3 Quadlet default-flip — code is validated + opt-in via ARCHIPELAGO_USE_QUADLET_BACKENDS=true on .228/.198 already (confirmed live 2026-07-01). Ready to flip (config.rs:256 + its test) the moment the .5 gate reports clean — deliberately NOT staged uncommitted in the tree (a prior attempt left an uncommitted flip sitting around and that caused confusion; it's a 2-line change, faster to just do it fresh once confirmed).
~~~Per-app test coverage for the ~~30 apps with zero automated coverage~~~~ — reframed 2026-07-01, mostly a non-issue. all-apps-matrix.bats + all-apps-lifecycle.bats already give EVERY installed app generic baseline coverage (no stuck state, no error state, stop/start/restart survives, UI reachable). The real gap is narrower: 34 apps lack app-specific assertions (health endpoints, API queryability, data integrity) beyond that baseline — aiui, bitcoin-core, botfights, core-lightning, did-wallet, fedimint-clientd, fedimint-gateway, fips-ui, gitea, grafana, home-assistant, indeedhub (+5 sub-containers), jellyfin, lightning-stack, lnd-ui, morphos-server, netbird (+2 sub-containers), nextcloud, nostr-rs-relay, photoprism, portainer, router, searxng, strfry, uptime-kuma, vaultwarden. Not urgent — baseline coverage is real safety net; treat as a backlog "nice to harden further," not a gate item.
~~Convert remaining multi-container legacy stacks to the manifest-owned model~~ — investigated 2026-07-01, DONE, nothing left. All 5 real multi-container stacks (btcpay, mempool, immich, netbird, indeedhub) are on the install_stack_via_orchestrator pattern (stacks.rs). saleor was removed from the codebase; portainer/home-assistant/grafana are single-container manifest-driven apps, never stacks; fedimint/fedimint-gateway/fedimint-clientd are 3 separate single-container apps with manifest dependency edges, not a coordinated stack. Workstream A's stack-migration tail is fully closed.
Developer tooling CLI suite (validate/render/local-install/lifecycle-test) — APP-PACKAGING-MIGRATION-PLAN.md step 5, needed before external devs can publish.
[~] Cross-node federation/mesh/transport suites — big find 2026-07-01: these already exist, just aren't wired into the gate or documented as existing: tests/multinode/smoke.sh (federation pairing/sync, FIPS anchor, peer content browse, tombstone-removal regression tests), tests/multinode/meshtastic.sh (8-stage on-air mesh test), harness in tests/multinode/lib/multinode.bash. Actually ran smoke.sh live against .116↔.228 2026-07-01: 14 passed, 1 failed, 1 skipped. Confirms federation pairing (both directions), FIPS anchor connectivity (both nodes), and peer-content-browse-over-mesh (the v1.7.95 fix) all genuinely work node-to-node right now.
- ⚠️ Real robustness gap found: node_rpc() in tests/multinode/lib/multinode.bash has no --max-time on its curl calls — a slow server-side RPC hangs the whole suite with zero feedback (this is what looked like a hang before it eventually completed on its own). Cheap fix, not yet applied.
- 🐛 Real regression found and root-caused: removing a federation node (federation.remove-node) doesn't reliably stick — B reappeared in A's peer list after removal in the live test. Root cause: remove_node() (core/archipelago/src/federation/storage.rs:187) does let _ = tombstone_did(data_dir, did).await — silently swallows the tombstone write's errors. If that write fails (disk I/O, permission, transient issue), the peer is removed from nodes.json but never actually tombstoned, so the next background sync/notify-join re-adds it — the tombstone check at handlers.rs:592-599 passes because the DID was never recorded as removed. Diagnosed as a pre-existing logic gap, not a fresh regression from the v1.7.95 fix. Not fixed yet — this is federation/trust code, deliberately not touching it blind; needs a careful fix (surface the tombstone-write failure instead of swallowing it, and/or retry) plus re-verification with smoke.sh before considering it closed.

Tier 3 — Blocked on a decision or resource only you can supply

Version naming decision (1.7.99-alpha → 1.8.0 vs 1.8.00-alpha) — code is otherwise ready to tag; this is a one-line decision, then a mechanical bump + tag + push. Needs your call, not more engineering.
Workstream B signing ceremony — core/archipelago/src/trust/anchor.rs:21 still has RELEASE_ROOT_PUBKEY_HEX = None. Needs the offline RELEASE_MASTER_MNEMONIC to run docs/workstream-b-signing-runbook.md's 4-step ceremony — can't be automated by me.
Bitcoin multi-version fleet-wide OTA — .228 fully working on branch, per your prior gating this rollout is explicitly held for your decision on timing (docs/bitcoin-version-bulletproof-rollout.md).
3ccc stock-Meshtastic RF validation — needs a live send/receive test with physical radios in your hands; code fix is in place, just unverified live.

Backlog — deferred, no scope decided, low priority

Marketplace protocol (workstream C) — design-only (docs/marketplace-protocol.md), no tooling/trust UX built. Future work, not urgent.
DHT distribution (workstream D) — confirmed design-only, no code (docs/dht-distribution-design.md explicitly says "Status: Design (no code yet)"); an experimental iroh provider skeleton exists behind a feature flag for future PoC measurement, nothing fleet-facing.
Custom live voice-call protocol — deprioritized 2026-07-01 per user request; scope not yet decided. Revisit after the tiers above are worked down.

Historical narrative and detailed per-session logs remain in docs/SESSION-1.8.0-OTA-PROGRESS.md and docs/PRODUCTION-MASTER-PLAN.md §6/§8b — this doc is the live "what's left, in priority order" list. Update it (don't just append to the old docs) as items close or new ones surface.

14 KiB Raw Blame History Unescape Escape