archy/docs/bitcoin-version-bulletproof-rollout.md
archipelago ed1352d3a3 docs+catalog: bitcoin multi-version rollout handoff + reproducible generator
- generate-app-catalog.sh: VERSIONS map now lists the full Knots set
  (29.3.knots20260508/20260507/20260210 + 29.2.knots20251110) and Core
  (adds 29.2 + a `latest` entry → newest); generator forces top-level
  `version` == the default entry's version (the 169ff2e2 invariant) so
  regeneration is reproducible. releases/app-catalog.json regenerated.
- docs/bitcoin-version-bulletproof-rollout.md: full handoff — root causes,
  fixes, current .228 state, the coordinated fleet-rollout steps (incl.
  :latest repoint sequencing / fleet-safety), reindex finish procedure, and
  the switch-matrix test plan.
- PRODUCTION-MASTER-PLAN.md: link the rollout doc (§6b-bis).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 06:02:24 -04:00

8.2 KiB

Bitcoin Multi-Version — Bulletproofing & Rollout (handoff)

Status 2026-06-29: code + images + catalog + frontend DONE on branch bitcoin-version-bulletproof (base commit 095a76cd, plus the catalog-generator

  • handoff follow-ups). .228 is the test node: binary + frontend + catalog are live there; its Knots chainstate is mid-reindex recovery (see §5). The fleet rollout (OTA binary+frontend, mirror catalog publish, :latest repoint) is the coordinated step the other agent owns — see §4. Pairs with docs/bitcoin-multi-version-design.md (the original design).

1. What was broken (root causes)

User report: "switched Knots to v29.3.knots20260508, version didn't update in the UI." Three stacked bugs, plus a data-corruption hazard:

  1. Reconciler reverted the pin. prod_orchestrator::sync_quadlet_unit re-rendered the quadlet every reconcile tick using the manifest's :latest, ignoring the per-app pinned version → any switch silently reverted within one tick.
  2. Entrypoint render bug. The renderer folded the manifest entrypoint: ["sh","-lc"] into Exec=. That only works when the image ENTRYPOINT is a passthrough shell wrapper. The versioned images use ENTRYPOINT ["bitcoind"], so Exec=sh -lc … became bitcoind sh -lc …unexpected token 'sh' → crash loop.
  3. Image USER divergence. The versioned images were built USER bitcoin (uid 1000); the legacy :latest ran as root. Chain data is owned by the data_uid (host 100101 / container uid 102). Root reads it via CAP_DAC_OVERRIDE (granted in the manifest); uid-1000 cannot → Error initializing block database.
  4. Data hazard (already hit on .228). Repeated failed starts under mixed UIDs left bitcoind's two LevelDBs (blocks/index/ + chainstate/) truncated to KB stubs while the raw blocks/blk*.dat (797 GB) stayed intact. Recovery = bitcoind -reindex from local blocks (no re-download). The uniform-root image fix (below) removes the mixed-UID cause going forward; the proper switch flow was already data-safe (600s stop grace, clean stop→rm→recreate, conflict-stops the other impl — they share port 8332 + datadir /var/lib/archipelago/bitcoin).

2. What was fixed (all on the branch)

  • Renderer (core/archipelago/src/container/):
    • prod_orchestrator.rs: factored resolve_catalog_image() (catalog/pinned-version → image) and call it in BOTH install_fresh and sync_quadlet_unit — the pin now survives reconcile.
    • quadlet.rs: emit a real Entrypoint=<first> + Exec=<rest+cmd> instead of folding; exec_changed now also diffs Entrypoint= so the recreate fires. Validated against the live podman 5.4.2 quadlet generator.
  • Images (scripts/build-bitcoin-image.sh, apps/bitcoin-{knots,core}/Dockerfile): removed USER bitcoin → run as container-root like legacy (still 100% rootless: container-root maps to the unprivileged host service user; CAP_DAC_OVERRIDE from the manifest lets bitcoind read the data_uid-owned datadir). All images rebuilt root + pushed to the mirror (146.59.87.168:3000/lfg2025):
    • Knots: 29.3.knots20260508, 29.3.knots20260507, 29.3.knots20260210, 29.2.knots20251110
    • Core: 25.2 26.2 27.2 28.4 29.2 29.3 30.2 31.0 + latest (→31.0)
  • Catalog (scripts/generate-app-catalog.sh VERSIONS map + regenerated releases/app-catalog.json): Knots & Core versions[] populated; the generator now forces top-level version == the default entry's version (the 169ff2e2 invariant) regardless of the manifest version. Knots latest entry points at the newest dated image (29.3.knots20260508) so "Always use latest" = newest on fixed-binary nodes.
  • Frontend (neode-ui/):
    • AppSidebar.vue: rename the latest option to "Always use the latest version" (no v prefix), fix right padding, and pickSelection() guarantees the bound value is a real option (fixes the blank dropdown).
    • New components/InstallVersionModal.vue: full-screen version chooser shown from the App Store / Discover card install button for multi-version apps — app icon + "Install ", latest pre-selected. Wired in Discover.vue handleInstall.
    • i18n keys: appDetails.alwaysUseLatestVersion, marketplace.installModalTitle/Hint.

3. Current live state on .228 (test node)

  • Binary with both renderer fixes: deployed (/usr/local/bin/archipelago).
  • New frontend bundle: deployed to /opt/archipelago/web-ui (hard-refresh to see it).
  • Updated catalog: placed at /var/lib/archipelago/app-catalog.json (local override — will refresh from the mirror's OLDER copy at the next hourly fetch until §4 publishes it).
  • Knots: bitcoin-knots service held stopped (package.stop, user_stopped); a detached bitcoin-knots-reindex container is rebuilding the index+UTXO (§5).

4. Remaining — coordinated fleet rollout (OTHER AGENT)

Do this together with the other workstream's release, AFTER both are ready:

  1. Merge branch bitcoin-version-bulletproof into the release line.
  2. Build + OTA the binary + frontend (these carry the renderer fix + UI). The renderer fix is a hard prerequisite for the new images everywhere — see fleet-safety below.
  3. Publish the catalog to the mirror (push releases/app-catalog.json to gitea-vps2 main, the raw URL nodes fetch hourly). The current catalog is fleet-safe even before the binary lands: unpinned/auto-update nodes resolve via the manifest's floating :latest (still the legacy image); only explicit version selection (needs the new UI) uses the new root images.
  4. Only AFTER the binary is fleet-wide: optionally repoint the bitcoin-knots:latest tag → 29.3.knots20260508 (root) and simplify the catalog latest entry back to the :latest tag. Do NOT repoint :latest before then — old-binary nodes fold Exec=sh -lc … and would crash on an ENTRYPOINT ["bitcoind"] image. (Core never worked on old binaries — it always shipped ENTRYPOINT ["bitcoind"] — so Core has no such constraint.)
  5. Verify the full switch matrix on a healthy node (§6).

5. Finishing .228's reindex (the remaining test-node task)

The detached bitcoin-knots-reindex container runs the new root 29.3.knots20260508 image with -reindex -server=0 against /var/lib/archipelago/bitcoin. It holds the datadir lock, so the managed service (held stopped) can't collide. When it has connected blocks up to ~the prior tip (height ≥ ~955800) it's done; then:

# on .228 (SSH/sudo/UI pw all: ThisIsWeb54321@)
podman stop -t 600 bitcoin-knots-reindex && podman rm bitcoin-knots-reindex
# start the managed service via RPC (sets desired=running, clears user_stopped):
#   package.start {id: bitcoin-knots}  (POST https://127.0.0.1/rpc/v1, CSRF: echo csrf_token cookie as X-CSRF-Token)
# verify:
podman exec bitcoin-knots sh -lc '$(command -v bitcoind) --version | head -1'   # → v29.3.knots20260508
# RPC up → the Bitcoin UI populates; it syncs the gap to tip.

The "Bitcoin RPC connection refused (127.0.0.1:8332)" the UI shows is EXPECTED until this swap (reindex runs with RPC off).

6. Switch-matrix test plan (what "bulletproof" must prove)

On a healthy node, each step must end with bitcoind running + RPC answering + syncing, with NO Error initializing block database and NO data loss:

  • Knots: switch latest29.3.knots2026050729.3.knots20260210 → back to latest.
  • Core: install latest; switch 31.028.4.0.
  • Knots ↔ Core (shared datadir/port): Knots→Core upgrade path (Core ≥ data version) and the reverse. Cross-major DOWNGRADES (e.g. 29.x data → Core 28.4) legitimately need a reindex — the UI already surfaces a downgrade warning; confirm it does and that confirming reindexes cleanly rather than crash-looping.
  • Reboot survival after each switch.

7. Notes / assumptions

  • "29.2" in the request doesn't exist as a Knots build (404 upstream); added as Bitcoin Core 29.2 (exists). Revisit if a Knots 29.2 was meant.
  • Reindex is unavoidable ONLY because .228's index was already corrupted by the pre-fix crash loop; a normal switch on the fixed binary does NOT reindex.
  • Creds for .228: SSH/sudo + UI/RPC all ThisIsWeb54321@.