archy/docs/bitcoin-version-bulletproof-rollout.md

132 lines
8.2 KiB
Markdown
Raw Normal View History

# Bitcoin Multi-Version — Bulletproofing & Rollout (handoff)
> **Status 2026-06-29:** code + images + catalog + frontend DONE on branch
> `bitcoin-version-bulletproof` (base commit `095a76cd`, plus the catalog-generator
> + handoff follow-ups). **.228 is the test node**: binary + frontend + catalog are
> live there; its Knots chainstate is mid-**reindex recovery** (see §5). The fleet
> rollout (OTA binary+frontend, mirror catalog publish, `:latest` repoint) is the
> **coordinated step the other agent owns** — see §4. Pairs with
> `docs/bitcoin-multi-version-design.md` (the original design).
## 1. What was broken (root causes)
User report: "switched Knots to `v29.3.knots20260508`, version didn't update in the UI."
Three **stacked** bugs, plus a data-corruption hazard:
1. **Reconciler reverted the pin.** `prod_orchestrator::sync_quadlet_unit` re-rendered the
quadlet every reconcile tick using the manifest's `:latest`, ignoring the per-app
pinned version → any switch silently reverted within one tick.
2. **Entrypoint render bug.** The renderer folded the manifest `entrypoint: ["sh","-lc"]`
into `Exec=`. That only works when the image ENTRYPOINT is a passthrough shell wrapper.
The versioned images use `ENTRYPOINT ["bitcoind"]`, so `Exec=sh -lc …` became
`bitcoind sh -lc …``unexpected token 'sh'` → crash loop.
3. **Image USER divergence.** The versioned images were built `USER bitcoin` (uid 1000);
the legacy `:latest` ran as **root**. Chain data is owned by the `data_uid`
(host 100101 / container uid 102). Root reads it via `CAP_DAC_OVERRIDE` (granted in the
manifest); uid-1000 cannot → `Error initializing block database`.
4. **Data hazard (already hit on .228).** Repeated failed starts under mixed UIDs left
bitcoind's two LevelDBs (`blocks/index/` + `chainstate/`) truncated to KB stubs while
the raw `blocks/blk*.dat` (797 GB) stayed intact. Recovery = `bitcoind -reindex` from
local blocks (no re-download). The uniform-root image fix (below) removes the mixed-UID
cause going forward; the proper switch flow was already data-safe (600s stop grace,
clean stop→rm→recreate, conflict-stops the other impl — they share port 8332 + datadir
`/var/lib/archipelago/bitcoin`).
## 2. What was fixed (all on the branch)
- **Renderer** (`core/archipelago/src/container/`):
- `prod_orchestrator.rs`: factored `resolve_catalog_image()` (catalog/pinned-version →
image) and call it in BOTH `install_fresh` and `sync_quadlet_unit` — the pin now
survives reconcile.
- `quadlet.rs`: emit a real `Entrypoint=<first>` + `Exec=<rest+cmd>` instead of folding;
`exec_changed` now also diffs `Entrypoint=` so the recreate fires. Validated against
the live podman 5.4.2 quadlet generator.
- **Images** (`scripts/build-bitcoin-image.sh`, `apps/bitcoin-{knots,core}/Dockerfile`):
removed `USER bitcoin` → run as **container-root** like legacy (still 100% rootless:
container-root maps to the unprivileged host service user; `CAP_DAC_OVERRIDE` from the
manifest lets bitcoind read the `data_uid`-owned datadir). **All** images rebuilt root +
pushed to the mirror (`146.59.87.168:3000/lfg2025`):
- Knots: `29.3.knots20260508`, `29.3.knots20260507`, `29.3.knots20260210`, `29.2.knots20251110`
- Core: `25.2 26.2 27.2 28.4 29.2 29.3 30.2 31.0` + `latest` (→31.0)
- **Catalog** (`scripts/generate-app-catalog.sh` VERSIONS map + regenerated
`releases/app-catalog.json`): Knots & Core `versions[]` populated; the generator now
forces top-level `version` == the `default` entry's version (the `169ff2e2` invariant)
regardless of the manifest version. Knots `latest` entry points at the newest **dated**
image (`29.3.knots20260508`) so "Always use latest" = newest on fixed-binary nodes.
- **Frontend** (`neode-ui/`):
- `AppSidebar.vue`: rename the latest option to **"Always use the latest version"**
(no `v` prefix), fix right padding, and `pickSelection()` guarantees the bound value is
a real option (fixes the blank dropdown).
- New `components/InstallVersionModal.vue`: full-screen version chooser shown from the
App Store / Discover **card** install button for multi-version apps — app icon +
"Install <name>", latest pre-selected. Wired in `Discover.vue handleInstall`.
- i18n keys: `appDetails.alwaysUseLatestVersion`, `marketplace.installModalTitle/Hint`.
## 3. Current live state on .228 (test node)
- Binary with both renderer fixes: **deployed** (`/usr/local/bin/archipelago`).
- New frontend bundle: **deployed** to `/opt/archipelago/web-ui` (hard-refresh to see it).
- Updated catalog: placed at `/var/lib/archipelago/app-catalog.json` (local override —
will refresh from the mirror's OLDER copy at the next hourly fetch until §4 publishes it).
- Knots: `bitcoin-knots` service held **stopped** (`package.stop`, user_stopped);
a detached `bitcoin-knots-reindex` container is rebuilding the index+UTXO (§5).
## 4. Remaining — coordinated fleet rollout (OTHER AGENT)
Do this together with the other workstream's release, AFTER both are ready:
1. **Merge** branch `bitcoin-version-bulletproof` into the release line.
2. **Build + OTA** the binary + frontend (these carry the renderer fix + UI). The renderer
fix is a **hard prerequisite** for the new images everywhere — see fleet-safety below.
3. **Publish the catalog** to the mirror (push `releases/app-catalog.json` to gitea-vps2
`main`, the raw URL nodes fetch hourly). The current catalog is **fleet-safe even before
the binary lands**: unpinned/auto-update nodes resolve via the manifest's floating
`:latest` (still the legacy image); only explicit version selection (needs the new UI)
uses the new root images.
4. **Only AFTER the binary is fleet-wide:** optionally repoint the `bitcoin-knots:latest`
tag → `29.3.knots20260508` (root) and simplify the catalog `latest` entry back to the
`:latest` tag. **Do NOT repoint `:latest` before then** — old-binary nodes fold
`Exec=sh -lc …` and would crash on an `ENTRYPOINT ["bitcoind"]` image. (Core never
worked on old binaries — it always shipped `ENTRYPOINT ["bitcoind"]` — so Core has no
such constraint.)
5. **Verify the full switch matrix** on a healthy node (§6).
## 5. Finishing .228's reindex (the remaining test-node task)
The detached `bitcoin-knots-reindex` container runs the new **root** `29.3.knots20260508`
image with `-reindex -server=0` against `/var/lib/archipelago/bitcoin`. It holds the datadir
lock, so the managed service (held stopped) can't collide. When it has connected blocks up
to ~the prior tip (height ≥ ~955800) it's done; then:
```sh
# on .228 (SSH/sudo/UI pw all: ThisIsWeb54321@)
podman stop -t 600 bitcoin-knots-reindex && podman rm bitcoin-knots-reindex
# start the managed service via RPC (sets desired=running, clears user_stopped):
# package.start {id: bitcoin-knots} (POST https://127.0.0.1/rpc/v1, CSRF: echo csrf_token cookie as X-CSRF-Token)
# verify:
podman exec bitcoin-knots sh -lc '$(command -v bitcoind) --version | head -1' # → v29.3.knots20260508
# RPC up → the Bitcoin UI populates; it syncs the gap to tip.
```
The "Bitcoin RPC connection refused (127.0.0.1:8332)" the UI shows is EXPECTED until this
swap (reindex runs with RPC off).
## 6. Switch-matrix test plan (what "bulletproof" must prove)
On a healthy node, each step must end with bitcoind running + RPC answering + syncing, with
NO `Error initializing block database` and NO data loss:
- Knots: switch `latest``29.3.knots20260507``29.3.knots20260210` → back to `latest`.
- Core: install `latest`; switch `31.0``28.4.0`.
- **Knots ↔ Core** (shared datadir/port): Knots→Core upgrade path (Core ≥ data version) and
the reverse. **Cross-major DOWNGRADES** (e.g. 29.x data → Core 28.4) legitimately need a
reindex — the UI already surfaces a downgrade warning; confirm it does and that confirming
reindexes cleanly rather than crash-looping.
- Reboot survival after each switch.
## 7. Notes / assumptions
- **"29.2"** in the request doesn't exist as a Knots build (404 upstream); added as **Bitcoin
Core 29.2** (exists). Revisit if a Knots 29.2 was meant.
- Reindex is unavoidable ONLY because .228's index was already corrupted by the pre-fix
crash loop; a normal switch on the fixed binary does NOT reindex.
- Creds for .228: SSH/sudo + UI/RPC all `ThisIsWeb54321@`.