# RESUME HERE — Rust orchestrator migration Updated: 2026-04-23 (Step 7 committed, moving to Step 8) **To resume this work, SSH into the ThinkPad and run `opencode` from `~/Projects/archy/`. Or work from the laptop via the SSHFS mount at `~/mnt/archy-thinkpad/`.** ## Where we are Working through the 11-step plan in [`rust-orchestrator-migration.md`](./rust-orchestrator-migration.md). - [x] **Step 1** — `3767c267` ContainerConfig schema with `build:`, `ResolvedSource` enum, `resolve()`, 10 tests - [x] **Step 2** — `34af4d9d` ContainerRuntime trait gained `image_exists` + `build_image`, 4 argv tests, 25/25 pass - [x] **Step 3** — `b6a04d31` ProdContainerOrchestrator (999 LOC), 16 tests all pass, not yet wired to main.rs - [x] **Step 4** — `e8a59c93` ContainerOrchestrator trait, RpcHandler uses it in prod (+ `13858842` chore gitignore ._*) - [x] **Step 5** — `fc39b04b` BootReconciler with Arc shutdown, 4 paused-time tests pass - [x] **Step 6** — main.rs wire-up: construct orchestrator once, load_manifests + adopt_existing + spawn BootReconciler, thread through Server::new / ApiHandler::new / RpcHandler::new, wire shutdown Notify to SIGTERM/SIGINT. Clean `cargo check -p archipelago` (6 pre-existing warnings), container tests 43/44 pass (the one failing `test_parse_image_versions` is pre-existing and unrelated — asserts `!contains_key("NOT_AN_IMAGE")` but the retain on line 106 keeps anything ending in `_IMAGE`). - [x] **Step 7** — `069bc4a5` bitcoin-ui pre-start hook renders nginx.conf from embedded template. New `container::bitcoin_ui` module (render fn, atomic tmp+rename, idempotent byte-compare, 8 unit tests). `ProdContainerOrchestrator::run_pre_start_hooks` fires in `install_fresh` before `create_container` and in `ensure_running` (Running+Rewritten → restart; Stopped → re-render+start). bitcoin-ui Dockerfile no longer COPYs nginx conf; arrives via runtime bind-mount (safe-failure → 404 if missing, never stale auth). `apps/{bitcoin,electrs,lnd}-ui/manifest.yml` land. Integration test asserts `install("bitcoin-ui")` writes substituted config to disk. 39/39 container:: tests pass (same 1 pre-existing failure). - [ ] **Step 8a** — Delete `archipelago-reconcile.{service,timer}` + ISO builder touchpoints. Keep `reconcile-containers.sh` + `container-specs.sh` for `update.rs` OTA path. Next up. - [ ] **Step 8b** — Port remaining ~25 container creations from `first-boot-containers.sh` into `apps//manifest.yml`, then port `update.rs` to orchestrator (deferred, multi-day work) - [ ] **Step 8c** — Rename `first-boot-containers.sh` → `first-boot-setup.sh`, strip container ops, keep setup. Delete `reconcile-containers.sh` + `container-specs.sh`. Add ISO lines to copy `apps/` (final one-way door, requires 8b complete) - [ ] **Step 9** — Hot-swap + verify on .228 - [ ] **Step 10** — Hot-swap + verify on .116 - [ ] **Step 11** — Chaos matrix on both nodes ## Acceptance evidence (Steps 1–7) `cargo test -p archipelago-container --lib` → 25/25 pass. `cargo test -p archipelago container::` → 38/39 pass (all container:: tests; the 1 failure is pre-existing `test_parse_image_versions` — assert bug against `_IMAGE` suffix filter). `cargo check -p archipelago` → clean, 6 warnings (dead-code on trait methods not yet exercised — expected until Step 9 hot-swap). Unrelated test failures (identity_manager / session / wallet / mesh / credentials): 24 pre-existing on baseline `b6a04d31`, fluctuates to 25 on Step 4 — confirmed unrelated (diff only shifted 3 fs-state tests that are independently flaky). ## Uncommitted state Clean — only leftover is `tests/` (bats harness from prior session, not in scope for this migration). ## Answered design questions (no need to re-ask) 1. UI container naming → `archy-` for UIs only; existing bitcoin-knots/lnd/electrumx keep bare names 2. BITCOIN_RPC_AUTH injection → runtime bind-mount of nginx.conf (no build-args, no envsubst) 3. Reconciler interval → 30 seconds 4. Concurrency → per-app `Mutex<()>` in a `DashMap` 5. Bash scripts → delete immediately (first-boot-containers.sh, reconcile-containers.sh, container-specs.sh, + their systemd units) 6. Step 4 extension → `ContainerOrchestrator` trait includes `install(app_id)`; the `manifest_path`-based install RPC stays dev-only 7. Step 7 bitcoin-ui template → embed via `include_str!`, render on install + every reconcile, atomic tmp+rename to `/var/lib/archipelago/bitcoin-ui/nginx.conf`, bind-mount into container. RPC user hardcoded `archipelago`, password from `/var/lib/archipelago/secrets/bitcoin-rpc-password`. ## Context: which host is what | Host | IP | Role | Dashboard pw | Sudo pw | |---|---|---|---|---| | `archy` (this one) | 192.168.1.116 | **Dev ThinkPad** (Lenovo X250, Debian 13, archi-thinkpad), also runs v1.7.42-alpha | archipelago | ThisIsWeb54321@ | | `archy228` | 192.168.1.228 | Kiosk HP ProDesk, runs v1.7.41-alpha, missing bitcoin-ui + lnd-ui | password123 | archipelago | Both are development alpha nodes — **full destructive latitude**, no need to ask before stop/start/rebuild. ## Next action **Step 8a — Delete the reconcile systemd timer path.** Safe, isolated, atomic. Files to delete: 1. `image-recipe/configs/archipelago-reconcile.service` (14 LOC — replaced by BootReconciler) 2. `image-recipe/configs/archipelago-reconcile.timer` (14 LOC — replaced by BootReconciler) ISO builder edits in `image-recipe/build-auto-installer-iso.sh`: - L412-413: drop `COPY archipelago-reconcile.{service,timer}` - L449: drop `systemctl enable archipelago-reconcile.timer` - L542-543: drop the `cp archipelago-reconcile.{service,timer}` block **Keep** `scripts/reconcile-containers.sh` + `scripts/container-specs.sh` because `core/archipelago/src/api/rpc/package/update.rs` still shells out to reconcile-containers.sh during OTA updates. Porting update.rs to `ContainerOrchestrator::upgrade()` requires manifests for every container it touches — that's Step 8b's scope. No Rust changes. Atomic single commit. Full ISO build test on .116 before commit per user ask. **Step 8b/8c come later** — they require porting 25+ container creations from `first-boot-containers.sh` into `apps/*/manifest.yml`, which is a multi-day scope. Not tonight. --- ### Why Step 8 got split (discovered 2026-04-23) Original plan was one commit "delete bash + edit ISO builder". But on investigation: - `first-boot-containers.sh` creates **30+ containers** with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs. - Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator. - `update.rs` (OTA update RPC) invokes `reconcile-containers.sh` at two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests. - Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests. --- # Archipelago — Current State, Plan, and Releases Updated: 2026-04-22 This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in [`bulletproof-containers.md`](./bulletproof-containers.md). --- ## Current state ### Fleet status All four Gitea mirrors are synced to v1.7.40-alpha: | Mirror | Host | Status | |---|---|---| | tx1138 | https://git.tx1138.com | ✅ v1.7.40-alpha live | | gitea-local | http://localhost:3000 | ✅ v1.7.40-alpha live | | .160 | http://23.182.128.160:3000 | ✅ v1.7.40-alpha live (Gitea recovered via `podman system renumber` — see below) | | .168 | http://146.59.87.168:3000 | ✅ v1.7.40-alpha live | Fleet test nodes: | Node | Version | State | |---|---|---| | .103 (dev) | 1.7.40 | running, being developed against | | .116 (this box) | 1.7.40 | healed manually via `systemd-run chmod 755 /opt/archipelago/web-ui` after v1.7.38/39 bug | | .198 | 1.7.39 → 1.7.40-alpha | healed manually | | .228 (primary test) | 1.7.40-alpha | healed manually; bitcoin-core + lnd + electrumx running; UI companions currently missing; bitcoin.conf rpcauth patched live | | .249 (ISO test) | unreachable today | | | .253 | 1.7.39 → 1.7.40-alpha | healed manually | ### Known open issues (drives the plan below) 1. **UI companion containers disappear** on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration) 2. **bitcoin.conf rpcauth drifts** from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived) 3. **`host.containers.internal`** resolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf) 4. **Podman state DB loss** requires manual recovery (fixed by v1.7.44 startup self-heal) 5. **LND "Connect Wallet" info** vanishing after crashes — symptom of the same drift class as #2 6. **ElectrumX not syncing** on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled ### Recent field incident (2026-04-22) - Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was `drwx------` (700). Every node that OTA'd got 500 errors on every page. - Root-cause fix shipped in v1.7.40 (`create-release-manifest.sh` chmod + pre-ship assertion that `tar tvzf | head -1` shows `drwxr-xr-x`). - .160 Gitea was down all day (502) because its rootless podman's `libpod/bolt_state.db` had vanished. Recovered via clearing `/run/user/$UID/{containers,libpod,podman}` + `podman system renumber`. - Full failure-mode audit is in [`bulletproof-containers.md`](./bulletproof-containers.md). --- ## Plan We're shipping a level-triggered **reconciler + Quadlet** architecture over six incremental releases. Each release closes one failure mode. See [`bulletproof-containers.md`](./bulletproof-containers.md) for the full design, code layout, test harness, chaos matrix, sources. ### Release roadmap | Release | Closes | What lands | Status | |---|---|---|---| | **v1.7.41** | FM5 (bad OTA nginx 500) | Post-OTA auto-rollback. New binary probes `https://127.0.0.1/` on boot; if non-200 within 90s, restores `web-ui.bak` + calls `rollback_update()` + restarts | **in flight — deploying to .228 for test** | | **v1.7.42** | FM4 (`host.containers.internal` wrong) | `/etc/containers/containers.conf` w/ `host_containers_internal_ip = 10.89.0.1`; every container gets `--add-host=host.archipelago:10.89.0.1` | pending | | **v1.7.43** | FM2 (config drift) | `reconcile::derived::render_bitcoin_conf` — pure fn over canonical secret, rewrites on drift. Same for `lnd.conf` | pending | | **v1.7.44** | FM6 (podman state loss) | Startup probe detects broken podman state, auto-recovers via `/run/user/$UID/*` clear + `system renumber` | pending | | **v1.7.45** | FM1 + FM3 (companion orphans) | `archy-bitcoin-ui` → Quadlet `.container` unit in `/etc/containers/systemd/`. systemd (not archipelago) owns it | pending | | **v1.7.46** | — | `archy-lnd-ui` → Quadlet | pending | | **v1.7.47** | — | `archy-electrs-ui` → Quadlet | pending | | **v1.7.48+** | all (full daemon refactor) | `core/archipelago/src/reconcile/` module replaces imperative `install.rs` container management. Main app containers become Quadlet too | pending | Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag. --- ## Release history ### [v1.7.41-alpha](/releases/v1.7.41-alpha/) — IN FLIGHT — 2026-04-22 **Post-OTA auto-rollback.** After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node. Changes: - `core/archipelago/src/update.rs`: `PendingVerification` struct, write marker before service restart, `verify_pending_update()` on new binary boot — probes `https://127.0.0.1/`, on fail restores `web-ui.bak` + calls `rollback_update()` + `systemctl restart archipelago` - `core/archipelago/src/main.rs`: startup task invokes verifier concurrently with server ### [v1.7.40-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.40-alpha/) — 2026-04-22 **Proper fix for the 500 error.** Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly `chmod 755` before tar; `--mode=u=rwX,go=rX` normalizes archive perms; pre-ship assertion aborts release if `tar tvzf | head -1` isn't `drwxr-xr-x`. Changes: - `scripts/create-release-manifest.sh`: pre-tar chmod + tar --mode flag + post-tar verify - Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball) ### [v1.7.39-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.39-alpha/) — 2026-04-22 **Hotfix attempt** for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in `main.rs` and post-extract chmod in `update.rs` OTA applier. ### [v1.7.38-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.38-alpha/) — 2026-04-22 **Onboarding auto-heal + silent logins + App Store trim.** Changes: - `auth.rs`: `is_onboarding_complete()` auto-heals from `setup_complete` + `password_hash` (prevents clear-cache → onboarding wizard bug) - `useOnboarding`: tri-state — backend-unreachable no longer defaults to `/onboarding/intro` - Login sounds gated by `isFirstInstallPhase()` — silent after onboarding, typing sounds unaffected - Removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog + Rust + docker + icons - Deleted 15 image versions from tx1138, .168, gitea-local registries - AIUI baked into release tarball via `demo/aiui/` - `prebuild` hook syncs `app-catalog/catalog.json` → `public/catalog.json` (Shipped with tarball-perms bug; fleet had to be healed before v1.7.40.) ### [v1.7.37-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.37-alpha/) — 2026-04-22 **Bitcoin Core install fixes + dynamic node UI + full-archive default.** - Bitcoin Core passes explicit `-rpcbind/-rpcallowip/etc.` CLI args so vanilla image exposes RPC - Split `bitcoin-core` from `bitcoin-knots` in backend `AppMetadata` - bitcoin-ui auto-detects Core vs. Knots from subversion, swaps branding at runtime - Storage (Full Archive · X GB / Pruned) indicator on dashboard - Node Settings modal shows real values (network, storage, txindex, ZMQ, RPC port) - Pull fallback to `docker.io` when no mirror carries the image - Removed `prune=550` hardcode — full archive default --- ## Key docs - [`bulletproof-containers.md`](./bulletproof-containers.md) — full reconcile architecture, code layout, test matrix, chaos scenarios, sources - [`BETA-RELEASE-CHECKLIST.md`](./BETA-RELEASE-CHECKLIST.md) — existing beta checklist - [`BETA-ISSUES-20260328.md`](./BETA-ISSUES-20260328.md) — prior beta-blocker tracking - [`hotfix-process.md`](./hotfix-process.md) — release workflow - [`architecture.md`](./architecture.md) — system architecture overview --- ## How to resume 1. Check fleet mirrors are all live: `curl -sS https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version` 2. Read [`bulletproof-containers.md`](./bulletproof-containers.md) for the current plan 3. Check task list (`/list` or via Claude Code) for the in-flight release 4. Latest in-flight work: v1.7.41 deploying to .228 for test; will ship to all 4 mirrors once verified