Captures full design for the next session: - Full bug sequence (5.5min blocking RPC + 30s scan clobbering transitional state) - 4-commit implementation order with exact file:line targets - Single-button UI spec with full label table - Verification gates including manual LND stop test on .228 - Architectural decision: spawn lives in RPC layer, orchestrator trait stays sync No code change yet; next session implements.
34 KiB
RESUME HERE — Rust orchestrator migration
Updated: 2026-04-23 (Dashboard Stop UX bug diagnosed; async-spawn fix fully designed, ready to implement)
To resume this work, SSH into the ThinkPad and run opencode from ~/Projects/archy/. Or work from the laptop via the SSHFS mount at ~/mnt/archy-thinkpad/.
⚡ NEXT SESSION — START HERE
Goal: implement async-spawn lifecycle fix so the dashboard never shows a frozen spinner again. User mandate: "best server containers in the world". Do not ship the chaos matrix (Step 11) until this lands and manual LND stop verifies instant RPC + live Stopping… label.
Bug being fixed
Dashboard sequence when user clicks Stop LND:
- UI collapses Start/Stop buttons to single spinner-button ("Stopping…") via
loadingApps.add('lnd'). - Frontend calls
container-stopRPC. Server runspodman stop -t 330 lndsynchronously inside the RPC handler (viaorchestrator.stop()). RPC blocks up to 5.5 min for LND (330s timeout + overhead). - Meanwhile the 30-second package-scan loop in
server.rs:scan_and_update_packageskeeps running. It rebuildsPackageDataEntryfrom podman inspect — podman still reportsrunning(stop hasn't completed) — and blindly overwrites the store entry atserver.rs:854. container-listRPC readsstate_managersnapshot → returnsstate = "running".- Frontend polling sees
running→getAppState()returns'running'→ the two-button (Start | Stop) block re-renders → the transitional button disappears → UI looks like the stop silently failed. - Eventually
podman stopfinishes → next scan → state flips toStopped→ buttons change again.
Net visible bug: button spins briefly, reverts to Running, then several minutes later suddenly shows Stopped. User rightly calls this "out of sync and confusing".
Decisions already locked in (do not re-ask)
- Full scope fix (not minimal hotfix). User chose "Go full scope, do it right".
- Async-spawn lives in the RPC layer, not in the
ContainerOrchestratortrait. Trait stays synchronous so the reconciler, boot flow, unit tests, and the chaos harness retain deterministic behaviour. PackageStatealready hasStopping/Starting/Restarting/Installing/Updating/Removingvariants — enum atcore/archipelago/src/data_model.rs:107-124. No schema change needed.- UI collapses to one full-width button with spinner during every transitional state. Labels: Start / Stop / Starting… / Stopping… / Restarting… / Installing… / Updating… / Removing… / Install (when
not-installed). - Helper API shape:
RpcHandler::spawn_transitional(op: Op, app_id: String)whereOpis an enum{Stop, Start, Restart}. Helper dispatches toorchestrator.stop/start/restartinternally, knows each op's transitional+final states, handles error → revert +install_log(). mark_user_stoppedmust run BEFORE the spawn (preserves ordering the crash recovery layer depends on — seeruntime.rs:145-148).
Implementation order (4 commits, local only)
Commit 1 — feat(rpc): spawn_transitional helper for async lifecycle ops
- New file:
core/archipelago/src/api/rpc/transitional.rs(or extendcontainer.rs; prefer new file for cohesion with future stacks/package variants) enum Op { Stop, Start, Restart }withtransitional_state(),final_state_on_success(),log_prefix(), and asyncdispatch(&orch, &app_id)methodimpl RpcHandler { pub(super) async fn spawn_transitional(&self, op: Op, app_id: String) -> Result<()> }- Capture
Arc<dyn ContainerOrchestrator>+Arc<StateManager>clones - Set transitional state via
state_manager.update_data()(if entry exists; skip if not — Start on never-installed shouldn't create an entry) tokio::spawn(async move { ... })- Inside spawn:
install_log("{LOG_PREFIX}: {app_id}"),op.dispatch(&orch, &app_id).await, on success set final state, on error log +install_log("{LOG_PREFIX} FAIL: …")+ revert state to previous (cache pre-transition state in a local) - Return
Ok(())immediately after spawn
- Capture
Commit 2 — fix(rpc): async container stop/start/restart; widen state mapping
api/rpc/container.rs:85-107— rewritehandle_container_stopbody:validate_app_id,mark_user_stopped,spawn_transitional(Op::Stop, app_id.to_string()).await?, returnOk(json!({ "status": "stopping" }))api/rpc/container.rs:61-83— rewritehandle_container_start:clear_user_stopped,spawn_transitional(Op::Start, …), return{ "status": "starting" }- Add
handle_container_restart(currently missing incontainer.rs— only exists aspackage.restartatruntime.rs:176-242). Register RPC route namecontainer-restart. Add matching frontend client method incontainer-client.ts. api/rpc/container.rs:148-154— widen thecontainer-liststate mapping: add arms forStopping → "stopping",Starting → "starting",Restarting → "restarting",Installing → "installing",Updating → "updating",Removing → "removing",Installed → "installed",CreatingBackup/RestoringBackup/BackingUp→ their kebab-case strings. No more"unknown"fallback unless the variant is genuinely unknown.- Mirror same spawn treatment in
api/rpc/package/runtime.rs:handle_package_start(L28-119),handle_package_stop(L122-173),handle_package_restart(L176-242). Keep the existing verification loops (post-start exit-check at L82-117; restart stop+start fallback at L215-235) inside the spawned future, not in the RPC body.
Commit 3 — fix(state): preserve transitional state across container scans
server.rs:847-857— in the merge loop, before themerged.insert(id.clone(), pkg.clone())overwrite, checkmerged.get(id).stateand skip overwrite if it's transitional:matches!(existing.state, Installing | Stopping | Starting | Restarting | Updating | Removing | CreatingBackup | RestoringBackup | BackingUp)- Still allow non-state fields (lan_address, health, ports) to update. Simplest: when existing is transitional, keep
existing.statebut merge updated fields frompkg. Write a tiny helpermerge_preserving_transitional(existing, fresh) -> PackageDataEntry. - Unit test: construct
existing.state = Stopping,fresh.state = Running, assert merged.state staysStopping. - Also check: Is there a timeout escape hatch? If
Stoppingis set and podman actually finishes but the spawn died before writing the final state (process crash, panic), the entry will be stuckStoppingforever. Mitigation: track atransitional_since: Instantin the entry (not persisted, just in-memory side table on StateManager), and if > 2× the stop timeout has elapsed, allow podman scan state to override. Scope for this commit or follow-up — lean toward: include it, because fleet reliability matters.
Commit 4 — fix(ui): single-button lifecycle control with transitional labels
neode-ui/src/api/container-client.ts— extendContainerStatus.stateunion to:'created' | 'running' | 'stopped' | 'exited' | 'paused' | 'unknown' | 'stopping' | 'starting' | 'restarting' | 'installing' | 'updating' | 'removing' | 'installed'. AddrestartContainer(appId)method callingcontainer-restart.neode-ui/src/stores/container.ts— add computedgetAppVisualState(appId)that returns one of:'not-installed' | 'running' | 'stopped' | 'starting' | 'stopping' | 'restarting' | 'installing' | 'updating' | 'removing'. Mapsexited→stopped,created→stopped,paused→stopped,installed→stopped. AddrestartContainer(appId)action (setsloadingAppsfor request dedup, calls client, does NOTfetchContainersimmediately because server will broadcast state; a finalfetchContainersafter a short delay can backstop if WebSocket push is absent).neode-ui/src/views/ContainerApps.vue:85-136— replace the two-button conditional with a single full-width button bound togetAppVisualState(app.id). Table:visual state click action label spinner disabled not-installedinstallApp Install no no runningstopContainer Stop no no stoppedstartContainer Start no no starting— Starting… yes yes stopping— Stopping… yes yes restarting— Restarting… yes yes installing— Installing… yes yes updating— Updating… yes yes removing— Removing… yes yes - Add a separate Restart button next to the primary one when state is
running, calling newrestartContaineraction. Restart button hides while transitional.
- Add a separate Restart button next to the primary one when state is
neode-ui/src/views/ContainerAppDetails.vue:83(and full stop/start button blocks around L220, L232) — mirror the same single-button pattern.- Also audit line 239 of
ContainerApps.vue(some((app) => store.getAppState(app.id) === 'created')) and the logic around lines 276, 295, 309, 312 — make sure they usegetAppVisualStatewhere appropriate.
Verification gates (do not skip)
~/.cargo/bin/cargo check -p archipelagoon .116 via SSH~/.cargo/bin/cargo test -p archipelagoon .116 via SSH — at least the new merge helper test must pass- Build release binary on .116:
nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown. Poll until done. - SCP binary to .228
/usr/local/bin/archipelago, back up prior to/usr/local/bin/archipelago.bak-pre-async-stop.sudo systemctl restart archipelagoon .228. - Manual LND stop test on .228:
- Open dashboard, confirm LND is Running (first:
ssh archipelago@192.168.1.228 'podman start lnd'— LND is currently Exited(0) from the demo) - Click Stop
- Expected: button immediately becomes "Stopping…" with spinner (RPC returns <1s)
- Dashboard should stay on "Stopping…" for ~5 min
- Then flip to "Start" button with label "Start"
- At no point should it revert to "Running" mid-stop
- Open dashboard, confirm LND is Running (first:
- Same test with Bitcoin Core stop (longest timeout, 600s)
- Frontend build:
cd ~/Projects/archy/neode-ui && npm run type-check && npm run build. Rsyncdist/toarchipelago@192.168.1.228:/var/lib/archipelago/web-ui/(or wherever the active web root is — check/etc/nginxon .228 first). - Then and only then: resume chaos matrix. First recover LND/ElectrumX via UI (great end-to-end test of the new async Start path), then run smoke → full 32-case matrix.
Key files (exact lines of interest)
core/archipelago/src/api/rpc/container.rs:85-107—handle_container_stop(blocking — target of fix)core/archipelago/src/api/rpc/container.rs:61-83—handle_container_startcore/archipelago/src/api/rpc/container.rs:148-154— narrow state mapping (drops transitional → "unknown")core/archipelago/src/api/rpc/package/runtime.rs:11-24—stop_timeout_secstable (reference, unchanged)core/archipelago/src/api/rpc/package/runtime.rs:122-173—handle_package_stop(also blocking, mirror treatment)core/archipelago/src/api/rpc/package/runtime.rs:28-119—handle_package_startcore/archipelago/src/api/rpc/package/runtime.rs:176-242—handle_package_restartcore/archipelago/src/api/rpc/package/progress.rs— existing broadcast pattern to mirror (set_install_progress,set_uninstall_stage)core/archipelago/src/api/rpc/mod.rs:62-100—RpcHandlerstruct (already holdsArc<dyn ContainerOrchestrator>+ state_manager)core/archipelago/src/server.rs:812-857—scan_and_update_packages(merge loop at L850-857 is where transitional-state clobber happens)core/archipelago/src/container/docker_packages.rs:636-663—convert_state+package_state_str(read-only reference, no change)core/archipelago/src/container/traits.rs—ContainerOrchestratortrait (stays synchronous, do not change)core/archipelago/src/crash_recovery.rs—mark_user_stopped/clear_user_stopped(call order preserved)core/archipelago/src/data_model.rs:107-124—PackageStateenum (no change — all variants exist)neode-ui/src/api/container-client.ts—ContainerStatustype + RPC methods (extend)neode-ui/src/stores/container.ts:93-312— Pinia store (addgetAppVisualState, addrestartContaineraction)neode-ui/src/views/ContainerApps.vue:85-136, 239, 276, 295, 309-312, 383— two-button block + state readsneode-ui/src/views/ContainerAppDetails.vue:83, 220, 232— details page Stop/Start
Chaos harness (not in repo — lives on .116)
archipelago@192.168.1.116:~/ui-chaos/— deployed, playwright + deps installed, smoke test for bitcoin-core passes (2.1 min). LND/ElectrumX/bitcoin-ui smoke tests not yet run (blocked on the async-stop fix landing; LND currently Exited on .228 from the demo)./tmp/chaos/on laptop — canonical source for rsync to .116.- Run:
cd ~/ui-chaos && npx playwright test tests/<spec> - Target: 32 cases = 4 core containers × 8 scenarios (install-fresh, graceful-stop, sigkill, rm-container, oom-kill, rm-image, restart-service, network-partition).
- Uses SSH+Playwright hybrid per design; includes the
bash -lc '<escaped>'single-quote fix for ssh argv flattening and JSON-parsedpodman inspectinstead of Go templates.
Pre-existing bugs still deferred (do not fix until Stop UX lands)
archipelago --versionspawns server (should be a pure CLI query)- RPC unknown-method returns generic error (should return method-not-found with the bad method name)
docker_packages.rsfilters out UI containers (archy-lnd-ui,archy-electrs-ui) — some views need them visiblelnd.lan_addressstale on .228- first-boot silent failure on some hardware
web-ui.failed.*scar on .228 (benign systemd unit state)test_parse_image_versionspre-existing broken assertion — fix or#[ignore]when touching that area
Host reference
| Host | IP | Role | Dashboard pw | Sudo pw | SSH |
|---|---|---|---|---|---|
archy (ThinkPad X250) |
192.168.1.116 | dev host, Debian 13, repo at ~/Projects/archy/ |
archipelago | ThisIsWeb54321@ |
key installed |
archy228 (HP ProDesk) |
192.168.1.228 | prod kiosk, new Rust orchestrator binary | password123 | archipelago (NOPASSWD:ALL via /etc/sudoers.d/archipelago-ci) | key installed |
- Laptop SSHFS mount:
~/mnt/archy-thinkpad/(edits OK, git/cargo via SSH) - Cargo path over SSH:
~/.cargo/bin/cargo(non-interactive login has no cargo in PATH) - Release model: local commit + tag only; user pushes to 4 Gitea mirrors personally
- Full destructive latitude on both nodes. Announce multi-hour ops. Don't ask for routine stop/start/rebuild permission.
Where we are
Working through the 11-step plan in rust-orchestrator-migration.md.
- Step 1 —
3767c267ContainerConfig schema withbuild:,ResolvedSourceenum,resolve(), 10 tests - Step 2 —
34af4d9dContainerRuntime trait gainedimage_exists+build_image, 4 argv tests, 25/25 pass - Step 3 —
b6a04d31ProdContainerOrchestrator (999 LOC), 16 tests all pass, not yet wired to main.rs - Step 4 —
e8a59c93ContainerOrchestrator trait, RpcHandler uses it in prod (+13858842chore gitignore ._*) - Step 5 —
fc39b04bBootReconciler with Arc shutdown, 4 paused-time tests pass - Step 6 —
48f08aa3main.rs wire-up (orchestrator construction + adopt_existing + BootReconciler spawn + shutdown Notify) - Step 7 —
069bc4a5bitcoin-ui pre-start hook + embedded nginx.conf template (8 unit tests + 1 integration test), 39/39 container:: tests pass - Step 8a —
a0707f4dretire archipelago-reconcile.{service,timer} + ISO builder touchpoints, keep scripts for update.rs - Step 9 — Hot-swap on .228 verified. All three UIs (bitcoin-ui/lnd-ui/electrs-ui) installing + serving HTTP 200.
- .228 dashboard bugs — ExtraHost
192.168.1.254bug (3ee192ba) + LND macaroon permission bug (be960023). See "Post-Step 9 bug hunt" below. - Step 8b — Port remaining ~25 container creations from
first-boot-containers.shintoapps/<id>/manifest.yml, then portupdate.rsto orchestrator (deferred, multi-day work) - Step 8c — Rename
first-boot-containers.sh→first-boot-setup.sh, strip container ops, keep setup. Deletereconcile-containers.sh+container-specs.sh. Add ISO lines to copyapps/(final one-way door, requires 8b complete) - Step 10 — Hot-swap + verify on .116 (adoption-heavy test — .116 already has all containers running)
- Step 11 — Chaos matrix on both nodes (all 8 scenarios × all containers incl. bitcoin-core)
Post-Step 9 bug hunt (.228, 2026-04-23)
User reported three visible dashboard bugs after Step 9 verification:
- LND — "no connect details or QR"
- ElectrumX — stuck at "Building index (2 KB / ~130 GB)" for days
- bitcoin-core — in scope for chaos testing
Root cause #1 (ExtraHost, commit 3ee192ba): scripts/first-boot-containers.sh computed HOST_GATEWAY from ip route show default, which returns the LAN router (e.g. 192.168.1.254), not the gateway to the host. Every container configured with --add-host=host.containers.internal:$HOST_GATEWAY was dialing the WiFi router instead of the host. LND crash-looped with dial tcp 192.168.1.254:8332: connection refused; ElectrumX's DAEMON_URL hit the same dead end; any archy-net bridge consumer of bitcoin-core's RPC was broken. Fixed by replacing the computed value with podman's magic host-gateway literal (supported since 4.4; we ship 5.4.2). Live-recreated bitcoin-core/electrumx/lnd on .228 with the corrected --add-host; LND reached chain backend; ElectrumX resumed indexing (went from 2 KB → 164.9 MB in under an hour).
Root cause #2 (macaroon permissions, commit be960023): LND's admin.macaroon lives at /var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon, owned by rootless-podman subordinate UID 100000, mode 640. The archipelago server runs as host UID 1000 and literally cannot read the file. Every LND RPC (getinfo, connect-info, export-channel-backup) plus the shared lnd_client() helper failed with "Failed to read LND admin macaroon". Confirmed pre-existing on .116 too (long-standing bug unrelated to Step 9). Fix: centralised the path as LND_ADMIN_MACAROON_PATH, added a read_lnd_admin_macaroon() helper in api/rpc/lnd/mod.rs that tries direct read first then falls back to sudo -n cat (mirrors the pattern already used for Tor onion hostnames). Four call sites routed through the helper. Verified on .228 — curl -k https://<host>/lnd-connect-info now returns 200 with cert + macaroon + tor_onion; dashboard QR unblocked.
Step 9 evidence (.228, 2026-04-23)
- Binary: Step 9 build with
732df1b8+ba83f9bc, scp'd to .228 as/usr/local/bin/archipelago. Old binary backed up at/usr/local/bin/archipelago.bak-pre-step9. Later replaced with macaroon-fix build (be960023); previous backed up at/usr/local/bin/archipelago.bak-pre-macaroon. - DEV_MODE override disabled (
override.conf→override.conf.disabled-pre-step9). /opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.ymlpopulated./opt/archipelago/docker/bitcoin-ui/Dockerfilereplaced with the Step 7 version (noCOPY nginx.conf). Old dir backed up asbitcoin-ui.bak-pre-step9.- Post-start snapshot:
🔗 Adopted 1 existing container(s): ["electrs-ui"]— adoption of 13h-running container worked without recreation🔄 Boot reconciler started (interval: 30s)— every 30s, all three app_ids reachNoOpafter the initial install passbitcoin-ui nginx.conf rendered path=/var/lib/archipelago/bitcoin-ui/nginx.conf auth_hash=97af1c18— pre-start hook fires ininstall_freshcurl localhost:8334→ HTTP 200 (bitcoin-ui),:8081→ 200 (lnd-ui),:50002→ 200 (electrs-ui)- OCI memory limits correctly applied: bitcoin-ui=128Mi, electrs-ui=128Mi, lnd-ui=64Mi (was emitted as 0 pre-fix)
Bugs fixed this session
parse_memory_limittruncation bug (732df1b8): lowercased "128Mi" → "128mi" →trim_end_matches('m')→ "128i" → f64 parse fails →None.unwrap_or(0)→ OCImemory.limit:0→ systemd rejects MemoryMax=0. 6 regression tests;create_containernow omits instead of emitting 0.archipelago.servicecgroup delegation missing (ba83f9bc): belt-and-bracesDelegate=memory pids cpu io.- ExtraHost
192.168.1.254(3ee192ba): see Post-Step 9 bug hunt above. - LND admin.macaroon unreadable (
be960023): see Post-Step 9 bug hunt above.
Commits made this session
3ee192ba fix(first-boot): use podman host-gateway magic for host.containers.internal
be960023 fix(lnd): read admin macaroon via sudo fallback
4b8ef0a0 docs: STATUS.md through Step 9 (.228 hot-swap verified)
ba83f9bc feat(systemd): delegate cgroup controllers to archipelago.service
732df1b8 fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes
a0707f4d refactor: retire archipelago-reconcile.{service,timer} (Step 8a)
1c81a739 docs: split Step 8 into 8a/8b/8c
6e46932f docs: STATUS.md through Step 7
069bc4a5 feat: bitcoin-ui pre-start hook (Step 7)
Branch is 19 commits ahead of tx1138/main (local only — user pushes to mirrors personally).
Uncommitted state
Clean. Only untracked: tests/ (bats harness from prior session, not in scope), tmp-dump-spec.py (scratch).
Answered design questions (no need to re-ask)
- UI container naming →
archy-<app_id>for UIs only; existing bitcoin-knots/lnd/electrumx keep bare names - BITCOIN_RPC_AUTH injection → runtime bind-mount of nginx.conf (no build-args, no envsubst)
- Reconciler interval → 30 seconds
- Concurrency → per-app
Mutex<()>in aDashMap - Bash scripts → split into 8a/8b/8c; 8a done, 8b/8c deferred
- Step 4 extension →
ContainerOrchestratortrait includesinstall(app_id); themanifest_path-based install RPC stays dev-only - Step 7 bitcoin-ui template → embed via
include_str!, render on install + every reconcile, atomic tmp+rename to/var/lib/archipelago/bitcoin-ui/nginx.conf, bind-mount into container. RPC user hardcodedarchipelago, password from/var/lib/archipelago/secrets/bitcoin-rpc-password.
Context: which host is what
| Host | IP | Role | Dashboard pw | Sudo pw |
|---|---|---|---|---|
archy |
192.168.1.116 | Dev ThinkPad (Lenovo X250, Debian 13). Currently running v1.7.42-alpha (DEV_MODE). Step 10 target. | archipelago | ThisIsWeb54321@ |
archy228 |
192.168.1.228 | Kiosk HP ProDesk. Step 9 landing zone — now running Rust-orchestrator binary in prod mode. | password123 | archipelago |
Both are development alpha nodes — full destructive latitude, no need to ask before stop/start/rebuild.
Next action
Step 10 — Hot-swap on .116.
Unlike .228 (which tested the INSTALL path for net-new UI containers), .116 tests the ADOPTION path: it already has all three UIs and all backend containers running from prior v1.7.42-alpha runs. We want to verify the new prod orchestrator adopts every existing container without recreating or restarting them.
Steps:
- Disable DEV_MODE on .116 (check if override.conf exists —
/etc/systemd/system/archipelago.service.d/) - Stage the already-built binary at
~/Projects/archy/core/target/release/archipelago→/usr/local/bin/archipelago.new - Ensure
/opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.ymlpresent (copy from repo) - Ensure
/opt/archipelago/docker/bitcoin-ui/matches the Step-7 layout (no baked nginx.conf) - Snapshot:
podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.CreatedAt}}"→ save to/tmp/pre-step10-containers.txt systemctl stop archipelago→ install binary →systemctl start archipelago- Verify in journal: every running container appears in "Adopted N existing container(s)"; no container was recreated; all HTTP smokes still 200; BootReconciler reaches NoOp on every app_id after one pass.
- If broken → restore
.bakbinary, re-enable DEV_MODE override. - Commit STATUS.md update.
Risk on .116: If adoption fails mid-flight, we'd lose the running v1.7.42 backend that I'm currently typing at. Keep a second SSH session open to the ThinkPad for emergency revert. The backup plan is install /usr/local/bin/archipelago.bak /usr/local/bin/archipelago && systemctl restart archipelago.
After Step 10 we are blocked on Step 8b (multi-day manifest ports) before Step 11 (chaos matrix).
Why Step 8 got split (discovered 2026-04-23)
Original plan was one commit "delete bash + edit ISO builder". But on investigation:
first-boot-containers.shcreates 30+ containers with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs.- Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator.
update.rs(OTA update RPC) invokesreconcile-containers.shat two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests.- Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests.
Archipelago — Current State, Plan, and Releases
Updated: 2026-04-22
This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in bulletproof-containers.md.
Current state
Fleet status
All four Gitea mirrors are synced to v1.7.40-alpha:
| Mirror | Host | Status |
|---|---|---|
| tx1138 | https://git.tx1138.com | ✅ v1.7.40-alpha live |
| gitea-local | http://localhost:3000 | ✅ v1.7.40-alpha live |
| .160 | http://23.182.128.160:3000 | ✅ v1.7.40-alpha live (Gitea recovered via podman system renumber — see below) |
| .168 | http://146.59.87.168:3000 | ✅ v1.7.40-alpha live |
Fleet test nodes:
| Node | Version | State |
|---|---|---|
| .103 (dev) | 1.7.40 | running, being developed against |
| .116 (this box) | 1.7.40 | healed manually via systemd-run chmod 755 /opt/archipelago/web-ui after v1.7.38/39 bug |
| .198 | 1.7.39 → 1.7.40-alpha | healed manually |
| .228 (primary test) | 1.7.40-alpha | healed manually; bitcoin-core + lnd + electrumx running; UI companions currently missing; bitcoin.conf rpcauth patched live |
| .249 (ISO test) | unreachable today | |
| .253 | 1.7.39 → 1.7.40-alpha | healed manually |
Known open issues (drives the plan below)
- UI companion containers disappear on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration)
- bitcoin.conf rpcauth drifts from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived)
host.containers.internalresolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf)- Podman state DB loss requires manual recovery (fixed by v1.7.44 startup self-heal)
- LND "Connect Wallet" info vanishing after crashes — symptom of the same drift class as #2
- ElectrumX not syncing on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled
Recent field incident (2026-04-22)
- Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was
drwx------(700). Every node that OTA'd got 500 errors on every page. - Root-cause fix shipped in v1.7.40 (
create-release-manifest.shchmod + pre-ship assertion thattar tvzf | head -1showsdrwxr-xr-x). - .160 Gitea was down all day (502) because its rootless podman's
libpod/bolt_state.dbhad vanished. Recovered via clearing/run/user/$UID/{containers,libpod,podman}+podman system renumber. - Full failure-mode audit is in
bulletproof-containers.md.
Plan
We're shipping a level-triggered reconciler + Quadlet architecture over six incremental releases. Each release closes one failure mode. See bulletproof-containers.md for the full design, code layout, test harness, chaos matrix, sources.
Release roadmap
| Release | Closes | What lands | Status |
|---|---|---|---|
| v1.7.41 | FM5 (bad OTA nginx 500) | Post-OTA auto-rollback. New binary probes https://127.0.0.1/ on boot; if non-200 within 90s, restores web-ui.bak + calls rollback_update() + restarts |
in flight — deploying to .228 for test |
| v1.7.42 | FM4 (host.containers.internal wrong) |
/etc/containers/containers.conf w/ host_containers_internal_ip = 10.89.0.1; every container gets --add-host=host.archipelago:10.89.0.1 |
pending |
| v1.7.43 | FM2 (config drift) | reconcile::derived::render_bitcoin_conf — pure fn over canonical secret, rewrites on drift. Same for lnd.conf |
pending |
| v1.7.44 | FM6 (podman state loss) | Startup probe detects broken podman state, auto-recovers via /run/user/$UID/* clear + system renumber |
pending |
| v1.7.45 | FM1 + FM3 (companion orphans) | archy-bitcoin-ui → Quadlet .container unit in /etc/containers/systemd/. systemd (not archipelago) owns it |
pending |
| v1.7.46 | — | archy-lnd-ui → Quadlet |
pending |
| v1.7.47 | — | archy-electrs-ui → Quadlet |
pending |
| v1.7.48+ | all (full daemon refactor) | core/archipelago/src/reconcile/ module replaces imperative install.rs container management. Main app containers become Quadlet too |
pending |
Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag.
Release history
v1.7.41-alpha — IN FLIGHT — 2026-04-22
Post-OTA auto-rollback. After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.
Changes:
core/archipelago/src/update.rs:PendingVerificationstruct, write marker before service restart,verify_pending_update()on new binary boot — probeshttps://127.0.0.1/, on fail restoresweb-ui.bak+ callsrollback_update()+systemctl restart archipelagocore/archipelago/src/main.rs: startup task invokes verifier concurrently with server
v1.7.40-alpha — 2026-04-22
Proper fix for the 500 error. Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly chmod 755 before tar; --mode=u=rwX,go=rX normalizes archive perms; pre-ship assertion aborts release if tar tvzf | head -1 isn't drwxr-xr-x.
Changes:
scripts/create-release-manifest.sh: pre-tar chmod + tar --mode flag + post-tar verify- Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball)
v1.7.39-alpha — 2026-04-22
Hotfix attempt for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in main.rs and post-extract chmod in update.rs OTA applier.
v1.7.38-alpha — 2026-04-22
Onboarding auto-heal + silent logins + App Store trim.
Changes:
auth.rs:is_onboarding_complete()auto-heals fromsetup_complete+password_hash(prevents clear-cache → onboarding wizard bug)useOnboarding: tri-state — backend-unreachable no longer defaults to/onboarding/intro- Login sounds gated by
isFirstInstallPhase()— silent after onboarding, typing sounds unaffected - Removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog + Rust + docker + icons
- Deleted 15 image versions from tx1138, .168, gitea-local registries
- AIUI baked into release tarball via
demo/aiui/ prebuildhook syncsapp-catalog/catalog.json→public/catalog.json
(Shipped with tarball-perms bug; fleet had to be healed before v1.7.40.)
v1.7.37-alpha — 2026-04-22
Bitcoin Core install fixes + dynamic node UI + full-archive default.
- Bitcoin Core passes explicit
-rpcbind/-rpcallowip/etc.CLI args so vanilla image exposes RPC - Split
bitcoin-corefrombitcoin-knotsin backendAppMetadata - bitcoin-ui auto-detects Core vs. Knots from subversion, swaps branding at runtime
- Storage (Full Archive · X GB / Pruned) indicator on dashboard
- Node Settings modal shows real values (network, storage, txindex, ZMQ, RPC port)
- Pull fallback to
docker.iowhen no mirror carries the image - Removed
prune=550hardcode — full archive default
Key docs
bulletproof-containers.md— full reconcile architecture, code layout, test matrix, chaos scenarios, sourcesBETA-RELEASE-CHECKLIST.md— existing beta checklistBETA-ISSUES-20260328.md— prior beta-blocker trackinghotfix-process.md— release workflowarchitecture.md— system architecture overview
How to resume
- Check fleet mirrors are all live:
curl -sS https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version - Read
bulletproof-containers.mdfor the current plan - Check task list (
/listor via Claude Code) for the in-flight release - Latest in-flight work: v1.7.41 deploying to .228 for test; will ship to all 4 mirrors once verified