archy/docs/STATUS.md
archipelago e557e0156f docs: STATUS.md — dashboard Stop UX bug diagnosis + async-spawn fix plan
Captures full design for the next session:
- Full bug sequence (5.5min blocking RPC + 30s scan clobbering transitional state)
- 4-commit implementation order with exact file:line targets
- Single-button UI spec with full label table
- Verification gates including manual LND stop test on .228
- Architectural decision: spawn lives in RPC layer, orchestrator trait stays sync

No code change yet; next session implements.
2026-04-23 04:45:12 -04:00

34 KiB
Raw Blame History

RESUME HERE — Rust orchestrator migration

Updated: 2026-04-23 (Dashboard Stop UX bug diagnosed; async-spawn fix fully designed, ready to implement)

To resume this work, SSH into the ThinkPad and run opencode from ~/Projects/archy/. Or work from the laptop via the SSHFS mount at ~/mnt/archy-thinkpad/.


NEXT SESSION — START HERE

Goal: implement async-spawn lifecycle fix so the dashboard never shows a frozen spinner again. User mandate: "best server containers in the world". Do not ship the chaos matrix (Step 11) until this lands and manual LND stop verifies instant RPC + live Stopping… label.

Bug being fixed

Dashboard sequence when user clicks Stop LND:

  1. UI collapses Start/Stop buttons to single spinner-button ("Stopping…") via loadingApps.add('lnd').
  2. Frontend calls container-stop RPC. Server runs podman stop -t 330 lnd synchronously inside the RPC handler (via orchestrator.stop()). RPC blocks up to 5.5 min for LND (330s timeout + overhead).
  3. Meanwhile the 30-second package-scan loop in server.rs:scan_and_update_packages keeps running. It rebuilds PackageDataEntry from podman inspect — podman still reports running (stop hasn't completed) — and blindly overwrites the store entry at server.rs:854.
  4. container-list RPC reads state_manager snapshot → returns state = "running".
  5. Frontend polling sees runninggetAppState() returns 'running' → the two-button (Start | Stop) block re-renders → the transitional button disappears → UI looks like the stop silently failed.
  6. Eventually podman stop finishes → next scan → state flips to Stopped → buttons change again.

Net visible bug: button spins briefly, reverts to Running, then several minutes later suddenly shows Stopped. User rightly calls this "out of sync and confusing".

Decisions already locked in (do not re-ask)

  • Full scope fix (not minimal hotfix). User chose "Go full scope, do it right".
  • Async-spawn lives in the RPC layer, not in the ContainerOrchestrator trait. Trait stays synchronous so the reconciler, boot flow, unit tests, and the chaos harness retain deterministic behaviour.
  • PackageState already has Stopping/Starting/Restarting/Installing/Updating/Removing variants — enum at core/archipelago/src/data_model.rs:107-124. No schema change needed.
  • UI collapses to one full-width button with spinner during every transitional state. Labels: Start / Stop / Starting… / Stopping… / Restarting… / Installing… / Updating… / Removing… / Install (when not-installed).
  • Helper API shape: RpcHandler::spawn_transitional(op: Op, app_id: String) where Op is an enum {Stop, Start, Restart}. Helper dispatches to orchestrator.stop/start/restart internally, knows each op's transitional+final states, handles error → revert + install_log().
  • mark_user_stopped must run BEFORE the spawn (preserves ordering the crash recovery layer depends on — see runtime.rs:145-148).

Implementation order (4 commits, local only)

Commit 1 — feat(rpc): spawn_transitional helper for async lifecycle ops

  • New file: core/archipelago/src/api/rpc/transitional.rs (or extend container.rs; prefer new file for cohesion with future stacks/package variants)
  • enum Op { Stop, Start, Restart } with transitional_state(), final_state_on_success(), log_prefix(), and async dispatch(&orch, &app_id) method
  • impl RpcHandler { pub(super) async fn spawn_transitional(&self, op: Op, app_id: String) -> Result<()> }
    • Capture Arc<dyn ContainerOrchestrator> + Arc<StateManager> clones
    • Set transitional state via state_manager.update_data() (if entry exists; skip if not — Start on never-installed shouldn't create an entry)
    • tokio::spawn(async move { ... })
    • Inside spawn: install_log("{LOG_PREFIX}: {app_id}"), op.dispatch(&orch, &app_id).await, on success set final state, on error log + install_log("{LOG_PREFIX} FAIL: …") + revert state to previous (cache pre-transition state in a local)
    • Return Ok(()) immediately after spawn

Commit 2 — fix(rpc): async container stop/start/restart; widen state mapping

  • api/rpc/container.rs:85-107 — rewrite handle_container_stop body: validate_app_id, mark_user_stopped, spawn_transitional(Op::Stop, app_id.to_string()).await?, return Ok(json!({ "status": "stopping" }))
  • api/rpc/container.rs:61-83 — rewrite handle_container_start: clear_user_stopped, spawn_transitional(Op::Start, …), return { "status": "starting" }
  • Add handle_container_restart (currently missing in container.rs — only exists as package.restart at runtime.rs:176-242). Register RPC route name container-restart. Add matching frontend client method in container-client.ts.
  • api/rpc/container.rs:148-154 — widen the container-list state mapping: add arms for Stopping → "stopping", Starting → "starting", Restarting → "restarting", Installing → "installing", Updating → "updating", Removing → "removing", Installed → "installed", CreatingBackup/RestoringBackup/BackingUp → their kebab-case strings. No more "unknown" fallback unless the variant is genuinely unknown.
  • Mirror same spawn treatment in api/rpc/package/runtime.rs: handle_package_start (L28-119), handle_package_stop (L122-173), handle_package_restart (L176-242). Keep the existing verification loops (post-start exit-check at L82-117; restart stop+start fallback at L215-235) inside the spawned future, not in the RPC body.

Commit 3 — fix(state): preserve transitional state across container scans

  • server.rs:847-857 — in the merge loop, before the merged.insert(id.clone(), pkg.clone()) overwrite, check merged.get(id).state and skip overwrite if it's transitional: matches!(existing.state, Installing | Stopping | Starting | Restarting | Updating | Removing | CreatingBackup | RestoringBackup | BackingUp)
  • Still allow non-state fields (lan_address, health, ports) to update. Simplest: when existing is transitional, keep existing.state but merge updated fields from pkg. Write a tiny helper merge_preserving_transitional(existing, fresh) -> PackageDataEntry.
  • Unit test: construct existing.state = Stopping, fresh.state = Running, assert merged.state stays Stopping.
  • Also check: Is there a timeout escape hatch? If Stopping is set and podman actually finishes but the spawn died before writing the final state (process crash, panic), the entry will be stuck Stopping forever. Mitigation: track a transitional_since: Instant in the entry (not persisted, just in-memory side table on StateManager), and if > 2× the stop timeout has elapsed, allow podman scan state to override. Scope for this commit or follow-up — lean toward: include it, because fleet reliability matters.

Commit 4 — fix(ui): single-button lifecycle control with transitional labels

  • neode-ui/src/api/container-client.ts — extend ContainerStatus.state union to: 'created' | 'running' | 'stopped' | 'exited' | 'paused' | 'unknown' | 'stopping' | 'starting' | 'restarting' | 'installing' | 'updating' | 'removing' | 'installed'. Add restartContainer(appId) method calling container-restart.
  • neode-ui/src/stores/container.ts — add computed getAppVisualState(appId) that returns one of: 'not-installed' | 'running' | 'stopped' | 'starting' | 'stopping' | 'restarting' | 'installing' | 'updating' | 'removing'. Maps exitedstopped, createdstopped, pausedstopped, installedstopped. Add restartContainer(appId) action (sets loadingApps for request dedup, calls client, does NOT fetchContainers immediately because server will broadcast state; a final fetchContainers after a short delay can backstop if WebSocket push is absent).
  • neode-ui/src/views/ContainerApps.vue:85-136 — replace the two-button conditional with a single full-width button bound to getAppVisualState(app.id). Table:
    visual state click action label spinner disabled
    not-installed installApp Install no no
    running stopContainer Stop no no
    stopped startContainer Start no no
    starting Starting… yes yes
    stopping Stopping… yes yes
    restarting Restarting… yes yes
    installing Installing… yes yes
    updating Updating… yes yes
    removing Removing… yes yes
    • Add a separate Restart button next to the primary one when state is running, calling new restartContainer action. Restart button hides while transitional.
  • neode-ui/src/views/ContainerAppDetails.vue:83 (and full stop/start button blocks around L220, L232) — mirror the same single-button pattern.
  • Also audit line 239 of ContainerApps.vue (some((app) => store.getAppState(app.id) === 'created')) and the logic around lines 276, 295, 309, 312 — make sure they use getAppVisualState where appropriate.

Verification gates (do not skip)

  1. ~/.cargo/bin/cargo check -p archipelago on .116 via SSH
  2. ~/.cargo/bin/cargo test -p archipelago on .116 via SSH — at least the new merge helper test must pass
  3. Build release binary on .116: nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown. Poll until done.
  4. SCP binary to .228 /usr/local/bin/archipelago, back up prior to /usr/local/bin/archipelago.bak-pre-async-stop. sudo systemctl restart archipelago on .228.
  5. Manual LND stop test on .228:
    • Open dashboard, confirm LND is Running (first: ssh archipelago@192.168.1.228 'podman start lnd' — LND is currently Exited(0) from the demo)
    • Click Stop
    • Expected: button immediately becomes "Stopping…" with spinner (RPC returns <1s)
    • Dashboard should stay on "Stopping…" for ~5 min
    • Then flip to "Start" button with label "Start"
    • At no point should it revert to "Running" mid-stop
  6. Same test with Bitcoin Core stop (longest timeout, 600s)
  7. Frontend build: cd ~/Projects/archy/neode-ui && npm run type-check && npm run build. Rsync dist/ to archipelago@192.168.1.228:/var/lib/archipelago/web-ui/ (or wherever the active web root is — check /etc/nginx on .228 first).
  8. Then and only then: resume chaos matrix. First recover LND/ElectrumX via UI (great end-to-end test of the new async Start path), then run smoke → full 32-case matrix.

Key files (exact lines of interest)

  • core/archipelago/src/api/rpc/container.rs:85-107handle_container_stop (blocking — target of fix)
  • core/archipelago/src/api/rpc/container.rs:61-83handle_container_start
  • core/archipelago/src/api/rpc/container.rs:148-154 — narrow state mapping (drops transitional → "unknown")
  • core/archipelago/src/api/rpc/package/runtime.rs:11-24stop_timeout_secs table (reference, unchanged)
  • core/archipelago/src/api/rpc/package/runtime.rs:122-173handle_package_stop (also blocking, mirror treatment)
  • core/archipelago/src/api/rpc/package/runtime.rs:28-119handle_package_start
  • core/archipelago/src/api/rpc/package/runtime.rs:176-242handle_package_restart
  • core/archipelago/src/api/rpc/package/progress.rs — existing broadcast pattern to mirror (set_install_progress, set_uninstall_stage)
  • core/archipelago/src/api/rpc/mod.rs:62-100RpcHandler struct (already holds Arc<dyn ContainerOrchestrator> + state_manager)
  • core/archipelago/src/server.rs:812-857scan_and_update_packages (merge loop at L850-857 is where transitional-state clobber happens)
  • core/archipelago/src/container/docker_packages.rs:636-663convert_state + package_state_str (read-only reference, no change)
  • core/archipelago/src/container/traits.rsContainerOrchestrator trait (stays synchronous, do not change)
  • core/archipelago/src/crash_recovery.rsmark_user_stopped / clear_user_stopped (call order preserved)
  • core/archipelago/src/data_model.rs:107-124PackageState enum (no change — all variants exist)
  • neode-ui/src/api/container-client.tsContainerStatus type + RPC methods (extend)
  • neode-ui/src/stores/container.ts:93-312 — Pinia store (add getAppVisualState, add restartContainer action)
  • neode-ui/src/views/ContainerApps.vue:85-136, 239, 276, 295, 309-312, 383 — two-button block + state reads
  • neode-ui/src/views/ContainerAppDetails.vue:83, 220, 232 — details page Stop/Start

Chaos harness (not in repo — lives on .116)

  • archipelago@192.168.1.116:~/ui-chaos/ — deployed, playwright + deps installed, smoke test for bitcoin-core passes (2.1 min). LND/ElectrumX/bitcoin-ui smoke tests not yet run (blocked on the async-stop fix landing; LND currently Exited on .228 from the demo).
  • /tmp/chaos/ on laptop — canonical source for rsync to .116.
  • Run: cd ~/ui-chaos && npx playwright test tests/<spec>
  • Target: 32 cases = 4 core containers × 8 scenarios (install-fresh, graceful-stop, sigkill, rm-container, oom-kill, rm-image, restart-service, network-partition).
  • Uses SSH+Playwright hybrid per design; includes the bash -lc '<escaped>' single-quote fix for ssh argv flattening and JSON-parsed podman inspect instead of Go templates.

Pre-existing bugs still deferred (do not fix until Stop UX lands)

  1. archipelago --version spawns server (should be a pure CLI query)
  2. RPC unknown-method returns generic error (should return method-not-found with the bad method name)
  3. docker_packages.rs filters out UI containers (archy-lnd-ui, archy-electrs-ui) — some views need them visible
  4. lnd.lan_address stale on .228
  5. first-boot silent failure on some hardware
  6. web-ui.failed.* scar on .228 (benign systemd unit state)
  7. test_parse_image_versions pre-existing broken assertion — fix or #[ignore] when touching that area

Host reference

Host IP Role Dashboard pw Sudo pw SSH
archy (ThinkPad X250) 192.168.1.116 dev host, Debian 13, repo at ~/Projects/archy/ archipelago ThisIsWeb54321@ key installed
archy228 (HP ProDesk) 192.168.1.228 prod kiosk, new Rust orchestrator binary password123 archipelago (NOPASSWD:ALL via /etc/sudoers.d/archipelago-ci) key installed
  • Laptop SSHFS mount: ~/mnt/archy-thinkpad/ (edits OK, git/cargo via SSH)
  • Cargo path over SSH: ~/.cargo/bin/cargo (non-interactive login has no cargo in PATH)
  • Release model: local commit + tag only; user pushes to 4 Gitea mirrors personally
  • Full destructive latitude on both nodes. Announce multi-hour ops. Don't ask for routine stop/start/rebuild permission.

Where we are

Working through the 11-step plan in rust-orchestrator-migration.md.

  • Step 13767c267 ContainerConfig schema with build:, ResolvedSource enum, resolve(), 10 tests
  • Step 234af4d9d ContainerRuntime trait gained image_exists + build_image, 4 argv tests, 25/25 pass
  • Step 3b6a04d31 ProdContainerOrchestrator (999 LOC), 16 tests all pass, not yet wired to main.rs
  • Step 4e8a59c93 ContainerOrchestrator trait, RpcHandler uses it in prod (+ 13858842 chore gitignore ._*)
  • Step 5fc39b04b BootReconciler with Arc shutdown, 4 paused-time tests pass
  • Step 648f08aa3 main.rs wire-up (orchestrator construction + adopt_existing + BootReconciler spawn + shutdown Notify)
  • Step 7069bc4a5 bitcoin-ui pre-start hook + embedded nginx.conf template (8 unit tests + 1 integration test), 39/39 container:: tests pass
  • Step 8aa0707f4d retire archipelago-reconcile.{service,timer} + ISO builder touchpoints, keep scripts for update.rs
  • Step 9Hot-swap on .228 verified. All three UIs (bitcoin-ui/lnd-ui/electrs-ui) installing + serving HTTP 200.
  • .228 dashboard bugs — ExtraHost 192.168.1.254 bug (3ee192ba) + LND macaroon permission bug (be960023). See "Post-Step 9 bug hunt" below.
  • Step 8b — Port remaining ~25 container creations from first-boot-containers.sh into apps/<id>/manifest.yml, then port update.rs to orchestrator (deferred, multi-day work)
  • Step 8c — Rename first-boot-containers.shfirst-boot-setup.sh, strip container ops, keep setup. Delete reconcile-containers.sh + container-specs.sh. Add ISO lines to copy apps/ (final one-way door, requires 8b complete)
  • Step 10 — Hot-swap + verify on .116 (adoption-heavy test — .116 already has all containers running)
  • Step 11 — Chaos matrix on both nodes (all 8 scenarios × all containers incl. bitcoin-core)

Post-Step 9 bug hunt (.228, 2026-04-23)

User reported three visible dashboard bugs after Step 9 verification:

  1. LND — "no connect details or QR"
  2. ElectrumX — stuck at "Building index (2 KB / ~130 GB)" for days
  3. bitcoin-core — in scope for chaos testing

Root cause #1 (ExtraHost, commit 3ee192ba): scripts/first-boot-containers.sh computed HOST_GATEWAY from ip route show default, which returns the LAN router (e.g. 192.168.1.254), not the gateway to the host. Every container configured with --add-host=host.containers.internal:$HOST_GATEWAY was dialing the WiFi router instead of the host. LND crash-looped with dial tcp 192.168.1.254:8332: connection refused; ElectrumX's DAEMON_URL hit the same dead end; any archy-net bridge consumer of bitcoin-core's RPC was broken. Fixed by replacing the computed value with podman's magic host-gateway literal (supported since 4.4; we ship 5.4.2). Live-recreated bitcoin-core/electrumx/lnd on .228 with the corrected --add-host; LND reached chain backend; ElectrumX resumed indexing (went from 2 KB → 164.9 MB in under an hour).

Root cause #2 (macaroon permissions, commit be960023): LND's admin.macaroon lives at /var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon, owned by rootless-podman subordinate UID 100000, mode 640. The archipelago server runs as host UID 1000 and literally cannot read the file. Every LND RPC (getinfo, connect-info, export-channel-backup) plus the shared lnd_client() helper failed with "Failed to read LND admin macaroon". Confirmed pre-existing on .116 too (long-standing bug unrelated to Step 9). Fix: centralised the path as LND_ADMIN_MACAROON_PATH, added a read_lnd_admin_macaroon() helper in api/rpc/lnd/mod.rs that tries direct read first then falls back to sudo -n cat (mirrors the pattern already used for Tor onion hostnames). Four call sites routed through the helper. Verified on .228 — curl -k https://<host>/lnd-connect-info now returns 200 with cert + macaroon + tor_onion; dashboard QR unblocked.

Step 9 evidence (.228, 2026-04-23)

  • Binary: Step 9 build with 732df1b8 + ba83f9bc, scp'd to .228 as /usr/local/bin/archipelago. Old binary backed up at /usr/local/bin/archipelago.bak-pre-step9. Later replaced with macaroon-fix build (be960023); previous backed up at /usr/local/bin/archipelago.bak-pre-macaroon.
  • DEV_MODE override disabled (override.confoverride.conf.disabled-pre-step9).
  • /opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml populated.
  • /opt/archipelago/docker/bitcoin-ui/Dockerfile replaced with the Step 7 version (no COPY nginx.conf). Old dir backed up as bitcoin-ui.bak-pre-step9.
  • Post-start snapshot:
    • 🔗 Adopted 1 existing container(s): ["electrs-ui"] — adoption of 13h-running container worked without recreation
    • 🔄 Boot reconciler started (interval: 30s) — every 30s, all three app_ids reach NoOp after the initial install pass
    • bitcoin-ui nginx.conf rendered path=/var/lib/archipelago/bitcoin-ui/nginx.conf auth_hash=97af1c18 — pre-start hook fires in install_fresh
    • curl localhost:8334 → HTTP 200 (bitcoin-ui), :8081 → 200 (lnd-ui), :50002 → 200 (electrs-ui)
    • OCI memory limits correctly applied: bitcoin-ui=128Mi, electrs-ui=128Mi, lnd-ui=64Mi (was emitted as 0 pre-fix)

Bugs fixed this session

  1. parse_memory_limit truncation bug (732df1b8): lowercased "128Mi" → "128mi" → trim_end_matches('m') → "128i" → f64 parse fails → None.unwrap_or(0) → OCI memory.limit:0 → systemd rejects MemoryMax=0. 6 regression tests; create_container now omits instead of emitting 0.
  2. archipelago.service cgroup delegation missing (ba83f9bc): belt-and-braces Delegate=memory pids cpu io.
  3. ExtraHost 192.168.1.254 (3ee192ba): see Post-Step 9 bug hunt above.
  4. LND admin.macaroon unreadable (be960023): see Post-Step 9 bug hunt above.

Commits made this session

3ee192ba fix(first-boot): use podman host-gateway magic for host.containers.internal
be960023 fix(lnd): read admin macaroon via sudo fallback
4b8ef0a0 docs: STATUS.md through Step 9 (.228 hot-swap verified)
ba83f9bc feat(systemd): delegate cgroup controllers to archipelago.service
732df1b8 fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes
a0707f4d refactor: retire archipelago-reconcile.{service,timer}  (Step 8a)
1c81a739 docs: split Step 8 into 8a/8b/8c
6e46932f docs: STATUS.md through Step 7
069bc4a5 feat: bitcoin-ui pre-start hook (Step 7)

Branch is 19 commits ahead of tx1138/main (local only — user pushes to mirrors personally).

Uncommitted state

Clean. Only untracked: tests/ (bats harness from prior session, not in scope), tmp-dump-spec.py (scratch).

Answered design questions (no need to re-ask)

  1. UI container naming → archy-<app_id> for UIs only; existing bitcoin-knots/lnd/electrumx keep bare names
  2. BITCOIN_RPC_AUTH injection → runtime bind-mount of nginx.conf (no build-args, no envsubst)
  3. Reconciler interval → 30 seconds
  4. Concurrency → per-app Mutex<()> in a DashMap
  5. Bash scripts → split into 8a/8b/8c; 8a done, 8b/8c deferred
  6. Step 4 extension → ContainerOrchestrator trait includes install(app_id); the manifest_path-based install RPC stays dev-only
  7. Step 7 bitcoin-ui template → embed via include_str!, render on install + every reconcile, atomic tmp+rename to /var/lib/archipelago/bitcoin-ui/nginx.conf, bind-mount into container. RPC user hardcoded archipelago, password from /var/lib/archipelago/secrets/bitcoin-rpc-password.

Context: which host is what

Host IP Role Dashboard pw Sudo pw
archy 192.168.1.116 Dev ThinkPad (Lenovo X250, Debian 13). Currently running v1.7.42-alpha (DEV_MODE). Step 10 target. archipelago ThisIsWeb54321@
archy228 192.168.1.228 Kiosk HP ProDesk. Step 9 landing zone — now running Rust-orchestrator binary in prod mode. password123 archipelago

Both are development alpha nodes — full destructive latitude, no need to ask before stop/start/rebuild.

Next action

Step 10 — Hot-swap on .116.

Unlike .228 (which tested the INSTALL path for net-new UI containers), .116 tests the ADOPTION path: it already has all three UIs and all backend containers running from prior v1.7.42-alpha runs. We want to verify the new prod orchestrator adopts every existing container without recreating or restarting them.

Steps:

  1. Disable DEV_MODE on .116 (check if override.conf exists — /etc/systemd/system/archipelago.service.d/)
  2. Stage the already-built binary at ~/Projects/archy/core/target/release/archipelago/usr/local/bin/archipelago.new
  3. Ensure /opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml present (copy from repo)
  4. Ensure /opt/archipelago/docker/bitcoin-ui/ matches the Step-7 layout (no baked nginx.conf)
  5. Snapshot: podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.CreatedAt}}" → save to /tmp/pre-step10-containers.txt
  6. systemctl stop archipelago → install binary → systemctl start archipelago
  7. Verify in journal: every running container appears in "Adopted N existing container(s)"; no container was recreated; all HTTP smokes still 200; BootReconciler reaches NoOp on every app_id after one pass.
  8. If broken → restore .bak binary, re-enable DEV_MODE override.
  9. Commit STATUS.md update.

Risk on .116: If adoption fails mid-flight, we'd lose the running v1.7.42 backend that I'm currently typing at. Keep a second SSH session open to the ThinkPad for emergency revert. The backup plan is install /usr/local/bin/archipelago.bak /usr/local/bin/archipelago && systemctl restart archipelago.

After Step 10 we are blocked on Step 8b (multi-day manifest ports) before Step 11 (chaos matrix).


Why Step 8 got split (discovered 2026-04-23)

Original plan was one commit "delete bash + edit ISO builder". But on investigation:

  • first-boot-containers.sh creates 30+ containers with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs.
  • Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator.
  • update.rs (OTA update RPC) invokes reconcile-containers.sh at two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests.
  • Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests.

Archipelago — Current State, Plan, and Releases

Updated: 2026-04-22

This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in bulletproof-containers.md.


Current state

Fleet status

All four Gitea mirrors are synced to v1.7.40-alpha:

Mirror Host Status
tx1138 https://git.tx1138.com v1.7.40-alpha live
gitea-local http://localhost:3000 v1.7.40-alpha live
.160 http://23.182.128.160:3000 v1.7.40-alpha live (Gitea recovered via podman system renumber — see below)
.168 http://146.59.87.168:3000 v1.7.40-alpha live

Fleet test nodes:

Node Version State
.103 (dev) 1.7.40 running, being developed against
.116 (this box) 1.7.40 healed manually via systemd-run chmod 755 /opt/archipelago/web-ui after v1.7.38/39 bug
.198 1.7.39 → 1.7.40-alpha healed manually
.228 (primary test) 1.7.40-alpha healed manually; bitcoin-core + lnd + electrumx running; UI companions currently missing; bitcoin.conf rpcauth patched live
.249 (ISO test) unreachable today
.253 1.7.39 → 1.7.40-alpha healed manually

Known open issues (drives the plan below)

  1. UI companion containers disappear on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration)
  2. bitcoin.conf rpcauth drifts from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived)
  3. host.containers.internal resolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf)
  4. Podman state DB loss requires manual recovery (fixed by v1.7.44 startup self-heal)
  5. LND "Connect Wallet" info vanishing after crashes — symptom of the same drift class as #2
  6. ElectrumX not syncing on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled

Recent field incident (2026-04-22)

  • Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was drwx------ (700). Every node that OTA'd got 500 errors on every page.
  • Root-cause fix shipped in v1.7.40 (create-release-manifest.sh chmod + pre-ship assertion that tar tvzf | head -1 shows drwxr-xr-x).
  • .160 Gitea was down all day (502) because its rootless podman's libpod/bolt_state.db had vanished. Recovered via clearing /run/user/$UID/{containers,libpod,podman} + podman system renumber.
  • Full failure-mode audit is in bulletproof-containers.md.

Plan

We're shipping a level-triggered reconciler + Quadlet architecture over six incremental releases. Each release closes one failure mode. See bulletproof-containers.md for the full design, code layout, test harness, chaos matrix, sources.

Release roadmap

Release Closes What lands Status
v1.7.41 FM5 (bad OTA nginx 500) Post-OTA auto-rollback. New binary probes https://127.0.0.1/ on boot; if non-200 within 90s, restores web-ui.bak + calls rollback_update() + restarts in flight — deploying to .228 for test
v1.7.42 FM4 (host.containers.internal wrong) /etc/containers/containers.conf w/ host_containers_internal_ip = 10.89.0.1; every container gets --add-host=host.archipelago:10.89.0.1 pending
v1.7.43 FM2 (config drift) reconcile::derived::render_bitcoin_conf — pure fn over canonical secret, rewrites on drift. Same for lnd.conf pending
v1.7.44 FM6 (podman state loss) Startup probe detects broken podman state, auto-recovers via /run/user/$UID/* clear + system renumber pending
v1.7.45 FM1 + FM3 (companion orphans) archy-bitcoin-ui → Quadlet .container unit in /etc/containers/systemd/. systemd (not archipelago) owns it pending
v1.7.46 archy-lnd-ui → Quadlet pending
v1.7.47 archy-electrs-ui → Quadlet pending
v1.7.48+ all (full daemon refactor) core/archipelago/src/reconcile/ module replaces imperative install.rs container management. Main app containers become Quadlet too pending

Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag.


Release history

v1.7.41-alpha — IN FLIGHT — 2026-04-22

Post-OTA auto-rollback. After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.

Changes:

  • core/archipelago/src/update.rs: PendingVerification struct, write marker before service restart, verify_pending_update() on new binary boot — probes https://127.0.0.1/, on fail restores web-ui.bak + calls rollback_update() + systemctl restart archipelago
  • core/archipelago/src/main.rs: startup task invokes verifier concurrently with server

v1.7.40-alpha — 2026-04-22

Proper fix for the 500 error. Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly chmod 755 before tar; --mode=u=rwX,go=rX normalizes archive perms; pre-ship assertion aborts release if tar tvzf | head -1 isn't drwxr-xr-x.

Changes:

  • scripts/create-release-manifest.sh: pre-tar chmod + tar --mode flag + post-tar verify
  • Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball)

v1.7.39-alpha — 2026-04-22

Hotfix attempt for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in main.rs and post-extract chmod in update.rs OTA applier.

v1.7.38-alpha — 2026-04-22

Onboarding auto-heal + silent logins + App Store trim.

Changes:

  • auth.rs: is_onboarding_complete() auto-heals from setup_complete + password_hash (prevents clear-cache → onboarding wizard bug)
  • useOnboarding: tri-state — backend-unreachable no longer defaults to /onboarding/intro
  • Login sounds gated by isFirstInstallPhase() — silent after onboarding, typing sounds unaffected
  • Removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog + Rust + docker + icons
  • Deleted 15 image versions from tx1138, .168, gitea-local registries
  • AIUI baked into release tarball via demo/aiui/
  • prebuild hook syncs app-catalog/catalog.jsonpublic/catalog.json

(Shipped with tarball-perms bug; fleet had to be healed before v1.7.40.)

v1.7.37-alpha — 2026-04-22

Bitcoin Core install fixes + dynamic node UI + full-archive default.

  • Bitcoin Core passes explicit -rpcbind/-rpcallowip/etc. CLI args so vanilla image exposes RPC
  • Split bitcoin-core from bitcoin-knots in backend AppMetadata
  • bitcoin-ui auto-detects Core vs. Knots from subversion, swaps branding at runtime
  • Storage (Full Archive · X GB / Pruned) indicator on dashboard
  • Node Settings modal shows real values (network, storage, txindex, ZMQ, RPC port)
  • Pull fallback to docker.io when no mirror carries the image
  • Removed prune=550 hardcode — full archive default

Key docs


How to resume

  1. Check fleet mirrors are all live: curl -sS https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version
  2. Read bulletproof-containers.md for the current plan
  3. Check task list (/list or via Claude Code) for the in-flight release
  4. Latest in-flight work: v1.7.41 deploying to .228 for test; will ship to all 4 mirrors once verified