**To resume this work, SSH into the ThinkPad and run `opencode` from `~/Projects/archy/`. Or work from the laptop via the SSHFS mount at `~/mnt/archy-thinkpad/`.**
**Goal**: implement async-spawn lifecycle fix so the dashboard never shows a frozen spinner again. User mandate: _"best server containers in the world"_. Do not ship the chaos matrix (Step 11) until this lands and manual LND stop verifies instant RPC + live `Stopping…` label.
### Bug being fixed
Dashboard sequence when user clicks **Stop LND**:
1. UI collapses Start/Stop buttons to single spinner-button ("Stopping…") via `loadingApps.add('lnd')`.
2. Frontend calls `container-stop` RPC. Server runs `podman stop -t 330 lnd`**synchronously inside the RPC handler** (via `orchestrator.stop()`). RPC blocks up to **5.5 min** for LND (330s timeout + overhead).
3. Meanwhile the 30-second package-scan loop in `server.rs:scan_and_update_packages` keeps running. It rebuilds `PackageDataEntry` from podman inspect — podman still reports `running` (stop hasn't completed) — and **blindly overwrites** the store entry at `server.rs:854`.
5. Frontend polling sees `running` → `getAppState()` returns `'running'` → the two-button (Start | Stop) block re-renders → the transitional button disappears → **UI looks like the stop silently failed**.
6. Eventually `podman stop` finishes → next scan → state flips to `Stopped` → buttons change _again_.
Net visible bug: button spins briefly, reverts to Running, then several minutes later suddenly shows Stopped. User rightly calls this "out of sync and confusing".
### Decisions already locked in (do not re-ask)
- **Full scope fix** (not minimal hotfix). User chose "Go full scope, do it right".
- **Async-spawn lives in the RPC layer**, not in the `ContainerOrchestrator` trait. Trait stays synchronous so the reconciler, boot flow, unit tests, and the chaos harness retain deterministic behaviour.
- **`PackageState` already has `Stopping`/`Starting`/`Restarting`/`Installing`/`Updating`/`Removing`** variants — enum at `core/archipelago/src/data_model.rs:107-124`. No schema change needed.
- **UI collapses to one full-width button** with spinner during every transitional state. Labels: Start / Stop / Starting… / Stopping… / Restarting… / Installing… / Updating… / Removing… / Install (when `not-installed`).
- **Helper API shape**: `RpcHandler::spawn_transitional(op: Op, app_id: String)` where `Op` is an enum `{Stop, Start, Restart}`. Helper dispatches to `orchestrator.stop/start/restart` internally, knows each op's transitional+final states, handles error → revert + `install_log()`.
- **`mark_user_stopped` must run BEFORE the spawn** (preserves ordering the crash recovery layer depends on — see `runtime.rs:145-148`).
### Implementation order (4 commits, local only)
**Commit 1 — `feat(rpc): spawn_transitional helper for async lifecycle ops`**
- New file: `core/archipelago/src/api/rpc/transitional.rs` (or extend `container.rs`; prefer new file for cohesion with future stacks/package variants)
-`enum Op { Stop, Start, Restart }` with `transitional_state()`, `final_state_on_success()`, `log_prefix()`, and async `dispatch(&orch, &app_id)` method
- Set transitional state via `state_manager.update_data()` (if entry exists; skip if not — Start on never-installed shouldn't create an entry)
-`tokio::spawn(async move { ... })`
- Inside spawn: `install_log("{LOG_PREFIX}: {app_id}")`, `op.dispatch(&orch, &app_id).await`, on success set final state, on error log + `install_log("{LOG_PREFIX} FAIL: …")` + revert state to previous (cache pre-transition state in a local)
- Return `Ok(())` immediately after spawn
**Commit 2 — `fix(rpc): async container stop/start/restart; widen state mapping`**
- **Add** `handle_container_restart` (currently missing in `container.rs` — only exists as `package.restart` at `runtime.rs:176-242`). Register RPC route name `container-restart`. Add matching frontend client method in `container-client.ts`.
-`api/rpc/container.rs:148-154` — widen the `container-list` state mapping: add arms for `Stopping → "stopping"`, `Starting → "starting"`, `Restarting → "restarting"`, `Installing → "installing"`, `Updating → "updating"`, `Removing → "removing"`, `Installed → "installed"`, `CreatingBackup`/`RestoringBackup`/`BackingUp` → their kebab-case strings. No more `"unknown"` fallback unless the variant is genuinely unknown.
- Mirror same spawn treatment in `api/rpc/package/runtime.rs`: `handle_package_start` (L28-119), `handle_package_stop` (L122-173), `handle_package_restart` (L176-242). Keep the existing verification loops (post-start exit-check at L82-117; restart stop+start fallback at L215-235) _inside_ the spawned future, not in the RPC body.
**Commit 3 — `fix(state): preserve transitional state across container scans`**
-`server.rs:847-857` — in the merge loop, before the `merged.insert(id.clone(), pkg.clone())` overwrite, check `merged.get(id).state` and skip overwrite if it's transitional: `matches!(existing.state, Installing | Stopping | Starting | Restarting | Updating | Removing | CreatingBackup | RestoringBackup | BackingUp)`
- Still allow _non-state_ fields (lan_address, health, ports) to update. Simplest: when existing is transitional, keep `existing.state` but merge updated fields from `pkg`. Write a tiny helper `merge_preserving_transitional(existing, fresh) -> PackageDataEntry`.
- **Also check**: Is there a timeout escape hatch? If `Stopping` is set and podman actually finishes but the spawn died before writing the final state (process crash, panic), the entry will be stuck `Stopping` forever. Mitigation: track a `transitional_since: Instant` in the entry (not persisted, just in-memory side table on StateManager), and if > 2× the stop timeout has elapsed, allow podman scan state to override. Scope for this commit or follow-up — lean toward: include it, because fleet reliability matters.
**Commit 4 — `fix(ui): single-button lifecycle control with transitional labels`**
-`neode-ui/src/stores/container.ts` — add computed `getAppVisualState(appId)` that returns one of: `'not-installed' | 'running' | 'stopped' | 'starting' | 'stopping' | 'restarting' | 'installing' | 'updating' | 'removing'`. Maps `exited`→`stopped`, `created`→`stopped`, `paused`→`stopped`, `installed`→`stopped`. Add `restartContainer(appId)` action (sets `loadingApps` for request dedup, calls client, does NOT `fetchContainers` immediately because server will broadcast state; a final `fetchContainers` after a short delay can backstop if WebSocket push is absent).
-`neode-ui/src/views/ContainerApps.vue:85-136` — replace the two-button conditional with a single full-width button bound to `getAppVisualState(app.id)`. Table:
| `not-installed` | installApp | Install | no | no |
| `running` | stopContainer | Stop | no | no |
| `stopped` | startContainer | Start | no | no |
| `starting` | — | Starting… | yes | yes |
| `stopping` | — | Stopping… | yes | yes |
| `restarting` | — | Restarting… | yes | yes |
| `installing` | — | Installing… | yes | yes |
| `updating` | — | Updating… | yes | yes |
| `removing` | — | Removing… | yes | yes |
- Add a separate Restart button next to the primary one when state is `running`, calling new `restartContainer` action. Restart button hides while transitional.
-`neode-ui/src/views/ContainerAppDetails.vue:83` (and full stop/start button blocks around L220, L232) — mirror the same single-button pattern.
- Also audit line 239 of `ContainerApps.vue` (`some((app) => store.getAppState(app.id) === 'created')`) and the logic around lines 276, 295, 309, 312 — make sure they use `getAppVisualState` where appropriate.
### Verification gates (do not skip)
1.`~/.cargo/bin/cargo check -p archipelago` on .116 via SSH
2.`~/.cargo/bin/cargo test -p archipelago` on .116 via SSH — at least the new merge helper test must pass
4. SCP binary to .228 `/usr/local/bin/archipelago`, back up prior to `/usr/local/bin/archipelago.bak-pre-async-stop`. `sudo systemctl restart archipelago` on .228.
5.**Manual LND stop test on .228**:
- Open dashboard, confirm LND is Running (first: `ssh archipelago@192.168.1.228 'podman start lnd'` — LND is currently Exited(0) from the demo)
- At no point should it revert to "Running" mid-stop
6. Same test with Bitcoin Core stop (longest timeout, 600s)
7. Frontend build: `cd ~/Projects/archy/neode-ui && npm run type-check && npm run build`. Rsync `dist/` to `archipelago@192.168.1.228:/var/lib/archipelago/web-ui/` (or wherever the active web root is — check `/etc/nginx` on .228 first).
8. Then and only then: resume chaos matrix. First recover LND/ElectrumX via UI (great end-to-end test of the new async Start path), then run smoke → full 32-case matrix.
### Key files (exact lines of interest)
-`core/archipelago/src/api/rpc/container.rs:85-107` — `handle_container_stop` (blocking — target of fix)
-`archipelago@192.168.1.116:~/ui-chaos/` — deployed, playwright + deps installed, smoke test for bitcoin-core passes (2.1 min). LND/ElectrumX/bitcoin-ui smoke tests not yet run (blocked on the async-stop fix landing; LND currently Exited on .228 from the demo).
-`/tmp/chaos/` on laptop — canonical source for rsync to .116.
- Run: `cd ~/ui-chaos && npx playwright test tests/<spec>`
- Uses SSH+Playwright hybrid per design; includes the `bash -lc '<escaped>'` single-quote fix for ssh argv flattening and JSON-parsed `podman inspect` instead of Go templates.
### Pre-existing bugs still deferred (do not fix until Stop UX lands)
1.`archipelago --version` spawns server (should be a pure CLI query)
2. RPC unknown-method returns generic error (should return method-not-found with the bad method name)
3.`docker_packages.rs` filters out UI containers (`archy-lnd-ui`, `archy-electrs-ui`) — some views need them visible
4.`lnd.lan_address` stale on .228
5. first-boot silent failure on some hardware
6.`web-ui.failed.*` scar on .228 (benign systemd unit state)
7.`test_parse_image_versions` pre-existing broken assertion — fix or `#[ignore]` when touching that area
### Host reference
| Host | IP | Role | Dashboard pw | Sudo pw | SSH |
- [ ]**Step 8b** — Port remaining ~25 container creations from `first-boot-containers.sh` into `apps/<id>/manifest.yml`, then port `update.rs` to orchestrator (deferred, multi-day work)
- [ ]**Step 11** — Chaos matrix on both nodes (all 8 scenarios × all containers incl. bitcoin-core)
## Post-Step 9 bug hunt (.228, 2026-04-23)
User reported three visible dashboard bugs after Step 9 verification:
1. LND — "no connect details or QR"
2. ElectrumX — stuck at "Building index (2 KB / ~130 GB)" for days
3. bitcoin-core — in scope for chaos testing
**Root cause #1 (ExtraHost, commit `3ee192ba`)**: `scripts/first-boot-containers.sh` computed `HOST_GATEWAY` from `ip route show default`, which returns the **LAN router** (e.g. 192.168.1.254), not the gateway to the host. Every container configured with `--add-host=host.containers.internal:$HOST_GATEWAY` was dialing the WiFi router instead of the host. LND crash-looped with `dial tcp 192.168.1.254:8332: connection refused`; ElectrumX's DAEMON_URL hit the same dead end; any `archy-net` bridge consumer of bitcoin-core's RPC was broken. Fixed by replacing the computed value with podman's magic `host-gateway` literal (supported since 4.4; we ship 5.4.2). Live-recreated bitcoin-core/electrumx/lnd on .228 with the corrected `--add-host`; LND reached chain backend; ElectrumX resumed indexing (went from 2 KB → 164.9 MB in under an hour).
**Root cause #2 (macaroon permissions, commit `be960023`)**: LND's `admin.macaroon` lives at `/var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon`, owned by rootless-podman subordinate UID 100000, mode 640. The archipelago server runs as host UID 1000 and literally cannot read the file. Every LND RPC (`getinfo`, `connect-info`, `export-channel-backup`) plus the shared `lnd_client()` helper failed with "Failed to read LND admin macaroon". **Confirmed pre-existing on .116 too** (long-standing bug unrelated to Step 9). Fix: centralised the path as `LND_ADMIN_MACAROON_PATH`, added a `read_lnd_admin_macaroon()` helper in `api/rpc/lnd/mod.rs` that tries direct read first then falls back to `sudo -n cat` (mirrors the pattern already used for Tor onion hostnames). Four call sites routed through the helper. Verified on .228 — `curl -k https://<host>/lnd-connect-info` now returns 200 with cert + macaroon + tor_onion; dashboard QR unblocked.
- Binary: Step 9 build with `732df1b8` + `ba83f9bc`, scp'd to .228 as `/usr/local/bin/archipelago`. Old binary backed up at `/usr/local/bin/archipelago.bak-pre-step9`. Later replaced with macaroon-fix build (`be960023`); previous backed up at `/usr/local/bin/archipelago.bak-pre-macaroon`.
-`/opt/archipelago/docker/bitcoin-ui/Dockerfile` replaced with the Step 7 version (no `COPY nginx.conf`). Old dir backed up as `bitcoin-ui.bak-pre-step9`.
- Post-start snapshot:
-`🔗 Adopted 1 existing container(s): ["electrs-ui"]` — adoption of 13h-running container worked without recreation
-`🔄 Boot reconciler started (interval: 30s)` — every 30s, all three app_ids reach `NoOp` after the initial install pass
7. Step 7 bitcoin-ui template → embed via `include_str!`, render on install + every reconcile, atomic tmp+rename to `/var/lib/archipelago/bitcoin-ui/nginx.conf`, bind-mount into container. RPC user hardcoded `archipelago`, password from `/var/lib/archipelago/secrets/bitcoin-rpc-password`.
Unlike .228 (which tested the INSTALL path for net-new UI containers), .116 tests the ADOPTION path: it already has all three UIs and all backend containers running from prior v1.7.42-alpha runs. We want to verify the new prod orchestrator adopts every existing container without recreating or restarting them.
7. Verify in journal: every running container appears in "Adopted N existing container(s)"; no container was recreated; all HTTP smokes still 200; BootReconciler reaches NoOp on every app_id after one pass.
8. If broken → restore `.bak` binary, re-enable DEV_MODE override.
**Risk on .116:** If adoption fails mid-flight, we'd lose the running v1.7.42 backend that I'm currently typing at. Keep a second SSH session open to the ThinkPad for emergency revert. The backup plan is `install /usr/local/bin/archipelago.bak /usr/local/bin/archipelago && systemctl restart archipelago`.
Original plan was one commit "delete bash + edit ISO builder". But on investigation:
-`first-boot-containers.sh` creates **30+ containers** with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs.
- Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator.
-`update.rs` (OTA update RPC) invokes `reconcile-containers.sh` at two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests.
- Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests.
This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in [`bulletproof-containers.md`](./bulletproof-containers.md).
---
## Current state
### Fleet status
All four Gitea mirrors are synced to v1.7.40-alpha:
| Mirror | Host | Status |
|---|---|---|
| tx1138 | https://git.tx1138.com | ✅ v1.7.40-alpha live |
| gitea-local | http://localhost:3000 | ✅ v1.7.40-alpha live |
| .160 | http://23.182.128.160:3000 | ✅ v1.7.40-alpha live (Gitea recovered via `podman system renumber` — see below) |
| .168 | http://146.59.87.168:3000 | ✅ v1.7.40-alpha live |
Fleet test nodes:
| Node | Version | State |
|---|---|---|
| .103 (dev) | 1.7.40 | running, being developed against |
| .116 (this box) | 1.7.40 | healed manually via `systemd-run chmod 755 /opt/archipelago/web-ui` after v1.7.38/39 bug |
1.**UI companion containers disappear** on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration)
2.**bitcoin.conf rpcauth drifts** from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived)
3.**`host.containers.internal`** resolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf)
4.**Podman state DB loss** requires manual recovery (fixed by v1.7.44 startup self-heal)
5.**LND "Connect Wallet" info** vanishing after crashes — symptom of the same drift class as #2
6.**ElectrumX not syncing** on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled
### Recent field incident (2026-04-22)
- Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was `drwx------` (700). Every node that OTA'd got 500 errors on every page.
- Root-cause fix shipped in v1.7.40 (`create-release-manifest.sh` chmod + pre-ship assertion that `tar tvzf | head -1` shows `drwxr-xr-x`).
- .160 Gitea was down all day (502) because its rootless podman's `libpod/bolt_state.db` had vanished. Recovered via clearing `/run/user/$UID/{containers,libpod,podman}` + `podman system renumber`.
- Full failure-mode audit is in [`bulletproof-containers.md`](./bulletproof-containers.md).
---
## Plan
We're shipping a level-triggered **reconciler + Quadlet** architecture over six incremental releases. Each release closes one failure mode. See [`bulletproof-containers.md`](./bulletproof-containers.md) for the full design, code layout, test harness, chaos matrix, sources.
### Release roadmap
| Release | Closes | What lands | Status |
|---|---|---|---|
| **v1.7.41** | FM5 (bad OTA nginx 500) | Post-OTA auto-rollback. New binary probes `https://127.0.0.1/` on boot; if non-200 within 90s, restores `web-ui.bak` + calls `rollback_update()` + restarts | **in flight — deploying to .228 for test** |
| **v1.7.43** | FM2 (config drift) | `reconcile::derived::render_bitcoin_conf` — pure fn over canonical secret, rewrites on drift. Same for `lnd.conf` | pending |
| **v1.7.48+** | all (full daemon refactor) | `core/archipelago/src/reconcile/` module replaces imperative `install.rs` container management. Main app containers become Quadlet too | pending |
Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag.
---
## Release history
### [v1.7.41-alpha](/releases/v1.7.41-alpha/) — IN FLIGHT — 2026-04-22
**Post-OTA auto-rollback.** After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.
Changes:
-`core/archipelago/src/update.rs`: `PendingVerification` struct, write marker before service restart, `verify_pending_update()` on new binary boot — probes `https://127.0.0.1/`, on fail restores `web-ui.bak` + calls `rollback_update()` + `systemctl restart archipelago`
-`core/archipelago/src/main.rs`: startup task invokes verifier concurrently with server
**Proper fix for the 500 error.** Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly `chmod 755` before tar; `--mode=u=rwX,go=rX` normalizes archive perms; pre-ship assertion aborts release if `tar tvzf | head -1` isn't `drwxr-xr-x`.
Changes:
-`scripts/create-release-manifest.sh`: pre-tar chmod + tar --mode flag + post-tar verify
- Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball)
**Hotfix attempt** for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in `main.rs` and post-extract chmod in `update.rs` OTA applier.