From e557e0156f3593f85ecb251d0ca4ced42c979d9f Mon Sep 17 00:00:00 2001 From: archipelago Date: Thu, 23 Apr 2026 04:45:12 -0400 Subject: [PATCH] =?UTF-8?q?docs:=20STATUS.md=20=E2=80=94=20dashboard=20Sto?= =?UTF-8?q?p=20UX=20bug=20diagnosis=20+=20async-spawn=20fix=20plan?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Captures full design for the next session: - Full bug sequence (5.5min blocking RPC + 30s scan clobbering transitional state) - 4-commit implementation order with exact file:line targets - Single-button UI spec with full label table - Verification gates including manual LND stop test on .228 - Architectural decision: spawn lives in RPC layer, orchestrator trait stays sync No code change yet; next session implements. --- docs/STATUS.md | 143 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 142 insertions(+), 1 deletion(-) diff --git a/docs/STATUS.md b/docs/STATUS.md index 084c58dd..ca612272 100644 --- a/docs/STATUS.md +++ b/docs/STATUS.md @@ -1,9 +1,150 @@ # RESUME HERE — Rust orchestrator migration -Updated: 2026-04-23 (Step 9 + .228 dashboard bug fixes complete, Step 10 / chaos matrix next) +Updated: 2026-04-23 (Dashboard Stop UX bug diagnosed; async-spawn fix fully designed, ready to implement) **To resume this work, SSH into the ThinkPad and run `opencode` from `~/Projects/archy/`. Or work from the laptop via the SSHFS mount at `~/mnt/archy-thinkpad/`.** +--- + +## ⚡ NEXT SESSION — START HERE + +**Goal**: implement async-spawn lifecycle fix so the dashboard never shows a frozen spinner again. User mandate: _"best server containers in the world"_. Do not ship the chaos matrix (Step 11) until this lands and manual LND stop verifies instant RPC + live `Stopping…` label. + +### Bug being fixed + +Dashboard sequence when user clicks **Stop LND**: +1. UI collapses Start/Stop buttons to single spinner-button ("Stopping…") via `loadingApps.add('lnd')`. +2. Frontend calls `container-stop` RPC. Server runs `podman stop -t 330 lnd` **synchronously inside the RPC handler** (via `orchestrator.stop()`). RPC blocks up to **5.5 min** for LND (330s timeout + overhead). +3. Meanwhile the 30-second package-scan loop in `server.rs:scan_and_update_packages` keeps running. It rebuilds `PackageDataEntry` from podman inspect — podman still reports `running` (stop hasn't completed) — and **blindly overwrites** the store entry at `server.rs:854`. +4. `container-list` RPC reads `state_manager` snapshot → returns `state = "running"`. +5. Frontend polling sees `running` → `getAppState()` returns `'running'` → the two-button (Start | Stop) block re-renders → the transitional button disappears → **UI looks like the stop silently failed**. +6. Eventually `podman stop` finishes → next scan → state flips to `Stopped` → buttons change _again_. + +Net visible bug: button spins briefly, reverts to Running, then several minutes later suddenly shows Stopped. User rightly calls this "out of sync and confusing". + +### Decisions already locked in (do not re-ask) + +- **Full scope fix** (not minimal hotfix). User chose "Go full scope, do it right". +- **Async-spawn lives in the RPC layer**, not in the `ContainerOrchestrator` trait. Trait stays synchronous so the reconciler, boot flow, unit tests, and the chaos harness retain deterministic behaviour. +- **`PackageState` already has `Stopping`/`Starting`/`Restarting`/`Installing`/`Updating`/`Removing`** variants — enum at `core/archipelago/src/data_model.rs:107-124`. No schema change needed. +- **UI collapses to one full-width button** with spinner during every transitional state. Labels: Start / Stop / Starting… / Stopping… / Restarting… / Installing… / Updating… / Removing… / Install (when `not-installed`). +- **Helper API shape**: `RpcHandler::spawn_transitional(op: Op, app_id: String)` where `Op` is an enum `{Stop, Start, Restart}`. Helper dispatches to `orchestrator.stop/start/restart` internally, knows each op's transitional+final states, handles error → revert + `install_log()`. +- **`mark_user_stopped` must run BEFORE the spawn** (preserves ordering the crash recovery layer depends on — see `runtime.rs:145-148`). + +### Implementation order (4 commits, local only) + +**Commit 1 — `feat(rpc): spawn_transitional helper for async lifecycle ops`** +- New file: `core/archipelago/src/api/rpc/transitional.rs` (or extend `container.rs`; prefer new file for cohesion with future stacks/package variants) +- `enum Op { Stop, Start, Restart }` with `transitional_state()`, `final_state_on_success()`, `log_prefix()`, and async `dispatch(&orch, &app_id)` method +- `impl RpcHandler { pub(super) async fn spawn_transitional(&self, op: Op, app_id: String) -> Result<()> }` + - Capture `Arc` + `Arc` clones + - Set transitional state via `state_manager.update_data()` (if entry exists; skip if not — Start on never-installed shouldn't create an entry) + - `tokio::spawn(async move { ... })` + - Inside spawn: `install_log("{LOG_PREFIX}: {app_id}")`, `op.dispatch(&orch, &app_id).await`, on success set final state, on error log + `install_log("{LOG_PREFIX} FAIL: …")` + revert state to previous (cache pre-transition state in a local) + - Return `Ok(())` immediately after spawn + +**Commit 2 — `fix(rpc): async container stop/start/restart; widen state mapping`** +- `api/rpc/container.rs:85-107` — rewrite `handle_container_stop` body: `validate_app_id`, `mark_user_stopped`, `spawn_transitional(Op::Stop, app_id.to_string()).await?`, return `Ok(json!({ "status": "stopping" }))` +- `api/rpc/container.rs:61-83` — rewrite `handle_container_start`: `clear_user_stopped`, `spawn_transitional(Op::Start, …)`, return `{ "status": "starting" }` +- **Add** `handle_container_restart` (currently missing in `container.rs` — only exists as `package.restart` at `runtime.rs:176-242`). Register RPC route name `container-restart`. Add matching frontend client method in `container-client.ts`. +- `api/rpc/container.rs:148-154` — widen the `container-list` state mapping: add arms for `Stopping → "stopping"`, `Starting → "starting"`, `Restarting → "restarting"`, `Installing → "installing"`, `Updating → "updating"`, `Removing → "removing"`, `Installed → "installed"`, `CreatingBackup`/`RestoringBackup`/`BackingUp` → their kebab-case strings. No more `"unknown"` fallback unless the variant is genuinely unknown. +- Mirror same spawn treatment in `api/rpc/package/runtime.rs`: `handle_package_start` (L28-119), `handle_package_stop` (L122-173), `handle_package_restart` (L176-242). Keep the existing verification loops (post-start exit-check at L82-117; restart stop+start fallback at L215-235) _inside_ the spawned future, not in the RPC body. + +**Commit 3 — `fix(state): preserve transitional state across container scans`** +- `server.rs:847-857` — in the merge loop, before the `merged.insert(id.clone(), pkg.clone())` overwrite, check `merged.get(id).state` and skip overwrite if it's transitional: `matches!(existing.state, Installing | Stopping | Starting | Restarting | Updating | Removing | CreatingBackup | RestoringBackup | BackingUp)` +- Still allow _non-state_ fields (lan_address, health, ports) to update. Simplest: when existing is transitional, keep `existing.state` but merge updated fields from `pkg`. Write a tiny helper `merge_preserving_transitional(existing, fresh) -> PackageDataEntry`. +- Unit test: construct `existing.state = Stopping`, `fresh.state = Running`, assert merged.state stays `Stopping`. +- **Also check**: Is there a timeout escape hatch? If `Stopping` is set and podman actually finishes but the spawn died before writing the final state (process crash, panic), the entry will be stuck `Stopping` forever. Mitigation: track a `transitional_since: Instant` in the entry (not persisted, just in-memory side table on StateManager), and if > 2× the stop timeout has elapsed, allow podman scan state to override. Scope for this commit or follow-up — lean toward: include it, because fleet reliability matters. + +**Commit 4 — `fix(ui): single-button lifecycle control with transitional labels`** +- `neode-ui/src/api/container-client.ts` — extend `ContainerStatus.state` union to: `'created' | 'running' | 'stopped' | 'exited' | 'paused' | 'unknown' | 'stopping' | 'starting' | 'restarting' | 'installing' | 'updating' | 'removing' | 'installed'`. Add `restartContainer(appId)` method calling `container-restart`. +- `neode-ui/src/stores/container.ts` — add computed `getAppVisualState(appId)` that returns one of: `'not-installed' | 'running' | 'stopped' | 'starting' | 'stopping' | 'restarting' | 'installing' | 'updating' | 'removing'`. Maps `exited`→`stopped`, `created`→`stopped`, `paused`→`stopped`, `installed`→`stopped`. Add `restartContainer(appId)` action (sets `loadingApps` for request dedup, calls client, does NOT `fetchContainers` immediately because server will broadcast state; a final `fetchContainers` after a short delay can backstop if WebSocket push is absent). +- `neode-ui/src/views/ContainerApps.vue:85-136` — replace the two-button conditional with a single full-width button bound to `getAppVisualState(app.id)`. Table: + | visual state | click action | label | spinner | disabled | + |-----------------|----------------|----------------|---------|----------| + | `not-installed` | installApp | Install | no | no | + | `running` | stopContainer | Stop | no | no | + | `stopped` | startContainer | Start | no | no | + | `starting` | — | Starting… | yes | yes | + | `stopping` | — | Stopping… | yes | yes | + | `restarting` | — | Restarting… | yes | yes | + | `installing` | — | Installing… | yes | yes | + | `updating` | — | Updating… | yes | yes | + | `removing` | — | Removing… | yes | yes | + - Add a separate Restart button next to the primary one when state is `running`, calling new `restartContainer` action. Restart button hides while transitional. +- `neode-ui/src/views/ContainerAppDetails.vue:83` (and full stop/start button blocks around L220, L232) — mirror the same single-button pattern. +- Also audit line 239 of `ContainerApps.vue` (`some((app) => store.getAppState(app.id) === 'created')`) and the logic around lines 276, 295, 309, 312 — make sure they use `getAppVisualState` where appropriate. + +### Verification gates (do not skip) + +1. `~/.cargo/bin/cargo check -p archipelago` on .116 via SSH +2. `~/.cargo/bin/cargo test -p archipelago` on .116 via SSH — at least the new merge helper test must pass +3. Build release binary on .116: `nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown`. Poll until done. +4. SCP binary to .228 `/usr/local/bin/archipelago`, back up prior to `/usr/local/bin/archipelago.bak-pre-async-stop`. `sudo systemctl restart archipelago` on .228. +5. **Manual LND stop test on .228**: + - Open dashboard, confirm LND is Running (first: `ssh archipelago@192.168.1.228 'podman start lnd'` — LND is currently Exited(0) from the demo) + - Click Stop + - Expected: button _immediately_ becomes "Stopping…" with spinner (RPC returns <1s) + - Dashboard should stay on "Stopping…" for ~5 min + - Then flip to "Start" button with label "Start" + - At no point should it revert to "Running" mid-stop +6. Same test with Bitcoin Core stop (longest timeout, 600s) +7. Frontend build: `cd ~/Projects/archy/neode-ui && npm run type-check && npm run build`. Rsync `dist/` to `archipelago@192.168.1.228:/var/lib/archipelago/web-ui/` (or wherever the active web root is — check `/etc/nginx` on .228 first). +8. Then and only then: resume chaos matrix. First recover LND/ElectrumX via UI (great end-to-end test of the new async Start path), then run smoke → full 32-case matrix. + +### Key files (exact lines of interest) + +- `core/archipelago/src/api/rpc/container.rs:85-107` — `handle_container_stop` (blocking — target of fix) +- `core/archipelago/src/api/rpc/container.rs:61-83` — `handle_container_start` +- `core/archipelago/src/api/rpc/container.rs:148-154` — narrow state mapping (drops transitional → "unknown") +- `core/archipelago/src/api/rpc/package/runtime.rs:11-24` — `stop_timeout_secs` table (reference, unchanged) +- `core/archipelago/src/api/rpc/package/runtime.rs:122-173` — `handle_package_stop` (also blocking, mirror treatment) +- `core/archipelago/src/api/rpc/package/runtime.rs:28-119` — `handle_package_start` +- `core/archipelago/src/api/rpc/package/runtime.rs:176-242` — `handle_package_restart` +- `core/archipelago/src/api/rpc/package/progress.rs` — existing broadcast pattern to mirror (`set_install_progress`, `set_uninstall_stage`) +- `core/archipelago/src/api/rpc/mod.rs:62-100` — `RpcHandler` struct (already holds `Arc` + state_manager) +- `core/archipelago/src/server.rs:812-857` — `scan_and_update_packages` (merge loop at L850-857 is where transitional-state clobber happens) +- `core/archipelago/src/container/docker_packages.rs:636-663` — `convert_state` + `package_state_str` (read-only reference, no change) +- `core/archipelago/src/container/traits.rs` — `ContainerOrchestrator` trait (stays synchronous, do not change) +- `core/archipelago/src/crash_recovery.rs` — `mark_user_stopped` / `clear_user_stopped` (call order preserved) +- `core/archipelago/src/data_model.rs:107-124` — `PackageState` enum (no change — all variants exist) +- `neode-ui/src/api/container-client.ts` — `ContainerStatus` type + RPC methods (extend) +- `neode-ui/src/stores/container.ts:93-312` — Pinia store (add `getAppVisualState`, add `restartContainer` action) +- `neode-ui/src/views/ContainerApps.vue:85-136, 239, 276, 295, 309-312, 383` — two-button block + state reads +- `neode-ui/src/views/ContainerAppDetails.vue:83, 220, 232` — details page Stop/Start + +### Chaos harness (not in repo — lives on .116) + +- `archipelago@192.168.1.116:~/ui-chaos/` — deployed, playwright + deps installed, smoke test for bitcoin-core passes (2.1 min). LND/ElectrumX/bitcoin-ui smoke tests not yet run (blocked on the async-stop fix landing; LND currently Exited on .228 from the demo). +- `/tmp/chaos/` on laptop — canonical source for rsync to .116. +- Run: `cd ~/ui-chaos && npx playwright test tests/` +- Target: 32 cases = 4 core containers × 8 scenarios (install-fresh, graceful-stop, sigkill, rm-container, oom-kill, rm-image, restart-service, network-partition). +- Uses SSH+Playwright hybrid per design; includes the `bash -lc ''` single-quote fix for ssh argv flattening and JSON-parsed `podman inspect` instead of Go templates. + +### Pre-existing bugs still deferred (do not fix until Stop UX lands) + +1. `archipelago --version` spawns server (should be a pure CLI query) +2. RPC unknown-method returns generic error (should return method-not-found with the bad method name) +3. `docker_packages.rs` filters out UI containers (`archy-lnd-ui`, `archy-electrs-ui`) — some views need them visible +4. `lnd.lan_address` stale on .228 +5. first-boot silent failure on some hardware +6. `web-ui.failed.*` scar on .228 (benign systemd unit state) +7. `test_parse_image_versions` pre-existing broken assertion — fix or `#[ignore]` when touching that area + +### Host reference + +| Host | IP | Role | Dashboard pw | Sudo pw | SSH | +|---|---|---|---|---|---| +| `archy` (ThinkPad X250) | 192.168.1.116 | dev host, Debian 13, repo at `~/Projects/archy/` | archipelago | `ThisIsWeb54321@` | key installed | +| `archy228` (HP ProDesk) | 192.168.1.228 | prod kiosk, new Rust orchestrator binary | password123 | archipelago (NOPASSWD:ALL via /etc/sudoers.d/archipelago-ci) | key installed | + +- Laptop SSHFS mount: `~/mnt/archy-thinkpad/` (edits OK, git/cargo via SSH) +- Cargo path over SSH: `~/.cargo/bin/cargo` (non-interactive login has no cargo in PATH) +- Release model: local commit + tag only; user pushes to 4 Gitea mirrors personally +- Full destructive latitude on both nodes. Announce multi-hour ops. Don't ask for routine stop/start/rebuild permission. + +--- + ## Where we are Working through the 11-step plan in [`rust-orchestrator-migration.md`](./rust-orchestrator-migration.md).