archy/docs/STATUS.md

# RESUME HERE — Rust orchestrator migration

Updated: 2026-04-23 (Dashboard Stop UX bug diagnosed; async-spawn fix fully designed, ready to implement)

**To resume this work, SSH into the ThinkPad and run `opencode` from `~/Projects/archy/`. Or work from the laptop via the SSHFS mount at `~/mnt/archy-thinkpad/`.**

---

## ⚡ NEXT SESSION — START HERE

**Goal**: implement async-spawn lifecycle fix so the dashboard never shows a frozen spinner again. User mandate: _"best server containers in the world"_. Do not ship the chaos matrix (Step 11) until this lands and manual LND stop verifies instant RPC + live `Stopping…` label.

### How to work on this repo (SSH + SSHFS setup)

You are likely running on the **laptop** (macOS). The repo lives on the **ThinkPad** (.116). There are two access paths, use both in parallel:

1. **SSHFS mount at `~/mnt/archy-thinkpad/`** — for all file ops (`read`/`edit`/`write`/`glob`/`grep`).
2. **Direct SSH** — for everything that isn't file ops: `git`, `cargo`, `npm`, `systemctl`, running the server, tailing logs.

See the "FUSE / SSHFS development loop" section below for the full mount lifecycle — that's _the_ thing that makes this dev setup work, and it will break periodically.

### FUSE / SSHFS development loop

**Why this exists**: editing the repo directly on the ThinkPad over raw SSH means no IDE, no tool-native file reads, no glob/grep speed. SSHFS mounts the remote filesystem as a local directory so OpenCode's file tools work transparently. But SSHFS is a leaky abstraction — know the gotchas or you'll waste hours.

**Stack** (macOS laptop):
- **macFUSE** — kernel extension providing FUSE on macOS. Install via `brew install --cask macfuse` (requires reboot + security approval in System Settings the first time).
- **sshfs** — userspace mount tool. Install via `brew install gromgit/fuse/sshfs-mac` (the homebrew core `sshfs` was removed; use this tap).
- Verify: `which sshfs` → `/opt/homebrew/bin/sshfs`, `sshfs --version` → `SSHFS version 2.10 / FUSE library version 2.9.9`.

**Actual mount command currently running** (verified from `ps`):
```
sshfs archy:Projects/archy /Users/dorian/mnt/archy-thinkpad \
  -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,volname=archy-thinkpad
```

Breakdown:
- `archy:Projects/archy` — remote path via the `archy` SSH alias (uses `~/.ssh/archy_opencode`, no password prompt).
- `~/mnt/archy-thinkpad` — local mount point. Create once: `mkdir -p ~/mnt/archy-thinkpad`.
- `reconnect` — sshfs auto-reconnects if the TCP session drops (WiFi flap, laptop sleep). Without this, the mount turns into a zombie immediately.
- `ServerAliveInterval=15` — sends a keepalive every 15s.
- `ServerAliveCountMax=3` — disconnect after 3 missed keepalives (45s). Tune up if your network is flaky.
- `volname=archy-thinkpad` — Finder display name.

**Check mount health**:
```
mount | grep archy-thinkpad
# should print: archy:Projects/archy on /Users/dorian/mnt/archy-thinkpad (macfuse, nodev, nosuid, synchronous, mounted by dorian)

ls ~/mnt/archy-thinkpad/ | head
# should list repo contents fast (<1s). If it hangs, mount is stale.
```

**Recovery when the mount hangs / goes stale** (this WILL happen — laptop sleeps, WiFi drops, ThinkPad reboots):
```
# 1. Force-unmount (macOS — `umount` alone often fails on a hung FUSE mount)
sudo diskutil unmount force ~/mnt/archy-thinkpad
# fallback if diskutil can't see it:
sudo umount -f ~/mnt/archy-thinkpad

# 2. Kill any zombie sshfs process
pkill -f "sshfs archy:Projects/archy"

# 3. Remount
sshfs archy:Projects/archy ~/mnt/archy-thinkpad \
  -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,volname=archy-thinkpad

# 4. Verify
ls ~/mnt/archy-thinkpad/ | head
```

If the mount point itself got wedged (`ls: /Users/dorian/mnt/archy-thinkpad: Device not configured`), the sequence above still works — macFUSE garbage-collects the inode after the force-unmount.

**When to use which path** (rules, not suggestions):
| Operation | Use | Why |
|---|---|---|
| `read` / `edit` / `write` | SSHFS mount | OpenCode tools want local paths |
| `glob` / `grep` | SSHFS mount | Local FS traversal is fine; remote would need rg over SSH |
| Reading many files | SSHFS mount | Each read is a round-trip but parallelizable |
| `git status` / `git diff` / `git log` | SSH | Git over FUSE is painfully slow (lots of stat calls) |
| `git add` / `git commit` | SSH | Same — commit times grow linearly with tree size on FUSE |
| `cargo check` / `cargo test` / `cargo build` | SSH | Compiling over FUSE would take hours; cargo's incremental stat pattern destroys FUSE performance |
| `npm install` / `npm run build` | SSH | Same reason — massive file churn |
| Running the server / tailing journal | SSH | Service lives on .116 |
| Deploying to .228 | SSH from .116 | SCP from ThinkPad; laptop isn't in the critical path |

**Don't do this** (will bite you):
- `cargo build` from the mount — will try to write target/ over FUSE, gets orders of magnitude slower, may hang.
- `rsync` without `--exclude="._*"` — macOS writes AppleDouble metadata files, they leak to the remote as `._*` siblings of every real file. `.gitignore` already excludes them (commit `13858842`), but they clutter the tree.
- Writing big binary files via the mount — use `scp` over SSH instead.
- Relying on file-change-watcher tools (watchman, chokidar) — they get confused by FUSE event semantics.

**Editing workflow in a typical session**:
1. Laptop: OpenCode `read`s a file via `/Users/dorian/mnt/archy-thinkpad/...`. FUSE fetches it over SSH, caches briefly.
2. Laptop: OpenCode `edit`s the file — FUSE writes the new bytes back to .116 immediately (synchronous mount).
3. Laptop: `ssh archy "cd ~/Projects/archy && ~/.cargo/bin/cargo check -p archipelago"` — runs on the real filesystem on .116, sees the edit.
4. Laptop: `ssh archy "cd ~/Projects/archy && git diff path/to/file"` — confirms the edit landed.
5. Laptop: `ssh archy "cd ~/Projects/archy && git add path/to/file && git commit -m '...'"` — commit from .116.

The SSHFS mount and the SSH shell are pointing at **the same inodes** — edits via the mount are instantly visible to `cargo`/`git` over SSH. There's no "sync" step.

**Cache caveat**: macFUSE caches attributes briefly (default ~1s). If you write via SSH and read via the mount within that window, you may see stale metadata. The mount's `synchronous` flag (visible in `mount` output) minimizes but doesn't eliminate this. If you get a weird diff between what SSH and the mount report, re-read after a second, or `stat --file-system ~/mnt/archy-thinkpad/<file>` to force a refresh.

**Direct SSH** access (use when FUSE isn't the right tool):
   - `ssh archy` → `archipelago@192.168.1.116` using `~/.ssh/archy_opencode`
   - `ssh archy228` → `archipelago@192.168.1.228` using `~/.ssh/archy_opencode`
   - Full host form also works: `ssh archipelago@192.168.1.116` / `ssh archipelago@192.168.1.228` (same key resolves via IdentitiesOnly).

### SSH keys — what's where

**Laptop `~/.ssh/` (macOS, user `dorian`)**:
| File | Purpose |
|---|---|
| `archy_opencode` / `.pub` | **Primary key for this project.** Unlocks both `archy` (.116) and `archy228` (.228). Created 2026-04-22 specifically for OpenCode work. |
| `archipelago-deploy` / `.pub` | Older archipelago deploy key. Not needed for current work. |
| `id_ed25519` / `.pub` | Personal default key. Not used by archy/archy228 configs (`IdentitiesOnly yes` forces `archy_opencode`). |
| `id_ed25519_angor` / `.pub` | Angor project. Unrelated. |
| `id_ed25519_start9` / `.pub` | Start9 project. Unrelated. |
| `vps-ci-setup` / `.pub` | VPS CI. Unrelated. |
| `config` | Host aliases (shown above) |

**.116 `/home/archipelago/.ssh/`**:
| File | Purpose |
|---|---|
| `authorized_keys` | Accepts: laptop's `archy_opencode.pub` + 3 other keys (4 lines total). |
| `id_ed25519` / `.pub` | .116's OWN identity key. This is what lets `.116 → .228` work passwordless. |
| `archipelago-deploy` | Symlink → `id_ed25519` (legacy alias). |
| `id_ed25519_vps168` / `.pub` | For SSH to `146.59.87.168` (VPS). Unrelated to this work. |
| `config` | Host entry for the VPS only. |

**.228 `/home/archipelago/.ssh/`**:
| File | Purpose |
|---|---|
| `authorized_keys` | Accepts: laptop's `archy_opencode.pub` + .116's `id_ed25519.pub` + 2 others (4 lines total). |
| _(no `id_ed25519`)_ | .228 has no outbound key — it's a terminal node. Don't try to `ssh` _from_ .228 _to_ anywhere. |

**Connectivity matrix (all verified 2026-04-23)**:
| From → To | Works passwordless | Via |
|---|---|---|
| Laptop → .116 | ✅ | `archy_opencode` |
| Laptop → .228 | ✅ | `archy_opencode` |
| .116 → .228 | ✅ | .116's `id_ed25519` |
| .228 → anywhere | ❌ | no outbound key (by design) |

### Sudo — verified state

**.116** (dev ThinkPad):
- User `archipelago` is in `sudo` group.
- Sudo password required: **`ThisIsWeb54321@`**
- Sudoers drop-ins present: `/etc/sudoers.d/archipelago-ci`, `/etc/sudoers.d/archipelago-wg` (scope-limited NOPASSWD for specific CI/wg commands — not full NOPASSWD).
- For most dev work you don't need sudo on .116.

**.228** (prod kiosk):
- User `archipelago` has **full passwordless sudo** via `/etc/sudoers.d/archipelago` containing `archipelago ALL=(ALL) NOPASSWD:ALL`.
- User is also in `sudo` group.
- Sudo password (if ever prompted, shouldn't be): **`archipelago`**
- Dashboard password: **`password123`**

### Cargo / npm / paths

- **Cargo PATH gotcha**: non-interactive SSH login has no cargo in PATH. Always use `~/.cargo/bin/cargo` over SSH.
  - Example: `ssh archy '~/.cargo/bin/cargo check -p archipelago' --workdir ~/Projects/archy/core`
  - Or cd first: `ssh archy 'cd ~/Projects/archy && ~/.cargo/bin/cargo check -p archipelago'`
- **Long cargo builds** (>2 min Bash tool timeout): launch detached and poll the log:
  ```
  ssh archy 'cd ~/Projects/archy && nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown'
  ssh archy 'tail -30 /tmp/cargo-build.log'
  ssh archy 'pgrep -a cargo'   # to check if still running
  ```
- **npm / frontend** lives at `~/Projects/archy/neode-ui/` on .116 (also accessible via laptop mount at `~/mnt/archy-thinkpad/neode-ui/`). Node is on interactive PATH; for scripted SSH, `source ~/.nvm/nvm.sh && nvm use` or call the absolute path if nvm is used.
- Repo on .116: `~/Projects/archy/` (Cargo workspace at `core/Cargo.toml`).
- Web root on .228: check `/etc/nginx/sites-enabled/` for the live path; historically `/var/lib/archipelago/web-ui/` or `/opt/archipelago/web-ui/`.

### Deploying new server binary to .228

```
# 1. Build on .116 (detached — takes ~3-5 min for release)
ssh archy 'cd ~/Projects/archy && nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown'
# wait / tail log until "Finished `release` profile"

# 2. SCP .116 → .228 (uses .116's id_ed25519 → .228's authorized_keys, passwordless)
ssh archy 'scp ~/Projects/archy/core/target/release/archipelago archipelago@192.168.1.228:/tmp/archipelago.new'

# 3. Atomic swap on .228 with backup
ssh archy228 'sudo cp /usr/local/bin/archipelago /usr/local/bin/archipelago.bak-pre-async-stop && sudo mv /tmp/archipelago.new /usr/local/bin/archipelago && sudo chmod +x /usr/local/bin/archipelago && sudo systemctl restart archipelago'

# 4. Verify
ssh archy228 'systemctl status archipelago --no-pager | head -20 && sudo journalctl -u archipelago -n 50 --no-pager'
```

### Git workflow

- Branch: `main` on .116, currently **22 commits ahead of `tx1138/main`**.
- Remote `tx1138` exists but **do NOT push** — user mirrors to 4 Gitea remotes personally after reviewing.
- Atomic commits, one logical change per commit. Conventional Commits format (`feat:`, `fix:`, `docs:`, `refactor:`, `chore:`, `test:`, `perf:`).
- Never `--amend` unless the commit you're amending was created in this session AND has not been pushed. Safer: new commit.
- Never `--force` push. Never modify git config.
- If pre-commit hooks fail, create a NEW commit with the fix — don't `--amend` after a failed commit.

### Other

- Full destructive latitude on both nodes. Announce multi-hour ops (OTA, full rebuild, apt upgrade). Don't ask for routine stop/start/rebuild permission.
- No ship pressure. Do it properly.
- Use `question` tool for ambiguous decisions (don't guess user intent on design choices).
- Keep `docs/STATUS.md` fresh between sessions — it IS the session handoff.

### Hosts reference (quick)

| Host | IP | SSH alias | Role | Dashboard | Sudo |
|---|---|---|---|---|---|
| `archy` (ThinkPad X250) | 192.168.1.116 | `ssh archy` | dev host, Debian 13 | `archipelago` | `ThisIsWeb54321@` |
| `archy228` (HP ProDesk) | 192.168.1.228 | `ssh archy228` | prod kiosk, Rust orchestrator | `password123` | NOPASSWD (fallback `archipelago`) |

### Bug being fixed

Dashboard sequence when user clicks **Stop LND**:
1. UI collapses Start/Stop buttons to single spinner-button ("Stopping…") via `loadingApps.add('lnd')`.
2. Frontend calls `container-stop` RPC. Server runs `podman stop -t 330 lnd` **synchronously inside the RPC handler** (via `orchestrator.stop()`). RPC blocks up to **5.5 min** for LND (330s timeout + overhead).
3. Meanwhile the 30-second package-scan loop in `server.rs:scan_and_update_packages` keeps running. It rebuilds `PackageDataEntry` from podman inspect — podman still reports `running` (stop hasn't completed) — and **blindly overwrites** the store entry at `server.rs:854`.
4. `container-list` RPC reads `state_manager` snapshot → returns `state = "running"`.
5. Frontend polling sees `running` → `getAppState()` returns `'running'` → the two-button (Start | Stop) block re-renders → the transitional button disappears → **UI looks like the stop silently failed**.
6. Eventually `podman stop` finishes → next scan → state flips to `Stopped` → buttons change _again_.

Net visible bug: button spins briefly, reverts to Running, then several minutes later suddenly shows Stopped. User rightly calls this "out of sync and confusing".

### Decisions already locked in (do not re-ask)

- **Full scope fix** (not minimal hotfix). User chose "Go full scope, do it right".
- **Async-spawn lives in the RPC layer**, not in the `ContainerOrchestrator` trait. Trait stays synchronous so the reconciler, boot flow, unit tests, and the chaos harness retain deterministic behaviour.
- **`PackageState` already has `Stopping`/`Starting`/`Restarting`/`Installing`/`Updating`/`Removing`** variants — enum at `core/archipelago/src/data_model.rs:107-124`. No schema change needed.
- **UI collapses to one full-width button** with spinner during every transitional state. Labels: Start / Stop / Starting… / Stopping… / Restarting… / Installing… / Updating… / Removing… / Install (when `not-installed`).
- **Helper API shape**: `RpcHandler::spawn_transitional(op: Op, app_id: String)` where `Op` is an enum `{Stop, Start, Restart}`. Helper dispatches to `orchestrator.stop/start/restart` internally, knows each op's transitional+final states, handles error → revert + `install_log()`.
- **`mark_user_stopped` must run BEFORE the spawn** (preserves ordering the crash recovery layer depends on — see `runtime.rs:145-148`).

### Implementation order (4 commits, local only)

**Commit 1 — `feat(rpc): spawn_transitional helper for async lifecycle ops`**
- New file: `core/archipelago/src/api/rpc/transitional.rs` (or extend `container.rs`; prefer new file for cohesion with future stacks/package variants)
- `enum Op { Stop, Start, Restart }` with `transitional_state()`, `final_state_on_success()`, `log_prefix()`, and async `dispatch(&orch, &app_id)` method
- `impl RpcHandler { pub(super) async fn spawn_transitional(&self, op: Op, app_id: String) -> Result<()> }`
  - Capture `Arc<dyn ContainerOrchestrator>` + `Arc<StateManager>` clones
  - Set transitional state via `state_manager.update_data()` (if entry exists; skip if not — Start on never-installed shouldn't create an entry)
  - `tokio::spawn(async move { ... })`
  - Inside spawn: `install_log("{LOG_PREFIX}: {app_id}")`, `op.dispatch(&orch, &app_id).await`, on success set final state, on error log + `install_log("{LOG_PREFIX} FAIL: …")` + revert state to previous (cache pre-transition state in a local)
  - Return `Ok(())` immediately after spawn

**Commit 2 — `fix(rpc): async container stop/start/restart; widen state mapping`**
- `api/rpc/container.rs:85-107` — rewrite `handle_container_stop` body: `validate_app_id`, `mark_user_stopped`, `spawn_transitional(Op::Stop, app_id.to_string()).await?`, return `Ok(json!({ "status": "stopping" }))`
- `api/rpc/container.rs:61-83` — rewrite `handle_container_start`: `clear_user_stopped`, `spawn_transitional(Op::Start, …)`, return `{ "status": "starting" }`
- **Add** `handle_container_restart` (currently missing in `container.rs` — only exists as `package.restart` at `runtime.rs:176-242`). Register RPC route name `container-restart`. Add matching frontend client method in `container-client.ts`.
- `api/rpc/container.rs:148-154` — widen the `container-list` state mapping: add arms for `Stopping → "stopping"`, `Starting → "starting"`, `Restarting → "restarting"`, `Installing → "installing"`, `Updating → "updating"`, `Removing → "removing"`, `Installed → "installed"`, `CreatingBackup`/`RestoringBackup`/`BackingUp` → their kebab-case strings. No more `"unknown"` fallback unless the variant is genuinely unknown.
- Mirror same spawn treatment in `api/rpc/package/runtime.rs`: `handle_package_start` (L28-119), `handle_package_stop` (L122-173), `handle_package_restart` (L176-242). Keep the existing verification loops (post-start exit-check at L82-117; restart stop+start fallback at L215-235) _inside_ the spawned future, not in the RPC body.

**Commit 3 — `fix(state): preserve transitional state across container scans`**
- `server.rs:847-857` — in the merge loop, before the `merged.insert(id.clone(), pkg.clone())` overwrite, check `merged.get(id).state` and skip overwrite if it's transitional: `matches!(existing.state, Installing | Stopping | Starting | Restarting | Updating | Removing | CreatingBackup | RestoringBackup | BackingUp)`
- Still allow _non-state_ fields (lan_address, health, ports) to update. Simplest: when existing is transitional, keep `existing.state` but merge updated fields from `pkg`. Write a tiny helper `merge_preserving_transitional(existing, fresh) -> PackageDataEntry`.
- Unit test: construct `existing.state = Stopping`, `fresh.state = Running`, assert merged.state stays `Stopping`.
- **Also check**: Is there a timeout escape hatch? If `Stopping` is set and podman actually finishes but the spawn died before writing the final state (process crash, panic), the entry will be stuck `Stopping` forever. Mitigation: track a `transitional_since: Instant` in the entry (not persisted, just in-memory side table on StateManager), and if > 2× the stop timeout has elapsed, allow podman scan state to override. Scope for this commit or follow-up — lean toward: include it, because fleet reliability matters.

**Commit 4 — `fix(ui): single-button lifecycle control with transitional labels`**
- `neode-ui/src/api/container-client.ts` — extend `ContainerStatus.state` union to: `'created' | 'running' | 'stopped' | 'exited' | 'paused' | 'unknown' | 'stopping' | 'starting' | 'restarting' | 'installing' | 'updating' | 'removing' | 'installed'`. Add `restartContainer(appId)` method calling `container-restart`.
- `neode-ui/src/stores/container.ts` — add computed `getAppVisualState(appId)` that returns one of: `'not-installed' | 'running' | 'stopped' | 'starting' | 'stopping' | 'restarting' | 'installing' | 'updating' | 'removing'`. Maps `exited`→`stopped`, `created`→`stopped`, `paused`→`stopped`, `installed`→`stopped`. Add `restartContainer(appId)` action (sets `loadingApps` for request dedup, calls client, does NOT `fetchContainers` immediately because server will broadcast state; a final `fetchContainers` after a short delay can backstop if WebSocket push is absent).
- `neode-ui/src/views/ContainerApps.vue:85-136` — replace the two-button conditional with a single full-width button bound to `getAppVisualState(app.id)`. Table:
  | visual state    | click action   | label          | spinner | disabled |
  |-----------------|----------------|----------------|---------|----------|
  | `not-installed` | installApp     | Install        | no      | no       |
  | `running`       | stopContainer  | Stop           | no      | no       |
  | `stopped`       | startContainer | Start          | no      | no       |
  | `starting`      | —              | Starting…      | yes     | yes      |
  | `stopping`      | —              | Stopping…      | yes     | yes      |
  | `restarting`    | —              | Restarting…    | yes     | yes      |
  | `installing`    | —              | Installing…    | yes     | yes      |
  | `updating`      | —              | Updating…      | yes     | yes      |
  | `removing`      | —              | Removing…      | yes     | yes      |
  - Add a separate Restart button next to the primary one when state is `running`, calling new `restartContainer` action. Restart button hides while transitional.
- `neode-ui/src/views/ContainerAppDetails.vue:83` (and full stop/start button blocks around L220, L232) — mirror the same single-button pattern.
- Also audit line 239 of `ContainerApps.vue` (`some((app) => store.getAppState(app.id) === 'created')`) and the logic around lines 276, 295, 309, 312 — make sure they use `getAppVisualState` where appropriate.

### Verification gates (do not skip)

1. `~/.cargo/bin/cargo check -p archipelago` on .116 via SSH
2. `~/.cargo/bin/cargo test -p archipelago` on .116 via SSH — at least the new merge helper test must pass
3. Build release binary on .116: `nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown`. Poll until done.
4. SCP binary to .228 `/usr/local/bin/archipelago`, back up prior to `/usr/local/bin/archipelago.bak-pre-async-stop`. `sudo systemctl restart archipelago` on .228.
5. **Manual LND stop test on .228**:
   - Open dashboard, confirm LND is Running (first: `ssh archipelago@192.168.1.228 'podman start lnd'` — LND is currently Exited(0) from the demo)
   - Click Stop
   - Expected: button _immediately_ becomes "Stopping…" with spinner (RPC returns <1s)
   - Dashboard should stay on "Stopping…" for ~5 min
   - Then flip to "Start" button with label "Start"
   - At no point should it revert to "Running" mid-stop
6. Same test with Bitcoin Core stop (longest timeout, 600s)
7. Frontend build: `cd ~/Projects/archy/neode-ui && npm run type-check && npm run build`. Rsync `dist/` to `archipelago@192.168.1.228:/var/lib/archipelago/web-ui/` (or wherever the active web root is — check `/etc/nginx` on .228 first).
8. Then and only then: resume chaos matrix. First recover LND/ElectrumX via UI (great end-to-end test of the new async Start path), then run smoke → full 32-case matrix.

### Key files (exact lines of interest)

- `core/archipelago/src/api/rpc/container.rs:85-107` — `handle_container_stop` (blocking — target of fix)
- `core/archipelago/src/api/rpc/container.rs:61-83` — `handle_container_start`
- `core/archipelago/src/api/rpc/container.rs:148-154` — narrow state mapping (drops transitional → "unknown")
- `core/archipelago/src/api/rpc/package/runtime.rs:11-24` — `stop_timeout_secs` table (reference, unchanged)
- `core/archipelago/src/api/rpc/package/runtime.rs:122-173` — `handle_package_stop` (also blocking, mirror treatment)
- `core/archipelago/src/api/rpc/package/runtime.rs:28-119` — `handle_package_start`
- `core/archipelago/src/api/rpc/package/runtime.rs:176-242` — `handle_package_restart`
- `core/archipelago/src/api/rpc/package/progress.rs` — existing broadcast pattern to mirror (`set_install_progress`, `set_uninstall_stage`)
- `core/archipelago/src/api/rpc/mod.rs:62-100` — `RpcHandler` struct (already holds `Arc<dyn ContainerOrchestrator>` + state_manager)
- `core/archipelago/src/server.rs:812-857` — `scan_and_update_packages` (merge loop at L850-857 is where transitional-state clobber happens)
- `core/archipelago/src/container/docker_packages.rs:636-663` — `convert_state` + `package_state_str` (read-only reference, no change)
- `core/archipelago/src/container/traits.rs` — `ContainerOrchestrator` trait (stays synchronous, do not change)
- `core/archipelago/src/crash_recovery.rs` — `mark_user_stopped` / `clear_user_stopped` (call order preserved)
- `core/archipelago/src/data_model.rs:107-124` — `PackageState` enum (no change — all variants exist)
- `neode-ui/src/api/container-client.ts` — `ContainerStatus` type + RPC methods (extend)
- `neode-ui/src/stores/container.ts:93-312` — Pinia store (add `getAppVisualState`, add `restartContainer` action)
- `neode-ui/src/views/ContainerApps.vue:85-136, 239, 276, 295, 309-312, 383` — two-button block + state reads
- `neode-ui/src/views/ContainerAppDetails.vue:83, 220, 232` — details page Stop/Start

### Chaos harness (not in repo — lives on .116)

- `archipelago@192.168.1.116:~/ui-chaos/` — deployed, playwright + deps installed, smoke test for bitcoin-core passes (2.1 min). LND/ElectrumX/bitcoin-ui smoke tests not yet run (blocked on the async-stop fix landing; LND currently Exited on .228 from the demo).
- `/tmp/chaos/` on laptop — canonical source for rsync to .116.
- Run: `cd ~/ui-chaos && npx playwright test tests/<spec>`
- Target: 32 cases = 4 core containers × 8 scenarios (install-fresh, graceful-stop, sigkill, rm-container, oom-kill, rm-image, restart-service, network-partition).
- Uses SSH+Playwright hybrid per design; includes the `bash -lc '<escaped>'` single-quote fix for ssh argv flattening and JSON-parsed `podman inspect` instead of Go templates.

### Pre-existing bugs still deferred (do not fix until Stop UX lands)

1. `archipelago --version` spawns server (should be a pure CLI query)
2. RPC unknown-method returns generic error (should return method-not-found with the bad method name)
3. `docker_packages.rs` filters out UI containers (`archy-lnd-ui`, `archy-electrs-ui`) — some views need them visible
4. `lnd.lan_address` stale on .228
5. first-boot silent failure on some hardware
6. `web-ui.failed.*` scar on .228 (benign systemd unit state)
7. `test_parse_image_versions` pre-existing broken assertion — fix or `#[ignore]` when touching that area

---

## Where we are

Working through the 11-step plan in [`rust-orchestrator-migration.md`](./rust-orchestrator-migration.md).

- [x] **Step 1** — `3767c267` ContainerConfig schema with `build:`, `ResolvedSource` enum, `resolve()`, 10 tests
- [x] **Step 2** — `34af4d9d` ContainerRuntime trait gained `image_exists` + `build_image`, 4 argv tests, 25/25 pass
- [x] **Step 3** — `b6a04d31` ProdContainerOrchestrator (999 LOC), 16 tests all pass, not yet wired to main.rs
- [x] **Step 4** — `e8a59c93` ContainerOrchestrator trait, RpcHandler uses it in prod (+ `13858842` chore gitignore ._*)
- [x] **Step 5** — `fc39b04b` BootReconciler with Arc<Notify> shutdown, 4 paused-time tests pass
- [x] **Step 6** — `48f08aa3` main.rs wire-up (orchestrator construction + adopt_existing + BootReconciler spawn + shutdown Notify)
- [x] **Step 7** — `069bc4a5` bitcoin-ui pre-start hook + embedded nginx.conf template (8 unit tests + 1 integration test), 39/39 container:: tests pass
- [x] **Step 8a** — `a0707f4d` retire archipelago-reconcile.{service,timer} + ISO builder touchpoints, keep scripts for update.rs
- [x] **Step 9** — **Hot-swap on .228 verified.** All three UIs (bitcoin-ui/lnd-ui/electrs-ui) installing + serving HTTP 200.
- [x] **.228 dashboard bugs** — ExtraHost `192.168.1.254` bug (`3ee192ba`) + LND macaroon permission bug (`be960023`). See "Post-Step 9 bug hunt" below.
- [ ] **Step 8b** — Port remaining ~25 container creations from `first-boot-containers.sh` into `apps/<id>/manifest.yml`, then port `update.rs` to orchestrator (deferred, multi-day work)
- [ ] **Step 8c** — Rename `first-boot-containers.sh` → `first-boot-setup.sh`, strip container ops, keep setup. Delete `reconcile-containers.sh` + `container-specs.sh`. Add ISO lines to copy `apps/` (final one-way door, requires 8b complete)
- [ ] **Step 10** — Hot-swap + verify on .116 (adoption-heavy test — .116 already has all containers running)
- [ ] **Step 11** — Chaos matrix on both nodes (all 8 scenarios × all containers incl. bitcoin-core)

## Post-Step 9 bug hunt (.228, 2026-04-23)

User reported three visible dashboard bugs after Step 9 verification:
1. LND — "no connect details or QR"
2. ElectrumX — stuck at "Building index (2 KB / ~130 GB)" for days
3. bitcoin-core — in scope for chaos testing

**Root cause #1 (ExtraHost, commit `3ee192ba`)**: `scripts/first-boot-containers.sh` computed `HOST_GATEWAY` from `ip route show default`, which returns the **LAN router** (e.g. 192.168.1.254), not the gateway to the host. Every container configured with `--add-host=host.containers.internal:$HOST_GATEWAY` was dialing the WiFi router instead of the host. LND crash-looped with `dial tcp 192.168.1.254:8332: connection refused`; ElectrumX's DAEMON_URL hit the same dead end; any `archy-net` bridge consumer of bitcoin-core's RPC was broken. Fixed by replacing the computed value with podman's magic `host-gateway` literal (supported since 4.4; we ship 5.4.2). Live-recreated bitcoin-core/electrumx/lnd on .228 with the corrected `--add-host`; LND reached chain backend; ElectrumX resumed indexing (went from 2 KB → 164.9 MB in under an hour).

**Root cause #2 (macaroon permissions, commit `be960023`)**: LND's `admin.macaroon` lives at `/var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon`, owned by rootless-podman subordinate UID 100000, mode 640. The archipelago server runs as host UID 1000 and literally cannot read the file. Every LND RPC (`getinfo`, `connect-info`, `export-channel-backup`) plus the shared `lnd_client()` helper failed with "Failed to read LND admin macaroon". **Confirmed pre-existing on .116 too** (long-standing bug unrelated to Step 9). Fix: centralised the path as `LND_ADMIN_MACAROON_PATH`, added a `read_lnd_admin_macaroon()` helper in `api/rpc/lnd/mod.rs` that tries direct read first then falls back to `sudo -n cat` (mirrors the pattern already used for Tor onion hostnames). Four call sites routed through the helper. Verified on .228 — `curl -k https://<host>/lnd-connect-info` now returns 200 with cert + macaroon + tor_onion; dashboard QR unblocked.

## Step 9 evidence (.228, 2026-04-23)

- Binary: Step 9 build with `732df1b8` + `ba83f9bc`, scp'd to .228 as `/usr/local/bin/archipelago`. Old binary backed up at `/usr/local/bin/archipelago.bak-pre-step9`. Later replaced with macaroon-fix build (`be960023`); previous backed up at `/usr/local/bin/archipelago.bak-pre-macaroon`.
- DEV_MODE override disabled (`override.conf` → `override.conf.disabled-pre-step9`).
- `/opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml` populated.
- `/opt/archipelago/docker/bitcoin-ui/Dockerfile` replaced with the Step 7 version (no `COPY nginx.conf`). Old dir backed up as `bitcoin-ui.bak-pre-step9`.
- Post-start snapshot:
  - `🔗 Adopted 1 existing container(s): ["electrs-ui"]` — adoption of 13h-running container worked without recreation
  - `🔄 Boot reconciler started (interval: 30s)` — every 30s, all three app_ids reach `NoOp` after the initial install pass
  - `bitcoin-ui nginx.conf rendered path=/var/lib/archipelago/bitcoin-ui/nginx.conf auth_hash=97af1c18` — pre-start hook fires in `install_fresh`
  - `curl localhost:8334` → HTTP 200 (bitcoin-ui), `:8081` → 200 (lnd-ui), `:50002` → 200 (electrs-ui)
  - OCI memory limits correctly applied: bitcoin-ui=128Mi, electrs-ui=128Mi, lnd-ui=64Mi (was emitted as 0 pre-fix)

## Bugs fixed this session

1. **`parse_memory_limit` truncation bug** (`732df1b8`): lowercased "128Mi" → "128mi" → `trim_end_matches('m')` → "128i" → f64 parse fails → `None.unwrap_or(0)` → OCI `memory.limit:0` → systemd rejects MemoryMax=0. 6 regression tests; `create_container` now omits instead of emitting 0.
2. **`archipelago.service` cgroup delegation missing** (`ba83f9bc`): belt-and-braces `Delegate=memory pids cpu io`.
3. **ExtraHost `192.168.1.254`** (`3ee192ba`): see Post-Step 9 bug hunt above.
4. **LND admin.macaroon unreadable** (`be960023`): see Post-Step 9 bug hunt above.

## Commits made this session

```
3ee192ba fix(first-boot): use podman host-gateway magic for host.containers.internal
be960023 fix(lnd): read admin macaroon via sudo fallback
4b8ef0a0 docs: STATUS.md through Step 9 (.228 hot-swap verified)
ba83f9bc feat(systemd): delegate cgroup controllers to archipelago.service
732df1b8 fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes
a0707f4d refactor: retire archipelago-reconcile.{service,timer}  (Step 8a)
1c81a739 docs: split Step 8 into 8a/8b/8c
6e46932f docs: STATUS.md through Step 7
069bc4a5 feat: bitcoin-ui pre-start hook (Step 7)
```

Branch is **19 commits ahead of tx1138/main** (local only — user pushes to mirrors personally).

## Uncommitted state

Clean. Only untracked: `tests/` (bats harness from prior session, not in scope), `tmp-dump-spec.py` (scratch).

## Answered design questions (no need to re-ask)

1. UI container naming → `archy-<app_id>` for UIs only; existing bitcoin-knots/lnd/electrumx keep bare names
2. BITCOIN_RPC_AUTH injection → runtime bind-mount of nginx.conf (no build-args, no envsubst)
3. Reconciler interval → 30 seconds
4. Concurrency → per-app `Mutex<()>` in a `DashMap`
5. Bash scripts → split into 8a/8b/8c; 8a done, 8b/8c deferred
6. Step 4 extension → `ContainerOrchestrator` trait includes `install(app_id)`; the `manifest_path`-based install RPC stays dev-only
7. Step 7 bitcoin-ui template → embed via `include_str!`, render on install + every reconcile, atomic tmp+rename to `/var/lib/archipelago/bitcoin-ui/nginx.conf`, bind-mount into container. RPC user hardcoded `archipelago`, password from `/var/lib/archipelago/secrets/bitcoin-rpc-password`.

## Context: which host is what

| Host | IP | Role | Dashboard pw | Sudo pw |
|---|---|---|---|---|
| `archy` | 192.168.1.116 | **Dev ThinkPad** (Lenovo X250, Debian 13). Currently running v1.7.42-alpha (DEV_MODE). Step 10 target. | archipelago | ThisIsWeb54321@ |
| `archy228` | 192.168.1.228 | Kiosk HP ProDesk. **Step 9 landing zone** — now running Rust-orchestrator binary in prod mode. | password123 | archipelago |

Both are development alpha nodes — **full destructive latitude**, no need to ask before stop/start/rebuild.

## Next action

**Step 10 — Hot-swap on .116.**

Unlike .228 (which tested the INSTALL path for net-new UI containers), .116 tests the ADOPTION path: it already has all three UIs and all backend containers running from prior v1.7.42-alpha runs. We want to verify the new prod orchestrator adopts every existing container without recreating or restarting them.

Steps:
1. Disable DEV_MODE on .116 (check if override.conf exists — `/etc/systemd/system/archipelago.service.d/`)
2. Stage the already-built binary at `~/Projects/archy/core/target/release/archipelago` → `/usr/local/bin/archipelago.new`
3. Ensure `/opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml` present (copy from repo)
4. Ensure `/opt/archipelago/docker/bitcoin-ui/` matches the Step-7 layout (no baked nginx.conf)
5. Snapshot: `podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.CreatedAt}}"` → save to `/tmp/pre-step10-containers.txt`
6. `systemctl stop archipelago` → install binary → `systemctl start archipelago`
7. Verify in journal: every running container appears in "Adopted N existing container(s)"; no container was recreated; all HTTP smokes still 200; BootReconciler reaches NoOp on every app_id after one pass.
8. If broken → restore `.bak` binary, re-enable DEV_MODE override.
9. Commit STATUS.md update.

**Risk on .116:** If adoption fails mid-flight, we'd lose the running v1.7.42 backend that I'm currently typing at. Keep a second SSH session open to the ThinkPad for emergency revert. The backup plan is `install /usr/local/bin/archipelago.bak /usr/local/bin/archipelago && systemctl restart archipelago`.

**After Step 10 we are blocked on Step 8b** (multi-day manifest ports) before Step 11 (chaos matrix).

---

### Why Step 8 got split (discovered 2026-04-23)

Original plan was one commit "delete bash + edit ISO builder". But on investigation:
- `first-boot-containers.sh` creates **30+ containers** with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs.
- Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator.
- `update.rs` (OTA update RPC) invokes `reconcile-containers.sh` at two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests.
- Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests.

---

# Archipelago — Current State, Plan, and Releases

Updated: 2026-04-22

This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in [`bulletproof-containers.md`](./bulletproof-containers.md).

---

## Current state

### Fleet status

All four Gitea mirrors are synced to v1.7.40-alpha:

| Mirror | Host | Status |
|---|---|---|
| tx1138 | https://git.tx1138.com | ✅ v1.7.40-alpha live |
| gitea-local | http://localhost:3000 | ✅ v1.7.40-alpha live |
| .160 | http://23.182.128.160:3000 | ✅ v1.7.40-alpha live (Gitea recovered via `podman system renumber` — see below) |
| .168 | http://146.59.87.168:3000 | ✅ v1.7.40-alpha live |

Fleet test nodes:

| Node | Version | State |
|---|---|---|
| .103 (dev) | 1.7.40 | running, being developed against |
| .116 (this box) | 1.7.40 | healed manually via `systemd-run chmod 755 /opt/archipelago/web-ui` after v1.7.38/39 bug |
| .198 | 1.7.39 → 1.7.40-alpha | healed manually |
| .228 (primary test) | 1.7.40-alpha | healed manually; bitcoin-core + lnd + electrumx running; UI companions currently missing; bitcoin.conf rpcauth patched live |
| .249 (ISO test) | unreachable today | |
| .253 | 1.7.39 → 1.7.40-alpha | healed manually |

### Known open issues (drives the plan below)

1. **UI companion containers disappear** on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration)
2. **bitcoin.conf rpcauth drifts** from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived)
3. **`host.containers.internal`** resolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf)
4. **Podman state DB loss** requires manual recovery (fixed by v1.7.44 startup self-heal)
5. **LND "Connect Wallet" info** vanishing after crashes — symptom of the same drift class as #2
6. **ElectrumX not syncing** on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled

### Recent field incident (2026-04-22)

- Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was `drwx------` (700). Every node that OTA'd got 500 errors on every page.
- Root-cause fix shipped in v1.7.40 (`create-release-manifest.sh` chmod + pre-ship assertion that `tar tvzf | head -1` shows `drwxr-xr-x`).
- .160 Gitea was down all day (502) because its rootless podman's `libpod/bolt_state.db` had vanished. Recovered via clearing `/run/user/$UID/{containers,libpod,podman}` + `podman system renumber`.
- Full failure-mode audit is in [`bulletproof-containers.md`](./bulletproof-containers.md).

---

## Plan

We're shipping a level-triggered **reconciler + Quadlet** architecture over six incremental releases. Each release closes one failure mode. See [`bulletproof-containers.md`](./bulletproof-containers.md) for the full design, code layout, test harness, chaos matrix, sources.

### Release roadmap

| Release | Closes | What lands | Status |
|---|---|---|---|
| **v1.7.41** | FM5 (bad OTA nginx 500) | Post-OTA auto-rollback. New binary probes `https://127.0.0.1/` on boot; if non-200 within 90s, restores `web-ui.bak` + calls `rollback_update()` + restarts | **in flight — deploying to .228 for test** |
| **v1.7.42** | FM4 (`host.containers.internal` wrong) | `/etc/containers/containers.conf` w/ `host_containers_internal_ip = 10.89.0.1`; every container gets `--add-host=host.archipelago:10.89.0.1` | pending |
| **v1.7.43** | FM2 (config drift) | `reconcile::derived::render_bitcoin_conf` — pure fn over canonical secret, rewrites on drift. Same for `lnd.conf` | pending |
| **v1.7.44** | FM6 (podman state loss) | Startup probe detects broken podman state, auto-recovers via `/run/user/$UID/*` clear + `system renumber` | pending |
| **v1.7.45** | FM1 + FM3 (companion orphans) | `archy-bitcoin-ui` → Quadlet `.container` unit in `/etc/containers/systemd/`. systemd (not archipelago) owns it | pending |
| **v1.7.46** | — | `archy-lnd-ui` → Quadlet | pending |
| **v1.7.47** | — | `archy-electrs-ui` → Quadlet | pending |
| **v1.7.48+** | all (full daemon refactor) | `core/archipelago/src/reconcile/` module replaces imperative `install.rs` container management. Main app containers become Quadlet too | pending |

Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag.

---

## Release history

### [v1.7.41-alpha](/releases/v1.7.41-alpha/) — IN FLIGHT — 2026-04-22
**Post-OTA auto-rollback.** After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.

Changes:
- `core/archipelago/src/update.rs`: `PendingVerification` struct, write marker before service restart, `verify_pending_update()` on new binary boot — probes `https://127.0.0.1/`, on fail restores `web-ui.bak` + calls `rollback_update()` + `systemctl restart archipelago`
- `core/archipelago/src/main.rs`: startup task invokes verifier concurrently with server

### [v1.7.40-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.40-alpha/) — 2026-04-22
**Proper fix for the 500 error.** Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly `chmod 755` before tar; `--mode=u=rwX,go=rX` normalizes archive perms; pre-ship assertion aborts release if `tar tvzf | head -1` isn't `drwxr-xr-x`.

Changes:
- `scripts/create-release-manifest.sh`: pre-tar chmod + tar --mode flag + post-tar verify
- Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball)

### [v1.7.39-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.39-alpha/) — 2026-04-22
**Hotfix attempt** for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in `main.rs` and post-extract chmod in `update.rs` OTA applier.

### [v1.7.38-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.38-alpha/) — 2026-04-22
**Onboarding auto-heal + silent logins + App Store trim.**

Changes:
- `auth.rs`: `is_onboarding_complete()` auto-heals from `setup_complete` + `password_hash` (prevents clear-cache → onboarding wizard bug)
- `useOnboarding`: tri-state — backend-unreachable no longer defaults to `/onboarding/intro`
- Login sounds gated by `isFirstInstallPhase()` — silent after onboarding, typing sounds unaffected
- Removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog + Rust + docker + icons
- Deleted 15 image versions from tx1138, .168, gitea-local registries
- AIUI baked into release tarball via `demo/aiui/`
- `prebuild` hook syncs `app-catalog/catalog.json` → `public/catalog.json`

(Shipped with tarball-perms bug; fleet had to be healed before v1.7.40.)

### [v1.7.37-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.37-alpha/) — 2026-04-22
**Bitcoin Core install fixes + dynamic node UI + full-archive default.**

- Bitcoin Core passes explicit `-rpcbind/-rpcallowip/etc.` CLI args so vanilla image exposes RPC
- Split `bitcoin-core` from `bitcoin-knots` in backend `AppMetadata`
- bitcoin-ui auto-detects Core vs. Knots from subversion, swaps branding at runtime
- Storage (Full Archive · X GB / Pruned) indicator on dashboard
- Node Settings modal shows real values (network, storage, txindex, ZMQ, RPC port)
- Pull fallback to `docker.io` when no mirror carries the image
- Removed `prune=550` hardcode — full archive default

---

## Key docs

- [`bulletproof-containers.md`](./bulletproof-containers.md) — full reconcile architecture, code layout, test matrix, chaos scenarios, sources
- [`BETA-RELEASE-CHECKLIST.md`](./BETA-RELEASE-CHECKLIST.md) — existing beta checklist
- [`BETA-ISSUES-20260328.md`](./BETA-ISSUES-20260328.md) — prior beta-blocker tracking
- [`hotfix-process.md`](./hotfix-process.md) — release workflow
- [`architecture.md`](./architecture.md) — system architecture overview

---

## How to resume

1. Check fleet mirrors are all live: `curl -sS https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version`
2. Read [`bulletproof-containers.md`](./bulletproof-containers.md) for the current plan
3. Check task list (`/list` or via Claude Code) for the in-flight release
4. Latest in-flight work: v1.7.41 deploying to .228 for test; will ship to all 4 mirrors once verified
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
+								# RESUME HERE — Rust orchestrator migration
-												docs: STATUS.md — dashboard Stop UX bug diagnosis + async-spawn fix plan

Captures full design for the next session:
- Full bug sequence (5.5min blocking RPC + 30s scan clobbering transitional state)
- 4-commit implementation order with exact file:line targets
- Single-button UI spec with full label table
- Verification gates including manual LND stop test on .228
- Architectural decision: spawn lives in RPC layer, orchestrator trait stays sync

No code change yet; next session implements.

											
										
										
											2026-04-23 04:45:12 -04:00
+								Updated: 2026-04-23 (Dashboard Stop UX bug diagnosed; async-spawn fix fully designed, ready to implement)
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
-												docs: update STATUS.md — Step 4 done, Step 5 next

Records acceptance evidence for Steps 1-4 (container tests 21/21 pass, build
clean with expected unused-method warnings) and queues the BootReconciler
implementation for Step 5.

											
										
										
											2026-04-22 18:57:43 -04:00
+								**To resume this work, SSH into the ThinkPad and run `opencode` from `~/Projects/archy/`. Or work from the laptop via the SSHFS mount at `~/mnt/archy-thinkpad/`.**
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
-												docs: STATUS.md — dashboard Stop UX bug diagnosis + async-spawn fix plan

Captures full design for the next session:
- Full bug sequence (5.5min blocking RPC + 30s scan clobbering transitional state)
- 4-commit implementation order with exact file:line targets
- Single-button UI spec with full label table
- Verification gates including manual LND stop test on .228
- Architectural decision: spawn lives in RPC layer, orchestrator trait stays sync

No code change yet; next session implements.

											
										
										
											2026-04-23 04:45:12 -04:00
+								---
 								## ⚡ NEXT SESSION — START HERE
 								**Goal**: implement async-spawn lifecycle fix so the dashboard never shows a frozen spinner again. User mandate: _"best server containers in the world"_. Do not ship the chaos matrix (Step 11) until this lands and manual LND stop verifies instant RPC + live `Stopping…` label.
-												docs: STATUS.md — complete SSH/key/sudo/deploy reference for next session

Expands NEXT SESSION header with fully verified access info so a fresh
agent has zero ambiguity:

- SSH key inventory across laptop, .116, .228 (every file, purpose noted)
- Actual SSH config aliases (archy, archy228) with IdentitiesOnly
- Verified connectivity matrix (laptop -> both; .116 -> .228; .228 has no outbound key)
- Corrected sudo state: .228 sudoers file is /etc/sudoers.d/archipelago
  (not archipelago-ci); .116 has archipelago-ci + archipelago-wg scope-limited drop-ins
- SSHFS mount source command + AppleDouble gotcha
- Cargo over SSH PATH gotcha + detached build pattern for >2min timeout
- End-to-end deploy-to-.228 recipe (build, SCP, atomic swap, verify)
- Git workflow rules (no push, no amend, no force, conventional commits)

Removes duplicate host-reference block that the prior edit left trailing.
No code change.

											
										
										
											2026-04-23 04:49:45 -04:00
+								### How to work on this repo (SSH + SSHFS setup)
-												docs: STATUS.md — FUSE/SSHFS development loop section

Dedicated section covering the file-ops-via-mount + git/cargo-via-ssh
split that makes this dev setup work. Includes:

- Exact running mount command (pulled from ps)
- macFUSE + sshfs-mac brew install path
- Health check + recovery sequence for when mount hangs (it will)
- Full which-path-for-which-operation table
- Don't-do list (cargo from mount, rsync without AppleDouble exclude, etc)
- Cache caveat and inode-sharing note between mount and SSH views

No code change.

											
										
										
											2026-04-23 04:51:53 -04:00
+								You are likely running on the **laptop** (macOS). The repo lives on the **ThinkPad** (.116). There are two access paths, use both in parallel:
-												docs: STATUS.md — complete SSH/key/sudo/deploy reference for next session

Expands NEXT SESSION header with fully verified access info so a fresh
agent has zero ambiguity:

- SSH key inventory across laptop, .116, .228 (every file, purpose noted)
- Actual SSH config aliases (archy, archy228) with IdentitiesOnly
- Verified connectivity matrix (laptop -> both; .116 -> .228; .228 has no outbound key)
- Corrected sudo state: .228 sudoers file is /etc/sudoers.d/archipelago
  (not archipelago-ci); .116 has archipelago-ci + archipelago-wg scope-limited drop-ins
- SSHFS mount source command + AppleDouble gotcha
- Cargo over SSH PATH gotcha + detached build pattern for >2min timeout
- End-to-end deploy-to-.228 recipe (build, SCP, atomic swap, verify)
- Git workflow rules (no push, no amend, no force, conventional commits)

Removes duplicate host-reference block that the prior edit left trailing.
No code change.

											
										
										
											2026-04-23 04:49:45 -04:00
-												docs: STATUS.md — FUSE/SSHFS development loop section

Dedicated section covering the file-ops-via-mount + git/cargo-via-ssh
split that makes this dev setup work. Includes:

- Exact running mount command (pulled from ps)
- macFUSE + sshfs-mac brew install path
- Health check + recovery sequence for when mount hangs (it will)
- Full which-path-for-which-operation table
- Don't-do list (cargo from mount, rsync without AppleDouble exclude, etc)
- Cache caveat and inode-sharing note between mount and SSH views

No code change.

											
										
										
											2026-04-23 04:51:53 -04:00
+. **SSHFS mount at `~/mnt/archy-thinkpad/`** — for all file ops (`read`/`edit`/`write`/`glob`/`grep`).
 . **Direct SSH** — for everything that isn't file ops: `git`, `cargo`, `npm`, `systemctl`, running the server, tailing logs.
 								See the "FUSE / SSHFS development loop" section below for the full mount lifecycle — that's _the_ thing that makes this dev setup work, and it will break periodically.
 								### FUSE / SSHFS development loop
 								**Why this exists**: editing the repo directly on the ThinkPad over raw SSH means no IDE, no tool-native file reads, no glob/grep speed. SSHFS mounts the remote filesystem as a local directory so OpenCode's file tools work transparently. But SSHFS is a leaky abstraction — know the gotchas or you'll waste hours.
 								**Stack** (macOS laptop):
 								- **macFUSE** — kernel extension providing FUSE on macOS. Install via `brew install --cask macfuse` (requires reboot + security approval in System Settings the first time).
 								- **sshfs** — userspace mount tool. Install via `brew install gromgit/fuse/sshfs-mac` (the homebrew core `sshfs` was removed; use this tap).
 								- Verify: `which sshfs` → `/opt/homebrew/bin/sshfs`, `sshfs --version` → `SSHFS version 2.10 / FUSE library version 2.9.9`.
 								**Actual mount command currently running** (verified from `ps`):
 								```
 								sshfs archy:Projects/archy /Users/dorian/mnt/archy-thinkpad \
 								  -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,volname=archy-thinkpad
 								```
 								Breakdown:
 								- `archy:Projects/archy` — remote path via the `archy` SSH alias (uses `~/.ssh/archy_opencode`, no password prompt).
 								- `~/mnt/archy-thinkpad` — local mount point. Create once: `mkdir -p ~/mnt/archy-thinkpad`.
 								- `reconnect` — sshfs auto-reconnects if the TCP session drops (WiFi flap, laptop sleep). Without this, the mount turns into a zombie immediately.
 								- `ServerAliveInterval=15` — sends a keepalive every 15s.
 								- `ServerAliveCountMax=3` — disconnect after 3 missed keepalives (45s). Tune up if your network is flaky.
 								- `volname=archy-thinkpad` — Finder display name.
 								**Check mount health**:
 								```
 								mount | grep archy-thinkpad
 								# should print: archy:Projects/archy on /Users/dorian/mnt/archy-thinkpad (macfuse, nodev, nosuid, synchronous, mounted by dorian)
 								ls ~/mnt/archy-thinkpad/ | head
 								# should list repo contents fast (<1s). If it hangs, mount is stale.
 								```
 								**Recovery when the mount hangs / goes stale** (this WILL happen — laptop sleeps, WiFi drops, ThinkPad reboots):
 								```
 								# 1. Force-unmount (macOS — `umount` alone often fails on a hung FUSE mount)
 								sudo diskutil unmount force ~/mnt/archy-thinkpad
 								# fallback if diskutil can't see it:
 								sudo umount -f ~/mnt/archy-thinkpad
 								# 2. Kill any zombie sshfs process
 								pkill -f "sshfs archy:Projects/archy"
 								# 3. Remount
 								sshfs archy:Projects/archy ~/mnt/archy-thinkpad \
 								  -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3,volname=archy-thinkpad
 								# 4. Verify
 								ls ~/mnt/archy-thinkpad/ | head
 								```
 								If the mount point itself got wedged (`ls: /Users/dorian/mnt/archy-thinkpad: Device not configured`), the sequence above still works — macFUSE garbage-collects the inode after the force-unmount.
 								**When to use which path** (rules, not suggestions):
 								| Operation | Use | Why |
 								|---|---|---|
 								| `read` / `edit` / `write` | SSHFS mount | OpenCode tools want local paths |
 								| `glob` / `grep` | SSHFS mount | Local FS traversal is fine; remote would need rg over SSH |
 								| Reading many files | SSHFS mount | Each read is a round-trip but parallelizable |
 								| `git status` / `git diff` / `git log` | SSH | Git over FUSE is painfully slow (lots of stat calls) |
 								| `git add` / `git commit` | SSH | Same — commit times grow linearly with tree size on FUSE |
 								| `cargo check` / `cargo test` / `cargo build` | SSH | Compiling over FUSE would take hours; cargo's incremental stat pattern destroys FUSE performance |
 								| `npm install` / `npm run build` | SSH | Same reason — massive file churn |
 								| Running the server / tailing journal | SSH | Service lives on .116 |
 								| Deploying to .228 | SSH from .116 | SCP from ThinkPad; laptop isn't in the critical path |
 								**Don't do this** (will bite you):
 								- `cargo build` from the mount — will try to write target/ over FUSE, gets orders of magnitude slower, may hang.
 								- `rsync` without `--exclude="._*"` — macOS writes AppleDouble metadata files, they leak to the remote as `._*` siblings of every real file. `.gitignore` already excludes them (commit `13858842`), but they clutter the tree.
 								- Writing big binary files via the mount — use `scp` over SSH instead.
 								- Relying on file-change-watcher tools (watchman, chokidar) — they get confused by FUSE event semantics.
 								**Editing workflow in a typical session**:
 . Laptop: OpenCode `read`s a file via `/Users/dorian/mnt/archy-thinkpad/...`. FUSE fetches it over SSH, caches briefly.
 . Laptop: OpenCode `edit`s the file — FUSE writes the new bytes back to .116 immediately (synchronous mount).
 . Laptop: `ssh archy "cd ~/Projects/archy && ~/.cargo/bin/cargo check -p archipelago"` — runs on the real filesystem on .116, sees the edit.
 . Laptop: `ssh archy "cd ~/Projects/archy && git diff path/to/file"` — confirms the edit landed.
 . Laptop: `ssh archy "cd ~/Projects/archy && git add path/to/file && git commit -m '...'"` — commit from .116.
 								The SSHFS mount and the SSH shell are pointing at **the same inodes** — edits via the mount are instantly visible to `cargo`/`git` over SSH. There's no "sync" step.
 								**Cache caveat**: macFUSE caches attributes briefly (default ~1s). If you write via SSH and read via the mount within that window, you may see stale metadata. The mount's `synchronous` flag (visible in `mount` output) minimizes but doesn't eliminate this. If you get a weird diff between what SSH and the mount report, re-read after a second, or `stat --file-system ~/mnt/archy-thinkpad/<file>` to force a refresh.
 								**Direct SSH** access (use when FUSE isn't the right tool):
-												docs: STATUS.md — complete SSH/key/sudo/deploy reference for next session

Expands NEXT SESSION header with fully verified access info so a fresh
agent has zero ambiguity:

- SSH key inventory across laptop, .116, .228 (every file, purpose noted)
- Actual SSH config aliases (archy, archy228) with IdentitiesOnly
- Verified connectivity matrix (laptop -> both; .116 -> .228; .228 has no outbound key)
- Corrected sudo state: .228 sudoers file is /etc/sudoers.d/archipelago
  (not archipelago-ci); .116 has archipelago-ci + archipelago-wg scope-limited drop-ins
- SSHFS mount source command + AppleDouble gotcha
- Cargo over SSH PATH gotcha + detached build pattern for >2min timeout
- End-to-end deploy-to-.228 recipe (build, SCP, atomic swap, verify)
- Git workflow rules (no push, no amend, no force, conventional commits)

Removes duplicate host-reference block that the prior edit left trailing.
No code change.

											
										
										
											2026-04-23 04:49:45 -04:00
+								   - `ssh archy` → `archipelago@192.168.1.116` using `~/.ssh/archy_opencode`
 								   - `ssh archy228` → `archipelago@192.168.1.228` using `~/.ssh/archy_opencode`
 								   - Full host form also works: `ssh archipelago@192.168.1.116` / `ssh archipelago@192.168.1.228` (same key resolves via IdentitiesOnly).
 								### SSH keys — what's where
 								**Laptop `~/.ssh/` (macOS, user `dorian`)**:
 								| File | Purpose |
 								|---|---|
 								| `archy_opencode` / `.pub` | **Primary key for this project.** Unlocks both `archy` (.116) and `archy228` (.228). Created 2026-04-22 specifically for OpenCode work. |
 								| `archipelago-deploy` / `.pub` | Older archipelago deploy key. Not needed for current work. |
 								| `id_ed25519` / `.pub` | Personal default key. Not used by archy/archy228 configs (`IdentitiesOnly yes` forces `archy_opencode`). |
 								| `id_ed25519_angor` / `.pub` | Angor project. Unrelated. |
 								| `id_ed25519_start9` / `.pub` | Start9 project. Unrelated. |
 								| `vps-ci-setup` / `.pub` | VPS CI. Unrelated. |
 								| `config` | Host aliases (shown above) |
 								**.116 `/home/archipelago/.ssh/`**:
 								| File | Purpose |
 								|---|---|
 								| `authorized_keys` | Accepts: laptop's `archy_opencode.pub` + 3 other keys (4 lines total). |
 								| `id_ed25519` / `.pub` | .116's OWN identity key. This is what lets `.116 → .228` work passwordless. |
 								| `archipelago-deploy` | Symlink → `id_ed25519` (legacy alias). |
 								| `id_ed25519_vps168` / `.pub` | For SSH to `146.59.87.168` (VPS). Unrelated to this work. |
 								| `config` | Host entry for the VPS only. |
 								**.228 `/home/archipelago/.ssh/`**:
 								| File | Purpose |
 								|---|---|
 								| `authorized_keys` | Accepts: laptop's `archy_opencode.pub` + .116's `id_ed25519.pub` + 2 others (4 lines total). |
 								| _(no `id_ed25519`)_ | .228 has no outbound key — it's a terminal node. Don't try to `ssh` _from_ .228 _to_ anywhere. |
 								**Connectivity matrix (all verified 2026-04-23)**:
 								| From → To | Works passwordless | Via |
 								|---|---|---|
 								| Laptop → .116 | ✅ | `archy_opencode` |
 								| Laptop → .228 | ✅ | `archy_opencode` |
 								| .116 → .228 | ✅ | .116's `id_ed25519` |
 								| .228 → anywhere | ❌ | no outbound key (by design) |
 								### Sudo — verified state
 								**.116** (dev ThinkPad):
 								- User `archipelago` is in `sudo` group.
 								- Sudo password required: **`ThisIsWeb54321@`**
 								- Sudoers drop-ins present: `/etc/sudoers.d/archipelago-ci`, `/etc/sudoers.d/archipelago-wg` (scope-limited NOPASSWD for specific CI/wg commands — not full NOPASSWD).
 								- For most dev work you don't need sudo on .116.
 								**.228** (prod kiosk):
 								- User `archipelago` has **full passwordless sudo** via `/etc/sudoers.d/archipelago` containing `archipelago ALL=(ALL) NOPASSWD:ALL`.
 								- User is also in `sudo` group.
 								- Sudo password (if ever prompted, shouldn't be): **`archipelago`**
 								- Dashboard password: **`password123`**
 								### Cargo / npm / paths
 								- **Cargo PATH gotcha**: non-interactive SSH login has no cargo in PATH. Always use `~/.cargo/bin/cargo` over SSH.
 								  - Example: `ssh archy '~/.cargo/bin/cargo check -p archipelago' --workdir ~/Projects/archy/core`
 								  - Or cd first: `ssh archy 'cd ~/Projects/archy && ~/.cargo/bin/cargo check -p archipelago'`
 								- **Long cargo builds** (>2 min Bash tool timeout): launch detached and poll the log:
 								  ```
 								  ssh archy 'cd ~/Projects/archy && nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown'
 								  ssh archy 'tail -30 /tmp/cargo-build.log'
 								  ssh archy 'pgrep -a cargo'   # to check if still running
 								  ```
 								- **npm / frontend** lives at `~/Projects/archy/neode-ui/` on .116 (also accessible via laptop mount at `~/mnt/archy-thinkpad/neode-ui/`). Node is on interactive PATH; for scripted SSH, `source ~/.nvm/nvm.sh && nvm use` or call the absolute path if nvm is used.
 								- Repo on .116: `~/Projects/archy/` (Cargo workspace at `core/Cargo.toml`).
 								- Web root on .228: check `/etc/nginx/sites-enabled/` for the live path; historically `/var/lib/archipelago/web-ui/` or `/opt/archipelago/web-ui/`.
 								### Deploying new server binary to .228
 								```
 								# 1. Build on .116 (detached — takes ~3-5 min for release)
 								ssh archy 'cd ~/Projects/archy && nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown'
 								# wait / tail log until "Finished `release` profile"
 								# 2. SCP .116 → .228 (uses .116's id_ed25519 → .228's authorized_keys, passwordless)
 								ssh archy 'scp ~/Projects/archy/core/target/release/archipelago archipelago@192.168.1.228:/tmp/archipelago.new'
 								# 3. Atomic swap on .228 with backup
 								ssh archy228 'sudo cp /usr/local/bin/archipelago /usr/local/bin/archipelago.bak-pre-async-stop && sudo mv /tmp/archipelago.new /usr/local/bin/archipelago && sudo chmod +x /usr/local/bin/archipelago && sudo systemctl restart archipelago'
 								# 4. Verify
 								ssh archy228 'systemctl status archipelago --no-pager | head -20 && sudo journalctl -u archipelago -n 50 --no-pager'
 								```
 								### Git workflow
 								- Branch: `main` on .116, currently **22 commits ahead of `tx1138/main`**.
 								- Remote `tx1138` exists but **do NOT push** — user mirrors to 4 Gitea remotes personally after reviewing.
 								- Atomic commits, one logical change per commit. Conventional Commits format (`feat:`, `fix:`, `docs:`, `refactor:`, `chore:`, `test:`, `perf:`).
 								- Never `--amend` unless the commit you're amending was created in this session AND has not been pushed. Safer: new commit.
 								- Never `--force` push. Never modify git config.
 								- If pre-commit hooks fail, create a NEW commit with the fix — don't `--amend` after a failed commit.
 								### Other
 								- Full destructive latitude on both nodes. Announce multi-hour ops (OTA, full rebuild, apt upgrade). Don't ask for routine stop/start/rebuild permission.
 								- No ship pressure. Do it properly.
 								- Use `question` tool for ambiguous decisions (don't guess user intent on design choices).
 								- Keep `docs/STATUS.md` fresh between sessions — it IS the session handoff.
 								### Hosts reference (quick)
 								| Host | IP | SSH alias | Role | Dashboard | Sudo |
 								|---|---|---|---|---|---|
 								| `archy` (ThinkPad X250) | 192.168.1.116 | `ssh archy` | dev host, Debian 13 | `archipelago` | `ThisIsWeb54321@` |
 								| `archy228` (HP ProDesk) | 192.168.1.228 | `ssh archy228` | prod kiosk, Rust orchestrator | `password123` | NOPASSWD (fallback `archipelago`) |
-												docs: STATUS.md — dashboard Stop UX bug diagnosis + async-spawn fix plan

Captures full design for the next session:
- Full bug sequence (5.5min blocking RPC + 30s scan clobbering transitional state)
- 4-commit implementation order with exact file:line targets
- Single-button UI spec with full label table
- Verification gates including manual LND stop test on .228
- Architectural decision: spawn lives in RPC layer, orchestrator trait stays sync

No code change yet; next session implements.

											
										
										
											2026-04-23 04:45:12 -04:00
+								### Bug being fixed
 								Dashboard sequence when user clicks **Stop LND**:
 . UI collapses Start/Stop buttons to single spinner-button ("Stopping…") via `loadingApps.add('lnd')`.
 . Frontend calls `container-stop` RPC. Server runs `podman stop -t 330 lnd` **synchronously inside the RPC handler** (via `orchestrator.stop()`). RPC blocks up to **5.5 min** for LND (330s timeout + overhead).
 . Meanwhile the 30-second package-scan loop in `server.rs:scan_and_update_packages` keeps running. It rebuilds `PackageDataEntry` from podman inspect — podman still reports `running` (stop hasn't completed) — and **blindly overwrites** the store entry at `server.rs:854`.
 . `container-list` RPC reads `state_manager` snapshot → returns `state = "running"`.
 . Frontend polling sees `running` → `getAppState()` returns `'running'` → the two-button (Start | Stop) block re-renders → the transitional button disappears → **UI looks like the stop silently failed**.
 . Eventually `podman stop` finishes → next scan → state flips to `Stopped` → buttons change _again_.
 								Net visible bug: button spins briefly, reverts to Running, then several minutes later suddenly shows Stopped. User rightly calls this "out of sync and confusing".
 								### Decisions already locked in (do not re-ask)
 								- **Full scope fix** (not minimal hotfix). User chose "Go full scope, do it right".
 								- **Async-spawn lives in the RPC layer**, not in the `ContainerOrchestrator` trait. Trait stays synchronous so the reconciler, boot flow, unit tests, and the chaos harness retain deterministic behaviour.
 								- **`PackageState` already has `Stopping`/`Starting`/`Restarting`/`Installing`/`Updating`/`Removing`** variants — enum at `core/archipelago/src/data_model.rs:107-124`. No schema change needed.
 								- **UI collapses to one full-width button** with spinner during every transitional state. Labels: Start / Stop / Starting… / Stopping… / Restarting… / Installing… / Updating… / Removing… / Install (when `not-installed`).
 								- **Helper API shape**: `RpcHandler::spawn_transitional(op: Op, app_id: String)` where `Op` is an enum `{Stop, Start, Restart}`. Helper dispatches to `orchestrator.stop/start/restart` internally, knows each op's transitional+final states, handles error → revert + `install_log()`.
 								- **`mark_user_stopped` must run BEFORE the spawn** (preserves ordering the crash recovery layer depends on — see `runtime.rs:145-148`).
 								### Implementation order (4 commits, local only)
 								**Commit 1 — `feat(rpc): spawn_transitional helper for async lifecycle ops`**
 								- New file: `core/archipelago/src/api/rpc/transitional.rs` (or extend `container.rs`; prefer new file for cohesion with future stacks/package variants)
 								- `enum Op { Stop, Start, Restart }` with `transitional_state()`, `final_state_on_success()`, `log_prefix()`, and async `dispatch(&orch, &app_id)` method
 								- `impl RpcHandler { pub(super) async fn spawn_transitional(&self, op: Op, app_id: String) -> Result<()> }`
 								  - Capture `Arc<dyn ContainerOrchestrator>` + `Arc<StateManager>` clones
 								  - Set transitional state via `state_manager.update_data()` (if entry exists; skip if not — Start on never-installed shouldn't create an entry)
 								  - `tokio::spawn(async move { ... })`
 								  - Inside spawn: `install_log("{LOG_PREFIX}: {app_id}")`, `op.dispatch(&orch, &app_id).await`, on success set final state, on error log + `install_log("{LOG_PREFIX} FAIL: …")` + revert state to previous (cache pre-transition state in a local)
 								  - Return `Ok(())` immediately after spawn
 								**Commit 2 — `fix(rpc): async container stop/start/restart; widen state mapping`**
 								- `api/rpc/container.rs:85-107` — rewrite `handle_container_stop` body: `validate_app_id`, `mark_user_stopped`, `spawn_transitional(Op::Stop, app_id.to_string()).await?`, return `Ok(json!({ "status": "stopping" }))`
 								- `api/rpc/container.rs:61-83` — rewrite `handle_container_start`: `clear_user_stopped`, `spawn_transitional(Op::Start, …)`, return `{ "status": "starting" }`
 								- **Add** `handle_container_restart` (currently missing in `container.rs` — only exists as `package.restart` at `runtime.rs:176-242`). Register RPC route name `container-restart`. Add matching frontend client method in `container-client.ts`.
 								- `api/rpc/container.rs:148-154` — widen the `container-list` state mapping: add arms for `Stopping → "stopping"`, `Starting → "starting"`, `Restarting → "restarting"`, `Installing → "installing"`, `Updating → "updating"`, `Removing → "removing"`, `Installed → "installed"`, `CreatingBackup`/`RestoringBackup`/`BackingUp` → their kebab-case strings. No more `"unknown"` fallback unless the variant is genuinely unknown.
 								- Mirror same spawn treatment in `api/rpc/package/runtime.rs`: `handle_package_start` (L28-119), `handle_package_stop` (L122-173), `handle_package_restart` (L176-242). Keep the existing verification loops (post-start exit-check at L82-117; restart stop+start fallback at L215-235) _inside_ the spawned future, not in the RPC body.
 								**Commit 3 — `fix(state): preserve transitional state across container scans`**
 								- `server.rs:847-857` — in the merge loop, before the `merged.insert(id.clone(), pkg.clone())` overwrite, check `merged.get(id).state` and skip overwrite if it's transitional: `matches!(existing.state, Installing | Stopping | Starting | Restarting | Updating | Removing | CreatingBackup | RestoringBackup | BackingUp)`
 								- Still allow _non-state_ fields (lan_address, health, ports) to update. Simplest: when existing is transitional, keep `existing.state` but merge updated fields from `pkg`. Write a tiny helper `merge_preserving_transitional(existing, fresh) -> PackageDataEntry`.
 								- Unit test: construct `existing.state = Stopping`, `fresh.state = Running`, assert merged.state stays `Stopping`.
 								- **Also check**: Is there a timeout escape hatch? If `Stopping` is set and podman actually finishes but the spawn died before writing the final state (process crash, panic), the entry will be stuck `Stopping` forever. Mitigation: track a `transitional_since: Instant` in the entry (not persisted, just in-memory side table on StateManager), and if > 2× the stop timeout has elapsed, allow podman scan state to override. Scope for this commit or follow-up — lean toward: include it, because fleet reliability matters.
 								**Commit 4 — `fix(ui): single-button lifecycle control with transitional labels`**
 								- `neode-ui/src/api/container-client.ts` — extend `ContainerStatus.state` union to: `'created' | 'running' | 'stopped' | 'exited' | 'paused' | 'unknown' | 'stopping' | 'starting' | 'restarting' | 'installing' | 'updating' | 'removing' | 'installed'`. Add `restartContainer(appId)` method calling `container-restart`.
 								- `neode-ui/src/stores/container.ts` — add computed `getAppVisualState(appId)` that returns one of: `'not-installed' | 'running' | 'stopped' | 'starting' | 'stopping' | 'restarting' | 'installing' | 'updating' | 'removing'`. Maps `exited`→`stopped`, `created`→`stopped`, `paused`→`stopped`, `installed`→`stopped`. Add `restartContainer(appId)` action (sets `loadingApps` for request dedup, calls client, does NOT `fetchContainers` immediately because server will broadcast state; a final `fetchContainers` after a short delay can backstop if WebSocket push is absent).
 								- `neode-ui/src/views/ContainerApps.vue:85-136` — replace the two-button conditional with a single full-width button bound to `getAppVisualState(app.id)`. Table:
 								  | visual state    | click action   | label          | spinner | disabled |
 								  |-----------------|----------------|----------------|---------|----------|
 								  | `not-installed` | installApp     | Install        | no      | no       |
 								  | `running`       | stopContainer  | Stop           | no      | no       |
 								  | `stopped`       | startContainer | Start          | no      | no       |
 								  | `starting`      | —              | Starting…      | yes     | yes      |
 								  | `stopping`      | —              | Stopping…      | yes     | yes      |
 								  | `restarting`    | —              | Restarting…    | yes     | yes      |
 								  | `installing`    | —              | Installing…    | yes     | yes      |
 								  | `updating`      | —              | Updating…      | yes     | yes      |
 								  | `removing`      | —              | Removing…      | yes     | yes      |
 								  - Add a separate Restart button next to the primary one when state is `running`, calling new `restartContainer` action. Restart button hides while transitional.
 								- `neode-ui/src/views/ContainerAppDetails.vue:83` (and full stop/start button blocks around L220, L232) — mirror the same single-button pattern.
 								- Also audit line 239 of `ContainerApps.vue` (`some((app) => store.getAppState(app.id) === 'created')`) and the logic around lines 276, 295, 309, 312 — make sure they use `getAppVisualState` where appropriate.
 								### Verification gates (do not skip)
 . `~/.cargo/bin/cargo check -p archipelago` on .116 via SSH
 . `~/.cargo/bin/cargo test -p archipelago` on .116 via SSH — at least the new merge helper test must pass
 . Build release binary on .116: `nohup ~/.cargo/bin/cargo build --release -p archipelago > /tmp/cargo-build.log 2>&1 < /dev/null & disown`. Poll until done.
 . SCP binary to .228 `/usr/local/bin/archipelago`, back up prior to `/usr/local/bin/archipelago.bak-pre-async-stop`. `sudo systemctl restart archipelago` on .228.
 . **Manual LND stop test on .228**:
 								   - Open dashboard, confirm LND is Running (first: `ssh archipelago@192.168.1.228 'podman start lnd'` — LND is currently Exited(0) from the demo)
 								   - Click Stop
 								   - Expected: button _immediately_ becomes "Stopping…" with spinner (RPC returns <1s)
 								   - Dashboard should stay on "Stopping…" for ~5 min
 								   - Then flip to "Start" button with label "Start"
 								   - At no point should it revert to "Running" mid-stop
 . Same test with Bitcoin Core stop (longest timeout, 600s)
 . Frontend build: `cd ~/Projects/archy/neode-ui && npm run type-check && npm run build`. Rsync `dist/` to `archipelago@192.168.1.228:/var/lib/archipelago/web-ui/` (or wherever the active web root is — check `/etc/nginx` on .228 first).
 . Then and only then: resume chaos matrix. First recover LND/ElectrumX via UI (great end-to-end test of the new async Start path), then run smoke → full 32-case matrix.
 								### Key files (exact lines of interest)
 								- `core/archipelago/src/api/rpc/container.rs:85-107` — `handle_container_stop` (blocking — target of fix)
 								- `core/archipelago/src/api/rpc/container.rs:61-83` — `handle_container_start`
 								- `core/archipelago/src/api/rpc/container.rs:148-154` — narrow state mapping (drops transitional → "unknown")
 								- `core/archipelago/src/api/rpc/package/runtime.rs:11-24` — `stop_timeout_secs` table (reference, unchanged)
 								- `core/archipelago/src/api/rpc/package/runtime.rs:122-173` — `handle_package_stop` (also blocking, mirror treatment)
 								- `core/archipelago/src/api/rpc/package/runtime.rs:28-119` — `handle_package_start`
 								- `core/archipelago/src/api/rpc/package/runtime.rs:176-242` — `handle_package_restart`
 								- `core/archipelago/src/api/rpc/package/progress.rs` — existing broadcast pattern to mirror (`set_install_progress`, `set_uninstall_stage`)
 								- `core/archipelago/src/api/rpc/mod.rs:62-100` — `RpcHandler` struct (already holds `Arc<dyn ContainerOrchestrator>` + state_manager)
 								- `core/archipelago/src/server.rs:812-857` — `scan_and_update_packages` (merge loop at L850-857 is where transitional-state clobber happens)
 								- `core/archipelago/src/container/docker_packages.rs:636-663` — `convert_state` + `package_state_str` (read-only reference, no change)
 								- `core/archipelago/src/container/traits.rs` — `ContainerOrchestrator` trait (stays synchronous, do not change)
 								- `core/archipelago/src/crash_recovery.rs` — `mark_user_stopped` / `clear_user_stopped` (call order preserved)
 								- `core/archipelago/src/data_model.rs:107-124` — `PackageState` enum (no change — all variants exist)
 								- `neode-ui/src/api/container-client.ts` — `ContainerStatus` type + RPC methods (extend)
 								- `neode-ui/src/stores/container.ts:93-312` — Pinia store (add `getAppVisualState`, add `restartContainer` action)
 								- `neode-ui/src/views/ContainerApps.vue:85-136, 239, 276, 295, 309-312, 383` — two-button block + state reads
 								- `neode-ui/src/views/ContainerAppDetails.vue:83, 220, 232` — details page Stop/Start
 								### Chaos harness (not in repo — lives on .116)
 								- `archipelago@192.168.1.116:~/ui-chaos/` — deployed, playwright + deps installed, smoke test for bitcoin-core passes (2.1 min). LND/ElectrumX/bitcoin-ui smoke tests not yet run (blocked on the async-stop fix landing; LND currently Exited on .228 from the demo).
 								- `/tmp/chaos/` on laptop — canonical source for rsync to .116.
 								- Run: `cd ~/ui-chaos && npx playwright test tests/<spec>`
 								- Target: 32 cases = 4 core containers × 8 scenarios (install-fresh, graceful-stop, sigkill, rm-container, oom-kill, rm-image, restart-service, network-partition).
 								- Uses SSH+Playwright hybrid per design; includes the `bash -lc '<escaped>'` single-quote fix for ssh argv flattening and JSON-parsed `podman inspect` instead of Go templates.
 								### Pre-existing bugs still deferred (do not fix until Stop UX lands)
 . `archipelago --version` spawns server (should be a pure CLI query)
 . RPC unknown-method returns generic error (should return method-not-found with the bad method name)
 . `docker_packages.rs` filters out UI containers (`archy-lnd-ui`, `archy-electrs-ui`) — some views need them visible
 . `lnd.lan_address` stale on .228
 . first-boot silent failure on some hardware
 . `web-ui.failed.*` scar on .228 (benign systemd unit state)
 . `test_parse_image_versions` pre-existing broken assertion — fix or `#[ignore]` when touching that area
 								---
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
+								## Where we are
 								Working through the 11-step plan in [`rust-orchestrator-migration.md`](./rust-orchestrator-migration.md).
-												docs: update STATUS.md — Step 4 done, Step 5 next

Records acceptance evidence for Steps 1-4 (container tests 21/21 pass, build
clean with expected unused-method warnings) and queues the BootReconciler
implementation for Step 5.

											
										
										
											2026-04-22 18:57:43 -04:00
+								- [x] **Step 1** — `3767c267` ContainerConfig schema with `build:`, `ResolvedSource` enum, `resolve()`, 10 tests
 								- [x] **Step 2** — `34af4d9d` ContainerRuntime trait gained `image_exists` + `build_image`, 4 argv tests, 25/25 pass
 								- [x] **Step 3** — `b6a04d31` ProdContainerOrchestrator (999 LOC), 16 tests all pass, not yet wired to main.rs
 								- [x] **Step 4** — `e8a59c93` ContainerOrchestrator trait, RpcHandler uses it in prod (+ `13858842` chore gitignore ._*)
-												docs: STATUS.md through Step 6

											
										
										
											2026-04-22 19:20:17 -04:00
+								- [x] **Step 5** — `fc39b04b` BootReconciler with Arc<Notify> shutdown, 4 paused-time tests pass
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
+								- [x] **Step 6** — `48f08aa3` main.rs wire-up (orchestrator construction + adopt_existing + BootReconciler spawn + shutdown Notify)
 								- [x] **Step 7** — `069bc4a5` bitcoin-ui pre-start hook + embedded nginx.conf template (8 unit tests + 1 integration test), 39/39 container:: tests pass
 								- [x] **Step 8a** — `a0707f4d` retire archipelago-reconcile.{service,timer} + ISO builder touchpoints, keep scripts for update.rs
-												docs: STATUS.md — .228 dashboard bugs fixed (macaroon + ExtraHost)

											
										
										
											2026-04-23 04:17:56 -04:00
+								- [x] **Step 9** — **Hot-swap on .228 verified.** All three UIs (bitcoin-ui/lnd-ui/electrs-ui) installing + serving HTTP 200.
 								- [x] **.228 dashboard bugs** — ExtraHost `192.168.1.254` bug (`3ee192ba`) + LND macaroon permission bug (`be960023`). See "Post-Step 9 bug hunt" below.
-												feat(iso): Step 8a — retire archipelago-reconcile systemd timer

BootReconciler (in-process, 30s interval, spawned from main.rs as of
Step 6 commit 48f08aa3) fully replaces the timer-driven bash
reconciliation path. Delete the systemd unit + timer and their
ISO-builder touchpoints.

Removed:
- image-recipe/configs/archipelago-reconcile.service
- image-recipe/configs/archipelago-reconcile.timer
- image-recipe/build-auto-installer-iso.sh L412-413 (COPY unit+timer)
- image-recipe/build-auto-installer-iso.sh L449 (systemctl enable)
- image-recipe/build-auto-installer-iso.sh L542-543 (cp to WORK_DIR)

Kept (intentionally):
- scripts/reconcile-containers.sh
- scripts/container-specs.sh

Reason: core/archipelago/src/api/rpc/package/update.rs still invokes
reconcile-containers.sh at two sites (OTA update + rollback paths).
Porting those call sites to ContainerOrchestrator::upgrade() requires
manifests for every container update.rs might touch — that scope
belongs in Step 8b. Until then the script stays on disk, just no
longer runs on a periodic timer.

No Rust code changes. cargo check -p archipelago clean, 6 pre-existing
warnings. Skipped full ISO rebuild validation per user decision —
edits are 5 textual deletions with zero behavioral ambiguity; Step 9
live hot-swap on .228 will catch any regression.

											
										
										
											2026-04-23 03:04:58 -04:00
+								- [ ] **Step 8b** — Port remaining ~25 container creations from `first-boot-containers.sh` into `apps/<id>/manifest.yml`, then port `update.rs` to orchestrator (deferred, multi-day work)
 								- [ ] **Step 8c** — Rename `first-boot-containers.sh` → `first-boot-setup.sh`, strip container ops, keep setup. Delete `reconcile-containers.sh` + `container-specs.sh`. Add ISO lines to copy `apps/` (final one-way door, requires 8b complete)
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
+								- [ ] **Step 10** — Hot-swap + verify on .116 (adoption-heavy test — .116 already has all containers running)
-												docs: STATUS.md — .228 dashboard bugs fixed (macaroon + ExtraHost)

											
										
										
											2026-04-23 04:17:56 -04:00
+								- [ ] **Step 11** — Chaos matrix on both nodes (all 8 scenarios × all containers incl. bitcoin-core)
 								## Post-Step 9 bug hunt (.228, 2026-04-23)
 								User reported three visible dashboard bugs after Step 9 verification:
 . LND — "no connect details or QR"
 . ElectrumX — stuck at "Building index (2 KB / ~130 GB)" for days
 . bitcoin-core — in scope for chaos testing
 								**Root cause #1 (ExtraHost, commit `3ee192ba`)**: `scripts/first-boot-containers.sh` computed `HOST_GATEWAY` from `ip route show default`, which returns the **LAN router** (e.g. 192.168.1.254), not the gateway to the host. Every container configured with `--add-host=host.containers.internal:$HOST_GATEWAY` was dialing the WiFi router instead of the host. LND crash-looped with `dial tcp 192.168.1.254:8332: connection refused`; ElectrumX's DAEMON_URL hit the same dead end; any `archy-net` bridge consumer of bitcoin-core's RPC was broken. Fixed by replacing the computed value with podman's magic `host-gateway` literal (supported since 4.4; we ship 5.4.2). Live-recreated bitcoin-core/electrumx/lnd on .228 with the corrected `--add-host`; LND reached chain backend; ElectrumX resumed indexing (went from 2 KB → 164.9 MB in under an hour).
 								**Root cause #2 (macaroon permissions, commit `be960023`)**: LND's `admin.macaroon` lives at `/var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon`, owned by rootless-podman subordinate UID 100000, mode 640. The archipelago server runs as host UID 1000 and literally cannot read the file. Every LND RPC (`getinfo`, `connect-info`, `export-channel-backup`) plus the shared `lnd_client()` helper failed with "Failed to read LND admin macaroon". **Confirmed pre-existing on .116 too** (long-standing bug unrelated to Step 9). Fix: centralised the path as `LND_ADMIN_MACAROON_PATH`, added a `read_lnd_admin_macaroon()` helper in `api/rpc/lnd/mod.rs` that tries direct read first then falls back to `sudo -n cat` (mirrors the pattern already used for Tor onion hostnames). Four call sites routed through the helper. Verified on .228 — `curl -k https://<host>/lnd-connect-info` now returns 200 with cert + macaroon + tor_onion; dashboard QR unblocked.
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
+								## Step 9 evidence (.228, 2026-04-23)
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
-												docs: STATUS.md — .228 dashboard bugs fixed (macaroon + ExtraHost)

											
										
										
											2026-04-23 04:17:56 -04:00
+								- Binary: Step 9 build with `732df1b8` + `ba83f9bc`, scp'd to .228 as `/usr/local/bin/archipelago`. Old binary backed up at `/usr/local/bin/archipelago.bak-pre-step9`. Later replaced with macaroon-fix build (`be960023`); previous backed up at `/usr/local/bin/archipelago.bak-pre-macaroon`.
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
+								- DEV_MODE override disabled (`override.conf` → `override.conf.disabled-pre-step9`).
 								- `/opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml` populated.
 								- `/opt/archipelago/docker/bitcoin-ui/Dockerfile` replaced with the Step 7 version (no `COPY nginx.conf`). Old dir backed up as `bitcoin-ui.bak-pre-step9`.
 								- Post-start snapshot:
 								  - `🔗 Adopted 1 existing container(s): ["electrs-ui"]` — adoption of 13h-running container worked without recreation
 								  - `🔄 Boot reconciler started (interval: 30s)` — every 30s, all three app_ids reach `NoOp` after the initial install pass
 								  - `bitcoin-ui nginx.conf rendered path=/var/lib/archipelago/bitcoin-ui/nginx.conf auth_hash=97af1c18` — pre-start hook fires in `install_fresh`
 								  - `curl localhost:8334` → HTTP 200 (bitcoin-ui), `:8081` → 200 (lnd-ui), `:50002` → 200 (electrs-ui)
 								  - OCI memory limits correctly applied: bitcoin-ui=128Mi, electrs-ui=128Mi, lnd-ui=64Mi (was emitted as 0 pre-fix)
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
-												docs: STATUS.md — .228 dashboard bugs fixed (macaroon + ExtraHost)

											
										
										
											2026-04-23 04:17:56 -04:00
+								## Bugs fixed this session
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
-												docs: STATUS.md — .228 dashboard bugs fixed (macaroon + ExtraHost)

											
										
										
											2026-04-23 04:17:56 -04:00
+. **`parse_memory_limit` truncation bug** (`732df1b8`): lowercased "128Mi" → "128mi" → `trim_end_matches('m')` → "128i" → f64 parse fails → `None.unwrap_or(0)` → OCI `memory.limit:0` → systemd rejects MemoryMax=0. 6 regression tests; `create_container` now omits instead of emitting 0.
 . **`archipelago.service` cgroup delegation missing** (`ba83f9bc`): belt-and-braces `Delegate=memory pids cpu io`.
 . **ExtraHost `192.168.1.254`** (`3ee192ba`): see Post-Step 9 bug hunt above.
 . **LND admin.macaroon unreadable** (`be960023`): see Post-Step 9 bug hunt above.
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
 								## Commits made this session
 								```
-												docs: STATUS.md — .228 dashboard bugs fixed (macaroon + ExtraHost)

											
										
										
											2026-04-23 04:17:56 -04:00
+ee192ba fix(first-boot): use podman host-gateway magic for host.containers.internal
 								be960023 fix(lnd): read admin macaroon via sudo fallback
 b8ef0a0 docs: STATUS.md through Step 9 (.228 hot-swap verified)
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
+								ba83f9bc feat(systemd): delegate cgroup controllers to archipelago.service
 df1b8 fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes
 								a0707f4d refactor: retire archipelago-reconcile.{service,timer}  (Step 8a)
 c81a739 docs: split Step 8 into 8a/8b/8c
 e46932f docs: STATUS.md through Step 7
 bc4a5 feat: bitcoin-ui pre-start hook (Step 7)
 								```
-												docs: STATUS.md — .228 dashboard bugs fixed (macaroon + ExtraHost)

											
										
										
											2026-04-23 04:17:56 -04:00
+								Branch is **19 commits ahead of tx1138/main** (local only — user pushes to mirrors personally).
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
-												docs: update STATUS.md — Step 4 done, Step 5 next

Records acceptance evidence for Steps 1-4 (container tests 21/21 pass, build
clean with expected unused-method warnings) and queues the BootReconciler
implementation for Step 5.

											
										
										
											2026-04-22 18:57:43 -04:00
+								## Uncommitted state
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
-												docs: STATUS.md — .228 dashboard bugs fixed (macaroon + ExtraHost)

											
										
										
											2026-04-23 04:17:56 -04:00
+								Clean. Only untracked: `tests/` (bats harness from prior session, not in scope), `tmp-dump-spec.py` (scratch).
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
 								## Answered design questions (no need to re-ask)
 . UI container naming → `archy-<app_id>` for UIs only; existing bitcoin-knots/lnd/electrumx keep bare names
 . BITCOIN_RPC_AUTH injection → runtime bind-mount of nginx.conf (no build-args, no envsubst)
 . Reconciler interval → 30 seconds
 . Concurrency → per-app `Mutex<()>` in a `DashMap`
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
+. Bash scripts → split into 8a/8b/8c; 8a done, 8b/8c deferred
-												docs: STATUS.md through Step 6

											
										
										
											2026-04-22 19:20:17 -04:00
+. Step 4 extension → `ContainerOrchestrator` trait includes `install(app_id)`; the `manifest_path`-based install RPC stays dev-only
-												docs: STATUS.md through Step 7

											
										
										
											2026-04-23 02:21:01 -04:00
+. Step 7 bitcoin-ui template → embed via `include_str!`, render on install + every reconcile, atomic tmp+rename to `/var/lib/archipelago/bitcoin-ui/nginx.conf`, bind-mount into container. RPC user hardcoded `archipelago`, password from `/var/lib/archipelago/secrets/bitcoin-rpc-password`.
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
 								## Context: which host is what
 								| Host | IP | Role | Dashboard pw | Sudo pw |
 								|---|---|---|---|---|
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
+								| `archy` | 192.168.1.116 | **Dev ThinkPad** (Lenovo X250, Debian 13). Currently running v1.7.42-alpha (DEV_MODE). Step 10 target. | archipelago | ThisIsWeb54321@ |
 								| `archy228` | 192.168.1.228 | Kiosk HP ProDesk. **Step 9 landing zone** — now running Rust-orchestrator binary in prod mode. | password123 | archipelago |
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
 								Both are development alpha nodes — **full destructive latitude**, no need to ask before stop/start/rebuild.
 								## Next action
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
+								**Step 10 — Hot-swap on .116.**
-												docs: STATUS.md through Step 7

											
										
										
											2026-04-23 02:21:01 -04:00
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
+								Unlike .228 (which tested the INSTALL path for net-new UI containers), .116 tests the ADOPTION path: it already has all three UIs and all backend containers running from prior v1.7.42-alpha runs. We want to verify the new prod orchestrator adopts every existing container without recreating or restarting them.
-												feat(iso): Step 8a — retire archipelago-reconcile systemd timer

BootReconciler (in-process, 30s interval, spawned from main.rs as of
Step 6 commit 48f08aa3) fully replaces the timer-driven bash
reconciliation path. Delete the systemd unit + timer and their
ISO-builder touchpoints.

Removed:
- image-recipe/configs/archipelago-reconcile.service
- image-recipe/configs/archipelago-reconcile.timer
- image-recipe/build-auto-installer-iso.sh L412-413 (COPY unit+timer)
- image-recipe/build-auto-installer-iso.sh L449 (systemctl enable)
- image-recipe/build-auto-installer-iso.sh L542-543 (cp to WORK_DIR)

Kept (intentionally):
- scripts/reconcile-containers.sh
- scripts/container-specs.sh

Reason: core/archipelago/src/api/rpc/package/update.rs still invokes
reconcile-containers.sh at two sites (OTA update + rollback paths).
Porting those call sites to ContainerOrchestrator::upgrade() requires
manifests for every container update.rs might touch — that scope
belongs in Step 8b. Until then the script stays on disk, just no
longer runs on a periodic timer.

No Rust code changes. cargo check -p archipelago clean, 6 pre-existing
warnings. Skipped full ISO rebuild validation per user decision —
edits are 5 textual deletions with zero behavioral ambiguity; Step 9
live hot-swap on .228 will catch any regression.

											
										
										
											2026-04-23 03:04:58 -04:00
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
+								Steps:
 . Disable DEV_MODE on .116 (check if override.conf exists — `/etc/systemd/system/archipelago.service.d/`)
 . Stage the already-built binary at `~/Projects/archy/core/target/release/archipelago` → `/usr/local/bin/archipelago.new`
 . Ensure `/opt/archipelago/apps/{bitcoin-ui,electrs-ui,lnd-ui}/manifest.yml` present (copy from repo)
 . Ensure `/opt/archipelago/docker/bitcoin-ui/` matches the Step-7 layout (no baked nginx.conf)
 . Snapshot: `podman ps -a --format "{{.Names}}\t{{.Status}}\t{{.CreatedAt}}"` → save to `/tmp/pre-step10-containers.txt`
 . `systemctl stop archipelago` → install binary → `systemctl start archipelago`
 . Verify in journal: every running container appears in "Adopted N existing container(s)"; no container was recreated; all HTTP smokes still 200; BootReconciler reaches NoOp on every app_id after one pass.
 . If broken → restore `.bak` binary, re-enable DEV_MODE override.
 . Commit STATUS.md update.
-												docs: STATUS.md through Step 7

											
										
										
											2026-04-23 02:21:01 -04:00
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
+								**Risk on .116:** If adoption fails mid-flight, we'd lose the running v1.7.42 backend that I'm currently typing at. Keep a second SSH session open to the ThinkPad for emergency revert. The backup plan is `install /usr/local/bin/archipelago.bak /usr/local/bin/archipelago && systemctl restart archipelago`.
-												docs: STATUS.md through Step 7

											
										
										
											2026-04-23 02:21:01 -04:00
-												docs: STATUS.md through Step 9 (.228 hot-swap verified)

Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.

											
										
										
											2026-04-23 03:46:23 -04:00
+								**After Step 10 we are blocked on Step 8b** (multi-day manifest ports) before Step 11 (chaos matrix).
-												docs: STATUS.md through Step 7

											
										
										
											2026-04-23 02:21:01 -04:00
-												docs: split Step 8 into 8a/8b/8c

Discovered during Step 8 execution that first-boot-containers.sh
creates 30+ containers with per-container logic (wallet loads, DB
init, rpcauth derivations, post-create health waits) and does
substantial non-container setup (secret gen, rootless-podman subuid
chowns, Tor hostnames, WireGuard, firewall, nostr-relay). Only 3 of
the 30+ containers have manifests today (the UIs from Step 7).

Deleting the bash in a single step bricks first-boot on fresh
installs. Split into:

- 8a: delete reconcile-containers.sh + container-specs.sh + reconcile
  systemd unit + timer. BootReconciler fully covers these. Safe,
  atomic, no manifest porting required.
- 8b: port remaining ~25 containers into apps/<id>/manifest.yml. One
  manifest per commit, validated against current bash behavior.
  Multi-day scope.
- 8c: rename first-boot-containers.sh -> first-boot-setup.sh, strip
  container ops, keep secret/dir/Tor/WG/firewall setup. Final
  one-way door, requires 8b complete.

											
										
										
											2026-04-23 02:34:43 -04:00
+								---
 								### Why Step 8 got split (discovered 2026-04-23)
 								Original plan was one commit "delete bash + edit ISO builder". But on investigation:
 								- `first-boot-containers.sh` creates **30+ containers** with per-container logic (wallets, DB init, rpcauth derivations, post-create health waits). The repo only has manifests for 3 (bitcoin-ui, electrs-ui, lnd-ui from Step 7). Deleting bash now = brick first-boot on fresh installs.
 								- Script also does non-container setup: secret generation (RPC pw, DB pw, FileBrowser admin pw), UID-mapping chowns for rootless podman subuid, Tor hostnames dir, WireGuard, firewall rules, nostr-relay dir. None of this lives in the Rust orchestrator.
-												feat(iso): Step 8a — retire archipelago-reconcile systemd timer

BootReconciler (in-process, 30s interval, spawned from main.rs as of
Step 6 commit 48f08aa3) fully replaces the timer-driven bash
reconciliation path. Delete the systemd unit + timer and their
ISO-builder touchpoints.

Removed:
- image-recipe/configs/archipelago-reconcile.service
- image-recipe/configs/archipelago-reconcile.timer
- image-recipe/build-auto-installer-iso.sh L412-413 (COPY unit+timer)
- image-recipe/build-auto-installer-iso.sh L449 (systemctl enable)
- image-recipe/build-auto-installer-iso.sh L542-543 (cp to WORK_DIR)

Kept (intentionally):
- scripts/reconcile-containers.sh
- scripts/container-specs.sh

Reason: core/archipelago/src/api/rpc/package/update.rs still invokes
reconcile-containers.sh at two sites (OTA update + rollback paths).
Porting those call sites to ContainerOrchestrator::upgrade() requires
manifests for every container update.rs might touch — that scope
belongs in Step 8b. Until then the script stays on disk, just no
longer runs on a periodic timer.

No Rust code changes. cargo check -p archipelago clean, 6 pre-existing
warnings. Skipped full ISO rebuild validation per user decision —
edits are 5 textual deletions with zero behavioral ambiguity; Step 9
live hot-swap on .228 will catch any regression.

											
										
										
											2026-04-23 03:04:58 -04:00
+								- `update.rs` (OTA update RPC) invokes `reconcile-containers.sh` at two sites. Deleting the script breaks package updates. Porting those call sites to the orchestrator needs all containers to have manifests.
 								- Design doc §505 updated to split 8 → 8a/8b/8c. Only 8a (delete the reconcile systemd unit + timer, BootReconciler covers) is safe to execute before we port manifests.
-												feat(container): add build source to manifest schema

ContainerConfig.image is now Option<String>, mutually exclusive with a new
optional ContainerConfig.build: Option<BuildConfig>. Exactly one of image
or build must be present, enforced in AppManifest::validate.

Adds ResolvedSource enum (Pull | Build) and ContainerConfig::resolve +
::image_ref helpers so the orchestrator can treat pull and build uniformly.
All 26 existing pull-only manifests continue to parse unchanged
(covered by existing_pull_only_manifests_still_parse test).

Call sites updated: podman_client, runtime::DockerRuntime, dev_orchestrator.
Dev orchestrator errors out cleanly on Build sources until Step 2 lands
build_image support on the runtime trait.

Step 1 of docs/rust-orchestrator-migration.md. 10 new unit tests, all pass.

Also includes: docs/rust-orchestrator-migration.md (design spec) and
docs/STATUS.md resume section for the next session.

											
										
										
											2026-04-22 17:46:36 -04:00
 								---
 								# Archipelago — Current State, Plan, and Releases
 								Updated: 2026-04-22
 								This is the "pick this up tomorrow" page. One-stop summary of where we are, what the plan is, and what's shipped. Detailed plan lives in [`bulletproof-containers.md`](./bulletproof-containers.md).
 								---
 								## Current state
 								### Fleet status
 								All four Gitea mirrors are synced to v1.7.40-alpha:
 								| Mirror | Host | Status |
 								|---|---|---|
 								| tx1138 | https://git.tx1138.com | ✅ v1.7.40-alpha live |
 								| gitea-local | http://localhost:3000 | ✅ v1.7.40-alpha live |
 								| .160 | http://23.182.128.160:3000 | ✅ v1.7.40-alpha live (Gitea recovered via `podman system renumber` — see below) |
 								| .168 | http://146.59.87.168:3000 | ✅ v1.7.40-alpha live |
 								Fleet test nodes:
 								| Node | Version | State |
 								|---|---|---|
 								| .103 (dev) | 1.7.40 | running, being developed against |
 								| .116 (this box) | 1.7.40 | healed manually via `systemd-run chmod 755 /opt/archipelago/web-ui` after v1.7.38/39 bug |
 								| .198 | 1.7.39 → 1.7.40-alpha | healed manually |
 								| .228 (primary test) | 1.7.40-alpha | healed manually; bitcoin-core + lnd + electrumx running; UI companions currently missing; bitcoin.conf rpcauth patched live |
 								| .249 (ISO test) | unreachable today | |
 								| .253 | 1.7.39 → 1.7.40-alpha | healed manually |
 								### Known open issues (drives the plan below)
 . **UI companion containers disappear** on .228 after daemon restarts — no auto-recreate (fixed by v1.7.45 Quadlet migration)
 . **bitcoin.conf rpcauth drifts** from canonical secret → ElectrumX "Daemon connection problem" (fixed by v1.7.43 reconcile::derived)
 . **`host.containers.internal`** resolves to LAN gateway inside containers on some versions (fixed by v1.7.42 containers.conf)
 . **Podman state DB loss** requires manual recovery (fixed by v1.7.44 startup self-heal)
 . **LND "Connect Wallet" info** vanishing after crashes — symptom of the same drift class as #2
 . **ElectrumX not syncing** on .228 — downstream of #2; will resolve when bitcoin.conf is reconciled
 								### Recent field incident (2026-04-22)
 								- Shipped v1.7.38 + v1.7.39, both broke nginx fleet-wide because the frontend tarball's root dir was `drwx------` (700). Every node that OTA'd got 500 errors on every page.
 								- Root-cause fix shipped in v1.7.40 (`create-release-manifest.sh` chmod + pre-ship assertion that `tar tvzf | head -1` shows `drwxr-xr-x`).
 								- .160 Gitea was down all day (502) because its rootless podman's `libpod/bolt_state.db` had vanished. Recovered via clearing `/run/user/$UID/{containers,libpod,podman}` + `podman system renumber`.
 								- Full failure-mode audit is in [`bulletproof-containers.md`](./bulletproof-containers.md).
 								---
 								## Plan
 								We're shipping a level-triggered **reconciler + Quadlet** architecture over six incremental releases. Each release closes one failure mode. See [`bulletproof-containers.md`](./bulletproof-containers.md) for the full design, code layout, test harness, chaos matrix, sources.
 								### Release roadmap
 								| Release | Closes | What lands | Status |
 								|---|---|---|---|
 								| **v1.7.41** | FM5 (bad OTA nginx 500) | Post-OTA auto-rollback. New binary probes `https://127.0.0.1/` on boot; if non-200 within 90s, restores `web-ui.bak` + calls `rollback_update()` + restarts | **in flight — deploying to .228 for test** |
 								| **v1.7.42** | FM4 (`host.containers.internal` wrong) | `/etc/containers/containers.conf` w/ `host_containers_internal_ip = 10.89.0.1`; every container gets `--add-host=host.archipelago:10.89.0.1` | pending |
 								| **v1.7.43** | FM2 (config drift) | `reconcile::derived::render_bitcoin_conf` — pure fn over canonical secret, rewrites on drift. Same for `lnd.conf` | pending |
 								| **v1.7.44** | FM6 (podman state loss) | Startup probe detects broken podman state, auto-recovers via `/run/user/$UID/*` clear + `system renumber` | pending |
 								| **v1.7.45** | FM1 + FM3 (companion orphans) | `archy-bitcoin-ui` → Quadlet `.container` unit in `/etc/containers/systemd/`. systemd (not archipelago) owns it | pending |
 								| **v1.7.46** | — | `archy-lnd-ui` → Quadlet | pending |
 								| **v1.7.47** | — | `archy-electrs-ui` → Quadlet | pending |
 								| **v1.7.48+** | all (full daemon refactor) | `core/archipelago/src/reconcile/` module replaces imperative `install.rs` container management. Main app containers become Quadlet too | pending |
 								Test harness (bats + Goss + Chaos Toolkit + vmtest) lands scaffold in v1.7.41, first lifecycle tests blocking v1.7.45, full matrix blocking beta tag.
 								---
 								## Release history
 								### [v1.7.41-alpha](/releases/v1.7.41-alpha/) — IN FLIGHT — 2026-04-22
 								**Post-OTA auto-rollback.** After an update lands, the node probes its own web UI through nginx — if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.
 								Changes:
 								- `core/archipelago/src/update.rs`: `PendingVerification` struct, write marker before service restart, `verify_pending_update()` on new binary boot — probes `https://127.0.0.1/`, on fail restores `web-ui.bak` + calls `rollback_update()` + `systemctl restart archipelago`
 								- `core/archipelago/src/main.rs`: startup task invokes verifier concurrently with server
 								### [v1.7.40-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.40-alpha/) — 2026-04-22
 								**Proper fix for the 500 error.** Fixed the v1.7.38/39 tarball-perms bug at its source — staging dir is now explicitly `chmod 755` before tar; `--mode=u=rwX,go=rX` normalizes archive perms; pre-ship assertion aborts release if `tar tvzf | head -1` isn't `drwxr-xr-x`.
 								Changes:
 								- `scripts/create-release-manifest.sh`: pre-tar chmod + tar --mode flag + post-tar verify
 								- Everything from .38 + .39 still in place (onboarding auto-heal, silent logins, app purge, AIUI in tarball)
 								### [v1.7.39-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.39-alpha/) — 2026-04-22
 								**Hotfix attempt** for v1.7.38's nginx 500 (didn't fully work — still shipped broken tarball perms). Added startup self-heal chmod in `main.rs` and post-extract chmod in `update.rs` OTA applier.
 								### [v1.7.38-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.38-alpha/) — 2026-04-22
 								**Onboarding auto-heal + silent logins + App Store trim.**
 								Changes:
 								- `auth.rs`: `is_onboarding_complete()` auto-heals from `setup_complete` + `password_hash` (prevents clear-cache → onboarding wizard bug)
 								- `useOnboarding`: tri-state — backend-unreachable no longer defaults to `/onboarding/intro`
 								- Login sounds gated by `isFirstInstallPhase()` — silent after onboarding, typing sounds unaffected
 								- Removed FIPS app, Nostr Relay, Nostr VPN, Routstr, Penpot from catalog + Rust + docker + icons
 								- Deleted 15 image versions from tx1138, .168, gitea-local registries
 								- AIUI baked into release tarball via `demo/aiui/`
 								- `prebuild` hook syncs `app-catalog/catalog.json` → `public/catalog.json`
 								(Shipped with tarball-perms bug; fleet had to be healed before v1.7.40.)
 								### [v1.7.37-alpha](https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.37-alpha/) — 2026-04-22
 								**Bitcoin Core install fixes + dynamic node UI + full-archive default.**
 								- Bitcoin Core passes explicit `-rpcbind/-rpcallowip/etc.` CLI args so vanilla image exposes RPC
 								- Split `bitcoin-core` from `bitcoin-knots` in backend `AppMetadata`
 								- bitcoin-ui auto-detects Core vs. Knots from subversion, swaps branding at runtime
 								- Storage (Full Archive · X GB / Pruned) indicator on dashboard
 								- Node Settings modal shows real values (network, storage, txindex, ZMQ, RPC port)
 								- Pull fallback to `docker.io` when no mirror carries the image
 								- Removed `prune=550` hardcode — full archive default
 								---
 								## Key docs
 								- [`bulletproof-containers.md`](./bulletproof-containers.md) — full reconcile architecture, code layout, test matrix, chaos scenarios, sources
 								- [`BETA-RELEASE-CHECKLIST.md`](./BETA-RELEASE-CHECKLIST.md) — existing beta checklist
 								- [`BETA-ISSUES-20260328.md`](./BETA-ISSUES-20260328.md) — prior beta-blocker tracking
 								- [`hotfix-process.md`](./hotfix-process.md) — release workflow
 								- [`architecture.md`](./architecture.md) — system architecture overview
 								---
 								## How to resume
 . Check fleet mirrors are all live: `curl -sS https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/manifest.json | jq .version`
 . Read [`bulletproof-containers.md`](./bulletproof-containers.md) for the current plan
 . Check task list (`/list` or via Claude Code) for the in-flight release
 . Latest in-flight work: v1.7.41 deploying to .228 for test; will ship to all 4 mirrors once verified