docs(master-plan): WS-F — uninstall-hang root cause fixed + cascade validated

Workstream F now in-progress: the immich/grafana uninstall hang →
ghost/stuck-bar/reinstall-block is root-caused (unbounded systemctl/
podman in quadlet::disable_remove) and fixed (71cc9ac4); cascade-
uninstall.bats 7/7 on .228. Records the remaining F items + the pending
gate-wiring decision.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
archipelago 2026-06-26 05:18:39 -04:00
parent 71cc9ac46a
commit 292a2650df
3 changed files with 18 additions and 75 deletions

View File

@ -1,5 +0,0 @@
# Meshtastic - uses official image
FROM meshtastic/meshtastic:latest
# Default configuration is in the image
# No additional setup needed

View File

@ -1,69 +0,0 @@
app:
id: meshtastic
name: Meshtastic
version: 2-daily-alpine
description: Open-source mesh networking for LoRa radios. Create decentralized communication networks.
container:
image: docker.io/meshtastic/meshtasticd:daily-alpine
pull_policy: if-not-present
dependencies:
- storage: 1Gi
resources:
cpu_limit: 1
memory_limit: 512Mi
disk_limit: 1Gi
security:
capabilities: [NET_ADMIN, SYS_ADMIN] # Required for LoRa radio access
readonly_root: false # Needs write access for device management
no_new_privileges: true
user: 1000
seccomp_profile: default
network_policy: host # Requires host network for radio access
apparmor_profile: meshtastic
ports:
- host: 4403
container: 4403
protocol: tcp # Meshtastic TCP API
devices:
- /dev/ttyUSB0 # LoRa radio device (if connected)
volumes:
- type: bind
source: /var/lib/archipelago/meshtastic
target: /var/lib/meshtasticd
options: [rw]
files:
- path: /var/lib/archipelago/meshtastic/config.yaml
content: |
General:
MACAddress: AA:BB:CC:DD:EE:01
Webserver:
Port: 4403
environment:
- MESHTASTIC_PORT=/dev/ttyUSB0
- MESHTASTIC_SERIAL=true
health_check:
type: cmd
endpoint: test -f /var/lib/meshtasticd/config.yaml
interval: 30s
timeout: 30s
retries: 5
networking:
mesh_enabled: true
local_network_access: true
metadata:
icon: /assets/img/app-icons/meshcore.svg
category: networking
tier: recommended
repo: https://github.com/meshtastic/firmware

View File

@ -70,7 +70,7 @@ real nodes. Until then, this plan is the priority.
| C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending | | C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending |
| D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 02 code-complete (worktree) | | D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 02 code-complete (worktree) |
| E | **Production test gate** — 5× lifecycle on **.228**, per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **✅ .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23)** — but this is DESTRUCTIVE-tier / ~8 core apps only; see §6c for the coverage gaps | | E | **Production test gate** — 5× lifecycle on **.228**, per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **✅ .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23)** — but this is DESTRUCTIVE-tier / ~8 core apps only; see §6c for the coverage gaps |
| F | **Lifecycle perfection — cascade + progress + ALL apps** — extend the gate to uninstall/reinstall (cascade), real install/uninstall progress UI, and EVERY installed app (not just the 8 core). The "insanely-perfect OS/container environment" bar. | §6c (below), `tests/lifecycle/TESTING.md` | **NEW (2026-06-23)** — real bugs already found in manual multinode testing; sequenced after netbird + Phase-3 | | F | **Lifecycle perfection — cascade + progress + ALL apps** — extend the gate to uninstall/reinstall (cascade), real install/uninstall progress UI, and EVERY installed app (not just the 8 core). The "insanely-perfect OS/container environment" bar. | §6c (below), `tests/lifecycle/TESTING.md` | **IN PROGRESS (2026-06-26)** — root bug FIXED: uninstall could hang → ghost/stuck-bar/reinstall-block (`71cc9ac4`, unbounded systemctl/podman in `quadlet::disable_remove`); `cascade-uninstall.bats` **7/7 green on .228** w/ binary `ae349a75`. Remaining: wire CASCADE into the canonical gate run, progress-UI truthfulness, all-apps matrix, guardian/IBD state. |
**Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md` **Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md`
(ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption (ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption
@ -156,10 +156,27 @@ reinstall, install-progress UI, and most apps were never under test.
Bitcoin finishes initial sync" / "starting"** on that node — verify this is correct Bitcoin finishes initial sync" / "starting"** on that node — verify this is correct
wait-for-IBD behavior vs a stuck/false state (it's a backend that depends on bitcoin sync). wait-for-IBD behavior vs a stuck/false state (it's a backend that depends on bitcoin sync).
**✅ 2026-06-26 — root cause of the immich/grafana uninstall trio FOUND + FIXED (`71cc9ac4`).**
Single cause: `quadlet::disable_remove()` (first op in uninstall teardown, via companion +
orchestrator) ran `systemctl --user stop` / `daemon-reload` / `podman rm -f` with **no timeout**.
On rootless podman a generated unit can wedge "deactivating" while podman hangs → `systemctl stop`
blocks forever → the spawned uninstall task returns neither Ok nor Err, so (a) `set_uninstall_stage`
never fires → **frozen full-red bar**, (b) `remove_package_state_entry` never runs → **ghost stuck in
`Removing`**, (c) the install guard rejects reinstall (`already Removing`). The spawn wrapper already
reverts state on Err/removes on Ok — only a *hang* stranded it. Fix bounds all three calls
(stop→`QUADLET_STOP_TIMEOUT` + SIGKILL/reset-failed escalation; daemon-reload→30s; podman rm→timeout).
**Validated live: `cascade-uninstall.bats` 7/7 on .228** (binary `ae349a75`) — grafana install →
uninstall (no ghost, data dir gone) → reinstall → running → cleanup. NOTE: proves the happy path +
no-regression; the original hang was load/timing-induced and not separately reproduced.
**Workstream F scope — the gate must grow to (in priority order):** **Workstream F scope — the gate must grow to (in priority order):**
1. **CASCADE tier in the canonical gate:** uninstall → verify the app is GONE from My Apps / 1. **CASCADE tier in the canonical gate:** uninstall → verify the app is GONE from My Apps /
`container-list` / package state (no ghost), data preserved per policy, then reinstall → `container-list` / package state (no ghost), data preserved per policy, then reinstall →
verify it returns healthy. Catch the immich/grafana ghost + reinstall-stops bugs. verify it returns healthy. Catch the immich/grafana ghost + reinstall-stops bugs.
*(Test EXISTS + passes — `bats/cascade-uninstall.bats`, gated on `ARCHY_ALLOW_CASCADE_DESTRUCTIVE`;
`run-gate.sh` still never sets it. DECISION PENDING: run a single cascade pass alongside the 5×
destructive loop vs. a dedicated cascade gate — do NOT fold uninstall/reinstall into all 5
iterations, it balloons runtime.)*
2. **Progress-UI assertions:** install AND uninstall must report monotonic, truthful progress 2. **Progress-UI assertions:** install AND uninstall must report monotonic, truthful progress
(not a stuck full-red bar); a long op must surface a real stage/percentage and a terminal (not a stuck full-red bar); a long op must surface a real stage/percentage and a terminal
success/failure — no silent hang. (Likely both a backend progress-event fix AND a UI fix.) success/failure — no silent hang. (Likely both a backend progress-event fix AND a UI fix.)