docs(master-plan): WS-F — uninstall-hang root cause fixed + cascade validated
Workstream F now in-progress: the immich/grafana uninstall hang → ghost/stuck-bar/reinstall-block is root-caused (unbounded systemctl/ podman in quadlet::disable_remove) and fixed (71cc9ac4); cascade- uninstall.bats 7/7 on .228. Records the remaining F items + the pending gate-wiring decision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
71cc9ac46a
commit
292a2650df
@ -1,5 +0,0 @@
|
||||
# Meshtastic - uses official image
|
||||
FROM meshtastic/meshtastic:latest
|
||||
|
||||
# Default configuration is in the image
|
||||
# No additional setup needed
|
||||
@ -1,69 +0,0 @@
|
||||
app:
|
||||
id: meshtastic
|
||||
name: Meshtastic
|
||||
version: 2-daily-alpine
|
||||
description: Open-source mesh networking for LoRa radios. Create decentralized communication networks.
|
||||
|
||||
container:
|
||||
image: docker.io/meshtastic/meshtasticd:daily-alpine
|
||||
pull_policy: if-not-present
|
||||
|
||||
dependencies:
|
||||
- storage: 1Gi
|
||||
|
||||
resources:
|
||||
cpu_limit: 1
|
||||
memory_limit: 512Mi
|
||||
disk_limit: 1Gi
|
||||
|
||||
security:
|
||||
capabilities: [NET_ADMIN, SYS_ADMIN] # Required for LoRa radio access
|
||||
readonly_root: false # Needs write access for device management
|
||||
no_new_privileges: true
|
||||
user: 1000
|
||||
seccomp_profile: default
|
||||
network_policy: host # Requires host network for radio access
|
||||
apparmor_profile: meshtastic
|
||||
|
||||
ports:
|
||||
- host: 4403
|
||||
container: 4403
|
||||
protocol: tcp # Meshtastic TCP API
|
||||
|
||||
devices:
|
||||
- /dev/ttyUSB0 # LoRa radio device (if connected)
|
||||
|
||||
volumes:
|
||||
- type: bind
|
||||
source: /var/lib/archipelago/meshtastic
|
||||
target: /var/lib/meshtasticd
|
||||
options: [rw]
|
||||
|
||||
files:
|
||||
- path: /var/lib/archipelago/meshtastic/config.yaml
|
||||
content: |
|
||||
General:
|
||||
MACAddress: AA:BB:CC:DD:EE:01
|
||||
Webserver:
|
||||
Port: 4403
|
||||
|
||||
environment:
|
||||
- MESHTASTIC_PORT=/dev/ttyUSB0
|
||||
- MESHTASTIC_SERIAL=true
|
||||
|
||||
health_check:
|
||||
type: cmd
|
||||
endpoint: test -f /var/lib/meshtasticd/config.yaml
|
||||
interval: 30s
|
||||
timeout: 30s
|
||||
retries: 5
|
||||
|
||||
networking:
|
||||
mesh_enabled: true
|
||||
local_network_access: true
|
||||
|
||||
metadata:
|
||||
icon: /assets/img/app-icons/meshcore.svg
|
||||
category: networking
|
||||
tier: recommended
|
||||
repo: https://github.com/meshtastic/firmware
|
||||
@ -70,7 +70,7 @@ real nodes. Until then, this plan is the priority.
|
||||
| C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending |
|
||||
| D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 0–2 code-complete (worktree) |
|
||||
| E | **Production test gate** — 5× lifecycle on **.228**, per-app L1/L2 matrix; multinode is split out → `multinode-testing-plan.md` | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **✅ .228 5×-GREEN (110/110 ×5, 0 not-ok, 2026-06-23)** — but this is DESTRUCTIVE-tier / ~8 core apps only; see §6c for the coverage gaps |
|
||||
| F | **Lifecycle perfection — cascade + progress + ALL apps** — extend the gate to uninstall/reinstall (cascade), real install/uninstall progress UI, and EVERY installed app (not just the 8 core). The "insanely-perfect OS/container environment" bar. | §6c (below), `tests/lifecycle/TESTING.md` | **NEW (2026-06-23)** — real bugs already found in manual multinode testing; sequenced after netbird + Phase-3 |
|
||||
| F | **Lifecycle perfection — cascade + progress + ALL apps** — extend the gate to uninstall/reinstall (cascade), real install/uninstall progress UI, and EVERY installed app (not just the 8 core). The "insanely-perfect OS/container environment" bar. | §6c (below), `tests/lifecycle/TESTING.md` | **IN PROGRESS (2026-06-26)** — root bug FIXED: uninstall could hang → ghost/stuck-bar/reinstall-block (`71cc9ac4`, unbounded systemctl/podman in `quadlet::disable_remove`); `cascade-uninstall.bats` **7/7 green on .228** w/ binary `ae349a75`. Remaining: wire CASCADE into the canonical gate run, progress-UI truthfulness, all-apps matrix, guardian/IBD state. |
|
||||
|
||||
**Orchestrator architecture** (foundation for A/B): `rust-orchestrator-migration.md`
|
||||
(ProdContainerOrchestrator, BootReconciler 30s level-triggered reconcile, adoption
|
||||
@ -156,10 +156,27 @@ reinstall, install-progress UI, and most apps were never under test.
|
||||
Bitcoin finishes initial sync" / "starting"** on that node — verify this is correct
|
||||
wait-for-IBD behavior vs a stuck/false state (it's a backend that depends on bitcoin sync).
|
||||
|
||||
**✅ 2026-06-26 — root cause of the immich/grafana uninstall trio FOUND + FIXED (`71cc9ac4`).**
|
||||
Single cause: `quadlet::disable_remove()` (first op in uninstall teardown, via companion +
|
||||
orchestrator) ran `systemctl --user stop` / `daemon-reload` / `podman rm -f` with **no timeout**.
|
||||
On rootless podman a generated unit can wedge "deactivating" while podman hangs → `systemctl stop`
|
||||
blocks forever → the spawned uninstall task returns neither Ok nor Err, so (a) `set_uninstall_stage`
|
||||
never fires → **frozen full-red bar**, (b) `remove_package_state_entry` never runs → **ghost stuck in
|
||||
`Removing`**, (c) the install guard rejects reinstall (`already Removing`). The spawn wrapper already
|
||||
reverts state on Err/removes on Ok — only a *hang* stranded it. Fix bounds all three calls
|
||||
(stop→`QUADLET_STOP_TIMEOUT` + SIGKILL/reset-failed escalation; daemon-reload→30s; podman rm→timeout).
|
||||
**Validated live: `cascade-uninstall.bats` 7/7 on .228** (binary `ae349a75`) — grafana install →
|
||||
uninstall (no ghost, data dir gone) → reinstall → running → cleanup. NOTE: proves the happy path +
|
||||
no-regression; the original hang was load/timing-induced and not separately reproduced.
|
||||
|
||||
**Workstream F scope — the gate must grow to (in priority order):**
|
||||
1. **CASCADE tier in the canonical gate:** uninstall → verify the app is GONE from My Apps /
|
||||
`container-list` / package state (no ghost), data preserved per policy, then reinstall →
|
||||
verify it returns healthy. Catch the immich/grafana ghost + reinstall-stops bugs.
|
||||
*(Test EXISTS + passes — `bats/cascade-uninstall.bats`, gated on `ARCHY_ALLOW_CASCADE_DESTRUCTIVE`;
|
||||
`run-gate.sh` still never sets it. DECISION PENDING: run a single cascade pass alongside the 5×
|
||||
destructive loop vs. a dedicated cascade gate — do NOT fold uninstall/reinstall into all 5
|
||||
iterations, it balloons runtime.)*
|
||||
2. **Progress-UI assertions:** install AND uninstall must report monotonic, truthful progress
|
||||
(not a stuck full-red bar); a long op must surface a real stage/percentage and a terminal
|
||||
success/failure — no silent hang. (Likely both a backend progress-event fix AND a UI fix.)
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user