# Archipelago Container Infrastructure — Critical Issues Report **Date:** 2026-03-31 **Status:** Server .228 rebooted — some apps recovered, many did not. UI showed everything as "crashed" during recovery window. **Purpose:** Fix guide for getting container lifecycle to production quality. --- ## Executive Summary The container system has **7 systemic failures** that compound each other: 1. **Silent failures everywhere** — errors are swallowed with `|| true`, `.unwrap_or_default()`, and warn-level logs. Nothing actually tells the user (or the system) that something broke. 2. **Health checks are fake** — manifests define real health checks (HTTP probes, exec checks) but they are **never executed**. "Healthy" just means `podman ps` shows "running". 3. **Duplicate polling burns CPU** — health monitor + metrics collector both call `podman stats` every 60 seconds independently. Add crash recovery snapshots, disk monitor, and frontend polling = constant subprocess spawning. 4. **Uninstall doesn't clean up** — no volume removal, no network cleanup, force-kills stateful containers (risking wallet/DB corruption), returns 200 OK on partial failure. 5. **Two divergent install paths** — `first-boot-containers.sh` and the Rust RPC installer use different passwords, ports, capabilities, memory limits, and Bitcoin config. They are never in sync. 6. **UI misrepresents state** — `Exited` (even clean exit code 0) shows as "crashed". No "recovering" or "starting up" state exists. During boot recovery, UI shows a wall of red/gray "crashed" labels. 7. **Dependency-blind restarts** — health monitor restarts services without restarting their dependencies first, so they immediately fail again and burn through the 3-attempt limit. --- ## LIVE EVIDENCE: .228 Reboot on 2026-03-31 After rebooting .228, here's the actual container state 30 minutes later: ### Permanently Dead (exceeded 3 restart attempts, abandoned) | Container | Exit Code | Cause | |-----------|-----------|-------| | `indeedhub-postgres` | 0 (clean) | Shut down by reboot. Health monitor tried 3 restarts, it keeps exiting cleanly. Once abandoned, all dependent services die too. | | `indeedhub-redis` | 0 | Same — clean exit, 3 failed restart attempts, abandoned | | `indeedhub-minio` | 0 | Same | | `indeedhub-relay` | 0 | Same | | `indeedhub` | 0 | Same | | `indeedhub-api` | 1 | Can't resolve hostname `indeedhub-postgres` (postgres is dead, DNS entry gone from network) | | `jellyfin` | 137 (OOM) | "Failed to create CoreCLR" — memory limit too low for .NET runtime. SIGKILL = OOM. 3 attempts exhausted. | ### Crash-Looping (still failing on every restart) | Container | Cause | |-----------|-------| | `mempool-api` | `ECONNREFUSED 10.89.0.42:3306` — DB (`archy-mempool-db`) just restarted, not ready yet | | `portainer` | "database schema version does not align with server version" — image upgraded, DB not migrated. Will NEVER recover. | | `photoprism` | "Failed creating test file in storage folder" — volume permission issue (rootless UID mapping) | ### Never Started (stuck in "Created" state) | Container | Cause | |-----------|-------| | `archy-mempool-web` | "cannot assign requested address" — network binding failure | | `fedimint` | Same network error | ### Running but Unhealthy | Container | Notes | |-----------|-------| | `homeassistant` | Up 14 min, health check failing | | `searxng` | Up 13 min, health check failing | | `onlyoffice` | Up 10 min, health check failing | ### Actually Recovered (healthy) `filebrowser`, `bitcoin-knots`, `vaultwarden`, `nginx-proxy-manager`, `archy-btcpay-db`, `lnd`, `electrumx`, `grafana` ### Key Observations 1. **All containers have `unless-stopped` restart policy** — but this doesn't help because containers that exit cleanly (code 0) don't get restarted by Podman. The health monitor is the only restart mechanism, and it gives up after 3 attempts. 2. **The entire IndeedHub stack died** because postgres was abandoned first. Once postgres hit 3 restart attempts, every dependent service (api, redis, minio, relay, main) also failed and hit their own 3-attempt limit. **No dependency awareness.** 3. **Containers in "Created" state** were never even started — some kind of network assignment failure during creation. The health monitor doesn't handle "Created" state containers. 4. **The UI showed ALL apps as "crashed"** during the first few minutes, even the ones that eventually recovered. This is because `Exited` state (even exit code 0) maps to the label "crashed" in `appsConfig.ts`. --- ## Problem 1: Containers Don't Start or Recover After Reboot **Confirmed:** All apps crashed after .228 reboot on 2026-03-31. ### Root Causes #### A. Crash recovery has a 30-second timeout that's too short **File:** `core/archipelago/src/crash_recovery.rs:265-271` ```rust let result = tokio::time::timeout( std::time::Duration::from_secs(30), tokio::process::Command::new("podman").args(["start", &record.name]).output(), ).await; ``` On a cold boot with many containers, Podman is under load. 30 seconds is not enough. If it times out, the container is **skipped** — no retry. #### B. If `podman ps` itself times out, recovery finds zero containers **File:** `core/archipelago/src/crash_recovery.rs:318` The `podman ps -a` call to discover stopped containers has a 30-second timeout. On a busy system post-reboot, this can timeout. Result: `all_names` is empty, recovery silently exits having started nothing. #### C. Boot tier ordering uses a catch-all that misses dependencies **File:** `core/archipelago/src/crash_recovery.rs:374-385` ```rust fn container_boot_tier(name: &str) -> u8 { match id { "btcpay-db" | "mempool-db" | ... => 0, // databases "bitcoin-knots" | ... => 1, // bitcoin "lnd" | "electrumx" | ... => 2, // depends on bitcoin "mempool-web" | ... => 4, // frontend _ => 3, // EVERYTHING ELSE - may start before its dependencies } } ``` Any app not explicitly listed gets tier 3, which may be before its dependencies are ready. #### D. First-boot script swallows ALL errors **File:** `scripts/first-boot-containers.sh:8` — no `set -e` 48+ commands have `|| true` appended. Every `podman run` failure is silently ignored. The script always exits 0 and reports "complete" to systemd even if 50% of containers failed. #### E. Install RPC returns success before container is actually running **File:** `core/archipelago/src/api/rpc/package/install.rs:260-294` After container creation, the installer polls for 30 seconds (6 checks x 5 seconds). If the container is still in "created" or "starting" state after 30 seconds: ```rust if i == 5 { debug!("Container {} health check timeout (30s) -- continuing anyway"); } ``` It logs at debug level and **returns success**. The user sees "installed" but the container never actually started. ### Fixes Required 1. **Increase crash recovery timeout to 120s** and add retry with backoff (3 attempts per container) 2. **Increase `podman ps` timeout to 60s** during boot recovery 3. **Replace tier catch-all** — every container must be explicitly listed or derived from manifest dependencies 4. **Remove `|| true`** from critical commands in first-boot-containers.sh. Use proper error handling: log the error, record the failure, continue to next container, but report actual failures at the end 5. **Install RPC must return failure** if container isn't running after timeout, not silently succeed 6. **Add `--restart unless-stopped`** to container creation in the Podman client (`core/container/src/podman_client.rs:303-335`) — currently missing, so Podman itself never auto-restarts crashed containers --- ## Problem 2: Health Checks Are Fake ### Root Causes #### A. "Healthy" just means "running" — application health is never checked **File:** `core/archipelago/src/container/dev_orchestrator.rs:239-249` ```rust pub async fn get_health_status(&self, app_id: &str) -> Result { match status.state { ContainerState::Running => Ok("healthy".to_string()), // <-- THIS IS THE ENTIRE CHECK ContainerState::Stopped | ContainerState::Exited => Ok("unhealthy".to_string()), ... } } ``` A container can be "running" but the application inside is completely broken. This is reported as "healthy". #### B. Manifest health checks exist but are never executed All 30+ app manifests in `image-recipe/build/debian-iso/custom/archipelago/apps/*/manifest.yml` define health checks like: ```yaml health_check: type: http endpoint: http://localhost:4080 path: /api/health interval: 30s timeout: 5s retries: 3 ``` The `HealthMonitor` struct at `core/container/src/health_monitor.rs` can execute these checks. **But it is never instantiated.** No code path creates a `HealthMonitor` from the manifest health check definitions. #### C. Health status is never pushed to the frontend via WebSocket **File:** `core/archipelago/src/data_model.rs:120-127` ```rust pub struct PackageDataEntry { pub health: Option, // Field exists but is NEVER POPULATED } ``` The health field in the data model is always `None`. Frontend can only get health via explicit RPC call, which it almost never makes. #### D. Frontend never polls health status **File:** `neode-ui/src/stores/container.ts:169-175` `fetchHealthStatus()` is only called after `startContainer()` and `startBundledApp()`. There is **no setInterval, no periodic polling, no watch**. After the initial call, health status is never refreshed. ### Fixes Required 1. **Wire up manifest health checks** — instantiate `HealthMonitor` from manifest definitions, run actual HTTP/exec probes instead of just checking `podman ps` 2. **Populate the `health` field in `PackageDataEntry`** so WebSocket pushes real health status to frontend 3. **Add 30-second health polling** in the frontend container store (with backoff to 60s when all healthy) 4. **Fix `get_health_status()`** in dev_orchestrator to call actual health checks, not just check container state --- ## Problem 3: CPU Exhaustion from Duplicate Polling ### Root Causes #### A. Two independent monitors both call `podman stats` every 60 seconds - **Health monitor:** `core/archipelago/src/health_monitor.rs:17` — `CHECK_INTERVAL_SECS = 60` - Runs `podman ps -a --format json` (line 305-323) - Runs `podman stats --no-stream` every 5 cycles (line 442-450) - **Metrics collector:** `core/archipelago/src/monitoring/mod.rs:28` — 60-second interval - Runs `podman stats --no-stream --format json` independently (collector.rs:220-224) These are **not coordinated**. Both spawn separate subprocesses. On a system with 15+ containers, each `podman stats` call is expensive. #### B. Total subprocess spawning frequency | Component | Interval | What it runs | |-----------|----------|-------------| | Health monitor | 60s | `podman ps`, `podman stats` (every 5th), restart attempts | | Metrics collector | 60s | `podman stats` (duplicate!) | | Crash recovery snapshot | 120s | `podman ps` | | Disk monitor | 300s | `df`, `sudo dmesg`, potentially `podman image prune` | | Telemetry | 900s | `podman stats` (another duplicate) | | Systemd watchdog | 120s | sd_notify ping | | Frontend fleet polling | 60s | RPC calls that trigger more podman commands | That's roughly **one `podman` subprocess every 10-15 seconds** on average, plus all the triggered operations. #### C. No restart policy means polling-driven restarts **File:** `core/container/src/podman_client.rs:303-335` Container creation spec does NOT include `RestartPolicy`. Podman itself never restarts crashed containers. Instead, the health monitor's 60-second poll detects the crash and attempts a restart. This is far more CPU-intensive than Podman's built-in restart mechanism. #### D. Health monitor restart attempts with exponential backoff still spawn processes When a container fails, the health monitor tries restarts at 10s, 30s, 90s backoff. Each attempt spawns `podman start`, `podman inspect`, etc. If multiple containers are unhealthy, this multiplies. ### Fixes Required 1. **Deduplicate `podman stats`** — create a shared cache layer. One component fetches, others read from cache (TTL: 30s) 2. **Add `RestartPolicy: unless-stopped` with MaxRetryCount: 5** to all container creation — let Podman handle restarts natively instead of polling 3. **Increase health monitor interval to 120s** (60s is too aggressive when health checks are just `podman ps`) 4. **Remove duplicate `podman stats`** call from metrics collector — share data with health monitor 5. **Make frontend fleet polling viewport-aware** — only poll when user is actually viewing the fleet page 6. **Batch all container queries** — use a single `podman ps -a --format json` per check cycle, shared across all consumers --- ## Problem 4: Uninstall Doesn't Work ### Root Causes #### A. No volume removal **File:** `core/archipelago/src/api/rpc/package/runtime.rs:172-289` The uninstall function stops containers, removes containers, releases ports, and attempts data directory cleanup. It **never removes Podman volumes**. Orphaned volumes accumulate forever. #### B. No network cleanup **File:** `core/archipelago/src/api/rpc/package/runtime.rs:172-289` Multi-container stacks create networks (`archy-net`, `immich-net`, `penpot-net`) during install (`stacks.rs:89, 211`). These are **never cleaned up** during uninstall. Leftover networks can prevent reinstallation. #### C. Force-kills stateful containers without graceful shutdown **File:** `core/archipelago/src/api/rpc/package/runtime.rs:226` ```rust let rm_out = tokio::process::Command::new("podman") .args(["rm", "-f", name]) // -f = force kill .output().await; ``` The code defines proper shutdown timeouts (Bitcoin: 600s, LND: 330s, databases: 120s) but only uses them for `stop`. The `rm -f` that follows **ignores these timeouts** and force-kills immediately. This risks corrupting Bitcoin's UTXO set, LND channel state, or database WAL. #### D. Returns 200 OK even on partial failure **File:** `core/archipelago/src/api/rpc/package/runtime.rs:268-289` ```rust Ok(serde_json::json!({ "status": if errors.is_empty() { "uninstalled" } else { "partial" }, ... })) ``` Returns HTTP 200 with `"partial"` status. Frontend at `neode-ui/src/views/apps/useAppsActions.ts:74` doesn't check for "partial" — it deletes the app from the UI regardless. #### E. Data directory cleanup requires sudo and fails silently **File:** `core/archipelago/src/api/rpc/package/runtime.rs:256-265` ```rust let rm_out = tokio::process::Command::new("sudo") .args(["rm", "-rf", dir]).output().await; if let Ok(o) = rm_out { if !o.status.success() { tracing::warn!(...); // Warning only, continues } } ``` If sudo isn't configured or fails, data remains on disk but UI shows "uninstalled". #### F. Container name detection has gaps **File:** `core/archipelago/src/api/rpc/package/config.rs:287-340` Container names are hardcoded patterns. If a container was created with a different naming convention (e.g., by first-boot-containers.sh vs RPC installer), it won't be found and won't be removed. ### Fixes Required 1. **Add `podman volume rm`** for all volumes associated with the app after container removal 2. **Add network cleanup** — remove app-specific networks after all containers on that network are gone 3. **Use `podman stop -t {timeout}` then `podman rm`** (without -f) — respect graceful shutdown timeouts, especially for Bitcoin/LND/databases 4. **Return an error (not 200)** when uninstall has failures. Frontend must check and display errors 5. **Surface "partial" failures to the user** with specific error messages 6. **Unify container naming** — derive names from a single source (manifest), not hardcoded patterns in multiple files --- ## Problem 5: Two Divergent Install Paths The first-boot bash script and the Rust RPC installer create containers with **different configurations**. This is a major source of bugs. ### Specific Divergences #### A. Database passwords - **First-boot** (`scripts/first-boot-containers.sh:118-127`): Generates random passwords with `openssl rand -base64 24`, stores in `/var/lib/archipelago/secrets/` - **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:456,484,514-515,610`): Uses hardcoded `"btcpaypass"`, `"mempoolpass"`, `"rootpass"`, `"immichpass"` **Result:** Apps installed via RPC after first-boot can't connect to databases because passwords don't match. #### B. Bitcoin configuration - **First-boot** (`scripts/first-boot-containers.sh:295-313`): Dynamically sets `-prune=550` on small disks, `-txindex=1` on large disks - **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:415-420`): No custom args at all **Result:** Bitcoin installed via RPC has no pruning or txindex regardless of disk size. #### C. ZMQ configuration for LND - **First-boot** (`scripts/first-boot-containers.sh:100-114`): Bitcoin.conf generated without ZMQ publisher settings - **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:438-439`): LND configured to connect to `tcp://bitcoin-knots:28332` and `tcp://bitcoin-knots:28333` **Result:** LND can't receive block notifications from Bitcoin because ZMQ isn't configured on either path. #### D. Port conflicts - **First-boot** (`scripts/first-boot-containers.sh:813,835`): Both strfry and indeedhub bind to host port 7777 - **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:734`): IndeedHub uses `8190:3000` **Result:** On first-boot, whichever of strfry/indeedhub starts second fails. Via RPC, different port entirely. #### E. Memory limits - **First-boot** (`scripts/first-boot-containers.sh:253-283`): Ollama gets 1g on low-mem systems - **Rust RPC** (`core/archipelago/src/api/rpc/package/config.rs:245-280`): Ollama gets 4g always **Result:** Same app gets different resource limits depending on how it was installed. #### F. Version mismatches in marketplace UI - `scripts/image-versions.sh:17`: LND image is `v0.18.4-beta` - `neode-ui/src/views/marketplace/marketplaceData.ts:155`: Shows `0.17.4` - `scripts/image-versions.sh:21-22`: Mempool images are `v3.0.0` - `neode-ui/src/views/marketplace/marketplaceData.ts:177`: Shows `2.5.0` ### Fixes Required 1. **Single source of truth for container config** — Rust config must read passwords from `/var/lib/archipelago/secrets/`, not hardcode them 2. **Add ZMQ config** to Bitcoin startup in both paths: `zmqpubrawblock=tcp://0.0.0.0:28332` and `zmqpubrawtx=tcp://0.0.0.0:28333` 3. **Fix port 7777 conflict** — assign unique ports to strfry and indeedhub 4. **Add disk-aware Bitcoin config** to Rust installer (prune/txindex based on disk size) 5. **Sync memory limits** between first-boot and Rust config 6. **Update marketplace version strings** to match actual image versions in `image-versions.sh` 7. **Long-term: eliminate first-boot-containers.sh** — have the backend handle all container creation using the same Rust code path --- ## Problem 6: Post-Install Hooks Run Async and Fail Silently **File:** `core/archipelago/src/api/rpc/package/install.rs:541-625` Post-install hooks (setting FileBrowser password, configuring NextCloud, etc.) are spawned as background tasks: ```rust tokio::spawn(async move { let _ = tokio::fs::create_dir_all(secret_dir).await; let _ = tokio::fs::write(...).await; }); ``` The install RPC returns success **before hooks complete**. If a hook fails (network timeout, service not ready), the error is logged but the user is told installation succeeded. Credentials aren't set, configs aren't applied. ### Fix Required Await post-install hooks before returning success, or return a "configuring" status and let the frontend poll for completion. --- ## Problem 7: Podman Client Swallows Errors **File:** `core/container/src/podman_client.rs` #### A. JSON serialization failures return empty strings (line 182-183) ```rust let body_str = body.map(|b| serde_json::to_string(&b).unwrap_or_default()).unwrap_or_default(); ``` #### B. Container ID parsing failures return empty string (line 344-348) ```rust let id = result["Id"].as_str().unwrap_or("").to_string(); Ok(id) // Empty string = success? ``` #### C. Socket timeout is only 5 seconds (line 154-160) On a busy system or during boot, Podman socket may take >5s to respond. Every API call fails. No retry logic. ### Fixes Required 1. Replace `.unwrap_or_default()` with proper error propagation using `?` 2. Return `Err` when container ID is empty 3. Increase socket timeout to 15-30s 4. Add retry with backoff (3 attempts) on socket connection --- ## Problem 8: UI Misrepresents Container State ### Root Causes #### A. "Exited" always displays as "Crashed" — even for clean shutdowns **File:** `neode-ui/src/views/apps/appsConfig.ts:119-146` ```typescript getStatusLabel(state, health): - "exited" → "crashed" // <-- THIS IS THE PROBLEM ``` Every container that exited — whether from a clean reboot (exit 0), OOM kill (exit 137), or app error (exit 1) — shows the same "crashed" label. After a reboot, the UI is a wall of "crashed" labels even though containers are in the process of starting up. #### B. No "recovering" or "boot in progress" state exists **File:** `core/archipelago/src/data_model.rs:103-119` PackageState enum has `Starting`, but it's only set during **explicit user start actions**, not during automatic crash recovery. During boot recovery, containers transition from `Exited → Running` without ever passing through `Starting`, so the UI never shows a spinner or "starting up" message. #### C. Backend skips sub-containers from package listing, so their state is invisible **File:** `core/archipelago/src/container/docker_packages.rs:39-117` The excluded_services list filters out backend services like `mempool-db`, `btcpay-db`, `nbxplorer`, `penpot-postgres`, etc. UI containers ending in `-ui` are also skipped. These containers are invisible to the user even when they're the actual cause of a stack failure (e.g., `indeedhub-postgres` being dead kills the entire IndeedHub stack, but only `indeedhub-api` errors are visible). #### D. No distinction between "needs manual intervention" and "will recover soon" The UI shows the same visual treatment for: - Portainer (DB migration error — will NEVER recover without manual intervention) - mempool-api (DB not ready yet — will recover in 30 seconds) - IndeedHub (dependencies abandoned — won't recover until deps are manually restarted) ### Fixes Required 1. **Differentiate exit codes**: Exit 0 = "stopped" (gray), Exit non-zero = "crashed" (red), Exit 137 = "killed (OOM)" (red with warning) 2. **Add a "recovering" state**: During boot/crash recovery window (first 5 minutes after backend start), show "Starting up..." instead of "crashed" for exited containers 3. **Show sub-container health**: When a parent app is unhealthy, show which sub-service caused the failure (e.g., "IndeedHub: postgres is down") 4. **Distinguish recoverable from permanent failures**: After health monitor gives up (3 attempts), change label to "Needs attention" instead of keeping "crashed" 5. **Add recovery progress indicator**: During boot, show "Recovering containers: 15/22 started" on the dashboard --- ## Problem 9: Dependency-Blind Restarts ### Root Cause (Confirmed by .228 reboot) The health monitor restarts containers individually without considering dependencies. This was proven by the IndeedHub stack failure: 1. `indeedhub-postgres` exits cleanly (code 0) on reboot 2. Health monitor restarts postgres — it starts, but exits again (likely needs volume mount or network ready) 3. After 3 attempts, postgres is **abandoned** 4. Meanwhile, `indeedhub-api` tries to connect to postgres → `ENOTFOUND indeedhub-postgres` → exits 5. Health monitor restarts api → same DNS failure → exits 6. After 3 attempts, api is **abandoned** 7. Same cascade for redis, minio, relay, main container — all abandoned within minutes **File:** `core/archipelago/src/health_monitor.rs:500-530` The restart loop treats each container independently. There's no logic to: - Check if a container's dependencies are running before restarting it - Restart dependencies first when a dependent container fails - Reset attempt counters when a dependency comes back online **3 attempts is too few**, especially when dependencies need time: - Attempt 1: 10s backoff → dependency still starting - Attempt 2: 30s backoff → dependency crashed and is being restarted - Attempt 3: 90s backoff → dependency hit its own 3-attempt limit and was abandoned - Game over. Entire stack is dead. ### Fixes Required 1. **Dependency-aware restart ordering**: Before restarting a container, check if its dependencies are running. If not, restart dependencies first. 2. **Increase max restart attempts to 5-10** for containers with dependencies 3. **Reset attempt counters** when a dependency comes back online (the dependent container failed because of the dependency, not itself) 4. **Add a "stack restart" concept**: When restarting any container in a multi-container stack (indeedhub, mempool, btcpay, immich, penpot), restart the entire stack in dependency order 5. **Handle "Created" state containers**: `archy-mempool-web` and `fedimint` are in "Created" state (never started). The health monitor should detect these and attempt to start them. --- ## Priority Order for Fixes ### P0 — System is broken without these (reboot = broken system) 1. **Dependency-aware restarts** in health_monitor.rs — restart dependencies before dependents, reset attempt counters when deps recover 2. **Increase max restart attempts to 10** (currently 3) — dependency chains need more time on boot 3. **Handle "Created" state** — containers stuck in Created are never started by health monitor 4. **Fix UI state labels** — "exited" code 0 should say "stopped", not "crashed". Add "recovering" state during boot window. 5. Fix Rust config to read secrets from `/var/lib/archipelago/secrets/` instead of hardcoded passwords 6. Fix port 7777 conflict (strfry vs indeedhub) 7. Add ZMQ config to Bitcoin for LND block notifications ### P1 — Core functionality broken 8. Wire up manifest health checks (replace fake "running = healthy" with actual HTTP/exec probes) 9. Fix uninstall to clean up volumes, networks, and respect graceful shutdown timeouts 10. Return actual errors from install/uninstall instead of silent success on partial failure 11. Remove `|| true` from critical first-boot commands 12. Show sub-container health in UI (which dependency is actually broken) ### P2 — Performance and CPU 13. Deduplicate `podman stats` calls (health monitor + metrics collector both call every 60s independently) 14. Increase health monitor interval to 120s 15. Add frontend health polling via WebSocket push (populate `health` field in data model) 16. Make fleet polling viewport-aware (don't poll when user isn't viewing) ### P3 — Consistency and correctness 17. Sync memory limits between first-boot and Rust config 18. Update marketplace version strings (LND shows 0.17.4, actual is 0.18.4; Mempool shows 2.5.0, actual is 3.0.0) 19. Unify container naming conventions between first-boot script and Rust config 20. Add disk-aware Bitcoin config (prune/txindex) to Rust installer 21. Distinguish "needs manual intervention" from "will recover soon" in UI --- ## Key Files to Modify | File | What to fix | |------|-------------| | `core/archipelago/src/health_monitor.rs` | Dependency-aware restarts, increase MAX_RESTART_ATTEMPTS to 10, handle Created state, deduplicate with metrics collector | | `core/container/src/podman_client.rs` | Add RestartPolicy to container creation spec, fix `.unwrap_or_default()` error swallowing, increase socket timeout to 15-30s | | `core/archipelago/src/crash_recovery.rs` | Increase timeouts to 120s, add retry with backoff, fix tier ordering catch-all | | `core/archipelago/src/api/rpc/package/install.rs` | Return failure on timeout (not silent success), await post-install hooks | | `core/archipelago/src/api/rpc/package/runtime.rs` | Add volume/network cleanup on uninstall, use `podman stop -t` then `podman rm` (not `-f`), return errors on partial failure | | `core/archipelago/src/api/rpc/package/config.rs` | Read secrets from disk, fix port 7777, add ZMQ config, sync memory limits | | `core/archipelago/src/container/dev_orchestrator.rs` | Wire up manifest-defined health checks instead of just checking podman state | | `core/archipelago/src/container/docker_packages.rs` | Stop filtering sub-containers from state — or expose their health as part of parent app status | | `core/archipelago/src/data_model.rs` | Populate `health` field for WebSocket push, add exit code to state | | `core/archipelago/src/monitoring/mod.rs` | Share podman stats data with health monitor instead of duplicate subprocess calls | | `neode-ui/src/views/apps/appsConfig.ts` | Fix state labels: exit 0 = "stopped", exit non-zero = "crashed", add "recovering" during boot window | | `neode-ui/src/stores/container.ts` | Add periodic health polling (30s) | | `neode-ui/src/views/apps/useAppsActions.ts` | Check for "partial" uninstall status, show errors to user | | `neode-ui/src/views/marketplace/marketplaceData.ts` | Fix version strings to match image-versions.sh | | `scripts/first-boot-containers.sh` | Remove `\|\| true` from critical commands, fix port 7777 conflict, add proper error reporting |