# Archipelago: Production Excellence Plan **Duration**: 12 months (48 weeks) **Goal**: Code so good no developer could question any decision. Apple-level reliability. Every failure visible and recoverable. Every operation bounded. Every line justified. **Audited**: 2026-03-20 — 122 Rust files, 38 Vue views, 180+ frontend files, 80+ shell scripts ## CONSTRAINTS - **DEPLOY ONLY TO .198** — Never .228. All verification on .198. - **BETA FREEZE** — Behavior-preserving only. No new features/UI/endpoints. - **Tests before every refactor** — Capture current behavior first. Tests must pass unchanged after. - **Atomic commits** — One logical change per commit. Every step compiles + passes tests. ```bash ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.198 ``` --- ## COMPLETE ISSUE REGISTRY ### Backend Rust — 122 files audited | ID | Issue | File(s) | Severity | |----|-------|---------|----------| | R1 | Health RPC endpoint has no handler — returns "Unknown method" | `api/rpc/mod.rs` | P0 | | R2 | Nostr client.connect() hangs indefinitely (4 calls, no timeout) | `nostr_handshake.rs:124,161,262,282` | P0 | | R3 | Backup restore extracts directly to live dir — no atomic rollback | `backup/full.rs:122-149` | P0 | | R4 | Rate limiter cleanup() never spawned — HashMap grows forever | `session.rs:566-579` | P1 | | R5 | Login rate limiter same issue — entries never evicted | `session.rs:452-472` | P1 | | R6 | Blocking std::fs in async — session.rs (6 calls) | `session.rs:77,128,370,413,423,425` | P1 | | R7 | Blocking std::fs in async — docker_packages.rs | `docker_packages.rs:561,573` | P1 | | R8 | Blocking std::fs in async — port_allocator.rs | `port_allocator.rs:59,73,77` | P1 | | R9 | Blocking std::fs in async — peers.rs, node_message.rs | `peers.rs:30`, `node_message.rs:65` | P1 | | R10 | Blocking std::fs in async — identity.rs, identity_manager.rs | `identity.rs:50`, `identity_manager.rs:164` | P1 | | R11 | Blocking std::fs in async — nostr_discovery.rs | `nostr_discovery.rs:55` | P1 | | R12 | Sync TCP I/O in async context — electrs_status.rs | `electrs_status.rs:5,40,78,81` | P1 | | R13 | .expect() in main.rs startup | `main.rs:124,159` | P2 | | R14 | .parse().unwrap() in session.rs rate limiting | `session.rs:665,676,688` | P1 | | R15 | 7 .unwrap()/.expect() in mesh/protocol.rs | `protocol.rs:582,592,614,649,679,713,728` | P1 | | R16 | .expect() in identity.rs crypto | `identity.rs:114,119` | P2 | | R17 | .unwrap() in helpers/lib.rs (5 calls) | `helpers/lib.rs:167,172,180,233,253` | P2 | | R18 | .unwrap() in helpers/rsync.rs (5 calls) | `rsync.rs:196,199,202,210,220` | P2 | | R19 | .unwrap() in js-engine/lib.rs | `js-engine/lib.rs:130,249` | P2 | | R20 | 14 #[allow(dead_code)] suppressions in mesh/mod.rs | `mesh/mod.rs:7-25` | P2 | | R21 | Dead code in lnd.rs, data_manager.rs, dev_orchestrator.rs | Multiple | P2 | | R22 | Bitcoin RPC URL hardcoded in 4+ files | `bitcoin.rs:89`, `mesh/mod.rs:624,649,663`, `listener.rs:1509+` | P2 | | R23 | DWN health URL hardcoded | `dwn_sync.rs:76` | P2 | | R24 | Update manifest URL hardcoded | `update.rs:11` | P3 | | R25 | DNS-over-HTTPS URLs hardcoded (4 providers) | `network/dns.rs:98,102,106,110` | P3 | | R26 | DWN protocol URIs hardcoded in server.rs | `server.rs:453-456` | P3 | | R27 | Missing timeouts on mesh Bitcoin RPC calls | `mesh/mod.rs:624,649,663` | P1 | | R28 | Missing timeouts on LND proxy calls (68 .send() calls) | `api/rpc/lnd.rs` | P2 | | R29 | Missing timeout on DWN health check | `dwn_sync.rs:76` | P2 | | R30 | TODO: track last-seen timestamp | `handshake.rs:77` | P3 | | R31 | TODO: lnd.lookupinvoice RPC endpoint | `marketplace.rs:183` | P3 | | R32 | TODO: trigger auto-restart or alert | `container/health_monitor.rs:140` | P3 | | R33 | TODO: configure Podman to use AppArmor profile | `security/container_policies.rs:68` | P3 | | R34 | Tor rotation deletes old .onion immediately — no transition | `api/rpc/tor.rs:184-240` | P1 | | R35 | package.rs god file — 1,795 lines | `api/rpc/package.rs` | P2 | | R36 | mesh/listener.rs god file — 1,799 lines | `mesh/listener.rs` | P2 | | R37 | rpc/mod.rs god file — 1,092 lines | `api/rpc/mod.rs` | P2 | | R38 | lnd.rs god file — 1,068 lines | `api/rpc/lnd.rs` | P2 | | R39 | monitoring/mod.rs — 993 lines | `monitoring/mod.rs` | P3 | | R40 | api/handler.rs — 911 lines | `api/handler.rs` | P3 | | R41 | 30+ functions exceed 50 lines across codebase | Multiple | P3 | ### Frontend — 180+ files audited | ID | Issue | File(s) | Severity | |----|-------|---------|----------| | F1 | WebSocket subscription registered multiple times — race condition | `stores/app.ts:88-134` | P0 | | F2 | Unprotected concurrent mesh state mutations | `stores/mesh.ts:249-268,294-324` | P0 | | F3 | No global Vue error handler — white screen on error | `main.ts` | P0 | | F4 | Stale data after WebSocket reconnect — no full refresh | `stores/app.ts:88-163` | P1 | | F5 | Message polling timer never stopped after logout | `composables/useMessageToast.ts:60` | P1 | | F6 | AppLauncher NIP-07 message listener leak on close | `stores/appLauncher.ts:295-301` | P1 | | F7 | Audio player listeners stack — never cleaned up | `composables/useAudioPlayer.ts:1-91` | P1 | | F8 | WebSocket reconnection race — parallel connect() attempts | `api/websocket.ts:212-238` | P2 | | F9 | WebSocket parse error silently caught — stale UI forever | `api/websocket.ts:164-172` | P2 | | F10 | WebSocket stale connection detection too aggressive (5min) | `api/websocket.ts:284-299` | P2 | | F11 | RPC client backoff + timeout = 40s max wait | `api/rpc-client.ts:31-117` | P2 | | F12 | No code splitting — monolithic bundle | `vite.config.ts` | P2 | | F13 | v-html on QR code without DOMPurify | `views/Settings.vue:441` | P2 | | F14 | Goals store O(n) alias lookup on every computed | `stores/goals.ts:16-20,38-89` | P2 | | F15 | localStorage save without try/catch (5+ instances) | `stores/goals.ts:34-36` + others | P2 | | F16 | FileBrowser auth token duality — memory + cookie | `api/filebrowser-client.ts:39,50-68` | P2 | | F17 | CSRF token cookie parsing brittle — regex only | `api/rpc-client.ts:18-21` | P2 | | F18 | aiPermissions.ts Set uses unsafe type assertion | `stores/aiPermissions.ts:91-103` | P3 | | F19 | Untracked setTimeout in AppSession — fires after unmount | `views/AppSession.vue:507` | P3 | | F20 | Dashboard navigation missing aria-current="page" | `views/Dashboard.vue` | P3 | | F21 | Search performance — string re-lowercasing every keystroke | `views/Apps.vue:510-537` | P3 | | F22 | 30+ backdrop-filter blur elements — GPU overload on mobile | `style.css` | P3 | | F23 | Record on sensitive DID operations | `types/api.ts` + `rpc-client.ts` | P3 | | F24 | checkInterval timer leak on connect race | `api/websocket.ts:82-96` | P3 | | F25 | Web5.vue god component — 3,940 lines | `views/Web5.vue` | P2 | | F26 | Mesh.vue — 2,106 lines | `views/Mesh.vue` | P2 | | F27 | Dashboard.vue — 1,819 lines | `views/Dashboard.vue` | P2 | | F28 | Settings.vue — 1,792 lines | `views/Settings.vue` | P2 | | F29 | Marketplace.vue — 1,293 lines | `views/Marketplace.vue` | P3 | | F30 | Server.vue — 1,132 lines | `views/Server.vue` | P3 | | F31 | Home.vue — 1,059 lines | `views/Home.vue` | P3 | | F32 | AppDetails.vue — 1,036 lines | `views/AppDetails.vue` | P3 | | F33 | useAppStore god store — 324 lines, 16 methods, 8+ responsibilities | `stores/app.ts` | P2 | ### Shell Scripts — 80+ files audited | ID | Issue | File(s) | Severity | |----|-------|---------|----------| | S1 | 60+ instances of `sudo podman` — should be rootless | `fix-indeedhub(28)`, `deploy-bitcoin(11)`, `deploy-tailscale(2+)` | P0 | | S2 | Zero container health checks in first-boot (30 containers) | `first-boot-containers.sh` | P0 | | S3 | 50+ `:latest` image tags across all scripts | `first-boot(15)`, `deploy(11)`, `tailscale(18)`, `iso(7)` | P1 | | S4 | No `set -e` in first-boot — silent container failures | `first-boot-containers.sh:1-9` | P1 | | S5 | `eval "$DB_PASSWORDS"` — code injection risk | `deploy-to-target.sh:940` | P1 | | S6 | No deploy locking — concurrent deploys corrupt state | `deploy-to-target.sh` | P1 | | S7 | No deploy rollback — failed deploy leaves broken system | `deploy-to-target.sh` | P1 | | S8 | sshpass usage in trust-archipelago-cert.sh | `trust-archipelago-cert.sh:23-26` | P1 | | S9 | MariaDB password in command line — visible in ps | `first-boot-containers.sh:285` | P1 | | S10 | 80+ instances of `2>/dev/null \|\| true` masking errors | `deploy-to-target.sh` | P2 | | S11 | No trap cleanup for temp files | Multiple scripts | P2 | | S12 | Unquoted variables (word splitting risk) | Multiple scripts | P2 | | S13 | Hardcoded IPs in 6+ scripts | `deploy-to-target.sh:26`, `deploy-tailscale.sh:26`, etc. | P2 | | S14 | No input validation on deploy targets | `deploy-tailscale.sh` | P2 | | S15 | Missing memory limits on some containers in deploy | `deploy-to-target.sh:842-880` | P2 | | S16 | ISO build not reproducible — dynamic image capture + :latest | `build-auto-installer-iso.sh:500-594` | P2 | | S17 | No disk space pre-flight in deploy | `deploy-to-target.sh` | P2 | | S18 | deploy-to-target.sh — 1,728 lines monolith | `deploy-to-target.sh` | P3 | | S19 | build-auto-installer-iso.sh — 1,850 lines monolith | `build-auto-installer-iso.sh` | P3 | | S20 | first-boot-containers.sh — 855 lines monolith | `first-boot-containers.sh` | P3 | | S21 | No shared script library — duplicated functions | `scripts/` | P3 | ### Infrastructure | ID | Issue | File(s) | Severity | |----|-------|---------|----------| | I1 | Nginx: /archipelago/, /content, /dwn missing timeout+rate-limit+body-size | `nginx-archipelago.conf:116-180` | P0 | | I2 | Systemd: no MemoryMax, LimitNOFILE, TasksMax | `archipelago.service` | P1 | | I3 | Tor rotation kills old address immediately — federation downtime | `api/rpc/tor.rs:184-240` | P1 | --- ## MONTH 1: CRASH PREVENTION (Weeks 1–4) > Fix every issue that can crash the system, hang indefinitely, or lose data. ### Week 1: P0 Backend — Things That Hang or Lose Data **R1 — Health endpoint handler** - File: `core/archipelago/src/api/rpc/mod.rs` - Add handler for `"health"` method that checks: crash recovery complete, Podman socket responsive, session store loaded - Tests: health returns JSON status, degraded when Podman unreachable, degraded during recovery - Verify: `curl http://192.168.1.198/rpc/v1 -d '{"method":"health"}'` returns real status **R2 — Nostr connect timeout** - File: `core/archipelago/src/nostr_handshake.rs` lines 124, 161, 262, 282 - Wrap all 4 `client.connect().await` in `tokio::time::timeout(Duration::from_secs(10), ...)` - Tests: connect timeout returns Err after 10s, successful connect within timeout works **R3 — Backup restore atomic rollback** - File: `core/archipelago/src/backup/full.rs` lines 122-149 - Rewrite: decrypt → extract to staging dir → validate required files → atomic rename → rollback on failure - Tests: valid backup restores, corrupt backup fails without touching live data, partial extraction rolls back, disk space check fails early **I1 — Nginx unauthenticated endpoint protection** - File: `image-recipe/configs/nginx-archipelago.conf` lines 116-180 - Add to `/archipelago/`, `/content`, `/dwn`: - `limit_req zone=peer burst=20 nodelay;` - `client_max_body_size 10m;` - `proxy_connect_timeout 30s; proxy_read_timeout 60s; proxy_send_timeout 30s;` - Tests: >10MB payload → 413, slow client → timeout, burst 30 → 429 after 20 ### Week 2: P0 Frontend + Scripts — Things That Break UI or Containers **F1 — WebSocket subscription race condition** - File: `neode-ui/src/stores/app.ts` lines 88-134 - Fix: Return unsubscribe function from `wsClient.subscribe()`, call it before re-subscribing. Use a subscription ID to prevent duplicates. - Tests: rapid connectWebSocket() calls produce only one active subscription **F2 — Mesh concurrent state mutations** - File: `neode-ui/src/stores/mesh.ts` lines 249-324 - Fix: Add `isSending` ref as mutex. Queue concurrent sends. `fetchMessages()` called once after all sends complete. - Tests: 3 concurrent sendMessage() calls → all succeed, messages list consistent **F3 — Global error handler** - File: `neode-ui/src/main.ts` - Add `app.config.errorHandler` that shows toast + logs structured error - Tests: thrown error in component shows toast, nested errors don't crash handler **S1 — Eliminate all `sudo podman`** - Files: `fix-indeedhub-containers.sh` (28), `deploy-bitcoin-knots.sh` (11), `deploy-tailscale.sh` (2+), `uptime-monitor.sh` (1), `setup-aiui-server.sh` - Replace every `sudo podman` with `podman` (runs as archipelago user) - Tests: grep for `sudo podman` across all scripts returns zero matches **S2 — Container health checks for all 30 containers** - File: `scripts/first-boot-containers.sh` - Add `--health-cmd`, `--health-interval=30s`, `--health-timeout=5s`, `--health-retries=3` to every `$DOCKER run` - Health commands per type: - Bitcoin: `bitcoin-cli -rpcuser=... getblockchaininfo || exit 1` - HTTP apps: `curl -sf http://localhost:{port}/ || exit 1` - LND: `curl -sf --insecure https://localhost:8080/v1/getinfo || exit 1` - Databases: `mariadb -u root -p... -e "SELECT 1" || exit 1` - Tests: script grep confirms every `$DOCKER run` has `--health-cmd` ### Week 3: P1 Backend — Blocking I/O and Memory Leaks **R4+R5 — Rate limiter cleanup** - File: `core/archipelago/src/session.rs` - Spawn background tasks for both `EndpointRateLimiter::cleanup()` and `LoginRateLimiter` cleanup, every 5 min - Tests: after cleanup, stale entries removed; active entries preserved **R6 — session.rs blocking I/O (6 calls)** - Replace `std::fs::read_to_string` → `tokio::fs::read_to_string` at lines 77, 370, 413 - Replace `std::fs::write` → `tokio::fs::write` at lines 128, 425 - Replace `std::fs::create_dir_all` → `tokio::fs::create_dir_all` at line 423 - Tests: session load/save/persist still works correctly **R7 — docker_packages.rs blocking I/O** - Replace `std::fs::read_to_string` → `tokio::fs::read_to_string` at lines 561, 573 - Tests: app metadata loading works **R8 — port_allocator.rs blocking I/O** - Replace all 3 std::fs calls → tokio::fs at lines 59, 73, 77 - Tests: port allocation/persistence works **R9+R10+R11 — Remaining blocking I/O** - `peers.rs:30`, `node_message.rs:65`, `identity.rs:50`, `identity_manager.rs:164`, `nostr_discovery.rs:55` - Convert all to tokio::fs - Tests: each module's file operations still work **R12 — electrs_status.rs sync TCP I/O** - Convert synchronous TCP client to async (tokio::net::TcpStream) - Tests: ElectrumX status query works, timeout on connection failure ### Week 4: P1 Frontend — Memory Leaks and Stale State **F4 — WebSocket reconnect full state refresh** - File: `neode-ui/src/stores/app.ts` - After reconnect, call `rpcClient.call({method: 'server.get-state'})` to get fresh state before accepting patches - Tests: after simulated disconnect+reconnect, state matches server **F5 — Message polling timer cleanup** - File: `neode-ui/src/composables/useMessageToast.ts` - Tie polling lifecycle to auth state: stop on logout, start on login. Export cleanup function. - Tests: polling stops when auth false, restarts when auth true, no timer after unmount **F6 — AppLauncher message listener leak** - File: `neode-ui/src/stores/appLauncher.ts` - Ensure listener is removed when app closes (even if not via close button — e.g., route navigation) - Tests: navigate away from app → listener removed, new app opens clean **F7 — Audio player listener stacking** - File: `neode-ui/src/composables/useAudioPlayer.ts` - Create Audio element once, register listeners once. Track initialization flag. - Tests: calling play() 10 times → still only 6 listeners total (not 60) **S3 — Pin all container images (remove :latest)** - Files: `first-boot-containers.sh` (15), `deploy-to-target.sh` (11), `deploy-tailscale.sh` (18), `build-auto-installer-iso.sh` (7) - Replace every `:latest` with specific version tag - Create `image-versions.env` sourced by all scripts — single source of truth - Tests: `grep -r ':latest' scripts/ image-recipe/` returns zero matches (excluding comments) --- ## MONTH 2: OPERATIONAL SAFETY (Weeks 5–8) > Fix everything that makes deploys dangerous, scripts unreliable, or operations opaque. ### Week 5: Deploy Script Hardening **S4 — first-boot error handling** - Add per-section error checking: if Bitcoin fails, skip dependent containers (LND, Mempool, BTCPay) - Add `wait_for_container` return value checking - Tests: first-boot with broken Bitcoin image → Bitcoin deps skipped, independent apps still start **S5 — Replace eval with safe construct** - File: `deploy-to-target.sh:940` - Replace `eval "$DB_PASSWORDS"` with explicit variable assignment from SSH output - Tests: passwords parsed correctly without eval **S6 — Deploy locking** - File: `deploy-to-target.sh` - Add remote `flock` on `/var/lock/archipelago-deploy.lock`. Second deploy fails immediately with message. Stale lock (>30 min) broken automatically. - Tests: two parallel deploys → second fails, stale lock → broken and deploy proceeds **S7 — Deploy rollback** - File: `deploy-to-target.sh` - Before overwriting binary: `cp archipelago archipelago.bak` - Before overwriting frontend: `cp -r web-ui web-ui.bak` - If health check fails post-restart: restore from .bak, restart again - Tests: intentionally broken binary → deploy detects, rolls back, system healthy **S8 — Eliminate sshpass** - File: `trust-archipelago-cert.sh` - Rewrite to use SSH key only: `ssh -i ~/.ssh/archipelago-deploy` - Tests: script works with key auth, fails gracefully without key ### Week 6: Script Quality **S9 — MariaDB password not on command line** - File: `first-boot-containers.sh:285` - Use `$DOCKER exec -i ... mariadb -uroot < /dev/stdin <<< "SET PASSWORD..."` - Tests: `ps aux` during execution doesn't show password **S10 — Replace silent error masking** - File: `deploy-to-target.sh` (80+ instances) - Pattern: replace `2>/dev/null || echo ""` with `|| { log_warn "..."; echo ""; }` - At minimum, log what failed before masking - Tests: failed health check produces log entry **S11 — Trap cleanup for temp files** - All scripts that create /tmp files: add `trap "rm -rf /tmp/deploy-$$" EXIT` at start - Files: deploy-to-target.sh, deploy-tailscale.sh, build-auto-installer-iso.sh - Tests: script interrupted mid-execution → temp files cleaned up **S12 — Quote all variables** - Audit and fix unquoted `$VARIABLE` in command arguments across all scripts - Tests: shellcheck passes on all modified scripts **S13 — Extract hardcoded IPs to config** - Create `scripts/deploy-config-defaults.sh` with all node IPs as named variables - Source from all scripts instead of hardcoding - Tests: changing IP in config → all scripts use new IP ### Week 7: Infrastructure Hardening **I2 — Systemd resource limits** - File: `image-recipe/configs/archipelago.service` - Add: `MemoryMax=4G`, `LimitNOFILE=65535`, `TasksMax=2048` - Tests: `systemctl show archipelago` confirms limits applied, service starts normally **I3 — Tor rotation transition period** - File: `core/archipelago/src/api/rpc/tor.rs` - Keep old hidden service running for 24h after rotation. Both addresses active. Notify peers of new address. Schedule old deletion. - Tests: after rotation old address still resolves, peers receive notification, old removed after transition **S14 — Input validation on deploy targets** - Add regex validation for hostnames/IPs before SSH - Tests: invalid hostname → clear error, valid hostname → proceeds **S15 — Memory limits on all deploy containers** - File: `deploy-to-target.sh` lines 842-880 - Add `--memory=$(mem_limit ...)` to all UI container builds - Tests: every container in deploy has `--memory` flag **S17 — Disk space pre-flight** - File: `deploy-to-target.sh` - Check target disk <85% before deploying. Abort with clear message if full. - Tests: deploy to 90% full disk → aborted, deploy to 50% full → succeeds ### Week 8: Remaining P1 Backend **R14 — Fix .parse().unwrap() in session rate limiting** - File: `session.rs:665,676,688` - Replace `.parse().unwrap()` with `.parse().context("...")?` - Tests: invalid IP handling works gracefully **R15 — Fix 7 unwrap/expect in mesh/protocol.rs** - File: `mesh/protocol.rs:582,592,614,649,679,713,728` - Replace all with `?` operator + proper error types - Tests: protocol parsing with malformed data returns error, not panic **R27 — Add timeouts to mesh Bitcoin RPC calls** - File: `mesh/mod.rs:624,649,663` - Add `tokio::time::timeout(Duration::from_secs(10), ...)` to all Bitcoin RPC calls - Tests: RPC timeout returns error after 10s **R34 — Tor rotation transition** - (Covered by I3 above) --- ## MONTH 3: PRODUCTION POLISH (Weeks 9–12) > Fix every remaining P2 issue — unwraps, hardcoded values, frontend quality, resilience. ### Week 9: Remaining Backend Unwraps + Dead Code **R13 — main.rs .expect() → .context()** - Replace 2 `.expect()` calls with `.context("...")?` and proper startup error handling **R16 — identity.rs .expect() → safe handling** - Replace 2 `.expect()` in crypto operations with result propagation **R17+R18 — helpers unwraps** - Fix 10 `.unwrap()` calls in `helpers/lib.rs` and `helpers/rsync.rs` - Replace with `?` operator or `.context()` **R19 — js-engine unwraps** - Fix 2 `.unwrap()` in `js-engine/lib.rs:130,249` **R20+R21 — Dead code elimination** - Remove all 14 `#[allow(dead_code)]` in `mesh/mod.rs`. Either use the fields or delete them. - Same for `lnd.rs`, `data_manager.rs`, `dev_orchestrator.rs` - Tests: `cargo clippy` zero warnings, `cargo test` passes ### Week 10: Hardcoded Values → Constants **R22 — Bitcoin RPC URL constant** - Create `const BITCOIN_RPC_URL: &str = "http://127.0.0.1:8332/";` in a shared constants module - Use across `bitcoin.rs`, `mesh/mod.rs`, `mesh/listener.rs` - Tests: all Bitcoin RPC calls still work **R23 — DWN health URL constant** **R24 — Update manifest URL constant** **R25 — DNS-over-HTTPS URLs → constants array** **R26 — DWN protocol URIs → constants** - Centralize all hardcoded URLs/URIs into `core/archipelago/src/constants.rs` - Tests: all modules reference constants, no hardcoded strings remain **R28 — LND proxy timeouts** - Audit all 68 `.send()` calls in `api/rpc/lnd.rs`. Ensure each has explicit timeout. - Tests: LND proxy call with unresponsive LND → timeout error, not hang **R29 — DWN health check timeout** - Add timeout to `dwn_sync.rs:76` health check **R30-R33 — Resolve all TODOs** - Either implement the TODO or remove the dead code path. Per project rules: no TODO/FIXME in commits. ### Week 11: Frontend P2 Fixes **F8 — WebSocket reconnection race** - Add `isReconnecting` flag. Skip if already reconnecting. - Tests: rapid close events → only one reconnect attempt **F9 — WebSocket parse error handling** - Count consecutive parse errors. After 3, force reconnect. - Tests: 3 malformed messages → reconnect triggered; single bad message → logged only **F10 — Stale connection detection tuning** - Require mutual pong response within 30s. Don't close valid connections that are simply quiet. - Tests: quiet but healthy connection → stays open; no pong for 30s → reconnects **F11 — RPC client backoff reduction** - Reduce default timeout from 30s to 15s. Add jitter to backoff. Cap total retry time at 20s. - Tests: server outage → user sees error within 20s, not 40s **F12 — Code splitting** - Lazy-load all routes: `() => import('./views/Web5.vue')` - Add manual chunks in vite.config.ts for vendor/api - Tests: build produces multiple chunks, initial bundle < 200KB gzipped **F13 — DOMPurify on QR v-html** - Add DOMPurify.sanitize() to QR SVG before v-html rendering - Tests: XSS payload in QR content → sanitized ### Week 12: Frontend P2 Continued + Performance **F14 — Goals computed memoization** - Replace O(n) alias lookup with Map. Add deep equality check. - Tests: goalStatuses computed runs in <1ms with 100 apps **F15 — localStorage error handling** - Wrap all localStorage.setItem in try/catch. Show toast on quota exceeded. - Tests: full localStorage → toast shown, app continues **F16 — FileBrowser auth consolidation** - Use cookie-only auth. Remove in-memory token. - Tests: login persists across page reload, logout clears cookie **F17 — CSRF token parsing robustness** - Add header fallback for CSRF token. Handle edge cases. - Tests: missing cookie → falls back to header, both missing → error **F22 — CSS backdrop-filter mobile performance** - Add media query: reduce blur to 8px on mobile. Remove backdrop-filter from non-visible elements. - Tests: mobile Lighthouse performance score > 80 --- ## MONTH 4-5: BACKEND ARCHITECTURE (Weeks 13–20) > Split every Rust god file. Target: no file > 500 lines. ### Week 13–14: Split package.rs (1,795 lines) ``` api/rpc/package/ ├── mod.rs — Re-exports (~50 lines) ├── config.rs — get_app_config(), get_app_capabilities(), needs_archy_net() ├── lifecycle.rs — install, start, stop, restart, uninstall ├── validation.rs — Input validation, dependency checking, image validation └── progress.rs — Progress streaming, install status tracking ``` Pre-split tests: test every `get_app_config()` variant, validation path, lifecycle transition Post-split: all RPC calls return identical responses, `cargo test` passes ### Week 15–16: Split mesh/listener.rs (1,799 lines) ``` mesh/listener/ ├── mod.rs — Re-exports + spawn_mesh_listener() ├── session.rs — run_mesh_session() loop ├── frames.rs — handle_frame() dispatcher ├── identity.rs — handle_identity_received(), handle_typed_message() ├── sync.rs — sync_queued_messages(), store_typed_message() └── bitcoin.rs — Bitcoin relay operations, RPC calls ``` ### Week 17–18: Split rpc/mod.rs (1,092 lines) + lnd.rs (1,068 lines) **rpc/mod.rs** → `dispatcher.rs` (method routing), `middleware.rs` (CSRF/session/rate-limit), `response.rs` (response building) **lnd.rs** → `lnd/wallet.rs`, `lnd/channels.rs`, `lnd/info.rs`, `lnd/payments.rs` ### Week 19–20: Split monitoring (993), handler (911), mesh (865) Split each into sub-modules. Target: no file > 500 lines. All pre-split tests, all post-split verification. --- ## MONTH 6-8: FRONTEND ARCHITECTURE (Weeks 21–32) > Split every Vue god component. Target: no component > 500 lines. ### Week 21–22: Split Web5.vue (3,940 lines → 8 sub-views) ``` views/web5/ ├── Web5.vue — Router shell (~150 lines) ├── Web5Identity.vue — DID management ├── Web5Wallet.vue — Wallet operations ├── Web5Nostr.vue — Nostr relays/profiles ├── Web5Credentials.vue — Verifiable Credentials ├── Web5Peers.vue — P2P federation nodes ├── Web5Storage.vue — DWN storage/explorer ├── Web5Goals.vue — Goals/voting └── Web5Marketplace.vue — Decentralized marketplace ``` Add nested routes. Component tests for each section. All sections render identically. ### Week 23–24: Split Mesh.vue (2,106) + Dashboard.vue (1,819) **Mesh.vue** → `MeshRadio.vue`, `MeshChat.vue`, `MeshNetwork.vue`, `MeshFederation.vue` **Dashboard.vue** → `DashboardHome.vue`, `DashboardApps.vue`, `DashboardSystem.vue` ### Week 25–26: Split Settings.vue (1,792) + Server.vue (1,132) **Settings.vue** → `SettingsAccount.vue`, `SettingsSystem.vue`, `SettingsNetwork.vue`, `SettingsAppearance.vue` **Server.vue** → `ServerOverview.vue`, `ServerContainers.vue`, `ServerLogs.vue` ### Week 27–28: Split Marketplace.vue (1,293) + AppDetails.vue (1,036) + Home.vue (1,059) Each into 3-4 focused sub-components. ### Week 29–30: Decompose useAppStore (324 lines, 16 methods) ``` stores/ ├── app.ts — Thin re-export for backward compat (~50 lines) ├── auth.ts — Login, logout, session, password, TOTP ├── server.ts — Server info, system stats, reboot/shutdown ├── realtime.ts — WebSocket connection, subscriptions, heartbeat └── packages.ts — Package install/uninstall, marketplace data ``` Tests: every existing import of `useAppStore` still works. State transitions identical. ### Week 31–32: Remaining frontend P3 issues **F18** — aiPermissions runtime validation **F19** — Track AppSession timeout **F20** — Dashboard aria-current **F21** — Debounce search + memoize **F23** — Branded types for DID operations **F24** — Fix checkInterval leak --- ## MONTH 9-10: SCRIPT ARCHITECTURE + ISO (Weeks 33–40) > Split every monolithic script. Target: no script > 400 lines. ### Week 33–34: Create shared script library ``` scripts/lib/ ├── common.sh — Colors, logging, error handling, SSH helpers ├── health.sh — Health check polling, container status ├── deploy-utils.sh — Rsync, file sync, backup/restore ├── container.sh — Podman helpers, image management, mem_limit() └── network.sh — IP validation, port checking ``` Tests: each library function tested in `scripts/tests/` ### Week 35–36: Split deploy-to-target.sh (1,728 lines) ``` scripts/ ├── deploy-to-target.sh — Orchestrator + arg parsing (~300 lines) ├── deploy/ │ ├── frontend.sh — Build + sync frontend │ ├── backend.sh — Build + sync binary │ ├── configs.sh — Sync nginx, systemd, scripts │ ├── containers.sh — Container creation/update │ ├── verify.sh — Post-deploy health checks │ └── rollback.sh — Rollback on failure ``` ### Week 37–38: Split ISO build (1,850 lines) + first-boot (855 lines) **build-auto-installer-iso.sh** → `build/capture-images.sh`, `build/create-rootfs.sh`, `build/install-packages.sh`, `build/bundle-configs.sh`, `build/package-iso.sh` **first-boot-containers.sh** → `first-boot/databases.sh`, `first-boot/bitcoin.sh`, `first-boot/lightning.sh`, `first-boot/apps.sh`, `first-boot/networking.sh` ### Week 39–40: ISO Reproducibility + Integration Tests **S16 — Make ISO builds reproducible** - Create `image-versions.env` with pinned digests for every container image - ISO build sources this file, never pulls `:latest` - Build manifest records exactly what shipped - Tests: two consecutive ISO builds produce identical image sets **E2E smoke test script** ```bash # scripts/smoke-test.sh — Run against .198 # 1. curl /health → OK # 2. Login → get session # 3. Get server info → valid JSON # 4. List containers → all healthy # 5. Check every /app/* proxy → responds # 6. Check Tor hidden service → resolves # 7. Check WebSocket upgrade → 101 # Exit 0 only if all pass ``` --- ## MONTH 11: INTEGRATION TESTS (Weeks 41–44) > Comprehensive test suites that prove everything works. ### Week 41–42: Backend Integration Tests ``` core/archipelago/tests/ ├── test_auth_flow.rs — Login → session → CSRF → auth request → logout ├── test_container_lifecycle.rs — Install → start → health → stop → uninstall ├── test_federation.rs — Generate invite → join → sync → verify ├── test_rpc_validation.rs — Every endpoint with invalid input → proper error ├── test_session_persist.rs — Create session → restart → session survives ├── test_rate_limiting.rs — Flood → 429 → wait → allowed ├── test_backup_restore.rs — Create → verify → restore → validate ├── test_health_endpoint.rs — Healthy → degraded → recovery ``` Target: 25+ backend integration tests passing ### Week 43–44: Frontend Integration Tests ``` neode-ui/src/__tests__/integration/ ├── auth-flow.spec.ts — Login → dashboard → timeout → redirect ├── app-lifecycle.spec.ts — Marketplace → install → progress → launch → uninstall ├── websocket.spec.ts — Connect → update → disconnect → reconnect → state consistent ├── settings-flow.spec.ts — Change password → re-login → 2FA setup → verify ├── spotlight.spec.ts — Open → search → navigate → close ├── mesh-chat.spec.ts — Connect → send → receive → disconnect ├── error-handling.spec.ts — Network error → toast → retry → success ├── code-splitting.spec.ts — Route navigation → chunks loaded lazily ``` Target: 20+ frontend integration tests passing --- ## MONTH 12: TYPE SYNC + CI/CD PLAN (Weeks 45–48) ### Week 45–46: Rust↔TypeScript Type Sync **Approach**: `ts-rs` crate to auto-generate TypeScript types from Rust structs 1. Add `ts-rs` to `core/models/Cargo.toml` 2. Add `#[derive(TS)]` to all API request/response types 3. Build script generates `neode-ui/src/types/generated.ts` 4. Replace manual types in `types/api.ts` with imports from generated file 5. Verification: regenerate → diff → must be zero (types committed) Tests: frontend type-check passes with generated types, manual api.ts reduced to non-API types ### Week 47–48: CI/CD Planning (Document Only — Execute Later) > This section is the PLAN for CI/CD. Do not execute during this phase. Document everything needed so it can be implemented in a future sprint. **CI Pipeline Design** (`.github/workflows/ci.yml`): ```yaml # Triggers: push to main, all PRs # Jobs: # rust-checks (Linux runner): # - cargo clippy --all-targets --all-features (zero warnings gate) # - cargo fmt --all -- --check (formatting gate) # - cargo test --all-features (all tests gate) # # frontend-checks (Node 20): # - npm run type-check (TypeScript strictness gate) # - npm run lint (ESLint gate) # - npm test (Vitest suite gate) # # integration (Linux runner, optional): # - scripts/smoke-test.sh against staging # # Merge policy: all checks must pass before merge # Branch protection: require PR, require checks, no force push to main ``` **Release Pipeline Design** (`.github/workflows/release.yml`): ```yaml # Triggers: tag push (v*) # Jobs: # build-linux-binary: # - Cross-compile Rust for x86_64 + ARM64 # build-frontend: # - npm run build # build-iso: # - SSH to build server, run ISO build # - Upload ISO as release asset # smoke-test: # - Boot ISO in QEMU # - Run smoke-test.sh # - Gate release on pass ``` **Pre-requisites to implement**: - [ ] GitHub Actions runner with Rust toolchain + cross-compilation - [ ] Node.js 20 runner for frontend - [ ] SSH key for build server accessible from CI - [ ] Branch protection rules configured - [ ] Image digest manifest for reproducible ISO builds - [ ] QEMU-based ISO verification script **Estimated implementation time**: 2 weeks when ready to execute --- ## VERIFICATION PROTOCOL (Every Week) 1. `cargo clippy --all-targets --all-features` — zero warnings 2. `cargo fmt --all` 3. `cargo test --all-features` — all pass 4. `cd neode-ui && npm run type-check` — zero errors 5. `cd neode-ui && npm test` — all pass 6. `./scripts/deploy-to-target.sh --target 192.168.1.198` — **ONLY .198** 7. `curl http://192.168.1.198/health` — returns OK with service status 8. Navigate all affected views in browser — identical behavior 9. Atomic commit: `refactor: ` or `fix: ` --- ## EXIT CRITERIA (Month 12 Complete) ### Reliability (Zero Tolerance) - [ ] Health endpoint returns real service status - [ ] All async operations have bounded timeouts - [ ] Zero blocking I/O in async context (no std::fs in async functions) - [ ] Zero .unwrap()/.expect() in production code - [ ] All rate limiters have cleanup tasks - [ ] Backup restore uses staging + atomic swap + rollback - [ ] All 30 containers have health checks + memory limits - [ ] All container images pinned to specific versions - [ ] Nginx unauthenticated endpoints protected (timeout + rate limit + body size) - [ ] Systemd service has resource limits - [ ] Tor rotation preserves old address during transition - [ ] Deploy has locking + disk check + rollback - [ ] Zero `sudo podman` in any script - [ ] Zero `:latest` image tags anywhere - [ ] Zero silent error masking without logging ### Frontend (Zero Tolerance) - [ ] Global error handler catches and displays all errors - [ ] WebSocket: single subscription, reconnect refreshes state, bounded retries - [ ] All timers/listeners cleaned up on unmount - [ ] Code splitting: initial bundle < 200KB gzipped - [ ] v-html always uses DOMPurify - [ ] All localStorage operations wrapped in try/catch ### Architecture (Target: File Size Limits) - [ ] No Rust file > 500 lines (excluding generated code) - [ ] No Vue component > 500 lines - [ ] No shell script > 400 lines - [ ] No Pinia store has more than 1 responsibility - [ ] All hardcoded URLs/ports extracted to constants - [ ] Shared script library eliminates duplication - [ ] TypeScript types auto-generated from Rust structs ### Testing - [ ] 25+ backend integration tests passing - [ ] 20+ frontend integration tests passing - [ ] E2E smoke test script passes on .198 - [ ] ISO builds are reproducible (pinned digests) ### CI/CD (Planned, Not Executed) - [ ] CI pipeline design documented - [ ] Release pipeline design documented - [ ] Pre-requisites list complete - [ ] Ready for 2-week implementation sprint ### Zero Behavior Changes Every feature works identically. Every existing test passes. Every user flow unchanged.