2026-03-22 03:30:21 +00:00
# Archipelago: Production Excellence Plan
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**Duration**: 12 months (48 weeks)
**Goal**: Code so good no developer could question any decision. Apple-level reliability. Every failure visible and recoverable. Every operation bounded. Every line justified.
**Audited**: 2026-03-20 — 122 Rust files, 38 Vue views, 180+ frontend files, 80+ shell scripts
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
## CONSTRAINTS
- **DEPLOY ONLY TO .198** — Never .228. All verification on .198.
- **BETA FREEZE** — Behavior-preserving only. No new features/UI/endpoints.
- **Tests before every refactor** — Capture current behavior first. Tests must pass unchanged after.
- **Atomic commits** — One logical change per commit. Every step compiles + passes tests.
2026-03-08 08:06:52 +00:00
```bash
2026-03-22 03:30:21 +00:00
ssh -i ~/.ssh/archipelago-deploy archipelago@192 .168.1.198
2026-03-08 08:06:52 +00:00
```
---
2026-03-22 03:30:21 +00:00
## COMPLETE ISSUE REGISTRY
### Backend Rust — 122 files audited
| ID | Issue | File(s) | Severity |
|----|-------|---------|----------|
| R1 | Health RPC endpoint has no handler — returns "Unknown method" | `api/rpc/mod.rs` | P0 |
| R2 | Nostr client.connect() hangs indefinitely (4 calls, no timeout) | `nostr_handshake.rs:124,161,262,282` | P0 |
| R3 | Backup restore extracts directly to live dir — no atomic rollback | `backup/full.rs:122-149` | P0 |
| R4 | Rate limiter cleanup() never spawned — HashMap grows forever | `session.rs:566-579` | P1 |
| R5 | Login rate limiter same issue — entries never evicted | `session.rs:452-472` | P1 |
| R6 | Blocking std::fs in async — session.rs (6 calls) | `session.rs:77,128,370,413,423,425` | P1 |
| R7 | Blocking std::fs in async — docker_packages.rs | `docker_packages.rs:561,573` | P1 |
| R8 | Blocking std::fs in async — port_allocator.rs | `port_allocator.rs:59,73,77` | P1 |
| R9 | Blocking std::fs in async — peers.rs, node_message.rs | `peers.rs:30` , `node_message.rs:65` | P1 |
| R10 | Blocking std::fs in async — identity.rs, identity_manager.rs | `identity.rs:50` , `identity_manager.rs:164` | P1 |
| R11 | Blocking std::fs in async — nostr_discovery.rs | `nostr_discovery.rs:55` | P1 |
| R12 | Sync TCP I/O in async context — electrs_status.rs | `electrs_status.rs:5,40,78,81` | P1 |
| R13 | .expect() in main.rs startup | `main.rs:124,159` | P2 |
| R14 | .parse().unwrap() in session.rs rate limiting | `session.rs:665,676,688` | P1 |
| R15 | 7 .unwrap()/.expect() in mesh/protocol.rs | `protocol.rs:582,592,614,649,679,713,728` | P1 |
| R16 | .expect() in identity.rs crypto | `identity.rs:114,119` | P2 |
| R17 | .unwrap() in helpers/lib.rs (5 calls) | `helpers/lib.rs:167,172,180,233,253` | P2 |
| R18 | .unwrap() in helpers/rsync.rs (5 calls) | `rsync.rs:196,199,202,210,220` | P2 |
| R19 | .unwrap() in js-engine/lib.rs | `js-engine/lib.rs:130,249` | P2 |
| R20 | 14 #[allow(dead_code)] suppressions in mesh/mod.rs | `mesh/mod.rs:7-25` | P2 |
| R21 | Dead code in lnd.rs, data_manager.rs, dev_orchestrator.rs | Multiple | P2 |
| R22 | Bitcoin RPC URL hardcoded in 4+ files | `bitcoin.rs:89` , `mesh/mod.rs:624,649,663` , `listener.rs:1509+` | P2 |
| R23 | DWN health URL hardcoded | `dwn_sync.rs:76` | P2 |
| R24 | Update manifest URL hardcoded | `update.rs:11` | P3 |
| R25 | DNS-over-HTTPS URLs hardcoded (4 providers) | `network/dns.rs:98,102,106,110` | P3 |
| R26 | DWN protocol URIs hardcoded in server.rs | `server.rs:453-456` | P3 |
| R27 | Missing timeouts on mesh Bitcoin RPC calls | `mesh/mod.rs:624,649,663` | P1 |
| R28 | Missing timeouts on LND proxy calls (68 .send() calls) | `api/rpc/lnd.rs` | P2 |
| R29 | Missing timeout on DWN health check | `dwn_sync.rs:76` | P2 |
| R30 | TODO: track last-seen timestamp | `handshake.rs:77` | P3 |
| R31 | TODO: lnd.lookupinvoice RPC endpoint | `marketplace.rs:183` | P3 |
| R32 | TODO: trigger auto-restart or alert | `container/health_monitor.rs:140` | P3 |
| R33 | TODO: configure Podman to use AppArmor profile | `security/container_policies.rs:68` | P3 |
| R34 | Tor rotation deletes old .onion immediately — no transition | `api/rpc/tor.rs:184-240` | P1 |
| R35 | package.rs god file — 1,795 lines | `api/rpc/package.rs` | P2 |
| R36 | mesh/listener.rs god file — 1,799 lines | `mesh/listener.rs` | P2 |
| R37 | rpc/mod.rs god file — 1,092 lines | `api/rpc/mod.rs` | P2 |
| R38 | lnd.rs god file — 1,068 lines | `api/rpc/lnd.rs` | P2 |
| R39 | monitoring/mod.rs — 993 lines | `monitoring/mod.rs` | P3 |
| R40 | api/handler.rs — 911 lines | `api/handler.rs` | P3 |
| R41 | 30+ functions exceed 50 lines across codebase | Multiple | P3 |
### Frontend — 180+ files audited
| ID | Issue | File(s) | Severity |
|----|-------|---------|----------|
| F1 | WebSocket subscription registered multiple times — race condition | `stores/app.ts:88-134` | P0 |
| F2 | Unprotected concurrent mesh state mutations | `stores/mesh.ts:249-268,294-324` | P0 |
| F3 | No global Vue error handler — white screen on error | `main.ts` | P0 |
| F4 | Stale data after WebSocket reconnect — no full refresh | `stores/app.ts:88-163` | P1 |
| F5 | Message polling timer never stopped after logout | `composables/useMessageToast.ts:60` | P1 |
| F6 | AppLauncher NIP-07 message listener leak on close | `stores/appLauncher.ts:295-301` | P1 |
| F7 | Audio player listeners stack — never cleaned up | `composables/useAudioPlayer.ts:1-91` | P1 |
| F8 | WebSocket reconnection race — parallel connect() attempts | `api/websocket.ts:212-238` | P2 |
| F9 | WebSocket parse error silently caught — stale UI forever | `api/websocket.ts:164-172` | P2 |
| F10 | WebSocket stale connection detection too aggressive (5min) | `api/websocket.ts:284-299` | P2 |
| F11 | RPC client backoff + timeout = 40s max wait | `api/rpc-client.ts:31-117` | P2 |
| F12 | No code splitting — monolithic bundle | `vite.config.ts` | P2 |
| F13 | v-html on QR code without DOMPurify | `views/Settings.vue:441` | P2 |
| F14 | Goals store O(n) alias lookup on every computed | `stores/goals.ts:16-20,38-89` | P2 |
| F15 | localStorage save without try/catch (5+ instances) | `stores/goals.ts:34-36` + others | P2 |
| F16 | FileBrowser auth token duality — memory + cookie | `api/filebrowser-client.ts:39,50-68` | P2 |
| F17 | CSRF token cookie parsing brittle — regex only | `api/rpc-client.ts:18-21` | P2 |
| F18 | aiPermissions.ts Set uses unsafe type assertion | `stores/aiPermissions.ts:91-103` | P3 |
| F19 | Untracked setTimeout in AppSession — fires after unmount | `views/AppSession.vue:507` | P3 |
| F20 | Dashboard navigation missing aria-current="page" | `views/Dashboard.vue` | P3 |
| F21 | Search performance — string re-lowercasing every keystroke | `views/Apps.vue:510-537` | P3 |
| F22 | 30+ backdrop-filter blur elements — GPU overload on mobile | `style.css` | P3 |
| F23 | Record< string , unknown > on sensitive DID operations | `types/api.ts` + `rpc-client.ts` | P3 |
| F24 | checkInterval timer leak on connect race | `api/websocket.ts:82-96` | P3 |
| F25 | Web5.vue god component — 3,940 lines | `views/Web5.vue` | P2 |
| F26 | Mesh.vue — 2,106 lines | `views/Mesh.vue` | P2 |
| F27 | Dashboard.vue — 1,819 lines | `views/Dashboard.vue` | P2 |
| F28 | Settings.vue — 1,792 lines | `views/Settings.vue` | P2 |
| F29 | Marketplace.vue — 1,293 lines | `views/Marketplace.vue` | P3 |
| F30 | Server.vue — 1,132 lines | `views/Server.vue` | P3 |
| F31 | Home.vue — 1,059 lines | `views/Home.vue` | P3 |
| F32 | AppDetails.vue — 1,036 lines | `views/AppDetails.vue` | P3 |
| F33 | useAppStore god store — 324 lines, 16 methods, 8+ responsibilities | `stores/app.ts` | P2 |
### Shell Scripts — 80+ files audited
| ID | Issue | File(s) | Severity |
|----|-------|---------|----------|
| S1 | 60+ instances of `sudo podman` — should be rootless | `fix-indeedhub(28)` , `deploy-bitcoin(11)` , `deploy-tailscale(2+)` | P0 |
| S2 | Zero container health checks in first-boot (30 containers) | `first-boot-containers.sh` | P0 |
| S3 | 50+ `:latest` image tags across all scripts | `first-boot(15)` , `deploy(11)` , `tailscale(18)` , `iso(7)` | P1 |
| S4 | No `set -e` in first-boot — silent container failures | `first-boot-containers.sh:1-9` | P1 |
| S5 | `eval "$DB_PASSWORDS"` — code injection risk | `deploy-to-target.sh:940` | P1 |
| S6 | No deploy locking — concurrent deploys corrupt state | `deploy-to-target.sh` | P1 |
| S7 | No deploy rollback — failed deploy leaves broken system | `deploy-to-target.sh` | P1 |
| S8 | sshpass usage in trust-archipelago-cert.sh | `trust-archipelago-cert.sh:23-26` | P1 |
| S9 | MariaDB password in command line — visible in ps | `first-boot-containers.sh:285` | P1 |
| S10 | 80+ instances of `2>/dev/null \|\| true` masking errors | `deploy-to-target.sh` | P2 |
| S11 | No trap cleanup for temp files | Multiple scripts | P2 |
| S12 | Unquoted variables (word splitting risk) | Multiple scripts | P2 |
| S13 | Hardcoded IPs in 6+ scripts | `deploy-to-target.sh:26` , `deploy-tailscale.sh:26` , etc. | P2 |
| S14 | No input validation on deploy targets | `deploy-tailscale.sh` | P2 |
| S15 | Missing memory limits on some containers in deploy | `deploy-to-target.sh:842-880` | P2 |
| S16 | ISO build not reproducible — dynamic image capture + :latest | `build-auto-installer-iso.sh:500-594` | P2 |
| S17 | No disk space pre-flight in deploy | `deploy-to-target.sh` | P2 |
| S18 | deploy-to-target.sh — 1,728 lines monolith | `deploy-to-target.sh` | P3 |
| S19 | build-auto-installer-iso.sh — 1,850 lines monolith | `build-auto-installer-iso.sh` | P3 |
| S20 | first-boot-containers.sh — 855 lines monolith | `first-boot-containers.sh` | P3 |
| S21 | No shared script library — duplicated functions | `scripts/` | P3 |
### Infrastructure
| ID | Issue | File(s) | Severity |
|----|-------|---------|----------|
| I1 | Nginx: /archipelago/, /content, /dwn missing timeout+rate-limit+body-size | `nginx-archipelago.conf:116-180` | P0 |
| I2 | Systemd: no MemoryMax, LimitNOFILE, TasksMax | `archipelago.service` | P1 |
| I3 | Tor rotation kills old address immediately — federation downtime | `api/rpc/tor.rs:184-240` | P1 |
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
---
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
## MONTH 1: CRASH PREVENTION (Weeks 1– 4)
> Fix every issue that can crash the system, hang indefinitely, or lose data.
### Week 1: P0 Backend — Things That Hang or Lose Data
**R1 — Health endpoint handler**
- File: `core/archipelago/src/api/rpc/mod.rs`
- Add handler for `"health"` method that checks: crash recovery complete, Podman socket responsive, session store loaded
- Tests: health returns JSON status, degraded when Podman unreachable, degraded during recovery
- Verify: `curl http://192.168.1.198/rpc/v1 -d '{"method":"health"}'` returns real status
**R2 — Nostr connect timeout**
- File: `core/archipelago/src/nostr_handshake.rs` lines 124, 161, 262, 282
- Wrap all 4 `client.connect().await` in `tokio::time::timeout(Duration::from_secs(10), ...)`
- Tests: connect timeout returns Err after 10s, successful connect within timeout works
**R3 — Backup restore atomic rollback**
- File: `core/archipelago/src/backup/full.rs` lines 122-149
- Rewrite: decrypt → extract to staging dir → validate required files → atomic rename → rollback on failure
- Tests: valid backup restores, corrupt backup fails without touching live data, partial extraction rolls back, disk space check fails early
**I1 — Nginx unauthenticated endpoint protection**
- File: `image-recipe/configs/nginx-archipelago.conf` lines 116-180
- Add to `/archipelago/` , `/content` , `/dwn` :
- `limit_req zone=peer burst=20 nodelay;`
- `client_max_body_size 10m;`
- `proxy_connect_timeout 30s; proxy_read_timeout 60s; proxy_send_timeout 30s;`
- Tests: >10MB payload → 413, slow client → timeout, burst 30 → 429 after 20
### Week 2: P0 Frontend + Scripts — Things That Break UI or Containers
**F1 — WebSocket subscription race condition**
- File: `neode-ui/src/stores/app.ts` lines 88-134
- Fix: Return unsubscribe function from `wsClient.subscribe()` , call it before re-subscribing. Use a subscription ID to prevent duplicates.
- Tests: rapid connectWebSocket() calls produce only one active subscription
**F2 — Mesh concurrent state mutations**
- File: `neode-ui/src/stores/mesh.ts` lines 249-324
- Fix: Add `isSending` ref as mutex. Queue concurrent sends. `fetchMessages()` called once after all sends complete.
- Tests: 3 concurrent sendMessage() calls → all succeed, messages list consistent
**F3 — Global error handler**
- File: `neode-ui/src/main.ts`
- Add `app.config.errorHandler` that shows toast + logs structured error
- Tests: thrown error in component shows toast, nested errors don't crash handler
**S1 — Eliminate all `sudo podman` **
- Files: `fix-indeedhub-containers.sh` (28), `deploy-bitcoin-knots.sh` (11), `deploy-tailscale.sh` (2+), `uptime-monitor.sh` (1), `setup-aiui-server.sh`
- Replace every `sudo podman` with `podman` (runs as archipelago user)
- Tests: grep for `sudo podman` across all scripts returns zero matches
**S2 — Container health checks for all 30 containers**
- File: `scripts/first-boot-containers.sh`
- Add `--health-cmd` , `--health-interval=30s` , `--health-timeout=5s` , `--health-retries=3` to every `$DOCKER run`
- Health commands per type:
- Bitcoin: `bitcoin-cli -rpcuser=... getblockchaininfo || exit 1`
- HTTP apps: `curl -sf http://localhost:{port}/ || exit 1`
- LND: `curl -sf --insecure https://localhost:8080/v1/getinfo || exit 1`
- Databases: `mariadb -u root -p... -e "SELECT 1" || exit 1`
- Tests: script grep confirms every `$DOCKER run` has `--health-cmd`
### Week 3: P1 Backend — Blocking I/O and Memory Leaks
**R4+R5 — Rate limiter cleanup**
- File: `core/archipelago/src/session.rs`
- Spawn background tasks for both `EndpointRateLimiter::cleanup()` and `LoginRateLimiter` cleanup, every 5 min
- Tests: after cleanup, stale entries removed; active entries preserved
**R6 — session.rs blocking I/O (6 calls)**
- Replace `std::fs::read_to_string` → `tokio::fs::read_to_string` at lines 77, 370, 413
- Replace `std::fs::write` → `tokio::fs::write` at lines 128, 425
- Replace `std::fs::create_dir_all` → `tokio::fs::create_dir_all` at line 423
- Tests: session load/save/persist still works correctly
**R7 — docker_packages.rs blocking I/O**
- Replace `std::fs::read_to_string` → `tokio::fs::read_to_string` at lines 561, 573
- Tests: app metadata loading works
**R8 — port_allocator.rs blocking I/O**
- Replace all 3 std::fs calls → tokio::fs at lines 59, 73, 77
- Tests: port allocation/persistence works
**R9+R10+R11 — Remaining blocking I/O**
- `peers.rs:30` , `node_message.rs:65` , `identity.rs:50` , `identity_manager.rs:164` , `nostr_discovery.rs:55`
- Convert all to tokio::fs
- Tests: each module's file operations still work
**R12 — electrs_status.rs sync TCP I/O**
- Convert synchronous TCP client to async (tokio::net::TcpStream)
- Tests: ElectrumX status query works, timeout on connection failure
### Week 4: P1 Frontend — Memory Leaks and Stale State
**F4 — WebSocket reconnect full state refresh**
- File: `neode-ui/src/stores/app.ts`
- After reconnect, call `rpcClient.call({method: 'server.get-state'})` to get fresh state before accepting patches
- Tests: after simulated disconnect+reconnect, state matches server
**F5 — Message polling timer cleanup**
- File: `neode-ui/src/composables/useMessageToast.ts`
- Tie polling lifecycle to auth state: stop on logout, start on login. Export cleanup function.
- Tests: polling stops when auth false, restarts when auth true, no timer after unmount
**F6 — AppLauncher message listener leak**
- File: `neode-ui/src/stores/appLauncher.ts`
- Ensure listener is removed when app closes (even if not via close button — e.g., route navigation)
- Tests: navigate away from app → listener removed, new app opens clean
**F7 — Audio player listener stacking**
- File: `neode-ui/src/composables/useAudioPlayer.ts`
- Create Audio element once, register listeners once. Track initialization flag.
- Tests: calling play() 10 times → still only 6 listeners total (not 60)
**S3 — Pin all container images (remove :latest)**
- Files: `first-boot-containers.sh` (15), `deploy-to-target.sh` (11), `deploy-tailscale.sh` (18), `build-auto-installer-iso.sh` (7)
- Replace every `:latest` with specific version tag
- Create `image-versions.env` sourced by all scripts — single source of truth
- Tests: `grep -r ':latest' scripts/ image-recipe/` returns zero matches (excluding comments)
2026-03-08 08:06:52 +00:00
---
2026-03-22 03:30:21 +00:00
## MONTH 2: OPERATIONAL SAFETY (Weeks 5– 8)
> Fix everything that makes deploys dangerous, scripts unreliable, or operations opaque.
### Week 5: Deploy Script Hardening
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**S4 — first-boot error handling**
- Add per-section error checking: if Bitcoin fails, skip dependent containers (LND, Mempool, BTCPay)
- Add `wait_for_container` return value checking
- Tests: first-boot with broken Bitcoin image → Bitcoin deps skipped, independent apps still start
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**S5 — Replace eval with safe construct**
- File: `deploy-to-target.sh:940`
- Replace `eval "$DB_PASSWORDS"` with explicit variable assignment from SSH output
- Tests: passwords parsed correctly without eval
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**S6 — Deploy locking**
- File: `deploy-to-target.sh`
- Add remote `flock` on `/var/lock/archipelago-deploy.lock` . Second deploy fails immediately with message. Stale lock (>30 min) broken automatically.
- Tests: two parallel deploys → second fails, stale lock → broken and deploy proceeds
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**S7 — Deploy rollback**
- File: `deploy-to-target.sh`
- Before overwriting binary: `cp archipelago archipelago.bak`
- Before overwriting frontend: `cp -r web-ui web-ui.bak`
- If health check fails post-restart: restore from .bak, restart again
- Tests: intentionally broken binary → deploy detects, rolls back, system healthy
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**S8 — Eliminate sshpass**
- File: `trust-archipelago-cert.sh`
- Rewrite to use SSH key only: `ssh -i ~/.ssh/archipelago-deploy`
- Tests: script works with key auth, fails gracefully without key
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
### Week 6: Script Quality
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**S9 — MariaDB password not on command line**
- File: `first-boot-containers.sh:285`
- Use `$DOCKER exec -i ... mariadb -uroot < /dev/stdin <<< "SET PASSWORD..."`
- Tests: `ps aux` during execution doesn't show password
**S10 — Replace silent error masking**
- File: `deploy-to-target.sh` (80+ instances)
- Pattern: replace `2>/dev/null || echo ""` with `|| { log_warn "..."; echo ""; }`
- At minimum, log what failed before masking
- Tests: failed health check produces log entry
**S11 — Trap cleanup for temp files**
- All scripts that create /tmp files: add `trap "rm -rf /tmp/deploy-$$" EXIT` at start
- Files: deploy-to-target.sh, deploy-tailscale.sh, build-auto-installer-iso.sh
- Tests: script interrupted mid-execution → temp files cleaned up
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**S12 — Quote all variables**
- Audit and fix unquoted `$VARIABLE` in command arguments across all scripts
- Tests: shellcheck passes on all modified scripts
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**S13 — Extract hardcoded IPs to config**
- Create `scripts/deploy-config-defaults.sh` with all node IPs as named variables
- Source from all scripts instead of hardcoding
- Tests: changing IP in config → all scripts use new IP
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
### Week 7: Infrastructure Hardening
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**I2 — Systemd resource limits**
- File: `image-recipe/configs/archipelago.service`
- Add: `MemoryMax=4G` , `LimitNOFILE=65535` , `TasksMax=2048`
- Tests: `systemctl show archipelago` confirms limits applied, service starts normally
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**I3 — Tor rotation transition period**
- File: `core/archipelago/src/api/rpc/tor.rs`
- Keep old hidden service running for 24h after rotation. Both addresses active. Notify peers of new address. Schedule old deletion.
- Tests: after rotation old address still resolves, peers receive notification, old removed after transition
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**S14 — Input validation on deploy targets**
- Add regex validation for hostnames/IPs before SSH
- Tests: invalid hostname → clear error, valid hostname → proceeds
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**S15 — Memory limits on all deploy containers**
- File: `deploy-to-target.sh` lines 842-880
- Add `--memory=$(mem_limit ...)` to all UI container builds
- Tests: every container in deploy has `--memory` flag
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**S17 — Disk space pre-flight**
- File: `deploy-to-target.sh`
- Check target disk < 85 % before deploying . Abort with clear message if full .
- Tests: deploy to 90% full disk → aborted, deploy to 50% full → succeeds
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
### Week 8: Remaining P1 Backend
**R14 — Fix .parse().unwrap() in session rate limiting**
- File: `session.rs:665,676,688`
- Replace `.parse().unwrap()` with `.parse().context("...")?`
- Tests: invalid IP handling works gracefully
**R15 — Fix 7 unwrap/expect in mesh/protocol.rs**
- File: `mesh/protocol.rs:582,592,614,649,679,713,728`
- Replace all with `?` operator + proper error types
- Tests: protocol parsing with malformed data returns error, not panic
**R27 — Add timeouts to mesh Bitcoin RPC calls**
- File: `mesh/mod.rs:624,649,663`
- Add `tokio::time::timeout(Duration::from_secs(10), ...)` to all Bitcoin RPC calls
- Tests: RPC timeout returns error after 10s
**R34 — Tor rotation transition**
- (Covered by I3 above)
2026-03-08 08:06:52 +00:00
---
2026-03-22 03:30:21 +00:00
## MONTH 3: PRODUCTION POLISH (Weeks 9– 12)
> Fix every remaining P2 issue — unwraps, hardcoded values, frontend quality, resilience.
### Week 9: Remaining Backend Unwraps + Dead Code
**R13 — main.rs .expect() → .context()**
- Replace 2 `.expect()` calls with `.context("...")?` and proper startup error handling
**R16 — identity.rs .expect() → safe handling**
- Replace 2 `.expect()` in crypto operations with result propagation
**R17+R18 — helpers unwraps**
- Fix 10 `.unwrap()` calls in `helpers/lib.rs` and `helpers/rsync.rs`
- Replace with `?` operator or `.context()`
**R19 — js-engine unwraps**
- Fix 2 `.unwrap()` in `js-engine/lib.rs:130,249`
**R20+R21 — Dead code elimination**
- Remove all 14 `#[allow(dead_code)]` in `mesh/mod.rs` . Either use the fields or delete them.
- Same for `lnd.rs` , `data_manager.rs` , `dev_orchestrator.rs`
- Tests: `cargo clippy` zero warnings, `cargo test` passes
### Week 10: Hardcoded Values → Constants
**R22 — Bitcoin RPC URL constant**
- Create `const BITCOIN_RPC_URL: &str = "http://127.0.0.1:8332/";` in a shared constants module
- Use across `bitcoin.rs` , `mesh/mod.rs` , `mesh/listener.rs`
- Tests: all Bitcoin RPC calls still work
**R23 — DWN health URL constant**
**R24 — Update manifest URL constant**
**R25 — DNS-over-HTTPS URLs → constants array**
**R26 — DWN protocol URIs → constants**
- Centralize all hardcoded URLs/URIs into `core/archipelago/src/constants.rs`
- Tests: all modules reference constants, no hardcoded strings remain
**R28 — LND proxy timeouts**
- Audit all 68 `.send()` calls in `api/rpc/lnd.rs` . Ensure each has explicit timeout.
- Tests: LND proxy call with unresponsive LND → timeout error, not hang
**R29 — DWN health check timeout**
- Add timeout to `dwn_sync.rs:76` health check
**R30-R33 — Resolve all TODOs**
- Either implement the TODO or remove the dead code path. Per project rules: no TODO/FIXME in commits.
### Week 11: Frontend P2 Fixes
**F8 — WebSocket reconnection race**
- Add `isReconnecting` flag. Skip if already reconnecting.
- Tests: rapid close events → only one reconnect attempt
**F9 — WebSocket parse error handling**
- Count consecutive parse errors. After 3, force reconnect.
- Tests: 3 malformed messages → reconnect triggered; single bad message → logged only
**F10 — Stale connection detection tuning**
- Require mutual pong response within 30s. Don't close valid connections that are simply quiet.
- Tests: quiet but healthy connection → stays open; no pong for 30s → reconnects
**F11 — RPC client backoff reduction**
- Reduce default timeout from 30s to 15s. Add jitter to backoff. Cap total retry time at 20s.
- Tests: server outage → user sees error within 20s, not 40s
**F12 — Code splitting**
- Lazy-load all routes: `() => import('./views/Web5.vue')`
- Add manual chunks in vite.config.ts for vendor/api
- Tests: build produces multiple chunks, initial bundle < 200KB gzipped
**F13 — DOMPurify on QR v-html**
- Add DOMPurify.sanitize() to QR SVG before v-html rendering
- Tests: XSS payload in QR content → sanitized
### Week 12: Frontend P2 Continued + Performance
**F14 — Goals computed memoization**
- Replace O(n) alias lookup with Map. Add deep equality check.
- Tests: goalStatuses computed runs in < 1ms with 100 apps
**F15 — localStorage error handling**
- Wrap all localStorage.setItem in try/catch. Show toast on quota exceeded.
- Tests: full localStorage → toast shown, app continues
**F16 — FileBrowser auth consolidation**
- Use cookie-only auth. Remove in-memory token.
- Tests: login persists across page reload, logout clears cookie
**F17 — CSRF token parsing robustness**
- Add header fallback for CSRF token. Handle edge cases.
- Tests: missing cookie → falls back to header, both missing → error
**F22 — CSS backdrop-filter mobile performance**
- Add media query: reduce blur to 8px on mobile. Remove backdrop-filter from non-visible elements.
- Tests: mobile Lighthouse performance score > 80
2026-03-08 08:06:52 +00:00
---
2026-03-22 03:30:21 +00:00
## MONTH 4-5: BACKEND ARCHITECTURE (Weeks 13– 20)
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
> Split every Rust god file. Target: no file > 500 lines.
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
### Week 13– 14: Split package.rs (1,795 lines)
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
```
api/rpc/package/
├── mod.rs — Re-exports (~50 lines)
├── config.rs — get_app_config(), get_app_capabilities(), needs_archy_net()
├── lifecycle.rs — install, start, stop, restart, uninstall
├── validation.rs — Input validation, dependency checking, image validation
└── progress.rs — Progress streaming, install status tracking
```
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
Pre-split tests: test every `get_app_config()` variant, validation path, lifecycle transition
Post-split: all RPC calls return identical responses, `cargo test` passes
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
### Week 15– 16: Split mesh/listener.rs (1,799 lines)
```
mesh/listener/
├── mod.rs — Re-exports + spawn_mesh_listener()
├── session.rs — run_mesh_session() loop
├── frames.rs — handle_frame() dispatcher
├── identity.rs — handle_identity_received(), handle_typed_message()
├── sync.rs — sync_queued_messages(), store_typed_message()
└── bitcoin.rs — Bitcoin relay operations, RPC calls
```
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
### Week 17– 18: Split rpc/mod.rs (1,092 lines) + lnd.rs (1,068 lines)
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**rpc/mod.rs** → `dispatcher.rs` (method routing), `middleware.rs` (CSRF/session/rate-limit), `response.rs` (response building)
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
**lnd.rs** → `lnd/wallet.rs` , `lnd/channels.rs` , `lnd/info.rs` , `lnd/payments.rs`
### Week 19– 20: Split monitoring (993), handler (911), mesh (865)
Split each into sub-modules. Target: no file > 500 lines.
All pre-split tests, all post-split verification.
2026-03-08 08:06:52 +00:00
---
2026-03-22 03:30:21 +00:00
## MONTH 6-8: FRONTEND ARCHITECTURE (Weeks 21– 32)
> Split every Vue god component. Target: no component > 500 lines.
### Week 21– 22: Split Web5.vue (3,940 lines → 8 sub-views)
```
views/web5/
├── Web5.vue — Router shell (~150 lines)
├── Web5Identity.vue — DID management
├── Web5Wallet.vue — Wallet operations
├── Web5Nostr.vue — Nostr relays/profiles
├── Web5Credentials.vue — Verifiable Credentials
├── Web5Peers.vue — P2P federation nodes
├── Web5Storage.vue — DWN storage/explorer
├── Web5Goals.vue — Goals/voting
└── Web5Marketplace.vue — Decentralized marketplace
```
Add nested routes. Component tests for each section. All sections render identically.
### Week 23– 24: Split Mesh.vue (2,106) + Dashboard.vue (1,819)
**Mesh.vue** → `MeshRadio.vue` , `MeshChat.vue` , `MeshNetwork.vue` , `MeshFederation.vue`
**Dashboard.vue** → `DashboardHome.vue` , `DashboardApps.vue` , `DashboardSystem.vue`
### Week 25– 26: Split Settings.vue (1,792) + Server.vue (1,132)
**Settings.vue** → `SettingsAccount.vue` , `SettingsSystem.vue` , `SettingsNetwork.vue` , `SettingsAppearance.vue`
**Server.vue** → `ServerOverview.vue` , `ServerContainers.vue` , `ServerLogs.vue`
### Week 27– 28: Split Marketplace.vue (1,293) + AppDetails.vue (1,036) + Home.vue (1,059)
Each into 3-4 focused sub-components.
### Week 29– 30: Decompose useAppStore (324 lines, 16 methods)
```
stores/
├── app.ts — Thin re-export for backward compat (~50 lines)
├── auth.ts — Login, logout, session, password, TOTP
├── server.ts — Server info, system stats, reboot/shutdown
├── realtime.ts — WebSocket connection, subscriptions, heartbeat
└── packages.ts — Package install/uninstall, marketplace data
```
Tests: every existing import of `useAppStore` still works. State transitions identical.
### Week 31– 32: Remaining frontend P3 issues
**F18** — aiPermissions runtime validation
**F19** — Track AppSession timeout
**F20** — Dashboard aria-current
**F21** — Debounce search + memoize
**F23** — Branded types for DID operations
**F24** — Fix checkInterval leak
2026-03-08 08:06:52 +00:00
---
2026-03-22 03:30:21 +00:00
## MONTH 9-10: SCRIPT ARCHITECTURE + ISO (Weeks 33– 40)
> Split every monolithic script. Target: no script > 400 lines.
### Week 33– 34: Create shared script library
```
scripts/lib/
├── common.sh — Colors, logging, error handling, SSH helpers
├── health.sh — Health check polling, container status
├── deploy-utils.sh — Rsync, file sync, backup/restore
├── container.sh — Podman helpers, image management, mem_limit()
└── network.sh — IP validation, port checking
```
Tests: each library function tested in `scripts/tests/`
### Week 35– 36: Split deploy-to-target.sh (1,728 lines)
```
scripts/
├── deploy-to-target.sh — Orchestrator + arg parsing (~300 lines)
├── deploy/
│ ├── frontend.sh — Build + sync frontend
│ ├── backend.sh — Build + sync binary
│ ├── configs.sh — Sync nginx, systemd, scripts
│ ├── containers.sh — Container creation/update
│ ├── verify.sh — Post-deploy health checks
│ └── rollback.sh — Rollback on failure
```
### Week 37– 38: Split ISO build (1,850 lines) + first-boot (855 lines)
**build-auto-installer-iso.sh** → `build/capture-images.sh` , `build/create-rootfs.sh` , `build/install-packages.sh` , `build/bundle-configs.sh` , `build/package-iso.sh`
**first-boot-containers.sh** → `first-boot/databases.sh` , `first-boot/bitcoin.sh` , `first-boot/lightning.sh` , `first-boot/apps.sh` , `first-boot/networking.sh`
### Week 39– 40: ISO Reproducibility + Integration Tests
**S16 — Make ISO builds reproducible**
- Create `image-versions.env` with pinned digests for every container image
- ISO build sources this file, never pulls `:latest`
- Build manifest records exactly what shipped
- Tests: two consecutive ISO builds produce identical image sets
**E2E smoke test script**
```bash
# scripts/smoke-test.sh — Run against .198
# 1. curl /health → OK
# 2. Login → get session
# 3. Get server info → valid JSON
# 4. List containers → all healthy
# 5. Check every /app/* proxy → responds
# 6. Check Tor hidden service → resolves
# 7. Check WebSocket upgrade → 101
# Exit 0 only if all pass
```
2026-03-08 08:06:52 +00:00
---
2026-03-22 03:30:21 +00:00
## MONTH 11: INTEGRATION TESTS (Weeks 41– 44)
> Comprehensive test suites that prove everything works.
### Week 41– 42: Backend Integration Tests
```
core/archipelago/tests/
├── test_auth_flow.rs — Login → session → CSRF → auth request → logout
├── test_container_lifecycle.rs — Install → start → health → stop → uninstall
├── test_federation.rs — Generate invite → join → sync → verify
├── test_rpc_validation.rs — Every endpoint with invalid input → proper error
├── test_session_persist.rs — Create session → restart → session survives
├── test_rate_limiting.rs — Flood → 429 → wait → allowed
├── test_backup_restore.rs — Create → verify → restore → validate
├── test_health_endpoint.rs — Healthy → degraded → recovery
```
Target: 25+ backend integration tests passing
### Week 43– 44: Frontend Integration Tests
```
neode-ui/src/__tests__/integration/
├── auth-flow.spec.ts — Login → dashboard → timeout → redirect
├── app-lifecycle.spec.ts — Marketplace → install → progress → launch → uninstall
├── websocket.spec.ts — Connect → update → disconnect → reconnect → state consistent
├── settings-flow.spec.ts — Change password → re-login → 2FA setup → verify
├── spotlight.spec.ts — Open → search → navigate → close
├── mesh-chat.spec.ts — Connect → send → receive → disconnect
├── error-handling.spec.ts — Network error → toast → retry → success
├── code-splitting.spec.ts — Route navigation → chunks loaded lazily
```
Target: 20+ frontend integration tests passing
2026-03-08 08:06:52 +00:00
---
2026-03-22 03:30:21 +00:00
## MONTH 12: TYPE SYNC + CI/CD PLAN (Weeks 45– 48)
### Week 45– 46: Rust↔TypeScript Type Sync
**Approach**: `ts-rs` crate to auto-generate TypeScript types from Rust structs
1. Add `ts-rs` to `core/models/Cargo.toml`
2. Add `#[derive(TS)]` to all API request/response types
3. Build script generates `neode-ui/src/types/generated.ts`
4. Replace manual types in `types/api.ts` with imports from generated file
5. Verification: regenerate → diff → must be zero (types committed)
Tests: frontend type-check passes with generated types, manual api.ts reduced to non-API types
### Week 47– 48: CI/CD Planning (Document Only — Execute Later)
> This section is the PLAN for CI/CD. Do not execute during this phase. Document everything needed so it can be implemented in a future sprint.
**CI Pipeline Design** (`.github/workflows/ci.yml` ):
```yaml
# Triggers: push to main, all PRs
# Jobs:
# rust-checks (Linux runner):
# - cargo clippy --all-targets --all-features (zero warnings gate)
# - cargo fmt --all -- --check (formatting gate)
# - cargo test --all-features (all tests gate)
#
# frontend-checks (Node 20):
# - npm run type-check (TypeScript strictness gate)
# - npm run lint (ESLint gate)
# - npm test (Vitest suite gate)
#
# integration (Linux runner, optional):
# - scripts/smoke-test.sh against staging
#
# Merge policy: all checks must pass before merge
# Branch protection: require PR, require checks, no force push to main
```
**Release Pipeline Design** (`.github/workflows/release.yml` ):
```yaml
# Triggers: tag push (v*)
# Jobs:
# build-linux-binary:
# - Cross-compile Rust for x86_64 + ARM64
# build-frontend:
# - npm run build
# build-iso:
# - SSH to build server, run ISO build
# - Upload ISO as release asset
# smoke-test:
# - Boot ISO in QEMU
# - Run smoke-test.sh
# - Gate release on pass
```
**Pre-requisites to implement**:
- [ ] GitHub Actions runner with Rust toolchain + cross-compilation
- [ ] Node.js 20 runner for frontend
- [ ] SSH key for build server accessible from CI
- [ ] Branch protection rules configured
- [ ] Image digest manifest for reproducible ISO builds
- [ ] QEMU-based ISO verification script
**Estimated implementation time**: 2 weeks when ready to execute
2026-03-08 08:06:52 +00:00
---
2026-03-22 03:30:21 +00:00
## VERIFICATION PROTOCOL (Every Week)
1. `cargo clippy --all-targets --all-features` — zero warnings
2. `cargo fmt --all`
3. `cargo test --all-features` — all pass
4. `cd neode-ui && npm run type-check` — zero errors
5. `cd neode-ui && npm test` — all pass
6. `./scripts/deploy-to-target.sh --target 192.168.1.198` — **ONLY .198**
7. `curl http://192.168.1.198/health` — returns OK with service status
8. Navigate all affected views in browser — identical behavior
9. Atomic commit: `refactor: <description>` or `fix: <description>`
---
2026-03-08 08:06:52 +00:00
2026-03-22 03:30:21 +00:00
## EXIT CRITERIA (Month 12 Complete)
### Reliability (Zero Tolerance)
- [ ] Health endpoint returns real service status
- [ ] All async operations have bounded timeouts
- [ ] Zero blocking I/O in async context (no std::fs in async functions)
- [ ] Zero .unwrap()/.expect() in production code
- [ ] All rate limiters have cleanup tasks
- [ ] Backup restore uses staging + atomic swap + rollback
- [ ] All 30 containers have health checks + memory limits
- [ ] All container images pinned to specific versions
- [ ] Nginx unauthenticated endpoints protected (timeout + rate limit + body size)
- [ ] Systemd service has resource limits
- [ ] Tor rotation preserves old address during transition
- [ ] Deploy has locking + disk check + rollback
- [ ] Zero `sudo podman` in any script
- [ ] Zero `:latest` image tags anywhere
- [ ] Zero silent error masking without logging
### Frontend (Zero Tolerance)
- [ ] Global error handler catches and displays all errors
- [ ] WebSocket: single subscription, reconnect refreshes state, bounded retries
- [ ] All timers/listeners cleaned up on unmount
- [ ] Code splitting: initial bundle < 200KB gzipped
- [ ] v-html always uses DOMPurify
- [ ] All localStorage operations wrapped in try/catch
### Architecture (Target: File Size Limits)
- [ ] No Rust file > 500 lines (excluding generated code)
- [ ] No Vue component > 500 lines
- [ ] No shell script > 400 lines
- [ ] No Pinia store has more than 1 responsibility
- [ ] All hardcoded URLs/ports extracted to constants
- [ ] Shared script library eliminates duplication
- [ ] TypeScript types auto-generated from Rust structs
### Testing
- [ ] 25+ backend integration tests passing
- [ ] 20+ frontend integration tests passing
- [ ] E2E smoke test script passes on .198
- [ ] ISO builds are reproducible (pinned digests)
### CI/CD (Planned, Not Executed)
- [ ] CI pipeline design documented
- [ ] Release pipeline design documented
- [ ] Pre-requisites list complete
- [ ] Ready for 2-week implementation sprint
### Zero Behavior Changes
Every feature works identically. Every existing test passes. Every user flow unchanged.