804 lines
37 KiB
Markdown
804 lines
37 KiB
Markdown
# Archipelago: Production Excellence Plan
|
||
|
||
**Duration**: 12 months (48 weeks)
|
||
**Goal**: Code so good no developer could question any decision. Apple-level reliability. Every failure visible and recoverable. Every operation bounded. Every line justified.
|
||
**Audited**: 2026-03-20 — 122 Rust files, 38 Vue views, 180+ frontend files, 80+ shell scripts
|
||
|
||
## CONSTRAINTS
|
||
|
||
- **DEPLOY ONLY TO .198** — Never .228. All verification on .198.
|
||
- **BETA FREEZE** — Behavior-preserving only. No new features/UI/endpoints.
|
||
- **Tests before every refactor** — Capture current behavior first. Tests must pass unchanged after.
|
||
- **Atomic commits** — One logical change per commit. Every step compiles + passes tests.
|
||
|
||
```bash
|
||
ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.198
|
||
```
|
||
|
||
---
|
||
|
||
## COMPLETE ISSUE REGISTRY
|
||
|
||
### Backend Rust — 122 files audited
|
||
|
||
| ID | Issue | File(s) | Severity |
|
||
|----|-------|---------|----------|
|
||
| R1 | Health RPC endpoint has no handler — returns "Unknown method" | `api/rpc/mod.rs` | P0 |
|
||
| R2 | Nostr client.connect() hangs indefinitely (4 calls, no timeout) | `nostr_handshake.rs:124,161,262,282` | P0 |
|
||
| R3 | Backup restore extracts directly to live dir — no atomic rollback | `backup/full.rs:122-149` | P0 |
|
||
| R4 | Rate limiter cleanup() never spawned — HashMap grows forever | `session.rs:566-579` | P1 |
|
||
| R5 | Login rate limiter same issue — entries never evicted | `session.rs:452-472` | P1 |
|
||
| R6 | Blocking std::fs in async — session.rs (6 calls) | `session.rs:77,128,370,413,423,425` | P1 |
|
||
| R7 | Blocking std::fs in async — docker_packages.rs | `docker_packages.rs:561,573` | P1 |
|
||
| R8 | Blocking std::fs in async — port_allocator.rs | `port_allocator.rs:59,73,77` | P1 |
|
||
| R9 | Blocking std::fs in async — peers.rs, node_message.rs | `peers.rs:30`, `node_message.rs:65` | P1 |
|
||
| R10 | Blocking std::fs in async — identity.rs, identity_manager.rs | `identity.rs:50`, `identity_manager.rs:164` | P1 |
|
||
| R11 | Blocking std::fs in async — nostr_discovery.rs | `nostr_discovery.rs:55` | P1 |
|
||
| R12 | Sync TCP I/O in async context — electrs_status.rs | `electrs_status.rs:5,40,78,81` | P1 |
|
||
| R13 | .expect() in main.rs startup | `main.rs:124,159` | P2 |
|
||
| R14 | .parse().unwrap() in session.rs rate limiting | `session.rs:665,676,688` | P1 |
|
||
| R15 | 7 .unwrap()/.expect() in mesh/protocol.rs | `protocol.rs:582,592,614,649,679,713,728` | P1 |
|
||
| R16 | .expect() in identity.rs crypto | `identity.rs:114,119` | P2 |
|
||
| R17 | .unwrap() in helpers/lib.rs (5 calls) | `helpers/lib.rs:167,172,180,233,253` | P2 |
|
||
| R18 | .unwrap() in helpers/rsync.rs (5 calls) | `rsync.rs:196,199,202,210,220` | P2 |
|
||
| R19 | .unwrap() in js-engine/lib.rs | `js-engine/lib.rs:130,249` | P2 |
|
||
| R20 | 14 #[allow(dead_code)] suppressions in mesh/mod.rs | `mesh/mod.rs:7-25` | P2 |
|
||
| R21 | Dead code in lnd.rs, data_manager.rs, dev_orchestrator.rs | Multiple | P2 |
|
||
| R22 | Bitcoin RPC URL hardcoded in 4+ files | `bitcoin.rs:89`, `mesh/mod.rs:624,649,663`, `listener.rs:1509+` | P2 |
|
||
| R23 | DWN health URL hardcoded | `dwn_sync.rs:76` | P2 |
|
||
| R24 | Update manifest URL hardcoded | `update.rs:11` | P3 |
|
||
| R25 | DNS-over-HTTPS URLs hardcoded (4 providers) | `network/dns.rs:98,102,106,110` | P3 |
|
||
| R26 | DWN protocol URIs hardcoded in server.rs | `server.rs:453-456` | P3 |
|
||
| R27 | Missing timeouts on mesh Bitcoin RPC calls | `mesh/mod.rs:624,649,663` | P1 |
|
||
| R28 | Missing timeouts on LND proxy calls (68 .send() calls) | `api/rpc/lnd.rs` | P2 |
|
||
| R29 | Missing timeout on DWN health check | `dwn_sync.rs:76` | P2 |
|
||
| R30 | TODO: track last-seen timestamp | `handshake.rs:77` | P3 |
|
||
| R31 | TODO: lnd.lookupinvoice RPC endpoint | `marketplace.rs:183` | P3 |
|
||
| R32 | TODO: trigger auto-restart or alert | `container/health_monitor.rs:140` | P3 |
|
||
| R33 | TODO: configure Podman to use AppArmor profile | `security/container_policies.rs:68` | P3 |
|
||
| R34 | Tor rotation deletes old .onion immediately — no transition | `api/rpc/tor.rs:184-240` | P1 |
|
||
| R35 | package.rs god file — 1,795 lines | `api/rpc/package.rs` | P2 |
|
||
| R36 | mesh/listener.rs god file — 1,799 lines | `mesh/listener.rs` | P2 |
|
||
| R37 | rpc/mod.rs god file — 1,092 lines | `api/rpc/mod.rs` | P2 |
|
||
| R38 | lnd.rs god file — 1,068 lines | `api/rpc/lnd.rs` | P2 |
|
||
| R39 | monitoring/mod.rs — 993 lines | `monitoring/mod.rs` | P3 |
|
||
| R40 | api/handler.rs — 911 lines | `api/handler.rs` | P3 |
|
||
| R41 | 30+ functions exceed 50 lines across codebase | Multiple | P3 |
|
||
|
||
### Frontend — 180+ files audited
|
||
|
||
| ID | Issue | File(s) | Severity |
|
||
|----|-------|---------|----------|
|
||
| F1 | WebSocket subscription registered multiple times — race condition | `stores/app.ts:88-134` | P0 |
|
||
| F2 | Unprotected concurrent mesh state mutations | `stores/mesh.ts:249-268,294-324` | P0 |
|
||
| F3 | No global Vue error handler — white screen on error | `main.ts` | P0 |
|
||
| F4 | Stale data after WebSocket reconnect — no full refresh | `stores/app.ts:88-163` | P1 |
|
||
| F5 | Message polling timer never stopped after logout | `composables/useMessageToast.ts:60` | P1 |
|
||
| F6 | AppLauncher NIP-07 message listener leak on close | `stores/appLauncher.ts:295-301` | P1 |
|
||
| F7 | Audio player listeners stack — never cleaned up | `composables/useAudioPlayer.ts:1-91` | P1 |
|
||
| F8 | WebSocket reconnection race — parallel connect() attempts | `api/websocket.ts:212-238` | P2 |
|
||
| F9 | WebSocket parse error silently caught — stale UI forever | `api/websocket.ts:164-172` | P2 |
|
||
| F10 | WebSocket stale connection detection too aggressive (5min) | `api/websocket.ts:284-299` | P2 |
|
||
| F11 | RPC client backoff + timeout = 40s max wait | `api/rpc-client.ts:31-117` | P2 |
|
||
| F12 | No code splitting — monolithic bundle | `vite.config.ts` | P2 |
|
||
| F13 | v-html on QR code without DOMPurify | `views/Settings.vue:441` | P2 |
|
||
| F14 | Goals store O(n) alias lookup on every computed | `stores/goals.ts:16-20,38-89` | P2 |
|
||
| F15 | localStorage save without try/catch (5+ instances) | `stores/goals.ts:34-36` + others | P2 |
|
||
| F16 | FileBrowser auth token duality — memory + cookie | `api/filebrowser-client.ts:39,50-68` | P2 |
|
||
| F17 | CSRF token cookie parsing brittle — regex only | `api/rpc-client.ts:18-21` | P2 |
|
||
| F18 | aiPermissions.ts Set uses unsafe type assertion | `stores/aiPermissions.ts:91-103` | P3 |
|
||
| F19 | Untracked setTimeout in AppSession — fires after unmount | `views/AppSession.vue:507` | P3 |
|
||
| F20 | Dashboard navigation missing aria-current="page" | `views/Dashboard.vue` | P3 |
|
||
| F21 | Search performance — string re-lowercasing every keystroke | `views/Apps.vue:510-537` | P3 |
|
||
| F22 | 30+ backdrop-filter blur elements — GPU overload on mobile | `style.css` | P3 |
|
||
| F23 | Record<string, unknown> on sensitive DID operations | `types/api.ts` + `rpc-client.ts` | P3 |
|
||
| F24 | checkInterval timer leak on connect race | `api/websocket.ts:82-96` | P3 |
|
||
| F25 | Web5.vue god component — 3,940 lines | `views/Web5.vue` | P2 |
|
||
| F26 | Mesh.vue — 2,106 lines | `views/Mesh.vue` | P2 |
|
||
| F27 | Dashboard.vue — 1,819 lines | `views/Dashboard.vue` | P2 |
|
||
| F28 | Settings.vue — 1,792 lines | `views/Settings.vue` | P2 |
|
||
| F29 | Marketplace.vue — 1,293 lines | `views/Marketplace.vue` | P3 |
|
||
| F30 | Server.vue — 1,132 lines | `views/Server.vue` | P3 |
|
||
| F31 | Home.vue — 1,059 lines | `views/Home.vue` | P3 |
|
||
| F32 | AppDetails.vue — 1,036 lines | `views/AppDetails.vue` | P3 |
|
||
| F33 | useAppStore god store — 324 lines, 16 methods, 8+ responsibilities | `stores/app.ts` | P2 |
|
||
|
||
### Shell Scripts — 80+ files audited
|
||
|
||
| ID | Issue | File(s) | Severity |
|
||
|----|-------|---------|----------|
|
||
| S1 | 60+ instances of `sudo podman` — should be rootless | `fix-indeedhub(28)`, `deploy-bitcoin(11)`, `deploy-tailscale(2+)` | P0 |
|
||
| S2 | Zero container health checks in first-boot (30 containers) | `first-boot-containers.sh` | P0 |
|
||
| S3 | 50+ `:latest` image tags across all scripts | `first-boot(15)`, `deploy(11)`, `tailscale(18)`, `iso(7)` | P1 |
|
||
| S4 | No `set -e` in first-boot — silent container failures | `first-boot-containers.sh:1-9` | P1 |
|
||
| S5 | `eval "$DB_PASSWORDS"` — code injection risk | `deploy-to-target.sh:940` | P1 |
|
||
| S6 | No deploy locking — concurrent deploys corrupt state | `deploy-to-target.sh` | P1 |
|
||
| S7 | No deploy rollback — failed deploy leaves broken system | `deploy-to-target.sh` | P1 |
|
||
| S8 | sshpass usage in trust-archipelago-cert.sh | `trust-archipelago-cert.sh:23-26` | P1 |
|
||
| S9 | MariaDB password in command line — visible in ps | `first-boot-containers.sh:285` | P1 |
|
||
| S10 | 80+ instances of `2>/dev/null \|\| true` masking errors | `deploy-to-target.sh` | P2 |
|
||
| S11 | No trap cleanup for temp files | Multiple scripts | P2 |
|
||
| S12 | Unquoted variables (word splitting risk) | Multiple scripts | P2 |
|
||
| S13 | Hardcoded IPs in 6+ scripts | `deploy-to-target.sh:26`, `deploy-tailscale.sh:26`, etc. | P2 |
|
||
| S14 | No input validation on deploy targets | `deploy-tailscale.sh` | P2 |
|
||
| S15 | Missing memory limits on some containers in deploy | `deploy-to-target.sh:842-880` | P2 |
|
||
| S16 | ISO build not reproducible — dynamic image capture + :latest | `build-auto-installer-iso.sh:500-594` | P2 |
|
||
| S17 | No disk space pre-flight in deploy | `deploy-to-target.sh` | P2 |
|
||
| S18 | deploy-to-target.sh — 1,728 lines monolith | `deploy-to-target.sh` | P3 |
|
||
| S19 | build-auto-installer-iso.sh — 1,850 lines monolith | `build-auto-installer-iso.sh` | P3 |
|
||
| S20 | first-boot-containers.sh — 855 lines monolith | `first-boot-containers.sh` | P3 |
|
||
| S21 | No shared script library — duplicated functions | `scripts/` | P3 |
|
||
|
||
### Infrastructure
|
||
|
||
| ID | Issue | File(s) | Severity |
|
||
|----|-------|---------|----------|
|
||
| I1 | Nginx: /archipelago/, /content, /dwn missing timeout+rate-limit+body-size | `nginx-archipelago.conf:116-180` | P0 |
|
||
| I2 | Systemd: no MemoryMax, LimitNOFILE, TasksMax | `archipelago.service` | P1 |
|
||
| I3 | Tor rotation kills old address immediately — federation downtime | `api/rpc/tor.rs:184-240` | P1 |
|
||
|
||
---
|
||
|
||
## MONTH 1: CRASH PREVENTION (Weeks 1–4)
|
||
|
||
> Fix every issue that can crash the system, hang indefinitely, or lose data.
|
||
|
||
### Week 1: P0 Backend — Things That Hang or Lose Data
|
||
|
||
**R1 — Health endpoint handler**
|
||
- File: `core/archipelago/src/api/rpc/mod.rs`
|
||
- Add handler for `"health"` method that checks: crash recovery complete, Podman socket responsive, session store loaded
|
||
- Tests: health returns JSON status, degraded when Podman unreachable, degraded during recovery
|
||
- Verify: `curl http://192.168.1.198/rpc/v1 -d '{"method":"health"}'` returns real status
|
||
|
||
**R2 — Nostr connect timeout**
|
||
- File: `core/archipelago/src/nostr_handshake.rs` lines 124, 161, 262, 282
|
||
- Wrap all 4 `client.connect().await` in `tokio::time::timeout(Duration::from_secs(10), ...)`
|
||
- Tests: connect timeout returns Err after 10s, successful connect within timeout works
|
||
|
||
**R3 — Backup restore atomic rollback**
|
||
- File: `core/archipelago/src/backup/full.rs` lines 122-149
|
||
- Rewrite: decrypt → extract to staging dir → validate required files → atomic rename → rollback on failure
|
||
- Tests: valid backup restores, corrupt backup fails without touching live data, partial extraction rolls back, disk space check fails early
|
||
|
||
**I1 — Nginx unauthenticated endpoint protection**
|
||
- File: `image-recipe/configs/nginx-archipelago.conf` lines 116-180
|
||
- Add to `/archipelago/`, `/content`, `/dwn`:
|
||
- `limit_req zone=peer burst=20 nodelay;`
|
||
- `client_max_body_size 10m;`
|
||
- `proxy_connect_timeout 30s; proxy_read_timeout 60s; proxy_send_timeout 30s;`
|
||
- Tests: >10MB payload → 413, slow client → timeout, burst 30 → 429 after 20
|
||
|
||
### Week 2: P0 Frontend + Scripts — Things That Break UI or Containers
|
||
|
||
**F1 — WebSocket subscription race condition**
|
||
- File: `neode-ui/src/stores/app.ts` lines 88-134
|
||
- Fix: Return unsubscribe function from `wsClient.subscribe()`, call it before re-subscribing. Use a subscription ID to prevent duplicates.
|
||
- Tests: rapid connectWebSocket() calls produce only one active subscription
|
||
|
||
**F2 — Mesh concurrent state mutations**
|
||
- File: `neode-ui/src/stores/mesh.ts` lines 249-324
|
||
- Fix: Add `isSending` ref as mutex. Queue concurrent sends. `fetchMessages()` called once after all sends complete.
|
||
- Tests: 3 concurrent sendMessage() calls → all succeed, messages list consistent
|
||
|
||
**F3 — Global error handler**
|
||
- File: `neode-ui/src/main.ts`
|
||
- Add `app.config.errorHandler` that shows toast + logs structured error
|
||
- Tests: thrown error in component shows toast, nested errors don't crash handler
|
||
|
||
**S1 — Eliminate all `sudo podman`**
|
||
- Files: `fix-indeedhub-containers.sh` (28), `deploy-bitcoin-knots.sh` (11), `deploy-tailscale.sh` (2+), `uptime-monitor.sh` (1), `setup-aiui-server.sh`
|
||
- Replace every `sudo podman` with `podman` (runs as archipelago user)
|
||
- Tests: grep for `sudo podman` across all scripts returns zero matches
|
||
|
||
**S2 — Container health checks for all 30 containers**
|
||
- File: `scripts/first-boot-containers.sh`
|
||
- Add `--health-cmd`, `--health-interval=30s`, `--health-timeout=5s`, `--health-retries=3` to every `$DOCKER run`
|
||
- Health commands per type:
|
||
- Bitcoin: `bitcoin-cli -rpcuser=... getblockchaininfo || exit 1`
|
||
- HTTP apps: `curl -sf http://localhost:{port}/ || exit 1`
|
||
- LND: `curl -sf --insecure https://localhost:8080/v1/getinfo || exit 1`
|
||
- Databases: `mariadb -u root -p... -e "SELECT 1" || exit 1`
|
||
- Tests: script grep confirms every `$DOCKER run` has `--health-cmd`
|
||
|
||
### Week 3: P1 Backend — Blocking I/O and Memory Leaks
|
||
|
||
**R4+R5 — Rate limiter cleanup**
|
||
- File: `core/archipelago/src/session.rs`
|
||
- Spawn background tasks for both `EndpointRateLimiter::cleanup()` and `LoginRateLimiter` cleanup, every 5 min
|
||
- Tests: after cleanup, stale entries removed; active entries preserved
|
||
|
||
**R6 — session.rs blocking I/O (6 calls)**
|
||
- Replace `std::fs::read_to_string` → `tokio::fs::read_to_string` at lines 77, 370, 413
|
||
- Replace `std::fs::write` → `tokio::fs::write` at lines 128, 425
|
||
- Replace `std::fs::create_dir_all` → `tokio::fs::create_dir_all` at line 423
|
||
- Tests: session load/save/persist still works correctly
|
||
|
||
**R7 — docker_packages.rs blocking I/O**
|
||
- Replace `std::fs::read_to_string` → `tokio::fs::read_to_string` at lines 561, 573
|
||
- Tests: app metadata loading works
|
||
|
||
**R8 — port_allocator.rs blocking I/O**
|
||
- Replace all 3 std::fs calls → tokio::fs at lines 59, 73, 77
|
||
- Tests: port allocation/persistence works
|
||
|
||
**R9+R10+R11 — Remaining blocking I/O**
|
||
- `peers.rs:30`, `node_message.rs:65`, `identity.rs:50`, `identity_manager.rs:164`, `nostr_discovery.rs:55`
|
||
- Convert all to tokio::fs
|
||
- Tests: each module's file operations still work
|
||
|
||
**R12 — electrs_status.rs sync TCP I/O**
|
||
- Convert synchronous TCP client to async (tokio::net::TcpStream)
|
||
- Tests: ElectrumX status query works, timeout on connection failure
|
||
|
||
### Week 4: P1 Frontend — Memory Leaks and Stale State
|
||
|
||
**F4 — WebSocket reconnect full state refresh**
|
||
- File: `neode-ui/src/stores/app.ts`
|
||
- After reconnect, call `rpcClient.call({method: 'server.get-state'})` to get fresh state before accepting patches
|
||
- Tests: after simulated disconnect+reconnect, state matches server
|
||
|
||
**F5 — Message polling timer cleanup**
|
||
- File: `neode-ui/src/composables/useMessageToast.ts`
|
||
- Tie polling lifecycle to auth state: stop on logout, start on login. Export cleanup function.
|
||
- Tests: polling stops when auth false, restarts when auth true, no timer after unmount
|
||
|
||
**F6 — AppLauncher message listener leak**
|
||
- File: `neode-ui/src/stores/appLauncher.ts`
|
||
- Ensure listener is removed when app closes (even if not via close button — e.g., route navigation)
|
||
- Tests: navigate away from app → listener removed, new app opens clean
|
||
|
||
**F7 — Audio player listener stacking**
|
||
- File: `neode-ui/src/composables/useAudioPlayer.ts`
|
||
- Create Audio element once, register listeners once. Track initialization flag.
|
||
- Tests: calling play() 10 times → still only 6 listeners total (not 60)
|
||
|
||
**S3 — Pin all container images (remove :latest)**
|
||
- Files: `first-boot-containers.sh` (15), `deploy-to-target.sh` (11), `deploy-tailscale.sh` (18), `build-auto-installer-iso.sh` (7)
|
||
- Replace every `:latest` with specific version tag
|
||
- Create `image-versions.env` sourced by all scripts — single source of truth
|
||
- Tests: `grep -r ':latest' scripts/ image-recipe/` returns zero matches (excluding comments)
|
||
|
||
---
|
||
|
||
## MONTH 2: OPERATIONAL SAFETY (Weeks 5–8)
|
||
|
||
> Fix everything that makes deploys dangerous, scripts unreliable, or operations opaque.
|
||
|
||
### Week 5: Deploy Script Hardening
|
||
|
||
**S4 — first-boot error handling**
|
||
- Add per-section error checking: if Bitcoin fails, skip dependent containers (LND, Mempool, BTCPay)
|
||
- Add `wait_for_container` return value checking
|
||
- Tests: first-boot with broken Bitcoin image → Bitcoin deps skipped, independent apps still start
|
||
|
||
**S5 — Replace eval with safe construct**
|
||
- File: `deploy-to-target.sh:940`
|
||
- Replace `eval "$DB_PASSWORDS"` with explicit variable assignment from SSH output
|
||
- Tests: passwords parsed correctly without eval
|
||
|
||
**S6 — Deploy locking**
|
||
- File: `deploy-to-target.sh`
|
||
- Add remote `flock` on `/var/lock/archipelago-deploy.lock`. Second deploy fails immediately with message. Stale lock (>30 min) broken automatically.
|
||
- Tests: two parallel deploys → second fails, stale lock → broken and deploy proceeds
|
||
|
||
**S7 — Deploy rollback**
|
||
- File: `deploy-to-target.sh`
|
||
- Before overwriting binary: `cp archipelago archipelago.bak`
|
||
- Before overwriting frontend: `cp -r web-ui web-ui.bak`
|
||
- If health check fails post-restart: restore from .bak, restart again
|
||
- Tests: intentionally broken binary → deploy detects, rolls back, system healthy
|
||
|
||
**S8 — Eliminate sshpass**
|
||
- File: `trust-archipelago-cert.sh`
|
||
- Rewrite to use SSH key only: `ssh -i ~/.ssh/archipelago-deploy`
|
||
- Tests: script works with key auth, fails gracefully without key
|
||
|
||
### Week 6: Script Quality
|
||
|
||
**S9 — MariaDB password not on command line**
|
||
- File: `first-boot-containers.sh:285`
|
||
- Use `$DOCKER exec -i ... mariadb -uroot < /dev/stdin <<< "SET PASSWORD..."`
|
||
- Tests: `ps aux` during execution doesn't show password
|
||
|
||
**S10 — Replace silent error masking**
|
||
- File: `deploy-to-target.sh` (80+ instances)
|
||
- Pattern: replace `2>/dev/null || echo ""` with `|| { log_warn "..."; echo ""; }`
|
||
- At minimum, log what failed before masking
|
||
- Tests: failed health check produces log entry
|
||
|
||
**S11 — Trap cleanup for temp files**
|
||
- All scripts that create /tmp files: add `trap "rm -rf /tmp/deploy-$$" EXIT` at start
|
||
- Files: deploy-to-target.sh, deploy-tailscale.sh, build-auto-installer-iso.sh
|
||
- Tests: script interrupted mid-execution → temp files cleaned up
|
||
|
||
**S12 — Quote all variables**
|
||
- Audit and fix unquoted `$VARIABLE` in command arguments across all scripts
|
||
- Tests: shellcheck passes on all modified scripts
|
||
|
||
**S13 — Extract hardcoded IPs to config**
|
||
- Create `scripts/deploy-config-defaults.sh` with all node IPs as named variables
|
||
- Source from all scripts instead of hardcoding
|
||
- Tests: changing IP in config → all scripts use new IP
|
||
|
||
### Week 7: Infrastructure Hardening
|
||
|
||
**I2 — Systemd resource limits**
|
||
- File: `image-recipe/configs/archipelago.service`
|
||
- Add: `MemoryMax=4G`, `LimitNOFILE=65535`, `TasksMax=2048`
|
||
- Tests: `systemctl show archipelago` confirms limits applied, service starts normally
|
||
|
||
**I3 — Tor rotation transition period**
|
||
- File: `core/archipelago/src/api/rpc/tor.rs`
|
||
- Keep old hidden service running for 24h after rotation. Both addresses active. Notify peers of new address. Schedule old deletion.
|
||
- Tests: after rotation old address still resolves, peers receive notification, old removed after transition
|
||
|
||
**S14 — Input validation on deploy targets**
|
||
- Add regex validation for hostnames/IPs before SSH
|
||
- Tests: invalid hostname → clear error, valid hostname → proceeds
|
||
|
||
**S15 — Memory limits on all deploy containers**
|
||
- File: `deploy-to-target.sh` lines 842-880
|
||
- Add `--memory=$(mem_limit ...)` to all UI container builds
|
||
- Tests: every container in deploy has `--memory` flag
|
||
|
||
**S17 — Disk space pre-flight**
|
||
- File: `deploy-to-target.sh`
|
||
- Check target disk <85% before deploying. Abort with clear message if full.
|
||
- Tests: deploy to 90% full disk → aborted, deploy to 50% full → succeeds
|
||
|
||
### Week 8: Remaining P1 Backend
|
||
|
||
**R14 — Fix .parse().unwrap() in session rate limiting**
|
||
- File: `session.rs:665,676,688`
|
||
- Replace `.parse().unwrap()` with `.parse().context("...")?`
|
||
- Tests: invalid IP handling works gracefully
|
||
|
||
**R15 — Fix 7 unwrap/expect in mesh/protocol.rs**
|
||
- File: `mesh/protocol.rs:582,592,614,649,679,713,728`
|
||
- Replace all with `?` operator + proper error types
|
||
- Tests: protocol parsing with malformed data returns error, not panic
|
||
|
||
**R27 — Add timeouts to mesh Bitcoin RPC calls**
|
||
- File: `mesh/mod.rs:624,649,663`
|
||
- Add `tokio::time::timeout(Duration::from_secs(10), ...)` to all Bitcoin RPC calls
|
||
- Tests: RPC timeout returns error after 10s
|
||
|
||
**R34 — Tor rotation transition**
|
||
- (Covered by I3 above)
|
||
|
||
---
|
||
|
||
## MONTH 3: PRODUCTION POLISH (Weeks 9–12)
|
||
|
||
> Fix every remaining P2 issue — unwraps, hardcoded values, frontend quality, resilience.
|
||
|
||
### Week 9: Remaining Backend Unwraps + Dead Code
|
||
|
||
**R13 — main.rs .expect() → .context()**
|
||
- Replace 2 `.expect()` calls with `.context("...")?` and proper startup error handling
|
||
|
||
**R16 — identity.rs .expect() → safe handling**
|
||
- Replace 2 `.expect()` in crypto operations with result propagation
|
||
|
||
**R17+R18 — helpers unwraps**
|
||
- Fix 10 `.unwrap()` calls in `helpers/lib.rs` and `helpers/rsync.rs`
|
||
- Replace with `?` operator or `.context()`
|
||
|
||
**R19 — js-engine unwraps**
|
||
- Fix 2 `.unwrap()` in `js-engine/lib.rs:130,249`
|
||
|
||
**R20+R21 — Dead code elimination**
|
||
- Remove all 14 `#[allow(dead_code)]` in `mesh/mod.rs`. Either use the fields or delete them.
|
||
- Same for `lnd.rs`, `data_manager.rs`, `dev_orchestrator.rs`
|
||
- Tests: `cargo clippy` zero warnings, `cargo test` passes
|
||
|
||
### Week 10: Hardcoded Values → Constants
|
||
|
||
**R22 — Bitcoin RPC URL constant**
|
||
- Create `const BITCOIN_RPC_URL: &str = "http://127.0.0.1:8332/";` in a shared constants module
|
||
- Use across `bitcoin.rs`, `mesh/mod.rs`, `mesh/listener.rs`
|
||
- Tests: all Bitcoin RPC calls still work
|
||
|
||
**R23 — DWN health URL constant**
|
||
**R24 — Update manifest URL constant**
|
||
**R25 — DNS-over-HTTPS URLs → constants array**
|
||
**R26 — DWN protocol URIs → constants**
|
||
- Centralize all hardcoded URLs/URIs into `core/archipelago/src/constants.rs`
|
||
- Tests: all modules reference constants, no hardcoded strings remain
|
||
|
||
**R28 — LND proxy timeouts**
|
||
- Audit all 68 `.send()` calls in `api/rpc/lnd.rs`. Ensure each has explicit timeout.
|
||
- Tests: LND proxy call with unresponsive LND → timeout error, not hang
|
||
|
||
**R29 — DWN health check timeout**
|
||
- Add timeout to `dwn_sync.rs:76` health check
|
||
|
||
**R30-R33 — Resolve all TODOs**
|
||
- Either implement the TODO or remove the dead code path. Per project rules: no TODO/FIXME in commits.
|
||
|
||
### Week 11: Frontend P2 Fixes
|
||
|
||
**F8 — WebSocket reconnection race**
|
||
- Add `isReconnecting` flag. Skip if already reconnecting.
|
||
- Tests: rapid close events → only one reconnect attempt
|
||
|
||
**F9 — WebSocket parse error handling**
|
||
- Count consecutive parse errors. After 3, force reconnect.
|
||
- Tests: 3 malformed messages → reconnect triggered; single bad message → logged only
|
||
|
||
**F10 — Stale connection detection tuning**
|
||
- Require mutual pong response within 30s. Don't close valid connections that are simply quiet.
|
||
- Tests: quiet but healthy connection → stays open; no pong for 30s → reconnects
|
||
|
||
**F11 — RPC client backoff reduction**
|
||
- Reduce default timeout from 30s to 15s. Add jitter to backoff. Cap total retry time at 20s.
|
||
- Tests: server outage → user sees error within 20s, not 40s
|
||
|
||
**F12 — Code splitting**
|
||
- Lazy-load all routes: `() => import('./views/Web5.vue')`
|
||
- Add manual chunks in vite.config.ts for vendor/api
|
||
- Tests: build produces multiple chunks, initial bundle < 200KB gzipped
|
||
|
||
**F13 — DOMPurify on QR v-html**
|
||
- Add DOMPurify.sanitize() to QR SVG before v-html rendering
|
||
- Tests: XSS payload in QR content → sanitized
|
||
|
||
### Week 12: Frontend P2 Continued + Performance
|
||
|
||
**F14 — Goals computed memoization**
|
||
- Replace O(n) alias lookup with Map. Add deep equality check.
|
||
- Tests: goalStatuses computed runs in <1ms with 100 apps
|
||
|
||
**F15 — localStorage error handling**
|
||
- Wrap all localStorage.setItem in try/catch. Show toast on quota exceeded.
|
||
- Tests: full localStorage → toast shown, app continues
|
||
|
||
**F16 — FileBrowser auth consolidation**
|
||
- Use cookie-only auth. Remove in-memory token.
|
||
- Tests: login persists across page reload, logout clears cookie
|
||
|
||
**F17 — CSRF token parsing robustness**
|
||
- Add header fallback for CSRF token. Handle edge cases.
|
||
- Tests: missing cookie → falls back to header, both missing → error
|
||
|
||
**F22 — CSS backdrop-filter mobile performance**
|
||
- Add media query: reduce blur to 8px on mobile. Remove backdrop-filter from non-visible elements.
|
||
- Tests: mobile Lighthouse performance score > 80
|
||
|
||
---
|
||
|
||
## MONTH 4-5: BACKEND ARCHITECTURE (Weeks 13–20)
|
||
|
||
> Split every Rust god file. Target: no file > 500 lines.
|
||
|
||
### Week 13–14: Split package.rs (1,795 lines)
|
||
|
||
```
|
||
api/rpc/package/
|
||
├── mod.rs — Re-exports (~50 lines)
|
||
├── config.rs — get_app_config(), get_app_capabilities(), needs_archy_net()
|
||
├── lifecycle.rs — install, start, stop, restart, uninstall
|
||
├── validation.rs — Input validation, dependency checking, image validation
|
||
└── progress.rs — Progress streaming, install status tracking
|
||
```
|
||
|
||
Pre-split tests: test every `get_app_config()` variant, validation path, lifecycle transition
|
||
Post-split: all RPC calls return identical responses, `cargo test` passes
|
||
|
||
### Week 15–16: Split mesh/listener.rs (1,799 lines)
|
||
|
||
```
|
||
mesh/listener/
|
||
├── mod.rs — Re-exports + spawn_mesh_listener()
|
||
├── session.rs — run_mesh_session() loop
|
||
├── frames.rs — handle_frame() dispatcher
|
||
├── identity.rs — handle_identity_received(), handle_typed_message()
|
||
├── sync.rs — sync_queued_messages(), store_typed_message()
|
||
└── bitcoin.rs — Bitcoin relay operations, RPC calls
|
||
```
|
||
|
||
### Week 17–18: Split rpc/mod.rs (1,092 lines) + lnd.rs (1,068 lines)
|
||
|
||
**rpc/mod.rs** → `dispatcher.rs` (method routing), `middleware.rs` (CSRF/session/rate-limit), `response.rs` (response building)
|
||
|
||
**lnd.rs** → `lnd/wallet.rs`, `lnd/channels.rs`, `lnd/info.rs`, `lnd/payments.rs`
|
||
|
||
### Week 19–20: Split monitoring (993), handler (911), mesh (865)
|
||
|
||
Split each into sub-modules. Target: no file > 500 lines.
|
||
All pre-split tests, all post-split verification.
|
||
|
||
---
|
||
|
||
## MONTH 6-8: FRONTEND ARCHITECTURE (Weeks 21–32)
|
||
|
||
> Split every Vue god component. Target: no component > 500 lines.
|
||
|
||
### Week 21–22: Split Web5.vue (3,940 lines → 8 sub-views)
|
||
|
||
```
|
||
views/web5/
|
||
├── Web5.vue — Router shell (~150 lines)
|
||
├── Web5Identity.vue — DID management
|
||
├── Web5Wallet.vue — Wallet operations
|
||
├── Web5Nostr.vue — Nostr relays/profiles
|
||
├── Web5Credentials.vue — Verifiable Credentials
|
||
├── Web5Peers.vue — P2P federation nodes
|
||
├── Web5Storage.vue — DWN storage/explorer
|
||
├── Web5Goals.vue — Goals/voting
|
||
└── Web5Marketplace.vue — Decentralized marketplace
|
||
```
|
||
|
||
Add nested routes. Component tests for each section. All sections render identically.
|
||
|
||
### Week 23–24: Split Mesh.vue (2,106) + Dashboard.vue (1,819)
|
||
|
||
**Mesh.vue** → `MeshRadio.vue`, `MeshChat.vue`, `MeshNetwork.vue`, `MeshFederation.vue`
|
||
**Dashboard.vue** → `DashboardHome.vue`, `DashboardApps.vue`, `DashboardSystem.vue`
|
||
|
||
### Week 25–26: Split Settings.vue (1,792) + Server.vue (1,132)
|
||
|
||
**Settings.vue** → `SettingsAccount.vue`, `SettingsSystem.vue`, `SettingsNetwork.vue`, `SettingsAppearance.vue`
|
||
**Server.vue** → `ServerOverview.vue`, `ServerContainers.vue`, `ServerLogs.vue`
|
||
|
||
### Week 27–28: Split Marketplace.vue (1,293) + AppDetails.vue (1,036) + Home.vue (1,059)
|
||
|
||
Each into 3-4 focused sub-components.
|
||
|
||
### Week 29–30: Decompose useAppStore (324 lines, 16 methods)
|
||
|
||
```
|
||
stores/
|
||
├── app.ts — Thin re-export for backward compat (~50 lines)
|
||
├── auth.ts — Login, logout, session, password, TOTP
|
||
├── server.ts — Server info, system stats, reboot/shutdown
|
||
├── realtime.ts — WebSocket connection, subscriptions, heartbeat
|
||
└── packages.ts — Package install/uninstall, marketplace data
|
||
```
|
||
|
||
Tests: every existing import of `useAppStore` still works. State transitions identical.
|
||
|
||
### Week 31–32: Remaining frontend P3 issues
|
||
|
||
**F18** — aiPermissions runtime validation
|
||
**F19** — Track AppSession timeout
|
||
**F20** — Dashboard aria-current
|
||
**F21** — Debounce search + memoize
|
||
**F23** — Branded types for DID operations
|
||
**F24** — Fix checkInterval leak
|
||
|
||
---
|
||
|
||
## MONTH 9-10: SCRIPT ARCHITECTURE + ISO (Weeks 33–40)
|
||
|
||
> Split every monolithic script. Target: no script > 400 lines.
|
||
|
||
### Week 33–34: Create shared script library
|
||
|
||
```
|
||
scripts/lib/
|
||
├── common.sh — Colors, logging, error handling, SSH helpers
|
||
├── health.sh — Health check polling, container status
|
||
├── deploy-utils.sh — Rsync, file sync, backup/restore
|
||
├── container.sh — Podman helpers, image management, mem_limit()
|
||
└── network.sh — IP validation, port checking
|
||
```
|
||
|
||
Tests: each library function tested in `scripts/tests/`
|
||
|
||
### Week 35–36: Split deploy-to-target.sh (1,728 lines)
|
||
|
||
```
|
||
scripts/
|
||
├── deploy-to-target.sh — Orchestrator + arg parsing (~300 lines)
|
||
├── deploy/
|
||
│ ├── frontend.sh — Build + sync frontend
|
||
│ ├── backend.sh — Build + sync binary
|
||
│ ├── configs.sh — Sync nginx, systemd, scripts
|
||
│ ├── containers.sh — Container creation/update
|
||
│ ├── verify.sh — Post-deploy health checks
|
||
│ └── rollback.sh — Rollback on failure
|
||
```
|
||
|
||
### Week 37–38: Split ISO build (1,850 lines) + first-boot (855 lines)
|
||
|
||
**build-auto-installer-iso.sh** → `build/capture-images.sh`, `build/create-rootfs.sh`, `build/install-packages.sh`, `build/bundle-configs.sh`, `build/package-iso.sh`
|
||
|
||
**first-boot-containers.sh** → `first-boot/databases.sh`, `first-boot/bitcoin.sh`, `first-boot/lightning.sh`, `first-boot/apps.sh`, `first-boot/networking.sh`
|
||
|
||
### Week 39–40: ISO Reproducibility + Integration Tests
|
||
|
||
**S16 — Make ISO builds reproducible**
|
||
- Create `image-versions.env` with pinned digests for every container image
|
||
- ISO build sources this file, never pulls `:latest`
|
||
- Build manifest records exactly what shipped
|
||
- Tests: two consecutive ISO builds produce identical image sets
|
||
|
||
**E2E smoke test script**
|
||
```bash
|
||
# scripts/smoke-test.sh — Run against .198
|
||
# 1. curl /health → OK
|
||
# 2. Login → get session
|
||
# 3. Get server info → valid JSON
|
||
# 4. List containers → all healthy
|
||
# 5. Check every /app/* proxy → responds
|
||
# 6. Check Tor hidden service → resolves
|
||
# 7. Check WebSocket upgrade → 101
|
||
# Exit 0 only if all pass
|
||
```
|
||
|
||
---
|
||
|
||
## MONTH 11: INTEGRATION TESTS (Weeks 41–44)
|
||
|
||
> Comprehensive test suites that prove everything works.
|
||
|
||
### Week 41–42: Backend Integration Tests
|
||
|
||
```
|
||
core/archipelago/tests/
|
||
├── test_auth_flow.rs — Login → session → CSRF → auth request → logout
|
||
├── test_container_lifecycle.rs — Install → start → health → stop → uninstall
|
||
├── test_federation.rs — Generate invite → join → sync → verify
|
||
├── test_rpc_validation.rs — Every endpoint with invalid input → proper error
|
||
├── test_session_persist.rs — Create session → restart → session survives
|
||
├── test_rate_limiting.rs — Flood → 429 → wait → allowed
|
||
├── test_backup_restore.rs — Create → verify → restore → validate
|
||
├── test_health_endpoint.rs — Healthy → degraded → recovery
|
||
```
|
||
|
||
Target: 25+ backend integration tests passing
|
||
|
||
### Week 43–44: Frontend Integration Tests
|
||
|
||
```
|
||
neode-ui/src/__tests__/integration/
|
||
├── auth-flow.spec.ts — Login → dashboard → timeout → redirect
|
||
├── app-lifecycle.spec.ts — Marketplace → install → progress → launch → uninstall
|
||
├── websocket.spec.ts — Connect → update → disconnect → reconnect → state consistent
|
||
├── settings-flow.spec.ts — Change password → re-login → 2FA setup → verify
|
||
├── spotlight.spec.ts — Open → search → navigate → close
|
||
├── mesh-chat.spec.ts — Connect → send → receive → disconnect
|
||
├── error-handling.spec.ts — Network error → toast → retry → success
|
||
├── code-splitting.spec.ts — Route navigation → chunks loaded lazily
|
||
```
|
||
|
||
Target: 20+ frontend integration tests passing
|
||
|
||
---
|
||
|
||
## MONTH 12: TYPE SYNC + CI/CD PLAN (Weeks 45–48)
|
||
|
||
### Week 45–46: Rust↔TypeScript Type Sync
|
||
|
||
**Approach**: `ts-rs` crate to auto-generate TypeScript types from Rust structs
|
||
|
||
1. Add `ts-rs` to `core/models/Cargo.toml`
|
||
2. Add `#[derive(TS)]` to all API request/response types
|
||
3. Build script generates `neode-ui/src/types/generated.ts`
|
||
4. Replace manual types in `types/api.ts` with imports from generated file
|
||
5. Verification: regenerate → diff → must be zero (types committed)
|
||
|
||
Tests: frontend type-check passes with generated types, manual api.ts reduced to non-API types
|
||
|
||
### Week 47–48: CI/CD Planning (Document Only — Execute Later)
|
||
|
||
> This section is the PLAN for CI/CD. Do not execute during this phase. Document everything needed so it can be implemented in a future sprint.
|
||
|
||
**CI Pipeline Design** (`.github/workflows/ci.yml`):
|
||
|
||
```yaml
|
||
# Triggers: push to main, all PRs
|
||
# Jobs:
|
||
# rust-checks (Linux runner):
|
||
# - cargo clippy --all-targets --all-features (zero warnings gate)
|
||
# - cargo fmt --all -- --check (formatting gate)
|
||
# - cargo test --all-features (all tests gate)
|
||
#
|
||
# frontend-checks (Node 20):
|
||
# - npm run type-check (TypeScript strictness gate)
|
||
# - npm run lint (ESLint gate)
|
||
# - npm test (Vitest suite gate)
|
||
#
|
||
# integration (Linux runner, optional):
|
||
# - scripts/smoke-test.sh against staging
|
||
#
|
||
# Merge policy: all checks must pass before merge
|
||
# Branch protection: require PR, require checks, no force push to main
|
||
```
|
||
|
||
**Release Pipeline Design** (`.github/workflows/release.yml`):
|
||
```yaml
|
||
# Triggers: tag push (v*)
|
||
# Jobs:
|
||
# build-linux-binary:
|
||
# - Cross-compile Rust for x86_64 + ARM64
|
||
# build-frontend:
|
||
# - npm run build
|
||
# build-iso:
|
||
# - SSH to build server, run ISO build
|
||
# - Upload ISO as release asset
|
||
# smoke-test:
|
||
# - Boot ISO in QEMU
|
||
# - Run smoke-test.sh
|
||
# - Gate release on pass
|
||
```
|
||
|
||
**Pre-requisites to implement**:
|
||
- [ ] GitHub Actions runner with Rust toolchain + cross-compilation
|
||
- [ ] Node.js 20 runner for frontend
|
||
- [ ] SSH key for build server accessible from CI
|
||
- [ ] Branch protection rules configured
|
||
- [ ] Image digest manifest for reproducible ISO builds
|
||
- [ ] QEMU-based ISO verification script
|
||
|
||
**Estimated implementation time**: 2 weeks when ready to execute
|
||
|
||
---
|
||
|
||
## VERIFICATION PROTOCOL (Every Week)
|
||
|
||
1. `cargo clippy --all-targets --all-features` — zero warnings
|
||
2. `cargo fmt --all`
|
||
3. `cargo test --all-features` — all pass
|
||
4. `cd neode-ui && npm run type-check` — zero errors
|
||
5. `cd neode-ui && npm test` — all pass
|
||
6. `./scripts/deploy-to-target.sh --target 192.168.1.198` — **ONLY .198**
|
||
7. `curl http://192.168.1.198/health` — returns OK with service status
|
||
8. Navigate all affected views in browser — identical behavior
|
||
9. Atomic commit: `refactor: <description>` or `fix: <description>`
|
||
|
||
---
|
||
|
||
## EXIT CRITERIA (Month 12 Complete)
|
||
|
||
### Reliability (Zero Tolerance)
|
||
- [ ] Health endpoint returns real service status
|
||
- [ ] All async operations have bounded timeouts
|
||
- [ ] Zero blocking I/O in async context (no std::fs in async functions)
|
||
- [ ] Zero .unwrap()/.expect() in production code
|
||
- [ ] All rate limiters have cleanup tasks
|
||
- [ ] Backup restore uses staging + atomic swap + rollback
|
||
- [ ] All 30 containers have health checks + memory limits
|
||
- [ ] All container images pinned to specific versions
|
||
- [ ] Nginx unauthenticated endpoints protected (timeout + rate limit + body size)
|
||
- [ ] Systemd service has resource limits
|
||
- [ ] Tor rotation preserves old address during transition
|
||
- [ ] Deploy has locking + disk check + rollback
|
||
- [ ] Zero `sudo podman` in any script
|
||
- [ ] Zero `:latest` image tags anywhere
|
||
- [ ] Zero silent error masking without logging
|
||
|
||
### Frontend (Zero Tolerance)
|
||
- [ ] Global error handler catches and displays all errors
|
||
- [ ] WebSocket: single subscription, reconnect refreshes state, bounded retries
|
||
- [ ] All timers/listeners cleaned up on unmount
|
||
- [ ] Code splitting: initial bundle < 200KB gzipped
|
||
- [ ] v-html always uses DOMPurify
|
||
- [ ] All localStorage operations wrapped in try/catch
|
||
|
||
### Architecture (Target: File Size Limits)
|
||
- [ ] No Rust file > 500 lines (excluding generated code)
|
||
- [ ] No Vue component > 500 lines
|
||
- [ ] No shell script > 400 lines
|
||
- [ ] No Pinia store has more than 1 responsibility
|
||
- [ ] All hardcoded URLs/ports extracted to constants
|
||
- [ ] Shared script library eliminates duplication
|
||
- [ ] TypeScript types auto-generated from Rust structs
|
||
|
||
### Testing
|
||
- [ ] 25+ backend integration tests passing
|
||
- [ ] 20+ frontend integration tests passing
|
||
- [ ] E2E smoke test script passes on .198
|
||
- [ ] ISO builds are reproducible (pinned digests)
|
||
|
||
### CI/CD (Planned, Not Executed)
|
||
- [ ] CI pipeline design documented
|
||
- [ ] Release pipeline design documented
|
||
- [ ] Pre-requisites list complete
|
||
- [ ] Ready for 2-week implementation sprint
|
||
|
||
### Zero Behavior Changes
|
||
Every feature works identically. Every existing test passes. Every user flow unchanged.
|