archy/.claude/plans/plan.md

804 lines
37 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Archipelago: Production Excellence Plan
**Duration**: 12 months (48 weeks)
**Goal**: Code so good no developer could question any decision. Apple-level reliability. Every failure visible and recoverable. Every operation bounded. Every line justified.
**Audited**: 2026-03-20 — 122 Rust files, 38 Vue views, 180+ frontend files, 80+ shell scripts
## CONSTRAINTS
- **DEPLOY ONLY TO .198** — Never .228. All verification on .198.
- **BETA FREEZE** — Behavior-preserving only. No new features/UI/endpoints.
- **Tests before every refactor** — Capture current behavior first. Tests must pass unchanged after.
- **Atomic commits** — One logical change per commit. Every step compiles + passes tests.
```bash
ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.198
```
---
## COMPLETE ISSUE REGISTRY
### Backend Rust — 122 files audited
| ID | Issue | File(s) | Severity |
|----|-------|---------|----------|
| R1 | Health RPC endpoint has no handler — returns "Unknown method" | `api/rpc/mod.rs` | P0 |
| R2 | Nostr client.connect() hangs indefinitely (4 calls, no timeout) | `nostr_handshake.rs:124,161,262,282` | P0 |
| R3 | Backup restore extracts directly to live dir — no atomic rollback | `backup/full.rs:122-149` | P0 |
| R4 | Rate limiter cleanup() never spawned — HashMap grows forever | `session.rs:566-579` | P1 |
| R5 | Login rate limiter same issue — entries never evicted | `session.rs:452-472` | P1 |
| R6 | Blocking std::fs in async — session.rs (6 calls) | `session.rs:77,128,370,413,423,425` | P1 |
| R7 | Blocking std::fs in async — docker_packages.rs | `docker_packages.rs:561,573` | P1 |
| R8 | Blocking std::fs in async — port_allocator.rs | `port_allocator.rs:59,73,77` | P1 |
| R9 | Blocking std::fs in async — peers.rs, node_message.rs | `peers.rs:30`, `node_message.rs:65` | P1 |
| R10 | Blocking std::fs in async — identity.rs, identity_manager.rs | `identity.rs:50`, `identity_manager.rs:164` | P1 |
| R11 | Blocking std::fs in async — nostr_discovery.rs | `nostr_discovery.rs:55` | P1 |
| R12 | Sync TCP I/O in async context — electrs_status.rs | `electrs_status.rs:5,40,78,81` | P1 |
| R13 | .expect() in main.rs startup | `main.rs:124,159` | P2 |
| R14 | .parse().unwrap() in session.rs rate limiting | `session.rs:665,676,688` | P1 |
| R15 | 7 .unwrap()/.expect() in mesh/protocol.rs | `protocol.rs:582,592,614,649,679,713,728` | P1 |
| R16 | .expect() in identity.rs crypto | `identity.rs:114,119` | P2 |
| R17 | .unwrap() in helpers/lib.rs (5 calls) | `helpers/lib.rs:167,172,180,233,253` | P2 |
| R18 | .unwrap() in helpers/rsync.rs (5 calls) | `rsync.rs:196,199,202,210,220` | P2 |
| R19 | .unwrap() in js-engine/lib.rs | `js-engine/lib.rs:130,249` | P2 |
| R20 | 14 #[allow(dead_code)] suppressions in mesh/mod.rs | `mesh/mod.rs:7-25` | P2 |
| R21 | Dead code in lnd.rs, data_manager.rs, dev_orchestrator.rs | Multiple | P2 |
| R22 | Bitcoin RPC URL hardcoded in 4+ files | `bitcoin.rs:89`, `mesh/mod.rs:624,649,663`, `listener.rs:1509+` | P2 |
| R23 | DWN health URL hardcoded | `dwn_sync.rs:76` | P2 |
| R24 | Update manifest URL hardcoded | `update.rs:11` | P3 |
| R25 | DNS-over-HTTPS URLs hardcoded (4 providers) | `network/dns.rs:98,102,106,110` | P3 |
| R26 | DWN protocol URIs hardcoded in server.rs | `server.rs:453-456` | P3 |
| R27 | Missing timeouts on mesh Bitcoin RPC calls | `mesh/mod.rs:624,649,663` | P1 |
| R28 | Missing timeouts on LND proxy calls (68 .send() calls) | `api/rpc/lnd.rs` | P2 |
| R29 | Missing timeout on DWN health check | `dwn_sync.rs:76` | P2 |
| R30 | TODO: track last-seen timestamp | `handshake.rs:77` | P3 |
| R31 | TODO: lnd.lookupinvoice RPC endpoint | `marketplace.rs:183` | P3 |
| R32 | TODO: trigger auto-restart or alert | `container/health_monitor.rs:140` | P3 |
| R33 | TODO: configure Podman to use AppArmor profile | `security/container_policies.rs:68` | P3 |
| R34 | Tor rotation deletes old .onion immediately — no transition | `api/rpc/tor.rs:184-240` | P1 |
| R35 | package.rs god file — 1,795 lines | `api/rpc/package.rs` | P2 |
| R36 | mesh/listener.rs god file — 1,799 lines | `mesh/listener.rs` | P2 |
| R37 | rpc/mod.rs god file — 1,092 lines | `api/rpc/mod.rs` | P2 |
| R38 | lnd.rs god file — 1,068 lines | `api/rpc/lnd.rs` | P2 |
| R39 | monitoring/mod.rs — 993 lines | `monitoring/mod.rs` | P3 |
| R40 | api/handler.rs — 911 lines | `api/handler.rs` | P3 |
| R41 | 30+ functions exceed 50 lines across codebase | Multiple | P3 |
### Frontend — 180+ files audited
| ID | Issue | File(s) | Severity |
|----|-------|---------|----------|
| F1 | WebSocket subscription registered multiple times — race condition | `stores/app.ts:88-134` | P0 |
| F2 | Unprotected concurrent mesh state mutations | `stores/mesh.ts:249-268,294-324` | P0 |
| F3 | No global Vue error handler — white screen on error | `main.ts` | P0 |
| F4 | Stale data after WebSocket reconnect — no full refresh | `stores/app.ts:88-163` | P1 |
| F5 | Message polling timer never stopped after logout | `composables/useMessageToast.ts:60` | P1 |
| F6 | AppLauncher NIP-07 message listener leak on close | `stores/appLauncher.ts:295-301` | P1 |
| F7 | Audio player listeners stack — never cleaned up | `composables/useAudioPlayer.ts:1-91` | P1 |
| F8 | WebSocket reconnection race — parallel connect() attempts | `api/websocket.ts:212-238` | P2 |
| F9 | WebSocket parse error silently caught — stale UI forever | `api/websocket.ts:164-172` | P2 |
| F10 | WebSocket stale connection detection too aggressive (5min) | `api/websocket.ts:284-299` | P2 |
| F11 | RPC client backoff + timeout = 40s max wait | `api/rpc-client.ts:31-117` | P2 |
| F12 | No code splitting — monolithic bundle | `vite.config.ts` | P2 |
| F13 | v-html on QR code without DOMPurify | `views/Settings.vue:441` | P2 |
| F14 | Goals store O(n) alias lookup on every computed | `stores/goals.ts:16-20,38-89` | P2 |
| F15 | localStorage save without try/catch (5+ instances) | `stores/goals.ts:34-36` + others | P2 |
| F16 | FileBrowser auth token duality — memory + cookie | `api/filebrowser-client.ts:39,50-68` | P2 |
| F17 | CSRF token cookie parsing brittle — regex only | `api/rpc-client.ts:18-21` | P2 |
| F18 | aiPermissions.ts Set uses unsafe type assertion | `stores/aiPermissions.ts:91-103` | P3 |
| F19 | Untracked setTimeout in AppSession — fires after unmount | `views/AppSession.vue:507` | P3 |
| F20 | Dashboard navigation missing aria-current="page" | `views/Dashboard.vue` | P3 |
| F21 | Search performance — string re-lowercasing every keystroke | `views/Apps.vue:510-537` | P3 |
| F22 | 30+ backdrop-filter blur elements — GPU overload on mobile | `style.css` | P3 |
| F23 | Record<string, unknown> on sensitive DID operations | `types/api.ts` + `rpc-client.ts` | P3 |
| F24 | checkInterval timer leak on connect race | `api/websocket.ts:82-96` | P3 |
| F25 | Web5.vue god component — 3,940 lines | `views/Web5.vue` | P2 |
| F26 | Mesh.vue — 2,106 lines | `views/Mesh.vue` | P2 |
| F27 | Dashboard.vue — 1,819 lines | `views/Dashboard.vue` | P2 |
| F28 | Settings.vue — 1,792 lines | `views/Settings.vue` | P2 |
| F29 | Marketplace.vue — 1,293 lines | `views/Marketplace.vue` | P3 |
| F30 | Server.vue — 1,132 lines | `views/Server.vue` | P3 |
| F31 | Home.vue — 1,059 lines | `views/Home.vue` | P3 |
| F32 | AppDetails.vue — 1,036 lines | `views/AppDetails.vue` | P3 |
| F33 | useAppStore god store — 324 lines, 16 methods, 8+ responsibilities | `stores/app.ts` | P2 |
### Shell Scripts — 80+ files audited
| ID | Issue | File(s) | Severity |
|----|-------|---------|----------|
| S1 | 60+ instances of `sudo podman` — should be rootless | `fix-indeedhub(28)`, `deploy-bitcoin(11)`, `deploy-tailscale(2+)` | P0 |
| S2 | Zero container health checks in first-boot (30 containers) | `first-boot-containers.sh` | P0 |
| S3 | 50+ `:latest` image tags across all scripts | `first-boot(15)`, `deploy(11)`, `tailscale(18)`, `iso(7)` | P1 |
| S4 | No `set -e` in first-boot — silent container failures | `first-boot-containers.sh:1-9` | P1 |
| S5 | `eval "$DB_PASSWORDS"` — code injection risk | `deploy-to-target.sh:940` | P1 |
| S6 | No deploy locking — concurrent deploys corrupt state | `deploy-to-target.sh` | P1 |
| S7 | No deploy rollback — failed deploy leaves broken system | `deploy-to-target.sh` | P1 |
| S8 | sshpass usage in trust-archipelago-cert.sh | `trust-archipelago-cert.sh:23-26` | P1 |
| S9 | MariaDB password in command line — visible in ps | `first-boot-containers.sh:285` | P1 |
| S10 | 80+ instances of `2>/dev/null \|\| true` masking errors | `deploy-to-target.sh` | P2 |
| S11 | No trap cleanup for temp files | Multiple scripts | P2 |
| S12 | Unquoted variables (word splitting risk) | Multiple scripts | P2 |
| S13 | Hardcoded IPs in 6+ scripts | `deploy-to-target.sh:26`, `deploy-tailscale.sh:26`, etc. | P2 |
| S14 | No input validation on deploy targets | `deploy-tailscale.sh` | P2 |
| S15 | Missing memory limits on some containers in deploy | `deploy-to-target.sh:842-880` | P2 |
| S16 | ISO build not reproducible — dynamic image capture + :latest | `build-auto-installer-iso.sh:500-594` | P2 |
| S17 | No disk space pre-flight in deploy | `deploy-to-target.sh` | P2 |
| S18 | deploy-to-target.sh — 1,728 lines monolith | `deploy-to-target.sh` | P3 |
| S19 | build-auto-installer-iso.sh — 1,850 lines monolith | `build-auto-installer-iso.sh` | P3 |
| S20 | first-boot-containers.sh — 855 lines monolith | `first-boot-containers.sh` | P3 |
| S21 | No shared script library — duplicated functions | `scripts/` | P3 |
### Infrastructure
| ID | Issue | File(s) | Severity |
|----|-------|---------|----------|
| I1 | Nginx: /archipelago/, /content, /dwn missing timeout+rate-limit+body-size | `nginx-archipelago.conf:116-180` | P0 |
| I2 | Systemd: no MemoryMax, LimitNOFILE, TasksMax | `archipelago.service` | P1 |
| I3 | Tor rotation kills old address immediately — federation downtime | `api/rpc/tor.rs:184-240` | P1 |
---
## MONTH 1: CRASH PREVENTION (Weeks 14)
> Fix every issue that can crash the system, hang indefinitely, or lose data.
### Week 1: P0 Backend — Things That Hang or Lose Data
**R1 — Health endpoint handler**
- File: `core/archipelago/src/api/rpc/mod.rs`
- Add handler for `"health"` method that checks: crash recovery complete, Podman socket responsive, session store loaded
- Tests: health returns JSON status, degraded when Podman unreachable, degraded during recovery
- Verify: `curl http://192.168.1.198/rpc/v1 -d '{"method":"health"}'` returns real status
**R2 — Nostr connect timeout**
- File: `core/archipelago/src/nostr_handshake.rs` lines 124, 161, 262, 282
- Wrap all 4 `client.connect().await` in `tokio::time::timeout(Duration::from_secs(10), ...)`
- Tests: connect timeout returns Err after 10s, successful connect within timeout works
**R3 — Backup restore atomic rollback**
- File: `core/archipelago/src/backup/full.rs` lines 122-149
- Rewrite: decrypt → extract to staging dir → validate required files → atomic rename → rollback on failure
- Tests: valid backup restores, corrupt backup fails without touching live data, partial extraction rolls back, disk space check fails early
**I1 — Nginx unauthenticated endpoint protection**
- File: `image-recipe/configs/nginx-archipelago.conf` lines 116-180
- Add to `/archipelago/`, `/content`, `/dwn`:
- `limit_req zone=peer burst=20 nodelay;`
- `client_max_body_size 10m;`
- `proxy_connect_timeout 30s; proxy_read_timeout 60s; proxy_send_timeout 30s;`
- Tests: >10MB payload → 413, slow client → timeout, burst 30 → 429 after 20
### Week 2: P0 Frontend + Scripts — Things That Break UI or Containers
**F1 — WebSocket subscription race condition**
- File: `neode-ui/src/stores/app.ts` lines 88-134
- Fix: Return unsubscribe function from `wsClient.subscribe()`, call it before re-subscribing. Use a subscription ID to prevent duplicates.
- Tests: rapid connectWebSocket() calls produce only one active subscription
**F2 — Mesh concurrent state mutations**
- File: `neode-ui/src/stores/mesh.ts` lines 249-324
- Fix: Add `isSending` ref as mutex. Queue concurrent sends. `fetchMessages()` called once after all sends complete.
- Tests: 3 concurrent sendMessage() calls → all succeed, messages list consistent
**F3 — Global error handler**
- File: `neode-ui/src/main.ts`
- Add `app.config.errorHandler` that shows toast + logs structured error
- Tests: thrown error in component shows toast, nested errors don't crash handler
**S1 — Eliminate all `sudo podman`**
- Files: `fix-indeedhub-containers.sh` (28), `deploy-bitcoin-knots.sh` (11), `deploy-tailscale.sh` (2+), `uptime-monitor.sh` (1), `setup-aiui-server.sh`
- Replace every `sudo podman` with `podman` (runs as archipelago user)
- Tests: grep for `sudo podman` across all scripts returns zero matches
**S2 — Container health checks for all 30 containers**
- File: `scripts/first-boot-containers.sh`
- Add `--health-cmd`, `--health-interval=30s`, `--health-timeout=5s`, `--health-retries=3` to every `$DOCKER run`
- Health commands per type:
- Bitcoin: `bitcoin-cli -rpcuser=... getblockchaininfo || exit 1`
- HTTP apps: `curl -sf http://localhost:{port}/ || exit 1`
- LND: `curl -sf --insecure https://localhost:8080/v1/getinfo || exit 1`
- Databases: `mariadb -u root -p... -e "SELECT 1" || exit 1`
- Tests: script grep confirms every `$DOCKER run` has `--health-cmd`
### Week 3: P1 Backend — Blocking I/O and Memory Leaks
**R4+R5 — Rate limiter cleanup**
- File: `core/archipelago/src/session.rs`
- Spawn background tasks for both `EndpointRateLimiter::cleanup()` and `LoginRateLimiter` cleanup, every 5 min
- Tests: after cleanup, stale entries removed; active entries preserved
**R6 — session.rs blocking I/O (6 calls)**
- Replace `std::fs::read_to_string``tokio::fs::read_to_string` at lines 77, 370, 413
- Replace `std::fs::write``tokio::fs::write` at lines 128, 425
- Replace `std::fs::create_dir_all``tokio::fs::create_dir_all` at line 423
- Tests: session load/save/persist still works correctly
**R7 — docker_packages.rs blocking I/O**
- Replace `std::fs::read_to_string``tokio::fs::read_to_string` at lines 561, 573
- Tests: app metadata loading works
**R8 — port_allocator.rs blocking I/O**
- Replace all 3 std::fs calls → tokio::fs at lines 59, 73, 77
- Tests: port allocation/persistence works
**R9+R10+R11 — Remaining blocking I/O**
- `peers.rs:30`, `node_message.rs:65`, `identity.rs:50`, `identity_manager.rs:164`, `nostr_discovery.rs:55`
- Convert all to tokio::fs
- Tests: each module's file operations still work
**R12 — electrs_status.rs sync TCP I/O**
- Convert synchronous TCP client to async (tokio::net::TcpStream)
- Tests: ElectrumX status query works, timeout on connection failure
### Week 4: P1 Frontend — Memory Leaks and Stale State
**F4 — WebSocket reconnect full state refresh**
- File: `neode-ui/src/stores/app.ts`
- After reconnect, call `rpcClient.call({method: 'server.get-state'})` to get fresh state before accepting patches
- Tests: after simulated disconnect+reconnect, state matches server
**F5 — Message polling timer cleanup**
- File: `neode-ui/src/composables/useMessageToast.ts`
- Tie polling lifecycle to auth state: stop on logout, start on login. Export cleanup function.
- Tests: polling stops when auth false, restarts when auth true, no timer after unmount
**F6 — AppLauncher message listener leak**
- File: `neode-ui/src/stores/appLauncher.ts`
- Ensure listener is removed when app closes (even if not via close button — e.g., route navigation)
- Tests: navigate away from app → listener removed, new app opens clean
**F7 — Audio player listener stacking**
- File: `neode-ui/src/composables/useAudioPlayer.ts`
- Create Audio element once, register listeners once. Track initialization flag.
- Tests: calling play() 10 times → still only 6 listeners total (not 60)
**S3 — Pin all container images (remove :latest)**
- Files: `first-boot-containers.sh` (15), `deploy-to-target.sh` (11), `deploy-tailscale.sh` (18), `build-auto-installer-iso.sh` (7)
- Replace every `:latest` with specific version tag
- Create `image-versions.env` sourced by all scripts — single source of truth
- Tests: `grep -r ':latest' scripts/ image-recipe/` returns zero matches (excluding comments)
---
## MONTH 2: OPERATIONAL SAFETY (Weeks 58)
> Fix everything that makes deploys dangerous, scripts unreliable, or operations opaque.
### Week 5: Deploy Script Hardening
**S4 — first-boot error handling**
- Add per-section error checking: if Bitcoin fails, skip dependent containers (LND, Mempool, BTCPay)
- Add `wait_for_container` return value checking
- Tests: first-boot with broken Bitcoin image → Bitcoin deps skipped, independent apps still start
**S5 — Replace eval with safe construct**
- File: `deploy-to-target.sh:940`
- Replace `eval "$DB_PASSWORDS"` with explicit variable assignment from SSH output
- Tests: passwords parsed correctly without eval
**S6 — Deploy locking**
- File: `deploy-to-target.sh`
- Add remote `flock` on `/var/lock/archipelago-deploy.lock`. Second deploy fails immediately with message. Stale lock (>30 min) broken automatically.
- Tests: two parallel deploys → second fails, stale lock → broken and deploy proceeds
**S7 — Deploy rollback**
- File: `deploy-to-target.sh`
- Before overwriting binary: `cp archipelago archipelago.bak`
- Before overwriting frontend: `cp -r web-ui web-ui.bak`
- If health check fails post-restart: restore from .bak, restart again
- Tests: intentionally broken binary → deploy detects, rolls back, system healthy
**S8 — Eliminate sshpass**
- File: `trust-archipelago-cert.sh`
- Rewrite to use SSH key only: `ssh -i ~/.ssh/archipelago-deploy`
- Tests: script works with key auth, fails gracefully without key
### Week 6: Script Quality
**S9 — MariaDB password not on command line**
- File: `first-boot-containers.sh:285`
- Use `$DOCKER exec -i ... mariadb -uroot < /dev/stdin <<< "SET PASSWORD..."`
- Tests: `ps aux` during execution doesn't show password
**S10 — Replace silent error masking**
- File: `deploy-to-target.sh` (80+ instances)
- Pattern: replace `2>/dev/null || echo ""` with `|| { log_warn "..."; echo ""; }`
- At minimum, log what failed before masking
- Tests: failed health check produces log entry
**S11 — Trap cleanup for temp files**
- All scripts that create /tmp files: add `trap "rm -rf /tmp/deploy-$$" EXIT` at start
- Files: deploy-to-target.sh, deploy-tailscale.sh, build-auto-installer-iso.sh
- Tests: script interrupted mid-execution → temp files cleaned up
**S12 — Quote all variables**
- Audit and fix unquoted `$VARIABLE` in command arguments across all scripts
- Tests: shellcheck passes on all modified scripts
**S13 — Extract hardcoded IPs to config**
- Create `scripts/deploy-config-defaults.sh` with all node IPs as named variables
- Source from all scripts instead of hardcoding
- Tests: changing IP in config → all scripts use new IP
### Week 7: Infrastructure Hardening
**I2 — Systemd resource limits**
- File: `image-recipe/configs/archipelago.service`
- Add: `MemoryMax=4G`, `LimitNOFILE=65535`, `TasksMax=2048`
- Tests: `systemctl show archipelago` confirms limits applied, service starts normally
**I3 — Tor rotation transition period**
- File: `core/archipelago/src/api/rpc/tor.rs`
- Keep old hidden service running for 24h after rotation. Both addresses active. Notify peers of new address. Schedule old deletion.
- Tests: after rotation old address still resolves, peers receive notification, old removed after transition
**S14 — Input validation on deploy targets**
- Add regex validation for hostnames/IPs before SSH
- Tests: invalid hostname → clear error, valid hostname → proceeds
**S15 — Memory limits on all deploy containers**
- File: `deploy-to-target.sh` lines 842-880
- Add `--memory=$(mem_limit ...)` to all UI container builds
- Tests: every container in deploy has `--memory` flag
**S17 — Disk space pre-flight**
- File: `deploy-to-target.sh`
- Check target disk <85% before deploying. Abort with clear message if full.
- Tests: deploy to 90% full disk aborted, deploy to 50% full succeeds
### Week 8: Remaining P1 Backend
**R14 — Fix .parse().unwrap() in session rate limiting**
- File: `session.rs:665,676,688`
- Replace `.parse().unwrap()` with `.parse().context("...")?`
- Tests: invalid IP handling works gracefully
**R15 — Fix 7 unwrap/expect in mesh/protocol.rs**
- File: `mesh/protocol.rs:582,592,614,649,679,713,728`
- Replace all with `?` operator + proper error types
- Tests: protocol parsing with malformed data returns error, not panic
**R27 — Add timeouts to mesh Bitcoin RPC calls**
- File: `mesh/mod.rs:624,649,663`
- Add `tokio::time::timeout(Duration::from_secs(10), ...)` to all Bitcoin RPC calls
- Tests: RPC timeout returns error after 10s
**R34 — Tor rotation transition**
- (Covered by I3 above)
---
## MONTH 3: PRODUCTION POLISH (Weeks 912)
> Fix every remaining P2 issue — unwraps, hardcoded values, frontend quality, resilience.
### Week 9: Remaining Backend Unwraps + Dead Code
**R13 — main.rs .expect() → .context()**
- Replace 2 `.expect()` calls with `.context("...")?` and proper startup error handling
**R16 — identity.rs .expect() → safe handling**
- Replace 2 `.expect()` in crypto operations with result propagation
**R17+R18 — helpers unwraps**
- Fix 10 `.unwrap()` calls in `helpers/lib.rs` and `helpers/rsync.rs`
- Replace with `?` operator or `.context()`
**R19 — js-engine unwraps**
- Fix 2 `.unwrap()` in `js-engine/lib.rs:130,249`
**R20+R21 — Dead code elimination**
- Remove all 14 `#[allow(dead_code)]` in `mesh/mod.rs`. Either use the fields or delete them.
- Same for `lnd.rs`, `data_manager.rs`, `dev_orchestrator.rs`
- Tests: `cargo clippy` zero warnings, `cargo test` passes
### Week 10: Hardcoded Values → Constants
**R22 — Bitcoin RPC URL constant**
- Create `const BITCOIN_RPC_URL: &str = "http://127.0.0.1:8332/";` in a shared constants module
- Use across `bitcoin.rs`, `mesh/mod.rs`, `mesh/listener.rs`
- Tests: all Bitcoin RPC calls still work
**R23 — DWN health URL constant**
**R24 — Update manifest URL constant**
**R25 — DNS-over-HTTPS URLs → constants array**
**R26 — DWN protocol URIs → constants**
- Centralize all hardcoded URLs/URIs into `core/archipelago/src/constants.rs`
- Tests: all modules reference constants, no hardcoded strings remain
**R28 — LND proxy timeouts**
- Audit all 68 `.send()` calls in `api/rpc/lnd.rs`. Ensure each has explicit timeout.
- Tests: LND proxy call with unresponsive LND timeout error, not hang
**R29 — DWN health check timeout**
- Add timeout to `dwn_sync.rs:76` health check
**R30-R33 — Resolve all TODOs**
- Either implement the TODO or remove the dead code path. Per project rules: no TODO/FIXME in commits.
### Week 11: Frontend P2 Fixes
**F8 — WebSocket reconnection race**
- Add `isReconnecting` flag. Skip if already reconnecting.
- Tests: rapid close events only one reconnect attempt
**F9 — WebSocket parse error handling**
- Count consecutive parse errors. After 3, force reconnect.
- Tests: 3 malformed messages reconnect triggered; single bad message logged only
**F10 — Stale connection detection tuning**
- Require mutual pong response within 30s. Don't close valid connections that are simply quiet.
- Tests: quiet but healthy connection stays open; no pong for 30s reconnects
**F11 — RPC client backoff reduction**
- Reduce default timeout from 30s to 15s. Add jitter to backoff. Cap total retry time at 20s.
- Tests: server outage user sees error within 20s, not 40s
**F12 — Code splitting**
- Lazy-load all routes: `() => import('./views/Web5.vue')`
- Add manual chunks in vite.config.ts for vendor/api
- Tests: build produces multiple chunks, initial bundle < 200KB gzipped
**F13 — DOMPurify on QR v-html**
- Add DOMPurify.sanitize() to QR SVG before v-html rendering
- Tests: XSS payload in QR content sanitized
### Week 12: Frontend P2 Continued + Performance
**F14 — Goals computed memoization**
- Replace O(n) alias lookup with Map. Add deep equality check.
- Tests: goalStatuses computed runs in <1ms with 100 apps
**F15 — localStorage error handling**
- Wrap all localStorage.setItem in try/catch. Show toast on quota exceeded.
- Tests: full localStorage toast shown, app continues
**F16 — FileBrowser auth consolidation**
- Use cookie-only auth. Remove in-memory token.
- Tests: login persists across page reload, logout clears cookie
**F17 — CSRF token parsing robustness**
- Add header fallback for CSRF token. Handle edge cases.
- Tests: missing cookie falls back to header, both missing error
**F22 — CSS backdrop-filter mobile performance**
- Add media query: reduce blur to 8px on mobile. Remove backdrop-filter from non-visible elements.
- Tests: mobile Lighthouse performance score > 80
---
## MONTH 4-5: BACKEND ARCHITECTURE (Weeks 1320)
> Split every Rust god file. Target: no file > 500 lines.
### Week 1314: Split package.rs (1,795 lines)
```
api/rpc/package/
├── mod.rs — Re-exports (~50 lines)
├── config.rs — get_app_config(), get_app_capabilities(), needs_archy_net()
├── lifecycle.rs — install, start, stop, restart, uninstall
├── validation.rs — Input validation, dependency checking, image validation
└── progress.rs — Progress streaming, install status tracking
```
Pre-split tests: test every `get_app_config()` variant, validation path, lifecycle transition
Post-split: all RPC calls return identical responses, `cargo test` passes
### Week 1516: Split mesh/listener.rs (1,799 lines)
```
mesh/listener/
├── mod.rs — Re-exports + spawn_mesh_listener()
├── session.rs — run_mesh_session() loop
├── frames.rs — handle_frame() dispatcher
├── identity.rs — handle_identity_received(), handle_typed_message()
├── sync.rs — sync_queued_messages(), store_typed_message()
└── bitcoin.rs — Bitcoin relay operations, RPC calls
```
### Week 1718: Split rpc/mod.rs (1,092 lines) + lnd.rs (1,068 lines)
**rpc/mod.rs**`dispatcher.rs` (method routing), `middleware.rs` (CSRF/session/rate-limit), `response.rs` (response building)
**lnd.rs**`lnd/wallet.rs`, `lnd/channels.rs`, `lnd/info.rs`, `lnd/payments.rs`
### Week 1920: Split monitoring (993), handler (911), mesh (865)
Split each into sub-modules. Target: no file > 500 lines.
All pre-split tests, all post-split verification.
---
## MONTH 6-8: FRONTEND ARCHITECTURE (Weeks 2132)
> Split every Vue god component. Target: no component > 500 lines.
### Week 2122: Split Web5.vue (3,940 lines → 8 sub-views)
```
views/web5/
├── Web5.vue — Router shell (~150 lines)
├── Web5Identity.vue — DID management
├── Web5Wallet.vue — Wallet operations
├── Web5Nostr.vue — Nostr relays/profiles
├── Web5Credentials.vue — Verifiable Credentials
├── Web5Peers.vue — P2P federation nodes
├── Web5Storage.vue — DWN storage/explorer
├── Web5Goals.vue — Goals/voting
└── Web5Marketplace.vue — Decentralized marketplace
```
Add nested routes. Component tests for each section. All sections render identically.
### Week 2324: Split Mesh.vue (2,106) + Dashboard.vue (1,819)
**Mesh.vue**`MeshRadio.vue`, `MeshChat.vue`, `MeshNetwork.vue`, `MeshFederation.vue`
**Dashboard.vue**`DashboardHome.vue`, `DashboardApps.vue`, `DashboardSystem.vue`
### Week 2526: Split Settings.vue (1,792) + Server.vue (1,132)
**Settings.vue**`SettingsAccount.vue`, `SettingsSystem.vue`, `SettingsNetwork.vue`, `SettingsAppearance.vue`
**Server.vue**`ServerOverview.vue`, `ServerContainers.vue`, `ServerLogs.vue`
### Week 2728: Split Marketplace.vue (1,293) + AppDetails.vue (1,036) + Home.vue (1,059)
Each into 3-4 focused sub-components.
### Week 2930: Decompose useAppStore (324 lines, 16 methods)
```
stores/
├── app.ts — Thin re-export for backward compat (~50 lines)
├── auth.ts — Login, logout, session, password, TOTP
├── server.ts — Server info, system stats, reboot/shutdown
├── realtime.ts — WebSocket connection, subscriptions, heartbeat
└── packages.ts — Package install/uninstall, marketplace data
```
Tests: every existing import of `useAppStore` still works. State transitions identical.
### Week 3132: Remaining frontend P3 issues
**F18** — aiPermissions runtime validation
**F19** — Track AppSession timeout
**F20** — Dashboard aria-current
**F21** — Debounce search + memoize
**F23** — Branded types for DID operations
**F24** — Fix checkInterval leak
---
## MONTH 9-10: SCRIPT ARCHITECTURE + ISO (Weeks 3340)
> Split every monolithic script. Target: no script > 400 lines.
### Week 3334: Create shared script library
```
scripts/lib/
├── common.sh — Colors, logging, error handling, SSH helpers
├── health.sh — Health check polling, container status
├── deploy-utils.sh — Rsync, file sync, backup/restore
├── container.sh — Podman helpers, image management, mem_limit()
└── network.sh — IP validation, port checking
```
Tests: each library function tested in `scripts/tests/`
### Week 3536: Split deploy-to-target.sh (1,728 lines)
```
scripts/
├── deploy-to-target.sh — Orchestrator + arg parsing (~300 lines)
├── deploy/
│ ├── frontend.sh — Build + sync frontend
│ ├── backend.sh — Build + sync binary
│ ├── configs.sh — Sync nginx, systemd, scripts
│ ├── containers.sh — Container creation/update
│ ├── verify.sh — Post-deploy health checks
│ └── rollback.sh — Rollback on failure
```
### Week 3738: Split ISO build (1,850 lines) + first-boot (855 lines)
**build-auto-installer-iso.sh**`build/capture-images.sh`, `build/create-rootfs.sh`, `build/install-packages.sh`, `build/bundle-configs.sh`, `build/package-iso.sh`
**first-boot-containers.sh**`first-boot/databases.sh`, `first-boot/bitcoin.sh`, `first-boot/lightning.sh`, `first-boot/apps.sh`, `first-boot/networking.sh`
### Week 3940: ISO Reproducibility + Integration Tests
**S16 — Make ISO builds reproducible**
- Create `image-versions.env` with pinned digests for every container image
- ISO build sources this file, never pulls `:latest`
- Build manifest records exactly what shipped
- Tests: two consecutive ISO builds produce identical image sets
**E2E smoke test script**
```bash
# scripts/smoke-test.sh — Run against .198
# 1. curl /health → OK
# 2. Login → get session
# 3. Get server info → valid JSON
# 4. List containers → all healthy
# 5. Check every /app/* proxy → responds
# 6. Check Tor hidden service → resolves
# 7. Check WebSocket upgrade → 101
# Exit 0 only if all pass
```
---
## MONTH 11: INTEGRATION TESTS (Weeks 4144)
> Comprehensive test suites that prove everything works.
### Week 4142: Backend Integration Tests
```
core/archipelago/tests/
├── test_auth_flow.rs — Login → session → CSRF → auth request → logout
├── test_container_lifecycle.rs — Install → start → health → stop → uninstall
├── test_federation.rs — Generate invite → join → sync → verify
├── test_rpc_validation.rs — Every endpoint with invalid input → proper error
├── test_session_persist.rs — Create session → restart → session survives
├── test_rate_limiting.rs — Flood → 429 → wait → allowed
├── test_backup_restore.rs — Create → verify → restore → validate
├── test_health_endpoint.rs — Healthy → degraded → recovery
```
Target: 25+ backend integration tests passing
### Week 4344: Frontend Integration Tests
```
neode-ui/src/__tests__/integration/
├── auth-flow.spec.ts — Login → dashboard → timeout → redirect
├── app-lifecycle.spec.ts — Marketplace → install → progress → launch → uninstall
├── websocket.spec.ts — Connect → update → disconnect → reconnect → state consistent
├── settings-flow.spec.ts — Change password → re-login → 2FA setup → verify
├── spotlight.spec.ts — Open → search → navigate → close
├── mesh-chat.spec.ts — Connect → send → receive → disconnect
├── error-handling.spec.ts — Network error → toast → retry → success
├── code-splitting.spec.ts — Route navigation → chunks loaded lazily
```
Target: 20+ frontend integration tests passing
---
## MONTH 12: TYPE SYNC + CI/CD PLAN (Weeks 4548)
### Week 4546: Rust↔TypeScript Type Sync
**Approach**: `ts-rs` crate to auto-generate TypeScript types from Rust structs
1. Add `ts-rs` to `core/models/Cargo.toml`
2. Add `#[derive(TS)]` to all API request/response types
3. Build script generates `neode-ui/src/types/generated.ts`
4. Replace manual types in `types/api.ts` with imports from generated file
5. Verification: regenerate → diff → must be zero (types committed)
Tests: frontend type-check passes with generated types, manual api.ts reduced to non-API types
### Week 4748: CI/CD Planning (Document Only — Execute Later)
> This section is the PLAN for CI/CD. Do not execute during this phase. Document everything needed so it can be implemented in a future sprint.
**CI Pipeline Design** (`.github/workflows/ci.yml`):
```yaml
# Triggers: push to main, all PRs
# Jobs:
# rust-checks (Linux runner):
# - cargo clippy --all-targets --all-features (zero warnings gate)
# - cargo fmt --all -- --check (formatting gate)
# - cargo test --all-features (all tests gate)
#
# frontend-checks (Node 20):
# - npm run type-check (TypeScript strictness gate)
# - npm run lint (ESLint gate)
# - npm test (Vitest suite gate)
#
# integration (Linux runner, optional):
# - scripts/smoke-test.sh against staging
#
# Merge policy: all checks must pass before merge
# Branch protection: require PR, require checks, no force push to main
```
**Release Pipeline Design** (`.github/workflows/release.yml`):
```yaml
# Triggers: tag push (v*)
# Jobs:
# build-linux-binary:
# - Cross-compile Rust for x86_64 + ARM64
# build-frontend:
# - npm run build
# build-iso:
# - SSH to build server, run ISO build
# - Upload ISO as release asset
# smoke-test:
# - Boot ISO in QEMU
# - Run smoke-test.sh
# - Gate release on pass
```
**Pre-requisites to implement**:
- [ ] GitHub Actions runner with Rust toolchain + cross-compilation
- [ ] Node.js 20 runner for frontend
- [ ] SSH key for build server accessible from CI
- [ ] Branch protection rules configured
- [ ] Image digest manifest for reproducible ISO builds
- [ ] QEMU-based ISO verification script
**Estimated implementation time**: 2 weeks when ready to execute
---
## VERIFICATION PROTOCOL (Every Week)
1. `cargo clippy --all-targets --all-features` — zero warnings
2. `cargo fmt --all`
3. `cargo test --all-features` — all pass
4. `cd neode-ui && npm run type-check` — zero errors
5. `cd neode-ui && npm test` — all pass
6. `./scripts/deploy-to-target.sh --target 192.168.1.198`**ONLY .198**
7. `curl http://192.168.1.198/health` — returns OK with service status
8. Navigate all affected views in browser — identical behavior
9. Atomic commit: `refactor: <description>` or `fix: <description>`
---
## EXIT CRITERIA (Month 12 Complete)
### Reliability (Zero Tolerance)
- [ ] Health endpoint returns real service status
- [ ] All async operations have bounded timeouts
- [ ] Zero blocking I/O in async context (no std::fs in async functions)
- [ ] Zero .unwrap()/.expect() in production code
- [ ] All rate limiters have cleanup tasks
- [ ] Backup restore uses staging + atomic swap + rollback
- [ ] All 30 containers have health checks + memory limits
- [ ] All container images pinned to specific versions
- [ ] Nginx unauthenticated endpoints protected (timeout + rate limit + body size)
- [ ] Systemd service has resource limits
- [ ] Tor rotation preserves old address during transition
- [ ] Deploy has locking + disk check + rollback
- [ ] Zero `sudo podman` in any script
- [ ] Zero `:latest` image tags anywhere
- [ ] Zero silent error masking without logging
### Frontend (Zero Tolerance)
- [ ] Global error handler catches and displays all errors
- [ ] WebSocket: single subscription, reconnect refreshes state, bounded retries
- [ ] All timers/listeners cleaned up on unmount
- [ ] Code splitting: initial bundle < 200KB gzipped
- [ ] v-html always uses DOMPurify
- [ ] All localStorage operations wrapped in try/catch
### Architecture (Target: File Size Limits)
- [ ] No Rust file > 500 lines (excluding generated code)
- [ ] No Vue component > 500 lines
- [ ] No shell script > 400 lines
- [ ] No Pinia store has more than 1 responsibility
- [ ] All hardcoded URLs/ports extracted to constants
- [ ] Shared script library eliminates duplication
- [ ] TypeScript types auto-generated from Rust structs
### Testing
- [ ] 25+ backend integration tests passing
- [ ] 20+ frontend integration tests passing
- [ ] E2E smoke test script passes on .198
- [ ] ISO builds are reproducible (pinned digests)
### CI/CD (Planned, Not Executed)
- [ ] CI pipeline design documented
- [ ] Release pipeline design documented
- [ ] Pre-requisites list complete
- [ ] Ready for 2-week implementation sprint
### Zero Behavior Changes
Every feature works identically. Every existing test passes. Every user flow unchanged.