archy/.claude/plans/plan.md

37 KiB
Raw Blame History

Archipelago: Production Excellence Plan

Duration: 12 months (48 weeks) Goal: Code so good no developer could question any decision. Apple-level reliability. Every failure visible and recoverable. Every operation bounded. Every line justified. Audited: 2026-03-20 — 122 Rust files, 38 Vue views, 180+ frontend files, 80+ shell scripts

CONSTRAINTS

  • DEPLOY ONLY TO .198 — Never .228. All verification on .198.
  • BETA FREEZE — Behavior-preserving only. No new features/UI/endpoints.
  • Tests before every refactor — Capture current behavior first. Tests must pass unchanged after.
  • Atomic commits — One logical change per commit. Every step compiles + passes tests.
ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.198

COMPLETE ISSUE REGISTRY

Backend Rust — 122 files audited

ID Issue File(s) Severity
R1 Health RPC endpoint has no handler — returns "Unknown method" api/rpc/mod.rs P0
R2 Nostr client.connect() hangs indefinitely (4 calls, no timeout) nostr_handshake.rs:124,161,262,282 P0
R3 Backup restore extracts directly to live dir — no atomic rollback backup/full.rs:122-149 P0
R4 Rate limiter cleanup() never spawned — HashMap grows forever session.rs:566-579 P1
R5 Login rate limiter same issue — entries never evicted session.rs:452-472 P1
R6 Blocking std::fs in async — session.rs (6 calls) session.rs:77,128,370,413,423,425 P1
R7 Blocking std::fs in async — docker_packages.rs docker_packages.rs:561,573 P1
R8 Blocking std::fs in async — port_allocator.rs port_allocator.rs:59,73,77 P1
R9 Blocking std::fs in async — peers.rs, node_message.rs peers.rs:30, node_message.rs:65 P1
R10 Blocking std::fs in async — identity.rs, identity_manager.rs identity.rs:50, identity_manager.rs:164 P1
R11 Blocking std::fs in async — nostr_discovery.rs nostr_discovery.rs:55 P1
R12 Sync TCP I/O in async context — electrs_status.rs electrs_status.rs:5,40,78,81 P1
R13 .expect() in main.rs startup main.rs:124,159 P2
R14 .parse().unwrap() in session.rs rate limiting session.rs:665,676,688 P1
R15 7 .unwrap()/.expect() in mesh/protocol.rs protocol.rs:582,592,614,649,679,713,728 P1
R16 .expect() in identity.rs crypto identity.rs:114,119 P2
R17 .unwrap() in helpers/lib.rs (5 calls) helpers/lib.rs:167,172,180,233,253 P2
R18 .unwrap() in helpers/rsync.rs (5 calls) rsync.rs:196,199,202,210,220 P2
R19 .unwrap() in js-engine/lib.rs js-engine/lib.rs:130,249 P2
R20 14 #[allow(dead_code)] suppressions in mesh/mod.rs mesh/mod.rs:7-25 P2
R21 Dead code in lnd.rs, data_manager.rs, dev_orchestrator.rs Multiple P2
R22 Bitcoin RPC URL hardcoded in 4+ files bitcoin.rs:89, mesh/mod.rs:624,649,663, listener.rs:1509+ P2
R23 DWN health URL hardcoded dwn_sync.rs:76 P2
R24 Update manifest URL hardcoded update.rs:11 P3
R25 DNS-over-HTTPS URLs hardcoded (4 providers) network/dns.rs:98,102,106,110 P3
R26 DWN protocol URIs hardcoded in server.rs server.rs:453-456 P3
R27 Missing timeouts on mesh Bitcoin RPC calls mesh/mod.rs:624,649,663 P1
R28 Missing timeouts on LND proxy calls (68 .send() calls) api/rpc/lnd.rs P2
R29 Missing timeout on DWN health check dwn_sync.rs:76 P2
R30 TODO: track last-seen timestamp handshake.rs:77 P3
R31 TODO: lnd.lookupinvoice RPC endpoint marketplace.rs:183 P3
R32 TODO: trigger auto-restart or alert container/health_monitor.rs:140 P3
R33 TODO: configure Podman to use AppArmor profile security/container_policies.rs:68 P3
R34 Tor rotation deletes old .onion immediately — no transition api/rpc/tor.rs:184-240 P1
R35 package.rs god file — 1,795 lines api/rpc/package.rs P2
R36 mesh/listener.rs god file — 1,799 lines mesh/listener.rs P2
R37 rpc/mod.rs god file — 1,092 lines api/rpc/mod.rs P2
R38 lnd.rs god file — 1,068 lines api/rpc/lnd.rs P2
R39 monitoring/mod.rs — 993 lines monitoring/mod.rs P3
R40 api/handler.rs — 911 lines api/handler.rs P3
R41 30+ functions exceed 50 lines across codebase Multiple P3

Frontend — 180+ files audited

ID Issue File(s) Severity
F1 WebSocket subscription registered multiple times — race condition stores/app.ts:88-134 P0
F2 Unprotected concurrent mesh state mutations stores/mesh.ts:249-268,294-324 P0
F3 No global Vue error handler — white screen on error main.ts P0
F4 Stale data after WebSocket reconnect — no full refresh stores/app.ts:88-163 P1
F5 Message polling timer never stopped after logout composables/useMessageToast.ts:60 P1
F6 AppLauncher NIP-07 message listener leak on close stores/appLauncher.ts:295-301 P1
F7 Audio player listeners stack — never cleaned up composables/useAudioPlayer.ts:1-91 P1
F8 WebSocket reconnection race — parallel connect() attempts api/websocket.ts:212-238 P2
F9 WebSocket parse error silently caught — stale UI forever api/websocket.ts:164-172 P2
F10 WebSocket stale connection detection too aggressive (5min) api/websocket.ts:284-299 P2
F11 RPC client backoff + timeout = 40s max wait api/rpc-client.ts:31-117 P2
F12 No code splitting — monolithic bundle vite.config.ts P2
F13 v-html on QR code without DOMPurify views/Settings.vue:441 P2
F14 Goals store O(n) alias lookup on every computed stores/goals.ts:16-20,38-89 P2
F15 localStorage save without try/catch (5+ instances) stores/goals.ts:34-36 + others P2
F16 FileBrowser auth token duality — memory + cookie api/filebrowser-client.ts:39,50-68 P2
F17 CSRF token cookie parsing brittle — regex only api/rpc-client.ts:18-21 P2
F18 aiPermissions.ts Set uses unsafe type assertion stores/aiPermissions.ts:91-103 P3
F19 Untracked setTimeout in AppSession — fires after unmount views/AppSession.vue:507 P3
F20 Dashboard navigation missing aria-current="page" views/Dashboard.vue P3
F21 Search performance — string re-lowercasing every keystroke views/Apps.vue:510-537 P3
F22 30+ backdrop-filter blur elements — GPU overload on mobile style.css P3
F23 Record<string, unknown> on sensitive DID operations types/api.ts + rpc-client.ts P3
F24 checkInterval timer leak on connect race api/websocket.ts:82-96 P3
F25 Web5.vue god component — 3,940 lines views/Web5.vue P2
F26 Mesh.vue — 2,106 lines views/Mesh.vue P2
F27 Dashboard.vue — 1,819 lines views/Dashboard.vue P2
F28 Settings.vue — 1,792 lines views/Settings.vue P2
F29 Marketplace.vue — 1,293 lines views/Marketplace.vue P3
F30 Server.vue — 1,132 lines views/Server.vue P3
F31 Home.vue — 1,059 lines views/Home.vue P3
F32 AppDetails.vue — 1,036 lines views/AppDetails.vue P3
F33 useAppStore god store — 324 lines, 16 methods, 8+ responsibilities stores/app.ts P2

Shell Scripts — 80+ files audited

ID Issue File(s) Severity
S1 60+ instances of sudo podman — should be rootless fix-indeedhub(28), deploy-bitcoin(11), deploy-tailscale(2+) P0
S2 Zero container health checks in first-boot (30 containers) first-boot-containers.sh P0
S3 50+ :latest image tags across all scripts first-boot(15), deploy(11), tailscale(18), iso(7) P1
S4 No set -e in first-boot — silent container failures first-boot-containers.sh:1-9 P1
S5 eval "$DB_PASSWORDS" — code injection risk deploy-to-target.sh:940 P1
S6 No deploy locking — concurrent deploys corrupt state deploy-to-target.sh P1
S7 No deploy rollback — failed deploy leaves broken system deploy-to-target.sh P1
S8 sshpass usage in trust-archipelago-cert.sh trust-archipelago-cert.sh:23-26 P1
S9 MariaDB password in command line — visible in ps first-boot-containers.sh:285 P1
S10 80+ instances of 2>/dev/null || true masking errors deploy-to-target.sh P2
S11 No trap cleanup for temp files Multiple scripts P2
S12 Unquoted variables (word splitting risk) Multiple scripts P2
S13 Hardcoded IPs in 6+ scripts deploy-to-target.sh:26, deploy-tailscale.sh:26, etc. P2
S14 No input validation on deploy targets deploy-tailscale.sh P2
S15 Missing memory limits on some containers in deploy deploy-to-target.sh:842-880 P2
S16 ISO build not reproducible — dynamic image capture + :latest build-auto-installer-iso.sh:500-594 P2
S17 No disk space pre-flight in deploy deploy-to-target.sh P2
S18 deploy-to-target.sh — 1,728 lines monolith deploy-to-target.sh P3
S19 build-auto-installer-iso.sh — 1,850 lines monolith build-auto-installer-iso.sh P3
S20 first-boot-containers.sh — 855 lines monolith first-boot-containers.sh P3
S21 No shared script library — duplicated functions scripts/ P3

Infrastructure

ID Issue File(s) Severity
I1 Nginx: /archipelago/, /content, /dwn missing timeout+rate-limit+body-size nginx-archipelago.conf:116-180 P0
I2 Systemd: no MemoryMax, LimitNOFILE, TasksMax archipelago.service P1
I3 Tor rotation kills old address immediately — federation downtime api/rpc/tor.rs:184-240 P1

MONTH 1: CRASH PREVENTION (Weeks 14)

Fix every issue that can crash the system, hang indefinitely, or lose data.

Week 1: P0 Backend — Things That Hang or Lose Data

R1 — Health endpoint handler

  • File: core/archipelago/src/api/rpc/mod.rs
  • Add handler for "health" method that checks: crash recovery complete, Podman socket responsive, session store loaded
  • Tests: health returns JSON status, degraded when Podman unreachable, degraded during recovery
  • Verify: curl http://192.168.1.198/rpc/v1 -d '{"method":"health"}' returns real status

R2 — Nostr connect timeout

  • File: core/archipelago/src/nostr_handshake.rs lines 124, 161, 262, 282
  • Wrap all 4 client.connect().await in tokio::time::timeout(Duration::from_secs(10), ...)
  • Tests: connect timeout returns Err after 10s, successful connect within timeout works

R3 — Backup restore atomic rollback

  • File: core/archipelago/src/backup/full.rs lines 122-149
  • Rewrite: decrypt → extract to staging dir → validate required files → atomic rename → rollback on failure
  • Tests: valid backup restores, corrupt backup fails without touching live data, partial extraction rolls back, disk space check fails early

I1 — Nginx unauthenticated endpoint protection

  • File: image-recipe/configs/nginx-archipelago.conf lines 116-180
  • Add to /archipelago/, /content, /dwn:
    • limit_req zone=peer burst=20 nodelay;
    • client_max_body_size 10m;
    • proxy_connect_timeout 30s; proxy_read_timeout 60s; proxy_send_timeout 30s;
  • Tests: >10MB payload → 413, slow client → timeout, burst 30 → 429 after 20

Week 2: P0 Frontend + Scripts — Things That Break UI or Containers

F1 — WebSocket subscription race condition

  • File: neode-ui/src/stores/app.ts lines 88-134
  • Fix: Return unsubscribe function from wsClient.subscribe(), call it before re-subscribing. Use a subscription ID to prevent duplicates.
  • Tests: rapid connectWebSocket() calls produce only one active subscription

F2 — Mesh concurrent state mutations

  • File: neode-ui/src/stores/mesh.ts lines 249-324
  • Fix: Add isSending ref as mutex. Queue concurrent sends. fetchMessages() called once after all sends complete.
  • Tests: 3 concurrent sendMessage() calls → all succeed, messages list consistent

F3 — Global error handler

  • File: neode-ui/src/main.ts
  • Add app.config.errorHandler that shows toast + logs structured error
  • Tests: thrown error in component shows toast, nested errors don't crash handler

S1 — Eliminate all sudo podman

  • Files: fix-indeedhub-containers.sh (28), deploy-bitcoin-knots.sh (11), deploy-tailscale.sh (2+), uptime-monitor.sh (1), setup-aiui-server.sh
  • Replace every sudo podman with podman (runs as archipelago user)
  • Tests: grep for sudo podman across all scripts returns zero matches

S2 — Container health checks for all 30 containers

  • File: scripts/first-boot-containers.sh
  • Add --health-cmd, --health-interval=30s, --health-timeout=5s, --health-retries=3 to every $DOCKER run
  • Health commands per type:
    • Bitcoin: bitcoin-cli -rpcuser=... getblockchaininfo || exit 1
    • HTTP apps: curl -sf http://localhost:{port}/ || exit 1
    • LND: curl -sf --insecure https://localhost:8080/v1/getinfo || exit 1
    • Databases: mariadb -u root -p... -e "SELECT 1" || exit 1
  • Tests: script grep confirms every $DOCKER run has --health-cmd

Week 3: P1 Backend — Blocking I/O and Memory Leaks

R4+R5 — Rate limiter cleanup

  • File: core/archipelago/src/session.rs
  • Spawn background tasks for both EndpointRateLimiter::cleanup() and LoginRateLimiter cleanup, every 5 min
  • Tests: after cleanup, stale entries removed; active entries preserved

R6 — session.rs blocking I/O (6 calls)

  • Replace std::fs::read_to_stringtokio::fs::read_to_string at lines 77, 370, 413
  • Replace std::fs::writetokio::fs::write at lines 128, 425
  • Replace std::fs::create_dir_alltokio::fs::create_dir_all at line 423
  • Tests: session load/save/persist still works correctly

R7 — docker_packages.rs blocking I/O

  • Replace std::fs::read_to_stringtokio::fs::read_to_string at lines 561, 573
  • Tests: app metadata loading works

R8 — port_allocator.rs blocking I/O

  • Replace all 3 std::fs calls → tokio::fs at lines 59, 73, 77
  • Tests: port allocation/persistence works

R9+R10+R11 — Remaining blocking I/O

  • peers.rs:30, node_message.rs:65, identity.rs:50, identity_manager.rs:164, nostr_discovery.rs:55
  • Convert all to tokio::fs
  • Tests: each module's file operations still work

R12 — electrs_status.rs sync TCP I/O

  • Convert synchronous TCP client to async (tokio::net::TcpStream)
  • Tests: ElectrumX status query works, timeout on connection failure

Week 4: P1 Frontend — Memory Leaks and Stale State

F4 — WebSocket reconnect full state refresh

  • File: neode-ui/src/stores/app.ts
  • After reconnect, call rpcClient.call({method: 'server.get-state'}) to get fresh state before accepting patches
  • Tests: after simulated disconnect+reconnect, state matches server

F5 — Message polling timer cleanup

  • File: neode-ui/src/composables/useMessageToast.ts
  • Tie polling lifecycle to auth state: stop on logout, start on login. Export cleanup function.
  • Tests: polling stops when auth false, restarts when auth true, no timer after unmount

F6 — AppLauncher message listener leak

  • File: neode-ui/src/stores/appLauncher.ts
  • Ensure listener is removed when app closes (even if not via close button — e.g., route navigation)
  • Tests: navigate away from app → listener removed, new app opens clean

F7 — Audio player listener stacking

  • File: neode-ui/src/composables/useAudioPlayer.ts
  • Create Audio element once, register listeners once. Track initialization flag.
  • Tests: calling play() 10 times → still only 6 listeners total (not 60)

S3 — Pin all container images (remove :latest)

  • Files: first-boot-containers.sh (15), deploy-to-target.sh (11), deploy-tailscale.sh (18), build-auto-installer-iso.sh (7)
  • Replace every :latest with specific version tag
  • Create image-versions.env sourced by all scripts — single source of truth
  • Tests: grep -r ':latest' scripts/ image-recipe/ returns zero matches (excluding comments)

MONTH 2: OPERATIONAL SAFETY (Weeks 58)

Fix everything that makes deploys dangerous, scripts unreliable, or operations opaque.

Week 5: Deploy Script Hardening

S4 — first-boot error handling

  • Add per-section error checking: if Bitcoin fails, skip dependent containers (LND, Mempool, BTCPay)
  • Add wait_for_container return value checking
  • Tests: first-boot with broken Bitcoin image → Bitcoin deps skipped, independent apps still start

S5 — Replace eval with safe construct

  • File: deploy-to-target.sh:940
  • Replace eval "$DB_PASSWORDS" with explicit variable assignment from SSH output
  • Tests: passwords parsed correctly without eval

S6 — Deploy locking

  • File: deploy-to-target.sh
  • Add remote flock on /var/lock/archipelago-deploy.lock. Second deploy fails immediately with message. Stale lock (>30 min) broken automatically.
  • Tests: two parallel deploys → second fails, stale lock → broken and deploy proceeds

S7 — Deploy rollback

  • File: deploy-to-target.sh
  • Before overwriting binary: cp archipelago archipelago.bak
  • Before overwriting frontend: cp -r web-ui web-ui.bak
  • If health check fails post-restart: restore from .bak, restart again
  • Tests: intentionally broken binary → deploy detects, rolls back, system healthy

S8 — Eliminate sshpass

  • File: trust-archipelago-cert.sh
  • Rewrite to use SSH key only: ssh -i ~/.ssh/archipelago-deploy
  • Tests: script works with key auth, fails gracefully without key

Week 6: Script Quality

S9 — MariaDB password not on command line

  • File: first-boot-containers.sh:285
  • Use $DOCKER exec -i ... mariadb -uroot < /dev/stdin <<< "SET PASSWORD..."
  • Tests: ps aux during execution doesn't show password

S10 — Replace silent error masking

  • File: deploy-to-target.sh (80+ instances)
  • Pattern: replace 2>/dev/null || echo "" with || { log_warn "..."; echo ""; }
  • At minimum, log what failed before masking
  • Tests: failed health check produces log entry

S11 — Trap cleanup for temp files

  • All scripts that create /tmp files: add trap "rm -rf /tmp/deploy-$$" EXIT at start
  • Files: deploy-to-target.sh, deploy-tailscale.sh, build-auto-installer-iso.sh
  • Tests: script interrupted mid-execution → temp files cleaned up

S12 — Quote all variables

  • Audit and fix unquoted $VARIABLE in command arguments across all scripts
  • Tests: shellcheck passes on all modified scripts

S13 — Extract hardcoded IPs to config

  • Create scripts/deploy-config-defaults.sh with all node IPs as named variables
  • Source from all scripts instead of hardcoding
  • Tests: changing IP in config → all scripts use new IP

Week 7: Infrastructure Hardening

I2 — Systemd resource limits

  • File: image-recipe/configs/archipelago.service
  • Add: MemoryMax=4G, LimitNOFILE=65535, TasksMax=2048
  • Tests: systemctl show archipelago confirms limits applied, service starts normally

I3 — Tor rotation transition period

  • File: core/archipelago/src/api/rpc/tor.rs
  • Keep old hidden service running for 24h after rotation. Both addresses active. Notify peers of new address. Schedule old deletion.
  • Tests: after rotation old address still resolves, peers receive notification, old removed after transition

S14 — Input validation on deploy targets

  • Add regex validation for hostnames/IPs before SSH
  • Tests: invalid hostname → clear error, valid hostname → proceeds

S15 — Memory limits on all deploy containers

  • File: deploy-to-target.sh lines 842-880
  • Add --memory=$(mem_limit ...) to all UI container builds
  • Tests: every container in deploy has --memory flag

S17 — Disk space pre-flight

  • File: deploy-to-target.sh
  • Check target disk <85% before deploying. Abort with clear message if full.
  • Tests: deploy to 90% full disk → aborted, deploy to 50% full → succeeds

Week 8: Remaining P1 Backend

R14 — Fix .parse().unwrap() in session rate limiting

  • File: session.rs:665,676,688
  • Replace .parse().unwrap() with .parse().context("...")?
  • Tests: invalid IP handling works gracefully

R15 — Fix 7 unwrap/expect in mesh/protocol.rs

  • File: mesh/protocol.rs:582,592,614,649,679,713,728
  • Replace all with ? operator + proper error types
  • Tests: protocol parsing with malformed data returns error, not panic

R27 — Add timeouts to mesh Bitcoin RPC calls

  • File: mesh/mod.rs:624,649,663
  • Add tokio::time::timeout(Duration::from_secs(10), ...) to all Bitcoin RPC calls
  • Tests: RPC timeout returns error after 10s

R34 — Tor rotation transition

  • (Covered by I3 above)

MONTH 3: PRODUCTION POLISH (Weeks 912)

Fix every remaining P2 issue — unwraps, hardcoded values, frontend quality, resilience.

Week 9: Remaining Backend Unwraps + Dead Code

R13 — main.rs .expect() → .context()

  • Replace 2 .expect() calls with .context("...")? and proper startup error handling

R16 — identity.rs .expect() → safe handling

  • Replace 2 .expect() in crypto operations with result propagation

R17+R18 — helpers unwraps

  • Fix 10 .unwrap() calls in helpers/lib.rs and helpers/rsync.rs
  • Replace with ? operator or .context()

R19 — js-engine unwraps

  • Fix 2 .unwrap() in js-engine/lib.rs:130,249

R20+R21 — Dead code elimination

  • Remove all 14 #[allow(dead_code)] in mesh/mod.rs. Either use the fields or delete them.
  • Same for lnd.rs, data_manager.rs, dev_orchestrator.rs
  • Tests: cargo clippy zero warnings, cargo test passes

Week 10: Hardcoded Values → Constants

R22 — Bitcoin RPC URL constant

  • Create const BITCOIN_RPC_URL: &str = "http://127.0.0.1:8332/"; in a shared constants module
  • Use across bitcoin.rs, mesh/mod.rs, mesh/listener.rs
  • Tests: all Bitcoin RPC calls still work

R23 — DWN health URL constant R24 — Update manifest URL constant R25 — DNS-over-HTTPS URLs → constants array R26 — DWN protocol URIs → constants

  • Centralize all hardcoded URLs/URIs into core/archipelago/src/constants.rs
  • Tests: all modules reference constants, no hardcoded strings remain

R28 — LND proxy timeouts

  • Audit all 68 .send() calls in api/rpc/lnd.rs. Ensure each has explicit timeout.
  • Tests: LND proxy call with unresponsive LND → timeout error, not hang

R29 — DWN health check timeout

  • Add timeout to dwn_sync.rs:76 health check

R30-R33 — Resolve all TODOs

  • Either implement the TODO or remove the dead code path. Per project rules: no TODO/FIXME in commits.

Week 11: Frontend P2 Fixes

F8 — WebSocket reconnection race

  • Add isReconnecting flag. Skip if already reconnecting.
  • Tests: rapid close events → only one reconnect attempt

F9 — WebSocket parse error handling

  • Count consecutive parse errors. After 3, force reconnect.
  • Tests: 3 malformed messages → reconnect triggered; single bad message → logged only

F10 — Stale connection detection tuning

  • Require mutual pong response within 30s. Don't close valid connections that are simply quiet.
  • Tests: quiet but healthy connection → stays open; no pong for 30s → reconnects

F11 — RPC client backoff reduction

  • Reduce default timeout from 30s to 15s. Add jitter to backoff. Cap total retry time at 20s.
  • Tests: server outage → user sees error within 20s, not 40s

F12 — Code splitting

  • Lazy-load all routes: () => import('./views/Web5.vue')
  • Add manual chunks in vite.config.ts for vendor/api
  • Tests: build produces multiple chunks, initial bundle < 200KB gzipped

F13 — DOMPurify on QR v-html

  • Add DOMPurify.sanitize() to QR SVG before v-html rendering
  • Tests: XSS payload in QR content → sanitized

Week 12: Frontend P2 Continued + Performance

F14 — Goals computed memoization

  • Replace O(n) alias lookup with Map. Add deep equality check.
  • Tests: goalStatuses computed runs in <1ms with 100 apps

F15 — localStorage error handling

  • Wrap all localStorage.setItem in try/catch. Show toast on quota exceeded.
  • Tests: full localStorage → toast shown, app continues

F16 — FileBrowser auth consolidation

  • Use cookie-only auth. Remove in-memory token.
  • Tests: login persists across page reload, logout clears cookie

F17 — CSRF token parsing robustness

  • Add header fallback for CSRF token. Handle edge cases.
  • Tests: missing cookie → falls back to header, both missing → error

F22 — CSS backdrop-filter mobile performance

  • Add media query: reduce blur to 8px on mobile. Remove backdrop-filter from non-visible elements.
  • Tests: mobile Lighthouse performance score > 80

MONTH 4-5: BACKEND ARCHITECTURE (Weeks 1320)

Split every Rust god file. Target: no file > 500 lines.

Week 1314: Split package.rs (1,795 lines)

api/rpc/package/
├── mod.rs          — Re-exports (~50 lines)
├── config.rs       — get_app_config(), get_app_capabilities(), needs_archy_net()
├── lifecycle.rs    — install, start, stop, restart, uninstall
├── validation.rs   — Input validation, dependency checking, image validation
└── progress.rs     — Progress streaming, install status tracking

Pre-split tests: test every get_app_config() variant, validation path, lifecycle transition Post-split: all RPC calls return identical responses, cargo test passes

Week 1516: Split mesh/listener.rs (1,799 lines)

mesh/listener/
├── mod.rs          — Re-exports + spawn_mesh_listener()
├── session.rs      — run_mesh_session() loop
├── frames.rs       — handle_frame() dispatcher
├── identity.rs     — handle_identity_received(), handle_typed_message()
├── sync.rs         — sync_queued_messages(), store_typed_message()
└── bitcoin.rs      — Bitcoin relay operations, RPC calls

Week 1718: Split rpc/mod.rs (1,092 lines) + lnd.rs (1,068 lines)

rpc/mod.rsdispatcher.rs (method routing), middleware.rs (CSRF/session/rate-limit), response.rs (response building)

lnd.rslnd/wallet.rs, lnd/channels.rs, lnd/info.rs, lnd/payments.rs

Week 1920: Split monitoring (993), handler (911), mesh (865)

Split each into sub-modules. Target: no file > 500 lines. All pre-split tests, all post-split verification.


MONTH 6-8: FRONTEND ARCHITECTURE (Weeks 2132)

Split every Vue god component. Target: no component > 500 lines.

Week 2122: Split Web5.vue (3,940 lines → 8 sub-views)

views/web5/
├── Web5.vue            — Router shell (~150 lines)
├── Web5Identity.vue    — DID management
├── Web5Wallet.vue      — Wallet operations
├── Web5Nostr.vue       — Nostr relays/profiles
├── Web5Credentials.vue — Verifiable Credentials
├── Web5Peers.vue       — P2P federation nodes
├── Web5Storage.vue     — DWN storage/explorer
├── Web5Goals.vue       — Goals/voting
└── Web5Marketplace.vue — Decentralized marketplace

Add nested routes. Component tests for each section. All sections render identically.

Week 2324: Split Mesh.vue (2,106) + Dashboard.vue (1,819)

Mesh.vueMeshRadio.vue, MeshChat.vue, MeshNetwork.vue, MeshFederation.vue Dashboard.vueDashboardHome.vue, DashboardApps.vue, DashboardSystem.vue

Week 2526: Split Settings.vue (1,792) + Server.vue (1,132)

Settings.vueSettingsAccount.vue, SettingsSystem.vue, SettingsNetwork.vue, SettingsAppearance.vue Server.vueServerOverview.vue, ServerContainers.vue, ServerLogs.vue

Week 2728: Split Marketplace.vue (1,293) + AppDetails.vue (1,036) + Home.vue (1,059)

Each into 3-4 focused sub-components.

Week 2930: Decompose useAppStore (324 lines, 16 methods)

stores/
├── app.ts          — Thin re-export for backward compat (~50 lines)
├── auth.ts         — Login, logout, session, password, TOTP
├── server.ts       — Server info, system stats, reboot/shutdown
├── realtime.ts     — WebSocket connection, subscriptions, heartbeat
└── packages.ts     — Package install/uninstall, marketplace data

Tests: every existing import of useAppStore still works. State transitions identical.

Week 3132: Remaining frontend P3 issues

F18 — aiPermissions runtime validation F19 — Track AppSession timeout F20 — Dashboard aria-current F21 — Debounce search + memoize F23 — Branded types for DID operations F24 — Fix checkInterval leak


MONTH 9-10: SCRIPT ARCHITECTURE + ISO (Weeks 3340)

Split every monolithic script. Target: no script > 400 lines.

Week 3334: Create shared script library

scripts/lib/
├── common.sh       — Colors, logging, error handling, SSH helpers
├── health.sh       — Health check polling, container status
├── deploy-utils.sh — Rsync, file sync, backup/restore
├── container.sh    — Podman helpers, image management, mem_limit()
└── network.sh      — IP validation, port checking

Tests: each library function tested in scripts/tests/

Week 3536: Split deploy-to-target.sh (1,728 lines)

scripts/
├── deploy-to-target.sh  — Orchestrator + arg parsing (~300 lines)
├── deploy/
│   ├── frontend.sh      — Build + sync frontend
│   ├── backend.sh       — Build + sync binary
│   ├── configs.sh       — Sync nginx, systemd, scripts
│   ├── containers.sh    — Container creation/update
│   ├── verify.sh        — Post-deploy health checks
│   └── rollback.sh      — Rollback on failure

Week 3738: Split ISO build (1,850 lines) + first-boot (855 lines)

build-auto-installer-iso.shbuild/capture-images.sh, build/create-rootfs.sh, build/install-packages.sh, build/bundle-configs.sh, build/package-iso.sh

first-boot-containers.shfirst-boot/databases.sh, first-boot/bitcoin.sh, first-boot/lightning.sh, first-boot/apps.sh, first-boot/networking.sh

Week 3940: ISO Reproducibility + Integration Tests

S16 — Make ISO builds reproducible

  • Create image-versions.env with pinned digests for every container image
  • ISO build sources this file, never pulls :latest
  • Build manifest records exactly what shipped
  • Tests: two consecutive ISO builds produce identical image sets

E2E smoke test script

# scripts/smoke-test.sh — Run against .198
# 1. curl /health → OK
# 2. Login → get session
# 3. Get server info → valid JSON
# 4. List containers → all healthy
# 5. Check every /app/* proxy → responds
# 6. Check Tor hidden service → resolves
# 7. Check WebSocket upgrade → 101
# Exit 0 only if all pass

MONTH 11: INTEGRATION TESTS (Weeks 4144)

Comprehensive test suites that prove everything works.

Week 4142: Backend Integration Tests

core/archipelago/tests/
├── test_auth_flow.rs           — Login → session → CSRF → auth request → logout
├── test_container_lifecycle.rs — Install → start → health → stop → uninstall
├── test_federation.rs          — Generate invite → join → sync → verify
├── test_rpc_validation.rs      — Every endpoint with invalid input → proper error
├── test_session_persist.rs     — Create session → restart → session survives
├── test_rate_limiting.rs       — Flood → 429 → wait → allowed
├── test_backup_restore.rs      — Create → verify → restore → validate
├── test_health_endpoint.rs     — Healthy → degraded → recovery

Target: 25+ backend integration tests passing

Week 4344: Frontend Integration Tests

neode-ui/src/__tests__/integration/
├── auth-flow.spec.ts           — Login → dashboard → timeout → redirect
├── app-lifecycle.spec.ts       — Marketplace → install → progress → launch → uninstall
├── websocket.spec.ts           — Connect → update → disconnect → reconnect → state consistent
├── settings-flow.spec.ts       — Change password → re-login → 2FA setup → verify
├── spotlight.spec.ts           — Open → search → navigate → close
├── mesh-chat.spec.ts           — Connect → send → receive → disconnect
├── error-handling.spec.ts      — Network error → toast → retry → success
├── code-splitting.spec.ts      — Route navigation → chunks loaded lazily

Target: 20+ frontend integration tests passing


MONTH 12: TYPE SYNC + CI/CD PLAN (Weeks 4548)

Week 4546: Rust↔TypeScript Type Sync

Approach: ts-rs crate to auto-generate TypeScript types from Rust structs

  1. Add ts-rs to core/models/Cargo.toml
  2. Add #[derive(TS)] to all API request/response types
  3. Build script generates neode-ui/src/types/generated.ts
  4. Replace manual types in types/api.ts with imports from generated file
  5. Verification: regenerate → diff → must be zero (types committed)

Tests: frontend type-check passes with generated types, manual api.ts reduced to non-API types

Week 4748: CI/CD Planning (Document Only — Execute Later)

This section is the PLAN for CI/CD. Do not execute during this phase. Document everything needed so it can be implemented in a future sprint.

CI Pipeline Design (.github/workflows/ci.yml):

# Triggers: push to main, all PRs
# Jobs:
#   rust-checks (Linux runner):
#     - cargo clippy --all-targets --all-features (zero warnings gate)
#     - cargo fmt --all -- --check (formatting gate)
#     - cargo test --all-features (all tests gate)
#
#   frontend-checks (Node 20):
#     - npm run type-check (TypeScript strictness gate)
#     - npm run lint (ESLint gate)
#     - npm test (Vitest suite gate)
#
#   integration (Linux runner, optional):
#     - scripts/smoke-test.sh against staging
#
# Merge policy: all checks must pass before merge
# Branch protection: require PR, require checks, no force push to main

Release Pipeline Design (.github/workflows/release.yml):

# Triggers: tag push (v*)
# Jobs:
#   build-linux-binary:
#     - Cross-compile Rust for x86_64 + ARM64
#   build-frontend:
#     - npm run build
#   build-iso:
#     - SSH to build server, run ISO build
#     - Upload ISO as release asset
#   smoke-test:
#     - Boot ISO in QEMU
#     - Run smoke-test.sh
#     - Gate release on pass

Pre-requisites to implement:

  • GitHub Actions runner with Rust toolchain + cross-compilation
  • Node.js 20 runner for frontend
  • SSH key for build server accessible from CI
  • Branch protection rules configured
  • Image digest manifest for reproducible ISO builds
  • QEMU-based ISO verification script

Estimated implementation time: 2 weeks when ready to execute


VERIFICATION PROTOCOL (Every Week)

  1. cargo clippy --all-targets --all-features — zero warnings
  2. cargo fmt --all
  3. cargo test --all-features — all pass
  4. cd neode-ui && npm run type-check — zero errors
  5. cd neode-ui && npm test — all pass
  6. ./scripts/deploy-to-target.sh --target 192.168.1.198ONLY .198
  7. curl http://192.168.1.198/health — returns OK with service status
  8. Navigate all affected views in browser — identical behavior
  9. Atomic commit: refactor: <description> or fix: <description>

EXIT CRITERIA (Month 12 Complete)

Reliability (Zero Tolerance)

  • Health endpoint returns real service status
  • All async operations have bounded timeouts
  • Zero blocking I/O in async context (no std::fs in async functions)
  • Zero .unwrap()/.expect() in production code
  • All rate limiters have cleanup tasks
  • Backup restore uses staging + atomic swap + rollback
  • All 30 containers have health checks + memory limits
  • All container images pinned to specific versions
  • Nginx unauthenticated endpoints protected (timeout + rate limit + body size)
  • Systemd service has resource limits
  • Tor rotation preserves old address during transition
  • Deploy has locking + disk check + rollback
  • Zero sudo podman in any script
  • Zero :latest image tags anywhere
  • Zero silent error masking without logging

Frontend (Zero Tolerance)

  • Global error handler catches and displays all errors
  • WebSocket: single subscription, reconnect refreshes state, bounded retries
  • All timers/listeners cleaned up on unmount
  • Code splitting: initial bundle < 200KB gzipped
  • v-html always uses DOMPurify
  • All localStorage operations wrapped in try/catch

Architecture (Target: File Size Limits)

  • No Rust file > 500 lines (excluding generated code)
  • No Vue component > 500 lines
  • No shell script > 400 lines
  • No Pinia store has more than 1 responsibility
  • All hardcoded URLs/ports extracted to constants
  • Shared script library eliminates duplication
  • TypeScript types auto-generated from Rust structs

Testing

  • 25+ backend integration tests passing
  • 20+ frontend integration tests passing
  • E2E smoke test script passes on .198
  • ISO builds are reproducible (pinned digests)

CI/CD (Planned, Not Executed)

  • CI pipeline design documented
  • Release pipeline design documented
  • Pre-requisites list complete
  • Ready for 2-week implementation sprint

Zero Behavior Changes

Every feature works identically. Every existing test passes. Every user flow unchanged.