37 KiB
Archipelago: Production Excellence Plan
Duration: 12 months (48 weeks) Goal: Code so good no developer could question any decision. Apple-level reliability. Every failure visible and recoverable. Every operation bounded. Every line justified. Audited: 2026-03-20 — 122 Rust files, 38 Vue views, 180+ frontend files, 80+ shell scripts
CONSTRAINTS
- DEPLOY ONLY TO .198 — Never .228. All verification on .198.
- BETA FREEZE — Behavior-preserving only. No new features/UI/endpoints.
- Tests before every refactor — Capture current behavior first. Tests must pass unchanged after.
- Atomic commits — One logical change per commit. Every step compiles + passes tests.
ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.198
COMPLETE ISSUE REGISTRY
Backend Rust — 122 files audited
| ID | Issue | File(s) | Severity |
|---|---|---|---|
| R1 | Health RPC endpoint has no handler — returns "Unknown method" | api/rpc/mod.rs |
P0 |
| R2 | Nostr client.connect() hangs indefinitely (4 calls, no timeout) | nostr_handshake.rs:124,161,262,282 |
P0 |
| R3 | Backup restore extracts directly to live dir — no atomic rollback | backup/full.rs:122-149 |
P0 |
| R4 | Rate limiter cleanup() never spawned — HashMap grows forever | session.rs:566-579 |
P1 |
| R5 | Login rate limiter same issue — entries never evicted | session.rs:452-472 |
P1 |
| R6 | Blocking std::fs in async — session.rs (6 calls) | session.rs:77,128,370,413,423,425 |
P1 |
| R7 | Blocking std::fs in async — docker_packages.rs | docker_packages.rs:561,573 |
P1 |
| R8 | Blocking std::fs in async — port_allocator.rs | port_allocator.rs:59,73,77 |
P1 |
| R9 | Blocking std::fs in async — peers.rs, node_message.rs | peers.rs:30, node_message.rs:65 |
P1 |
| R10 | Blocking std::fs in async — identity.rs, identity_manager.rs | identity.rs:50, identity_manager.rs:164 |
P1 |
| R11 | Blocking std::fs in async — nostr_discovery.rs | nostr_discovery.rs:55 |
P1 |
| R12 | Sync TCP I/O in async context — electrs_status.rs | electrs_status.rs:5,40,78,81 |
P1 |
| R13 | .expect() in main.rs startup | main.rs:124,159 |
P2 |
| R14 | .parse().unwrap() in session.rs rate limiting | session.rs:665,676,688 |
P1 |
| R15 | 7 .unwrap()/.expect() in mesh/protocol.rs | protocol.rs:582,592,614,649,679,713,728 |
P1 |
| R16 | .expect() in identity.rs crypto | identity.rs:114,119 |
P2 |
| R17 | .unwrap() in helpers/lib.rs (5 calls) | helpers/lib.rs:167,172,180,233,253 |
P2 |
| R18 | .unwrap() in helpers/rsync.rs (5 calls) | rsync.rs:196,199,202,210,220 |
P2 |
| R19 | .unwrap() in js-engine/lib.rs | js-engine/lib.rs:130,249 |
P2 |
| R20 | 14 #[allow(dead_code)] suppressions in mesh/mod.rs | mesh/mod.rs:7-25 |
P2 |
| R21 | Dead code in lnd.rs, data_manager.rs, dev_orchestrator.rs | Multiple | P2 |
| R22 | Bitcoin RPC URL hardcoded in 4+ files | bitcoin.rs:89, mesh/mod.rs:624,649,663, listener.rs:1509+ |
P2 |
| R23 | DWN health URL hardcoded | dwn_sync.rs:76 |
P2 |
| R24 | Update manifest URL hardcoded | update.rs:11 |
P3 |
| R25 | DNS-over-HTTPS URLs hardcoded (4 providers) | network/dns.rs:98,102,106,110 |
P3 |
| R26 | DWN protocol URIs hardcoded in server.rs | server.rs:453-456 |
P3 |
| R27 | Missing timeouts on mesh Bitcoin RPC calls | mesh/mod.rs:624,649,663 |
P1 |
| R28 | Missing timeouts on LND proxy calls (68 .send() calls) | api/rpc/lnd.rs |
P2 |
| R29 | Missing timeout on DWN health check | dwn_sync.rs:76 |
P2 |
| R30 | TODO: track last-seen timestamp | handshake.rs:77 |
P3 |
| R31 | TODO: lnd.lookupinvoice RPC endpoint | marketplace.rs:183 |
P3 |
| R32 | TODO: trigger auto-restart or alert | container/health_monitor.rs:140 |
P3 |
| R33 | TODO: configure Podman to use AppArmor profile | security/container_policies.rs:68 |
P3 |
| R34 | Tor rotation deletes old .onion immediately — no transition | api/rpc/tor.rs:184-240 |
P1 |
| R35 | package.rs god file — 1,795 lines | api/rpc/package.rs |
P2 |
| R36 | mesh/listener.rs god file — 1,799 lines | mesh/listener.rs |
P2 |
| R37 | rpc/mod.rs god file — 1,092 lines | api/rpc/mod.rs |
P2 |
| R38 | lnd.rs god file — 1,068 lines | api/rpc/lnd.rs |
P2 |
| R39 | monitoring/mod.rs — 993 lines | monitoring/mod.rs |
P3 |
| R40 | api/handler.rs — 911 lines | api/handler.rs |
P3 |
| R41 | 30+ functions exceed 50 lines across codebase | Multiple | P3 |
Frontend — 180+ files audited
| ID | Issue | File(s) | Severity |
|---|---|---|---|
| F1 | WebSocket subscription registered multiple times — race condition | stores/app.ts:88-134 |
P0 |
| F2 | Unprotected concurrent mesh state mutations | stores/mesh.ts:249-268,294-324 |
P0 |
| F3 | No global Vue error handler — white screen on error | main.ts |
P0 |
| F4 | Stale data after WebSocket reconnect — no full refresh | stores/app.ts:88-163 |
P1 |
| F5 | Message polling timer never stopped after logout | composables/useMessageToast.ts:60 |
P1 |
| F6 | AppLauncher NIP-07 message listener leak on close | stores/appLauncher.ts:295-301 |
P1 |
| F7 | Audio player listeners stack — never cleaned up | composables/useAudioPlayer.ts:1-91 |
P1 |
| F8 | WebSocket reconnection race — parallel connect() attempts | api/websocket.ts:212-238 |
P2 |
| F9 | WebSocket parse error silently caught — stale UI forever | api/websocket.ts:164-172 |
P2 |
| F10 | WebSocket stale connection detection too aggressive (5min) | api/websocket.ts:284-299 |
P2 |
| F11 | RPC client backoff + timeout = 40s max wait | api/rpc-client.ts:31-117 |
P2 |
| F12 | No code splitting — monolithic bundle | vite.config.ts |
P2 |
| F13 | v-html on QR code without DOMPurify | views/Settings.vue:441 |
P2 |
| F14 | Goals store O(n) alias lookup on every computed | stores/goals.ts:16-20,38-89 |
P2 |
| F15 | localStorage save without try/catch (5+ instances) | stores/goals.ts:34-36 + others |
P2 |
| F16 | FileBrowser auth token duality — memory + cookie | api/filebrowser-client.ts:39,50-68 |
P2 |
| F17 | CSRF token cookie parsing brittle — regex only | api/rpc-client.ts:18-21 |
P2 |
| F18 | aiPermissions.ts Set uses unsafe type assertion | stores/aiPermissions.ts:91-103 |
P3 |
| F19 | Untracked setTimeout in AppSession — fires after unmount | views/AppSession.vue:507 |
P3 |
| F20 | Dashboard navigation missing aria-current="page" | views/Dashboard.vue |
P3 |
| F21 | Search performance — string re-lowercasing every keystroke | views/Apps.vue:510-537 |
P3 |
| F22 | 30+ backdrop-filter blur elements — GPU overload on mobile | style.css |
P3 |
| F23 | Record<string, unknown> on sensitive DID operations | types/api.ts + rpc-client.ts |
P3 |
| F24 | checkInterval timer leak on connect race | api/websocket.ts:82-96 |
P3 |
| F25 | Web5.vue god component — 3,940 lines | views/Web5.vue |
P2 |
| F26 | Mesh.vue — 2,106 lines | views/Mesh.vue |
P2 |
| F27 | Dashboard.vue — 1,819 lines | views/Dashboard.vue |
P2 |
| F28 | Settings.vue — 1,792 lines | views/Settings.vue |
P2 |
| F29 | Marketplace.vue — 1,293 lines | views/Marketplace.vue |
P3 |
| F30 | Server.vue — 1,132 lines | views/Server.vue |
P3 |
| F31 | Home.vue — 1,059 lines | views/Home.vue |
P3 |
| F32 | AppDetails.vue — 1,036 lines | views/AppDetails.vue |
P3 |
| F33 | useAppStore god store — 324 lines, 16 methods, 8+ responsibilities | stores/app.ts |
P2 |
Shell Scripts — 80+ files audited
| ID | Issue | File(s) | Severity |
|---|---|---|---|
| S1 | 60+ instances of sudo podman — should be rootless |
fix-indeedhub(28), deploy-bitcoin(11), deploy-tailscale(2+) |
P0 |
| S2 | Zero container health checks in first-boot (30 containers) | first-boot-containers.sh |
P0 |
| S3 | 50+ :latest image tags across all scripts |
first-boot(15), deploy(11), tailscale(18), iso(7) |
P1 |
| S4 | No set -e in first-boot — silent container failures |
first-boot-containers.sh:1-9 |
P1 |
| S5 | eval "$DB_PASSWORDS" — code injection risk |
deploy-to-target.sh:940 |
P1 |
| S6 | No deploy locking — concurrent deploys corrupt state | deploy-to-target.sh |
P1 |
| S7 | No deploy rollback — failed deploy leaves broken system | deploy-to-target.sh |
P1 |
| S8 | sshpass usage in trust-archipelago-cert.sh | trust-archipelago-cert.sh:23-26 |
P1 |
| S9 | MariaDB password in command line — visible in ps | first-boot-containers.sh:285 |
P1 |
| S10 | 80+ instances of 2>/dev/null || true masking errors |
deploy-to-target.sh |
P2 |
| S11 | No trap cleanup for temp files | Multiple scripts | P2 |
| S12 | Unquoted variables (word splitting risk) | Multiple scripts | P2 |
| S13 | Hardcoded IPs in 6+ scripts | deploy-to-target.sh:26, deploy-tailscale.sh:26, etc. |
P2 |
| S14 | No input validation on deploy targets | deploy-tailscale.sh |
P2 |
| S15 | Missing memory limits on some containers in deploy | deploy-to-target.sh:842-880 |
P2 |
| S16 | ISO build not reproducible — dynamic image capture + :latest | build-auto-installer-iso.sh:500-594 |
P2 |
| S17 | No disk space pre-flight in deploy | deploy-to-target.sh |
P2 |
| S18 | deploy-to-target.sh — 1,728 lines monolith | deploy-to-target.sh |
P3 |
| S19 | build-auto-installer-iso.sh — 1,850 lines monolith | build-auto-installer-iso.sh |
P3 |
| S20 | first-boot-containers.sh — 855 lines monolith | first-boot-containers.sh |
P3 |
| S21 | No shared script library — duplicated functions | scripts/ |
P3 |
Infrastructure
| ID | Issue | File(s) | Severity |
|---|---|---|---|
| I1 | Nginx: /archipelago/, /content, /dwn missing timeout+rate-limit+body-size | nginx-archipelago.conf:116-180 |
P0 |
| I2 | Systemd: no MemoryMax, LimitNOFILE, TasksMax | archipelago.service |
P1 |
| I3 | Tor rotation kills old address immediately — federation downtime | api/rpc/tor.rs:184-240 |
P1 |
MONTH 1: CRASH PREVENTION (Weeks 1–4)
Fix every issue that can crash the system, hang indefinitely, or lose data.
Week 1: P0 Backend — Things That Hang or Lose Data
R1 — Health endpoint handler
- File:
core/archipelago/src/api/rpc/mod.rs - Add handler for
"health"method that checks: crash recovery complete, Podman socket responsive, session store loaded - Tests: health returns JSON status, degraded when Podman unreachable, degraded during recovery
- Verify:
curl http://192.168.1.198/rpc/v1 -d '{"method":"health"}'returns real status
R2 — Nostr connect timeout
- File:
core/archipelago/src/nostr_handshake.rslines 124, 161, 262, 282 - Wrap all 4
client.connect().awaitintokio::time::timeout(Duration::from_secs(10), ...) - Tests: connect timeout returns Err after 10s, successful connect within timeout works
R3 — Backup restore atomic rollback
- File:
core/archipelago/src/backup/full.rslines 122-149 - Rewrite: decrypt → extract to staging dir → validate required files → atomic rename → rollback on failure
- Tests: valid backup restores, corrupt backup fails without touching live data, partial extraction rolls back, disk space check fails early
I1 — Nginx unauthenticated endpoint protection
- File:
image-recipe/configs/nginx-archipelago.conflines 116-180 - Add to
/archipelago/,/content,/dwn:limit_req zone=peer burst=20 nodelay;client_max_body_size 10m;proxy_connect_timeout 30s; proxy_read_timeout 60s; proxy_send_timeout 30s;
- Tests: >10MB payload → 413, slow client → timeout, burst 30 → 429 after 20
Week 2: P0 Frontend + Scripts — Things That Break UI or Containers
F1 — WebSocket subscription race condition
- File:
neode-ui/src/stores/app.tslines 88-134 - Fix: Return unsubscribe function from
wsClient.subscribe(), call it before re-subscribing. Use a subscription ID to prevent duplicates. - Tests: rapid connectWebSocket() calls produce only one active subscription
F2 — Mesh concurrent state mutations
- File:
neode-ui/src/stores/mesh.tslines 249-324 - Fix: Add
isSendingref as mutex. Queue concurrent sends.fetchMessages()called once after all sends complete. - Tests: 3 concurrent sendMessage() calls → all succeed, messages list consistent
F3 — Global error handler
- File:
neode-ui/src/main.ts - Add
app.config.errorHandlerthat shows toast + logs structured error - Tests: thrown error in component shows toast, nested errors don't crash handler
S1 — Eliminate all sudo podman
- Files:
fix-indeedhub-containers.sh(28),deploy-bitcoin-knots.sh(11),deploy-tailscale.sh(2+),uptime-monitor.sh(1),setup-aiui-server.sh - Replace every
sudo podmanwithpodman(runs as archipelago user) - Tests: grep for
sudo podmanacross all scripts returns zero matches
S2 — Container health checks for all 30 containers
- File:
scripts/first-boot-containers.sh - Add
--health-cmd,--health-interval=30s,--health-timeout=5s,--health-retries=3to every$DOCKER run - Health commands per type:
- Bitcoin:
bitcoin-cli -rpcuser=... getblockchaininfo || exit 1 - HTTP apps:
curl -sf http://localhost:{port}/ || exit 1 - LND:
curl -sf --insecure https://localhost:8080/v1/getinfo || exit 1 - Databases:
mariadb -u root -p... -e "SELECT 1" || exit 1
- Bitcoin:
- Tests: script grep confirms every
$DOCKER runhas--health-cmd
Week 3: P1 Backend — Blocking I/O and Memory Leaks
R4+R5 — Rate limiter cleanup
- File:
core/archipelago/src/session.rs - Spawn background tasks for both
EndpointRateLimiter::cleanup()andLoginRateLimitercleanup, every 5 min - Tests: after cleanup, stale entries removed; active entries preserved
R6 — session.rs blocking I/O (6 calls)
- Replace
std::fs::read_to_string→tokio::fs::read_to_stringat lines 77, 370, 413 - Replace
std::fs::write→tokio::fs::writeat lines 128, 425 - Replace
std::fs::create_dir_all→tokio::fs::create_dir_allat line 423 - Tests: session load/save/persist still works correctly
R7 — docker_packages.rs blocking I/O
- Replace
std::fs::read_to_string→tokio::fs::read_to_stringat lines 561, 573 - Tests: app metadata loading works
R8 — port_allocator.rs blocking I/O
- Replace all 3 std::fs calls → tokio::fs at lines 59, 73, 77
- Tests: port allocation/persistence works
R9+R10+R11 — Remaining blocking I/O
peers.rs:30,node_message.rs:65,identity.rs:50,identity_manager.rs:164,nostr_discovery.rs:55- Convert all to tokio::fs
- Tests: each module's file operations still work
R12 — electrs_status.rs sync TCP I/O
- Convert synchronous TCP client to async (tokio::net::TcpStream)
- Tests: ElectrumX status query works, timeout on connection failure
Week 4: P1 Frontend — Memory Leaks and Stale State
F4 — WebSocket reconnect full state refresh
- File:
neode-ui/src/stores/app.ts - After reconnect, call
rpcClient.call({method: 'server.get-state'})to get fresh state before accepting patches - Tests: after simulated disconnect+reconnect, state matches server
F5 — Message polling timer cleanup
- File:
neode-ui/src/composables/useMessageToast.ts - Tie polling lifecycle to auth state: stop on logout, start on login. Export cleanup function.
- Tests: polling stops when auth false, restarts when auth true, no timer after unmount
F6 — AppLauncher message listener leak
- File:
neode-ui/src/stores/appLauncher.ts - Ensure listener is removed when app closes (even if not via close button — e.g., route navigation)
- Tests: navigate away from app → listener removed, new app opens clean
F7 — Audio player listener stacking
- File:
neode-ui/src/composables/useAudioPlayer.ts - Create Audio element once, register listeners once. Track initialization flag.
- Tests: calling play() 10 times → still only 6 listeners total (not 60)
S3 — Pin all container images (remove :latest)
- Files:
first-boot-containers.sh(15),deploy-to-target.sh(11),deploy-tailscale.sh(18),build-auto-installer-iso.sh(7) - Replace every
:latestwith specific version tag - Create
image-versions.envsourced by all scripts — single source of truth - Tests:
grep -r ':latest' scripts/ image-recipe/returns zero matches (excluding comments)
MONTH 2: OPERATIONAL SAFETY (Weeks 5–8)
Fix everything that makes deploys dangerous, scripts unreliable, or operations opaque.
Week 5: Deploy Script Hardening
S4 — first-boot error handling
- Add per-section error checking: if Bitcoin fails, skip dependent containers (LND, Mempool, BTCPay)
- Add
wait_for_containerreturn value checking - Tests: first-boot with broken Bitcoin image → Bitcoin deps skipped, independent apps still start
S5 — Replace eval with safe construct
- File:
deploy-to-target.sh:940 - Replace
eval "$DB_PASSWORDS"with explicit variable assignment from SSH output - Tests: passwords parsed correctly without eval
S6 — Deploy locking
- File:
deploy-to-target.sh - Add remote
flockon/var/lock/archipelago-deploy.lock. Second deploy fails immediately with message. Stale lock (>30 min) broken automatically. - Tests: two parallel deploys → second fails, stale lock → broken and deploy proceeds
S7 — Deploy rollback
- File:
deploy-to-target.sh - Before overwriting binary:
cp archipelago archipelago.bak - Before overwriting frontend:
cp -r web-ui web-ui.bak - If health check fails post-restart: restore from .bak, restart again
- Tests: intentionally broken binary → deploy detects, rolls back, system healthy
S8 — Eliminate sshpass
- File:
trust-archipelago-cert.sh - Rewrite to use SSH key only:
ssh -i ~/.ssh/archipelago-deploy - Tests: script works with key auth, fails gracefully without key
Week 6: Script Quality
S9 — MariaDB password not on command line
- File:
first-boot-containers.sh:285 - Use
$DOCKER exec -i ... mariadb -uroot < /dev/stdin <<< "SET PASSWORD..." - Tests:
ps auxduring execution doesn't show password
S10 — Replace silent error masking
- File:
deploy-to-target.sh(80+ instances) - Pattern: replace
2>/dev/null || echo ""with|| { log_warn "..."; echo ""; } - At minimum, log what failed before masking
- Tests: failed health check produces log entry
S11 — Trap cleanup for temp files
- All scripts that create /tmp files: add
trap "rm -rf /tmp/deploy-$$" EXITat start - Files: deploy-to-target.sh, deploy-tailscale.sh, build-auto-installer-iso.sh
- Tests: script interrupted mid-execution → temp files cleaned up
S12 — Quote all variables
- Audit and fix unquoted
$VARIABLEin command arguments across all scripts - Tests: shellcheck passes on all modified scripts
S13 — Extract hardcoded IPs to config
- Create
scripts/deploy-config-defaults.shwith all node IPs as named variables - Source from all scripts instead of hardcoding
- Tests: changing IP in config → all scripts use new IP
Week 7: Infrastructure Hardening
I2 — Systemd resource limits
- File:
image-recipe/configs/archipelago.service - Add:
MemoryMax=4G,LimitNOFILE=65535,TasksMax=2048 - Tests:
systemctl show archipelagoconfirms limits applied, service starts normally
I3 — Tor rotation transition period
- File:
core/archipelago/src/api/rpc/tor.rs - Keep old hidden service running for 24h after rotation. Both addresses active. Notify peers of new address. Schedule old deletion.
- Tests: after rotation old address still resolves, peers receive notification, old removed after transition
S14 — Input validation on deploy targets
- Add regex validation for hostnames/IPs before SSH
- Tests: invalid hostname → clear error, valid hostname → proceeds
S15 — Memory limits on all deploy containers
- File:
deploy-to-target.shlines 842-880 - Add
--memory=$(mem_limit ...)to all UI container builds - Tests: every container in deploy has
--memoryflag
S17 — Disk space pre-flight
- File:
deploy-to-target.sh - Check target disk <85% before deploying. Abort with clear message if full.
- Tests: deploy to 90% full disk → aborted, deploy to 50% full → succeeds
Week 8: Remaining P1 Backend
R14 — Fix .parse().unwrap() in session rate limiting
- File:
session.rs:665,676,688 - Replace
.parse().unwrap()with.parse().context("...")? - Tests: invalid IP handling works gracefully
R15 — Fix 7 unwrap/expect in mesh/protocol.rs
- File:
mesh/protocol.rs:582,592,614,649,679,713,728 - Replace all with
?operator + proper error types - Tests: protocol parsing with malformed data returns error, not panic
R27 — Add timeouts to mesh Bitcoin RPC calls
- File:
mesh/mod.rs:624,649,663 - Add
tokio::time::timeout(Duration::from_secs(10), ...)to all Bitcoin RPC calls - Tests: RPC timeout returns error after 10s
R34 — Tor rotation transition
- (Covered by I3 above)
MONTH 3: PRODUCTION POLISH (Weeks 9–12)
Fix every remaining P2 issue — unwraps, hardcoded values, frontend quality, resilience.
Week 9: Remaining Backend Unwraps + Dead Code
R13 — main.rs .expect() → .context()
- Replace 2
.expect()calls with.context("...")?and proper startup error handling
R16 — identity.rs .expect() → safe handling
- Replace 2
.expect()in crypto operations with result propagation
R17+R18 — helpers unwraps
- Fix 10
.unwrap()calls inhelpers/lib.rsandhelpers/rsync.rs - Replace with
?operator or.context()
R19 — js-engine unwraps
- Fix 2
.unwrap()injs-engine/lib.rs:130,249
R20+R21 — Dead code elimination
- Remove all 14
#[allow(dead_code)]inmesh/mod.rs. Either use the fields or delete them. - Same for
lnd.rs,data_manager.rs,dev_orchestrator.rs - Tests:
cargo clippyzero warnings,cargo testpasses
Week 10: Hardcoded Values → Constants
R22 — Bitcoin RPC URL constant
- Create
const BITCOIN_RPC_URL: &str = "http://127.0.0.1:8332/";in a shared constants module - Use across
bitcoin.rs,mesh/mod.rs,mesh/listener.rs - Tests: all Bitcoin RPC calls still work
R23 — DWN health URL constant R24 — Update manifest URL constant R25 — DNS-over-HTTPS URLs → constants array R26 — DWN protocol URIs → constants
- Centralize all hardcoded URLs/URIs into
core/archipelago/src/constants.rs - Tests: all modules reference constants, no hardcoded strings remain
R28 — LND proxy timeouts
- Audit all 68
.send()calls inapi/rpc/lnd.rs. Ensure each has explicit timeout. - Tests: LND proxy call with unresponsive LND → timeout error, not hang
R29 — DWN health check timeout
- Add timeout to
dwn_sync.rs:76health check
R30-R33 — Resolve all TODOs
- Either implement the TODO or remove the dead code path. Per project rules: no TODO/FIXME in commits.
Week 11: Frontend P2 Fixes
F8 — WebSocket reconnection race
- Add
isReconnectingflag. Skip if already reconnecting. - Tests: rapid close events → only one reconnect attempt
F9 — WebSocket parse error handling
- Count consecutive parse errors. After 3, force reconnect.
- Tests: 3 malformed messages → reconnect triggered; single bad message → logged only
F10 — Stale connection detection tuning
- Require mutual pong response within 30s. Don't close valid connections that are simply quiet.
- Tests: quiet but healthy connection → stays open; no pong for 30s → reconnects
F11 — RPC client backoff reduction
- Reduce default timeout from 30s to 15s. Add jitter to backoff. Cap total retry time at 20s.
- Tests: server outage → user sees error within 20s, not 40s
F12 — Code splitting
- Lazy-load all routes:
() => import('./views/Web5.vue') - Add manual chunks in vite.config.ts for vendor/api
- Tests: build produces multiple chunks, initial bundle < 200KB gzipped
F13 — DOMPurify on QR v-html
- Add DOMPurify.sanitize() to QR SVG before v-html rendering
- Tests: XSS payload in QR content → sanitized
Week 12: Frontend P2 Continued + Performance
F14 — Goals computed memoization
- Replace O(n) alias lookup with Map. Add deep equality check.
- Tests: goalStatuses computed runs in <1ms with 100 apps
F15 — localStorage error handling
- Wrap all localStorage.setItem in try/catch. Show toast on quota exceeded.
- Tests: full localStorage → toast shown, app continues
F16 — FileBrowser auth consolidation
- Use cookie-only auth. Remove in-memory token.
- Tests: login persists across page reload, logout clears cookie
F17 — CSRF token parsing robustness
- Add header fallback for CSRF token. Handle edge cases.
- Tests: missing cookie → falls back to header, both missing → error
F22 — CSS backdrop-filter mobile performance
- Add media query: reduce blur to 8px on mobile. Remove backdrop-filter from non-visible elements.
- Tests: mobile Lighthouse performance score > 80
MONTH 4-5: BACKEND ARCHITECTURE (Weeks 13–20)
Split every Rust god file. Target: no file > 500 lines.
Week 13–14: Split package.rs (1,795 lines)
api/rpc/package/
├── mod.rs — Re-exports (~50 lines)
├── config.rs — get_app_config(), get_app_capabilities(), needs_archy_net()
├── lifecycle.rs — install, start, stop, restart, uninstall
├── validation.rs — Input validation, dependency checking, image validation
└── progress.rs — Progress streaming, install status tracking
Pre-split tests: test every get_app_config() variant, validation path, lifecycle transition
Post-split: all RPC calls return identical responses, cargo test passes
Week 15–16: Split mesh/listener.rs (1,799 lines)
mesh/listener/
├── mod.rs — Re-exports + spawn_mesh_listener()
├── session.rs — run_mesh_session() loop
├── frames.rs — handle_frame() dispatcher
├── identity.rs — handle_identity_received(), handle_typed_message()
├── sync.rs — sync_queued_messages(), store_typed_message()
└── bitcoin.rs — Bitcoin relay operations, RPC calls
Week 17–18: Split rpc/mod.rs (1,092 lines) + lnd.rs (1,068 lines)
rpc/mod.rs → dispatcher.rs (method routing), middleware.rs (CSRF/session/rate-limit), response.rs (response building)
lnd.rs → lnd/wallet.rs, lnd/channels.rs, lnd/info.rs, lnd/payments.rs
Week 19–20: Split monitoring (993), handler (911), mesh (865)
Split each into sub-modules. Target: no file > 500 lines. All pre-split tests, all post-split verification.
MONTH 6-8: FRONTEND ARCHITECTURE (Weeks 21–32)
Split every Vue god component. Target: no component > 500 lines.
Week 21–22: Split Web5.vue (3,940 lines → 8 sub-views)
views/web5/
├── Web5.vue — Router shell (~150 lines)
├── Web5Identity.vue — DID management
├── Web5Wallet.vue — Wallet operations
├── Web5Nostr.vue — Nostr relays/profiles
├── Web5Credentials.vue — Verifiable Credentials
├── Web5Peers.vue — P2P federation nodes
├── Web5Storage.vue — DWN storage/explorer
├── Web5Goals.vue — Goals/voting
└── Web5Marketplace.vue — Decentralized marketplace
Add nested routes. Component tests for each section. All sections render identically.
Week 23–24: Split Mesh.vue (2,106) + Dashboard.vue (1,819)
Mesh.vue → MeshRadio.vue, MeshChat.vue, MeshNetwork.vue, MeshFederation.vue
Dashboard.vue → DashboardHome.vue, DashboardApps.vue, DashboardSystem.vue
Week 25–26: Split Settings.vue (1,792) + Server.vue (1,132)
Settings.vue → SettingsAccount.vue, SettingsSystem.vue, SettingsNetwork.vue, SettingsAppearance.vue
Server.vue → ServerOverview.vue, ServerContainers.vue, ServerLogs.vue
Week 27–28: Split Marketplace.vue (1,293) + AppDetails.vue (1,036) + Home.vue (1,059)
Each into 3-4 focused sub-components.
Week 29–30: Decompose useAppStore (324 lines, 16 methods)
stores/
├── app.ts — Thin re-export for backward compat (~50 lines)
├── auth.ts — Login, logout, session, password, TOTP
├── server.ts — Server info, system stats, reboot/shutdown
├── realtime.ts — WebSocket connection, subscriptions, heartbeat
└── packages.ts — Package install/uninstall, marketplace data
Tests: every existing import of useAppStore still works. State transitions identical.
Week 31–32: Remaining frontend P3 issues
F18 — aiPermissions runtime validation F19 — Track AppSession timeout F20 — Dashboard aria-current F21 — Debounce search + memoize F23 — Branded types for DID operations F24 — Fix checkInterval leak
MONTH 9-10: SCRIPT ARCHITECTURE + ISO (Weeks 33–40)
Split every monolithic script. Target: no script > 400 lines.
Week 33–34: Create shared script library
scripts/lib/
├── common.sh — Colors, logging, error handling, SSH helpers
├── health.sh — Health check polling, container status
├── deploy-utils.sh — Rsync, file sync, backup/restore
├── container.sh — Podman helpers, image management, mem_limit()
└── network.sh — IP validation, port checking
Tests: each library function tested in scripts/tests/
Week 35–36: Split deploy-to-target.sh (1,728 lines)
scripts/
├── deploy-to-target.sh — Orchestrator + arg parsing (~300 lines)
├── deploy/
│ ├── frontend.sh — Build + sync frontend
│ ├── backend.sh — Build + sync binary
│ ├── configs.sh — Sync nginx, systemd, scripts
│ ├── containers.sh — Container creation/update
│ ├── verify.sh — Post-deploy health checks
│ └── rollback.sh — Rollback on failure
Week 37–38: Split ISO build (1,850 lines) + first-boot (855 lines)
build-auto-installer-iso.sh → build/capture-images.sh, build/create-rootfs.sh, build/install-packages.sh, build/bundle-configs.sh, build/package-iso.sh
first-boot-containers.sh → first-boot/databases.sh, first-boot/bitcoin.sh, first-boot/lightning.sh, first-boot/apps.sh, first-boot/networking.sh
Week 39–40: ISO Reproducibility + Integration Tests
S16 — Make ISO builds reproducible
- Create
image-versions.envwith pinned digests for every container image - ISO build sources this file, never pulls
:latest - Build manifest records exactly what shipped
- Tests: two consecutive ISO builds produce identical image sets
E2E smoke test script
# scripts/smoke-test.sh — Run against .198
# 1. curl /health → OK
# 2. Login → get session
# 3. Get server info → valid JSON
# 4. List containers → all healthy
# 5. Check every /app/* proxy → responds
# 6. Check Tor hidden service → resolves
# 7. Check WebSocket upgrade → 101
# Exit 0 only if all pass
MONTH 11: INTEGRATION TESTS (Weeks 41–44)
Comprehensive test suites that prove everything works.
Week 41–42: Backend Integration Tests
core/archipelago/tests/
├── test_auth_flow.rs — Login → session → CSRF → auth request → logout
├── test_container_lifecycle.rs — Install → start → health → stop → uninstall
├── test_federation.rs — Generate invite → join → sync → verify
├── test_rpc_validation.rs — Every endpoint with invalid input → proper error
├── test_session_persist.rs — Create session → restart → session survives
├── test_rate_limiting.rs — Flood → 429 → wait → allowed
├── test_backup_restore.rs — Create → verify → restore → validate
├── test_health_endpoint.rs — Healthy → degraded → recovery
Target: 25+ backend integration tests passing
Week 43–44: Frontend Integration Tests
neode-ui/src/__tests__/integration/
├── auth-flow.spec.ts — Login → dashboard → timeout → redirect
├── app-lifecycle.spec.ts — Marketplace → install → progress → launch → uninstall
├── websocket.spec.ts — Connect → update → disconnect → reconnect → state consistent
├── settings-flow.spec.ts — Change password → re-login → 2FA setup → verify
├── spotlight.spec.ts — Open → search → navigate → close
├── mesh-chat.spec.ts — Connect → send → receive → disconnect
├── error-handling.spec.ts — Network error → toast → retry → success
├── code-splitting.spec.ts — Route navigation → chunks loaded lazily
Target: 20+ frontend integration tests passing
MONTH 12: TYPE SYNC + CI/CD PLAN (Weeks 45–48)
Week 45–46: Rust↔TypeScript Type Sync
Approach: ts-rs crate to auto-generate TypeScript types from Rust structs
- Add
ts-rstocore/models/Cargo.toml - Add
#[derive(TS)]to all API request/response types - Build script generates
neode-ui/src/types/generated.ts - Replace manual types in
types/api.tswith imports from generated file - Verification: regenerate → diff → must be zero (types committed)
Tests: frontend type-check passes with generated types, manual api.ts reduced to non-API types
Week 47–48: CI/CD Planning (Document Only — Execute Later)
This section is the PLAN for CI/CD. Do not execute during this phase. Document everything needed so it can be implemented in a future sprint.
CI Pipeline Design (.github/workflows/ci.yml):
# Triggers: push to main, all PRs
# Jobs:
# rust-checks (Linux runner):
# - cargo clippy --all-targets --all-features (zero warnings gate)
# - cargo fmt --all -- --check (formatting gate)
# - cargo test --all-features (all tests gate)
#
# frontend-checks (Node 20):
# - npm run type-check (TypeScript strictness gate)
# - npm run lint (ESLint gate)
# - npm test (Vitest suite gate)
#
# integration (Linux runner, optional):
# - scripts/smoke-test.sh against staging
#
# Merge policy: all checks must pass before merge
# Branch protection: require PR, require checks, no force push to main
Release Pipeline Design (.github/workflows/release.yml):
# Triggers: tag push (v*)
# Jobs:
# build-linux-binary:
# - Cross-compile Rust for x86_64 + ARM64
# build-frontend:
# - npm run build
# build-iso:
# - SSH to build server, run ISO build
# - Upload ISO as release asset
# smoke-test:
# - Boot ISO in QEMU
# - Run smoke-test.sh
# - Gate release on pass
Pre-requisites to implement:
- GitHub Actions runner with Rust toolchain + cross-compilation
- Node.js 20 runner for frontend
- SSH key for build server accessible from CI
- Branch protection rules configured
- Image digest manifest for reproducible ISO builds
- QEMU-based ISO verification script
Estimated implementation time: 2 weeks when ready to execute
VERIFICATION PROTOCOL (Every Week)
cargo clippy --all-targets --all-features— zero warningscargo fmt --allcargo test --all-features— all passcd neode-ui && npm run type-check— zero errorscd neode-ui && npm test— all pass./scripts/deploy-to-target.sh --target 192.168.1.198— ONLY .198curl http://192.168.1.198/health— returns OK with service status- Navigate all affected views in browser — identical behavior
- Atomic commit:
refactor: <description>orfix: <description>
EXIT CRITERIA (Month 12 Complete)
Reliability (Zero Tolerance)
- Health endpoint returns real service status
- All async operations have bounded timeouts
- Zero blocking I/O in async context (no std::fs in async functions)
- Zero .unwrap()/.expect() in production code
- All rate limiters have cleanup tasks
- Backup restore uses staging + atomic swap + rollback
- All 30 containers have health checks + memory limits
- All container images pinned to specific versions
- Nginx unauthenticated endpoints protected (timeout + rate limit + body size)
- Systemd service has resource limits
- Tor rotation preserves old address during transition
- Deploy has locking + disk check + rollback
- Zero
sudo podmanin any script - Zero
:latestimage tags anywhere - Zero silent error masking without logging
Frontend (Zero Tolerance)
- Global error handler catches and displays all errors
- WebSocket: single subscription, reconnect refreshes state, bounded retries
- All timers/listeners cleaned up on unmount
- Code splitting: initial bundle < 200KB gzipped
- v-html always uses DOMPurify
- All localStorage operations wrapped in try/catch
Architecture (Target: File Size Limits)
- No Rust file > 500 lines (excluding generated code)
- No Vue component > 500 lines
- No shell script > 400 lines
- No Pinia store has more than 1 responsibility
- All hardcoded URLs/ports extracted to constants
- Shared script library eliminates duplication
- TypeScript types auto-generated from Rust structs
Testing
- 25+ backend integration tests passing
- 20+ frontend integration tests passing
- E2E smoke test script passes on .198
- ISO builds are reproducible (pinned digests)
CI/CD (Planned, Not Executed)
- CI pipeline design documented
- Release pipeline design documented
- Pre-requisites list complete
- Ready for 2-week implementation sprint
Zero Behavior Changes
Every feature works identically. Every existing test passes. Every user flow unchanged.