lfg2025/archy

Dorian e4e0ef4f11 bug fixing and deploy and build diagnostics

2026-03-22 03:30:21 +00:00

37 KiB

Raw Blame History

Archipelago: Production Excellence Plan

Duration: 12 months (48 weeks) Goal: Code so good no developer could question any decision. Apple-level reliability. Every failure visible and recoverable. Every operation bounded. Every line justified. Audited: 2026-03-20 — 122 Rust files, 38 Vue views, 180+ frontend files, 80+ shell scripts

CONSTRAINTS

DEPLOY ONLY TO .198 — Never .228. All verification on .198.
BETA FREEZE — Behavior-preserving only. No new features/UI/endpoints.
Tests before every refactor — Capture current behavior first. Tests must pass unchanged after.
Atomic commits — One logical change per commit. Every step compiles + passes tests.

ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.198

COMPLETE ISSUE REGISTRY

Backend Rust — 122 files audited

ID	Issue	File(s)	Severity
R1	Health RPC endpoint has no handler — returns "Unknown method"	`api/rpc/mod.rs`	P0
R2	Nostr client.connect() hangs indefinitely (4 calls, no timeout)	`nostr_handshake.rs:124,161,262,282`	P0
R3	Backup restore extracts directly to live dir — no atomic rollback	`backup/full.rs:122-149`	P0
R4	Rate limiter cleanup() never spawned — HashMap grows forever	`session.rs:566-579`	P1
R5	Login rate limiter same issue — entries never evicted	`session.rs:452-472`	P1
R6	Blocking std::fs in async — session.rs (6 calls)	`session.rs:77,128,370,413,423,425`	P1
R7	Blocking std::fs in async — docker_packages.rs	`docker_packages.rs:561,573`	P1
R8	Blocking std::fs in async — port_allocator.rs	`port_allocator.rs:59,73,77`	P1
R9	Blocking std::fs in async — peers.rs, node_message.rs	`peers.rs:30`, `node_message.rs:65`	P1
R10	Blocking std::fs in async — identity.rs, identity_manager.rs	`identity.rs:50`, `identity_manager.rs:164`	P1
R11	Blocking std::fs in async — nostr_discovery.rs	`nostr_discovery.rs:55`	P1
R12	Sync TCP I/O in async context — electrs_status.rs	`electrs_status.rs:5,40,78,81`	P1
R13	.expect() in main.rs startup	`main.rs:124,159`	P2
R14	.parse().unwrap() in session.rs rate limiting	`session.rs:665,676,688`	P1
R15	7 .unwrap()/.expect() in mesh/protocol.rs	`protocol.rs:582,592,614,649,679,713,728`	P1
R16	.expect() in identity.rs crypto	`identity.rs:114,119`	P2
R17	.unwrap() in helpers/lib.rs (5 calls)	`helpers/lib.rs:167,172,180,233,253`	P2
R18	.unwrap() in helpers/rsync.rs (5 calls)	`rsync.rs:196,199,202,210,220`	P2
R19	.unwrap() in js-engine/lib.rs	`js-engine/lib.rs:130,249`	P2
R20	14 #[allow(dead_code)] suppressions in mesh/mod.rs	`mesh/mod.rs:7-25`	P2
R21	Dead code in lnd.rs, data_manager.rs, dev_orchestrator.rs	Multiple	P2
R22	Bitcoin RPC URL hardcoded in 4+ files	`bitcoin.rs:89`, `mesh/mod.rs:624,649,663`, `listener.rs:1509+`	P2
R23	DWN health URL hardcoded	`dwn_sync.rs:76`	P2
R24	Update manifest URL hardcoded	`update.rs:11`	P3
R25	DNS-over-HTTPS URLs hardcoded (4 providers)	`network/dns.rs:98,102,106,110`	P3
R26	DWN protocol URIs hardcoded in server.rs	`server.rs:453-456`	P3
R27	Missing timeouts on mesh Bitcoin RPC calls	`mesh/mod.rs:624,649,663`	P1
R28	Missing timeouts on LND proxy calls (68 .send() calls)	`api/rpc/lnd.rs`	P2
R29	Missing timeout on DWN health check	`dwn_sync.rs:76`	P2
R30	TODO: track last-seen timestamp	`handshake.rs:77`	P3
R31	TODO: lnd.lookupinvoice RPC endpoint	`marketplace.rs:183`	P3
R32	TODO: trigger auto-restart or alert	`container/health_monitor.rs:140`	P3
R33	TODO: configure Podman to use AppArmor profile	`security/container_policies.rs:68`	P3
R34	Tor rotation deletes old .onion immediately — no transition	`api/rpc/tor.rs:184-240`	P1
R35	package.rs god file — 1,795 lines	`api/rpc/package.rs`	P2
R36	mesh/listener.rs god file — 1,799 lines	`mesh/listener.rs`	P2
R37	rpc/mod.rs god file — 1,092 lines	`api/rpc/mod.rs`	P2
R38	lnd.rs god file — 1,068 lines	`api/rpc/lnd.rs`	P2
R39	monitoring/mod.rs — 993 lines	`monitoring/mod.rs`	P3
R40	api/handler.rs — 911 lines	`api/handler.rs`	P3
R41	30+ functions exceed 50 lines across codebase	Multiple	P3

Frontend — 180+ files audited

ID	Issue	File(s)	Severity
F1	WebSocket subscription registered multiple times — race condition	`stores/app.ts:88-134`	P0
F2	Unprotected concurrent mesh state mutations	`stores/mesh.ts:249-268,294-324`	P0
F3	No global Vue error handler — white screen on error	`main.ts`	P0
F4	Stale data after WebSocket reconnect — no full refresh	`stores/app.ts:88-163`	P1
F5	Message polling timer never stopped after logout	`composables/useMessageToast.ts:60`	P1
F6	AppLauncher NIP-07 message listener leak on close	`stores/appLauncher.ts:295-301`	P1
F7	Audio player listeners stack — never cleaned up	`composables/useAudioPlayer.ts:1-91`	P1
F8	WebSocket reconnection race — parallel connect() attempts	`api/websocket.ts:212-238`	P2
F9	WebSocket parse error silently caught — stale UI forever	`api/websocket.ts:164-172`	P2
F10	WebSocket stale connection detection too aggressive (5min)	`api/websocket.ts:284-299`	P2
F11	RPC client backoff + timeout = 40s max wait	`api/rpc-client.ts:31-117`	P2
F12	No code splitting — monolithic bundle	`vite.config.ts`	P2
F13	v-html on QR code without DOMPurify	`views/Settings.vue:441`	P2
F14	Goals store O(n) alias lookup on every computed	`stores/goals.ts:16-20,38-89`	P2
F15	localStorage save without try/catch (5+ instances)	`stores/goals.ts:34-36` + others	P2
F16	FileBrowser auth token duality — memory + cookie	`api/filebrowser-client.ts:39,50-68`	P2
F17	CSRF token cookie parsing brittle — regex only	`api/rpc-client.ts:18-21`	P2
F18	aiPermissions.ts Set uses unsafe type assertion	`stores/aiPermissions.ts:91-103`	P3
F19	Untracked setTimeout in AppSession — fires after unmount	`views/AppSession.vue:507`	P3
F20	Dashboard navigation missing aria-current="page"	`views/Dashboard.vue`	P3
F21	Search performance — string re-lowercasing every keystroke	`views/Apps.vue:510-537`	P3
F22	30+ backdrop-filter blur elements — GPU overload on mobile	`style.css`	P3
F23	Record<string, unknown> on sensitive DID operations	`types/api.ts` + `rpc-client.ts`	P3
F24	checkInterval timer leak on connect race	`api/websocket.ts:82-96`	P3
F25	Web5.vue god component — 3,940 lines	`views/Web5.vue`	P2
F26	Mesh.vue — 2,106 lines	`views/Mesh.vue`	P2
F27	Dashboard.vue — 1,819 lines	`views/Dashboard.vue`	P2
F28	Settings.vue — 1,792 lines	`views/Settings.vue`	P2
F29	Marketplace.vue — 1,293 lines	`views/Marketplace.vue`	P3
F30	Server.vue — 1,132 lines	`views/Server.vue`	P3
F31	Home.vue — 1,059 lines	`views/Home.vue`	P3
F32	AppDetails.vue — 1,036 lines	`views/AppDetails.vue`	P3
F33	useAppStore god store — 324 lines, 16 methods, 8+ responsibilities	`stores/app.ts`	P2

Shell Scripts — 80+ files audited

ID	Issue	File(s)	Severity
S1	60+ instances of `sudo podman` — should be rootless	`fix-indeedhub(28)`, `deploy-bitcoin(11)`, `deploy-tailscale(2+)`	P0
S2	Zero container health checks in first-boot (30 containers)	`first-boot-containers.sh`	P0
S3	50+ `:latest` image tags across all scripts	`first-boot(15)`, `deploy(11)`, `tailscale(18)`, `iso(7)`	P1
S4	No `set -e` in first-boot — silent container failures	`first-boot-containers.sh:1-9`	P1
S5	`eval "$DB_PASSWORDS"` — code injection risk	`deploy-to-target.sh:940`	P1
S6	No deploy locking — concurrent deploys corrupt state	`deploy-to-target.sh`	P1
S7	No deploy rollback — failed deploy leaves broken system	`deploy-to-target.sh`	P1
S8	sshpass usage in trust-archipelago-cert.sh	`trust-archipelago-cert.sh:23-26`	P1
S9	MariaDB password in command line — visible in ps	`first-boot-containers.sh:285`	P1
S10	80+ instances of `2>/dev/null \|\| true` masking errors	`deploy-to-target.sh`	P2
S11	No trap cleanup for temp files	Multiple scripts	P2
S12	Unquoted variables (word splitting risk)	Multiple scripts	P2
S13	Hardcoded IPs in 6+ scripts	`deploy-to-target.sh:26`, `deploy-tailscale.sh:26`, etc.	P2
S14	No input validation on deploy targets	`deploy-tailscale.sh`	P2
S15	Missing memory limits on some containers in deploy	`deploy-to-target.sh:842-880`	P2
S16	ISO build not reproducible — dynamic image capture + :latest	`build-auto-installer-iso.sh:500-594`	P2
S17	No disk space pre-flight in deploy	`deploy-to-target.sh`	P2
S18	deploy-to-target.sh — 1,728 lines monolith	`deploy-to-target.sh`	P3
S19	build-auto-installer-iso.sh — 1,850 lines monolith	`build-auto-installer-iso.sh`	P3
S20	first-boot-containers.sh — 855 lines monolith	`first-boot-containers.sh`	P3
S21	No shared script library — duplicated functions	`scripts/`	P3

Infrastructure

ID	Issue	File(s)	Severity
I1	Nginx: /archipelago/, /content, /dwn missing timeout+rate-limit+body-size	`nginx-archipelago.conf:116-180`	P0
I2	Systemd: no MemoryMax, LimitNOFILE, TasksMax	`archipelago.service`	P1
I3	Tor rotation kills old address immediately — federation downtime	`api/rpc/tor.rs:184-240`	P1

MONTH 1: CRASH PREVENTION (Weeks 1–4)

Fix every issue that can crash the system, hang indefinitely, or lose data.

Week 1: P0 Backend — Things That Hang or Lose Data

R1 — Health endpoint handler

File: core/archipelago/src/api/rpc/mod.rs
Add handler for "health" method that checks: crash recovery complete, Podman socket responsive, session store loaded
Tests: health returns JSON status, degraded when Podman unreachable, degraded during recovery
Verify: curl http://192.168.1.198/rpc/v1 -d '{"method":"health"}' returns real status

R2 — Nostr connect timeout

File: core/archipelago/src/nostr_handshake.rs lines 124, 161, 262, 282
Wrap all 4 client.connect().await in tokio::time::timeout(Duration::from_secs(10), ...)
Tests: connect timeout returns Err after 10s, successful connect within timeout works

R3 — Backup restore atomic rollback

File: core/archipelago/src/backup/full.rs lines 122-149
Rewrite: decrypt → extract to staging dir → validate required files → atomic rename → rollback on failure
Tests: valid backup restores, corrupt backup fails without touching live data, partial extraction rolls back, disk space check fails early

I1 — Nginx unauthenticated endpoint protection

File: image-recipe/configs/nginx-archipelago.conf lines 116-180
Add to /archipelago/, /content, /dwn:
- limit_req zone=peer burst=20 nodelay;
- client_max_body_size 10m;
- proxy_connect_timeout 30s; proxy_read_timeout 60s; proxy_send_timeout 30s;
Tests: >10MB payload → 413, slow client → timeout, burst 30 → 429 after 20

Week 2: P0 Frontend + Scripts — Things That Break UI or Containers

F1 — WebSocket subscription race condition

File: neode-ui/src/stores/app.ts lines 88-134
Fix: Return unsubscribe function from wsClient.subscribe(), call it before re-subscribing. Use a subscription ID to prevent duplicates.
Tests: rapid connectWebSocket() calls produce only one active subscription

F2 — Mesh concurrent state mutations

File: neode-ui/src/stores/mesh.ts lines 249-324
Fix: Add isSending ref as mutex. Queue concurrent sends. fetchMessages() called once after all sends complete.
Tests: 3 concurrent sendMessage() calls → all succeed, messages list consistent

F3 — Global error handler

File: neode-ui/src/main.ts
Add app.config.errorHandler that shows toast + logs structured error
Tests: thrown error in component shows toast, nested errors don't crash handler

S1 — Eliminate all sudo podman

Files: fix-indeedhub-containers.sh (28), deploy-bitcoin-knots.sh (11), deploy-tailscale.sh (2+), uptime-monitor.sh (1), setup-aiui-server.sh
Replace every sudo podman with podman (runs as archipelago user)
Tests: grep for sudo podman across all scripts returns zero matches

S2 — Container health checks for all 30 containers

File: scripts/first-boot-containers.sh
Add --health-cmd, --health-interval=30s, --health-timeout=5s, --health-retries=3 to every $DOCKER run
Health commands per type:
- Bitcoin: bitcoin-cli -rpcuser=... getblockchaininfo || exit 1
- HTTP apps: curl -sf http://localhost:{port}/ || exit 1
- LND: curl -sf --insecure https://localhost:8080/v1/getinfo || exit 1
- Databases: mariadb -u root -p... -e "SELECT 1" || exit 1
Tests: script grep confirms every $DOCKER run has --health-cmd

Week 3: P1 Backend — Blocking I/O and Memory Leaks

R4+R5 — Rate limiter cleanup

File: core/archipelago/src/session.rs
Spawn background tasks for both EndpointRateLimiter::cleanup() and LoginRateLimiter cleanup, every 5 min
Tests: after cleanup, stale entries removed; active entries preserved

R6 — session.rs blocking I/O (6 calls)

Replace std::fs::read_to_string → tokio::fs::read_to_string at lines 77, 370, 413
Replace std::fs::write → tokio::fs::write at lines 128, 425
Replace std::fs::create_dir_all → tokio::fs::create_dir_all at line 423
Tests: session load/save/persist still works correctly

R7 — docker_packages.rs blocking I/O

Replace std::fs::read_to_string → tokio::fs::read_to_string at lines 561, 573
Tests: app metadata loading works

R8 — port_allocator.rs blocking I/O

Replace all 3 std::fs calls → tokio::fs at lines 59, 73, 77
Tests: port allocation/persistence works

R9+R10+R11 — Remaining blocking I/O

peers.rs:30, node_message.rs:65, identity.rs:50, identity_manager.rs:164, nostr_discovery.rs:55
Convert all to tokio::fs
Tests: each module's file operations still work

R12 — electrs_status.rs sync TCP I/O

Convert synchronous TCP client to async (tokio::net::TcpStream)
Tests: ElectrumX status query works, timeout on connection failure

Week 4: P1 Frontend — Memory Leaks and Stale State

F4 — WebSocket reconnect full state refresh

File: neode-ui/src/stores/app.ts
After reconnect, call rpcClient.call({method: 'server.get-state'}) to get fresh state before accepting patches
Tests: after simulated disconnect+reconnect, state matches server

F5 — Message polling timer cleanup

File: neode-ui/src/composables/useMessageToast.ts
Tie polling lifecycle to auth state: stop on logout, start on login. Export cleanup function.
Tests: polling stops when auth false, restarts when auth true, no timer after unmount

F6 — AppLauncher message listener leak

File: neode-ui/src/stores/appLauncher.ts
Ensure listener is removed when app closes (even if not via close button — e.g., route navigation)
Tests: navigate away from app → listener removed, new app opens clean

F7 — Audio player listener stacking

File: neode-ui/src/composables/useAudioPlayer.ts
Create Audio element once, register listeners once. Track initialization flag.
Tests: calling play() 10 times → still only 6 listeners total (not 60)

S3 — Pin all container images (remove :latest)

Files: first-boot-containers.sh (15), deploy-to-target.sh (11), deploy-tailscale.sh (18), build-auto-installer-iso.sh (7)
Replace every :latest with specific version tag
Create image-versions.env sourced by all scripts — single source of truth
Tests: grep -r ':latest' scripts/ image-recipe/ returns zero matches (excluding comments)

MONTH 2: OPERATIONAL SAFETY (Weeks 5–8)

Fix everything that makes deploys dangerous, scripts unreliable, or operations opaque.

Week 5: Deploy Script Hardening

S4 — first-boot error handling

Add per-section error checking: if Bitcoin fails, skip dependent containers (LND, Mempool, BTCPay)
Add wait_for_container return value checking
Tests: first-boot with broken Bitcoin image → Bitcoin deps skipped, independent apps still start

S5 — Replace eval with safe construct

File: deploy-to-target.sh:940
Replace eval "$DB_PASSWORDS" with explicit variable assignment from SSH output
Tests: passwords parsed correctly without eval

S6 — Deploy locking

File: deploy-to-target.sh
Add remote flock on /var/lock/archipelago-deploy.lock. Second deploy fails immediately with message. Stale lock (>30 min) broken automatically.
Tests: two parallel deploys → second fails, stale lock → broken and deploy proceeds

S7 — Deploy rollback

File: deploy-to-target.sh
Before overwriting binary: cp archipelago archipelago.bak
Before overwriting frontend: cp -r web-ui web-ui.bak
If health check fails post-restart: restore from .bak, restart again
Tests: intentionally broken binary → deploy detects, rolls back, system healthy

S8 — Eliminate sshpass

File: trust-archipelago-cert.sh
Rewrite to use SSH key only: ssh -i ~/.ssh/archipelago-deploy
Tests: script works with key auth, fails gracefully without key

Week 6: Script Quality

S9 — MariaDB password not on command line

File: first-boot-containers.sh:285
Use $DOCKER exec -i ... mariadb -uroot < /dev/stdin <<< "SET PASSWORD..."
Tests: ps aux during execution doesn't show password

S10 — Replace silent error masking

File: deploy-to-target.sh (80+ instances)
Pattern: replace 2>/dev/null || echo "" with || { log_warn "..."; echo ""; }
At minimum, log what failed before masking
Tests: failed health check produces log entry

S11 — Trap cleanup for temp files

All scripts that create /tmp files: add trap "rm -rf /tmp/deploy-$$" EXIT at start
Files: deploy-to-target.sh, deploy-tailscale.sh, build-auto-installer-iso.sh
Tests: script interrupted mid-execution → temp files cleaned up

S12 — Quote all variables

Audit and fix unquoted $VARIABLE in command arguments across all scripts
Tests: shellcheck passes on all modified scripts

S13 — Extract hardcoded IPs to config

Create scripts/deploy-config-defaults.sh with all node IPs as named variables
Source from all scripts instead of hardcoding
Tests: changing IP in config → all scripts use new IP

Week 7: Infrastructure Hardening

I2 — Systemd resource limits

File: image-recipe/configs/archipelago.service
Add: MemoryMax=4G, LimitNOFILE=65535, TasksMax=2048
Tests: systemctl show archipelago confirms limits applied, service starts normally

I3 — Tor rotation transition period

File: core/archipelago/src/api/rpc/tor.rs
Keep old hidden service running for 24h after rotation. Both addresses active. Notify peers of new address. Schedule old deletion.
Tests: after rotation old address still resolves, peers receive notification, old removed after transition

S14 — Input validation on deploy targets

Add regex validation for hostnames/IPs before SSH
Tests: invalid hostname → clear error, valid hostname → proceeds

S15 — Memory limits on all deploy containers

File: deploy-to-target.sh lines 842-880
Add --memory=$(mem_limit ...) to all UI container builds
Tests: every container in deploy has --memory flag

S17 — Disk space pre-flight

File: deploy-to-target.sh
Check target disk <85% before deploying. Abort with clear message if full.
Tests: deploy to 90% full disk → aborted, deploy to 50% full → succeeds

Week 8: Remaining P1 Backend

R14 — Fix .parse().unwrap() in session rate limiting

File: session.rs:665,676,688
Replace .parse().unwrap() with .parse().context("...")?
Tests: invalid IP handling works gracefully

R15 — Fix 7 unwrap/expect in mesh/protocol.rs

File: mesh/protocol.rs:582,592,614,649,679,713,728
Replace all with ? operator + proper error types
Tests: protocol parsing with malformed data returns error, not panic

R27 — Add timeouts to mesh Bitcoin RPC calls

File: mesh/mod.rs:624,649,663
Add tokio::time::timeout(Duration::from_secs(10), ...) to all Bitcoin RPC calls
Tests: RPC timeout returns error after 10s

R34 — Tor rotation transition

(Covered by I3 above)

MONTH 3: PRODUCTION POLISH (Weeks 9–12)

Fix every remaining P2 issue — unwraps, hardcoded values, frontend quality, resilience.

Week 9: Remaining Backend Unwraps + Dead Code

R13 — main.rs .expect() → .context()

Replace 2 .expect() calls with .context("...")? and proper startup error handling

R16 — identity.rs .expect() → safe handling

Replace 2 .expect() in crypto operations with result propagation

R17+R18 — helpers unwraps

Fix 10 .unwrap() calls in helpers/lib.rs and helpers/rsync.rs
Replace with ? operator or .context()

R19 — js-engine unwraps

Fix 2 .unwrap() in js-engine/lib.rs:130,249

R20+R21 — Dead code elimination

Remove all 14 #[allow(dead_code)] in mesh/mod.rs. Either use the fields or delete them.
Same for lnd.rs, data_manager.rs, dev_orchestrator.rs
Tests: cargo clippy zero warnings, cargo test passes

Week 10: Hardcoded Values → Constants

R22 — Bitcoin RPC URL constant

Create const BITCOIN_RPC_URL: &str = "http://127.0.0.1:8332/"; in a shared constants module
Use across bitcoin.rs, mesh/mod.rs, mesh/listener.rs
Tests: all Bitcoin RPC calls still work

R23 — DWN health URL constant R24 — Update manifest URL constant R25 — DNS-over-HTTPS URLs → constants array R26 — DWN protocol URIs → constants

Centralize all hardcoded URLs/URIs into core/archipelago/src/constants.rs
Tests: all modules reference constants, no hardcoded strings remain

R28 — LND proxy timeouts

Audit all 68 .send() calls in api/rpc/lnd.rs. Ensure each has explicit timeout.
Tests: LND proxy call with unresponsive LND → timeout error, not hang

R29 — DWN health check timeout

Add timeout to dwn_sync.rs:76 health check

R30-R33 — Resolve all TODOs

Either implement the TODO or remove the dead code path. Per project rules: no TODO/FIXME in commits.

Week 11: Frontend P2 Fixes

F8 — WebSocket reconnection race

Add isReconnecting flag. Skip if already reconnecting.
Tests: rapid close events → only one reconnect attempt

F9 — WebSocket parse error handling

Count consecutive parse errors. After 3, force reconnect.
Tests: 3 malformed messages → reconnect triggered; single bad message → logged only

F10 — Stale connection detection tuning

Require mutual pong response within 30s. Don't close valid connections that are simply quiet.
Tests: quiet but healthy connection → stays open; no pong for 30s → reconnects

F11 — RPC client backoff reduction

Reduce default timeout from 30s to 15s. Add jitter to backoff. Cap total retry time at 20s.
Tests: server outage → user sees error within 20s, not 40s

F12 — Code splitting

Lazy-load all routes: () => import('./views/Web5.vue')
Add manual chunks in vite.config.ts for vendor/api
Tests: build produces multiple chunks, initial bundle < 200KB gzipped

F13 — DOMPurify on QR v-html

Add DOMPurify.sanitize() to QR SVG before v-html rendering
Tests: XSS payload in QR content → sanitized

Week 12: Frontend P2 Continued + Performance

F14 — Goals computed memoization

Replace O(n) alias lookup with Map. Add deep equality check.
Tests: goalStatuses computed runs in <1ms with 100 apps

F15 — localStorage error handling

Wrap all localStorage.setItem in try/catch. Show toast on quota exceeded.
Tests: full localStorage → toast shown, app continues

F16 — FileBrowser auth consolidation

Use cookie-only auth. Remove in-memory token.
Tests: login persists across page reload, logout clears cookie

F17 — CSRF token parsing robustness

Add header fallback for CSRF token. Handle edge cases.
Tests: missing cookie → falls back to header, both missing → error

F22 — CSS backdrop-filter mobile performance

Add media query: reduce blur to 8px on mobile. Remove backdrop-filter from non-visible elements.
Tests: mobile Lighthouse performance score > 80

MONTH 4-5: BACKEND ARCHITECTURE (Weeks 13–20)

Split every Rust god file. Target: no file > 500 lines.

Week 13–14: Split package.rs (1,795 lines)

api/rpc/package/
├── mod.rs          — Re-exports (~50 lines)
├── config.rs       — get_app_config(), get_app_capabilities(), needs_archy_net()
├── lifecycle.rs    — install, start, stop, restart, uninstall
├── validation.rs   — Input validation, dependency checking, image validation
└── progress.rs     — Progress streaming, install status tracking

Pre-split tests: test every get_app_config() variant, validation path, lifecycle transition Post-split: all RPC calls return identical responses, cargo test passes

Week 15–16: Split mesh/listener.rs (1,799 lines)

mesh/listener/
├── mod.rs          — Re-exports + spawn_mesh_listener()
├── session.rs      — run_mesh_session() loop
├── frames.rs       — handle_frame() dispatcher
├── identity.rs     — handle_identity_received(), handle_typed_message()
├── sync.rs         — sync_queued_messages(), store_typed_message()
└── bitcoin.rs      — Bitcoin relay operations, RPC calls

Week 17–18: Split rpc/mod.rs (1,092 lines) + lnd.rs (1,068 lines)

rpc/mod.rs → dispatcher.rs (method routing), middleware.rs (CSRF/session/rate-limit), response.rs (response building)

lnd.rs → lnd/wallet.rs, lnd/channels.rs, lnd/info.rs, lnd/payments.rs

Week 19–20: Split monitoring (993), handler (911), mesh (865)

Split each into sub-modules. Target: no file > 500 lines. All pre-split tests, all post-split verification.

MONTH 6-8: FRONTEND ARCHITECTURE (Weeks 21–32)

Split every Vue god component. Target: no component > 500 lines.

Week 21–22: Split Web5.vue (3,940 lines → 8 sub-views)

views/web5/
├── Web5.vue            — Router shell (~150 lines)
├── Web5Identity.vue    — DID management
├── Web5Wallet.vue      — Wallet operations
├── Web5Nostr.vue       — Nostr relays/profiles
├── Web5Credentials.vue — Verifiable Credentials
├── Web5Peers.vue       — P2P federation nodes
├── Web5Storage.vue     — DWN storage/explorer
├── Web5Goals.vue       — Goals/voting
└── Web5Marketplace.vue — Decentralized marketplace

Add nested routes. Component tests for each section. All sections render identically.

Week 23–24: Split Mesh.vue (2,106) + Dashboard.vue (1,819)

Mesh.vue → MeshRadio.vue, MeshChat.vue, MeshNetwork.vue, MeshFederation.vue Dashboard.vue → DashboardHome.vue, DashboardApps.vue, DashboardSystem.vue

Week 25–26: Split Settings.vue (1,792) + Server.vue (1,132)

Settings.vue → SettingsAccount.vue, SettingsSystem.vue, SettingsNetwork.vue, SettingsAppearance.vue Server.vue → ServerOverview.vue, ServerContainers.vue, ServerLogs.vue

Week 27–28: Split Marketplace.vue (1,293) + AppDetails.vue (1,036) + Home.vue (1,059)

Each into 3-4 focused sub-components.

Week 29–30: Decompose useAppStore (324 lines, 16 methods)

stores/
├── app.ts          — Thin re-export for backward compat (~50 lines)
├── auth.ts         — Login, logout, session, password, TOTP
├── server.ts       — Server info, system stats, reboot/shutdown
├── realtime.ts     — WebSocket connection, subscriptions, heartbeat
└── packages.ts     — Package install/uninstall, marketplace data

Tests: every existing import of useAppStore still works. State transitions identical.

Week 31–32: Remaining frontend P3 issues

F18 — aiPermissions runtime validation F19 — Track AppSession timeout F20 — Dashboard aria-current F21 — Debounce search + memoize F23 — Branded types for DID operations F24 — Fix checkInterval leak

MONTH 9-10: SCRIPT ARCHITECTURE + ISO (Weeks 33–40)

Split every monolithic script. Target: no script > 400 lines.

Week 33–34: Create shared script library

scripts/lib/
├── common.sh       — Colors, logging, error handling, SSH helpers
├── health.sh       — Health check polling, container status
├── deploy-utils.sh — Rsync, file sync, backup/restore
├── container.sh    — Podman helpers, image management, mem_limit()
└── network.sh      — IP validation, port checking

Tests: each library function tested in scripts/tests/

Week 35–36: Split deploy-to-target.sh (1,728 lines)

scripts/
├── deploy-to-target.sh  — Orchestrator + arg parsing (~300 lines)
├── deploy/
│   ├── frontend.sh      — Build + sync frontend
│   ├── backend.sh       — Build + sync binary
│   ├── configs.sh       — Sync nginx, systemd, scripts
│   ├── containers.sh    — Container creation/update
│   ├── verify.sh        — Post-deploy health checks
│   └── rollback.sh      — Rollback on failure

Week 37–38: Split ISO build (1,850 lines) + first-boot (855 lines)

build-auto-installer-iso.sh → build/capture-images.sh, build/create-rootfs.sh, build/install-packages.sh, build/bundle-configs.sh, build/package-iso.sh

first-boot-containers.sh → first-boot/databases.sh, first-boot/bitcoin.sh, first-boot/lightning.sh, first-boot/apps.sh, first-boot/networking.sh

Week 39–40: ISO Reproducibility + Integration Tests

S16 — Make ISO builds reproducible

Create image-versions.env with pinned digests for every container image
ISO build sources this file, never pulls :latest
Build manifest records exactly what shipped
Tests: two consecutive ISO builds produce identical image sets

E2E smoke test script

# scripts/smoke-test.sh — Run against .198
# 1. curl /health → OK
# 2. Login → get session
# 3. Get server info → valid JSON
# 4. List containers → all healthy
# 5. Check every /app/* proxy → responds
# 6. Check Tor hidden service → resolves
# 7. Check WebSocket upgrade → 101
# Exit 0 only if all pass

MONTH 11: INTEGRATION TESTS (Weeks 41–44)

Comprehensive test suites that prove everything works.

Week 41–42: Backend Integration Tests

core/archipelago/tests/
├── test_auth_flow.rs           — Login → session → CSRF → auth request → logout
├── test_container_lifecycle.rs — Install → start → health → stop → uninstall
├── test_federation.rs          — Generate invite → join → sync → verify
├── test_rpc_validation.rs      — Every endpoint with invalid input → proper error
├── test_session_persist.rs     — Create session → restart → session survives
├── test_rate_limiting.rs       — Flood → 429 → wait → allowed
├── test_backup_restore.rs      — Create → verify → restore → validate
├── test_health_endpoint.rs     — Healthy → degraded → recovery

Target: 25+ backend integration tests passing

Week 43–44: Frontend Integration Tests

neode-ui/src/__tests__/integration/
├── auth-flow.spec.ts           — Login → dashboard → timeout → redirect
├── app-lifecycle.spec.ts       — Marketplace → install → progress → launch → uninstall
├── websocket.spec.ts           — Connect → update → disconnect → reconnect → state consistent
├── settings-flow.spec.ts       — Change password → re-login → 2FA setup → verify
├── spotlight.spec.ts           — Open → search → navigate → close
├── mesh-chat.spec.ts           — Connect → send → receive → disconnect
├── error-handling.spec.ts      — Network error → toast → retry → success
├── code-splitting.spec.ts      — Route navigation → chunks loaded lazily

Target: 20+ frontend integration tests passing

MONTH 12: TYPE SYNC + CI/CD PLAN (Weeks 45–48)

Week 45–46: Rust↔TypeScript Type Sync

Approach: ts-rs crate to auto-generate TypeScript types from Rust structs

Add ts-rs to core/models/Cargo.toml
Add #[derive(TS)] to all API request/response types
Build script generates neode-ui/src/types/generated.ts
Replace manual types in types/api.ts with imports from generated file
Verification: regenerate → diff → must be zero (types committed)

Tests: frontend type-check passes with generated types, manual api.ts reduced to non-API types

Week 47–48: CI/CD Planning (Document Only — Execute Later)

This section is the PLAN for CI/CD. Do not execute during this phase. Document everything needed so it can be implemented in a future sprint.

CI Pipeline Design (.github/workflows/ci.yml):

# Triggers: push to main, all PRs
# Jobs:
#   rust-checks (Linux runner):
#     - cargo clippy --all-targets --all-features (zero warnings gate)
#     - cargo fmt --all -- --check (formatting gate)
#     - cargo test --all-features (all tests gate)
#
#   frontend-checks (Node 20):
#     - npm run type-check (TypeScript strictness gate)
#     - npm run lint (ESLint gate)
#     - npm test (Vitest suite gate)
#
#   integration (Linux runner, optional):
#     - scripts/smoke-test.sh against staging
#
# Merge policy: all checks must pass before merge
# Branch protection: require PR, require checks, no force push to main

Release Pipeline Design (.github/workflows/release.yml):

# Triggers: tag push (v*)
# Jobs:
#   build-linux-binary:
#     - Cross-compile Rust for x86_64 + ARM64
#   build-frontend:
#     - npm run build
#   build-iso:
#     - SSH to build server, run ISO build
#     - Upload ISO as release asset
#   smoke-test:
#     - Boot ISO in QEMU
#     - Run smoke-test.sh
#     - Gate release on pass

Pre-requisites to implement:

GitHub Actions runner with Rust toolchain + cross-compilation
Node.js 20 runner for frontend
SSH key for build server accessible from CI
Branch protection rules configured
Image digest manifest for reproducible ISO builds
QEMU-based ISO verification script

Estimated implementation time: 2 weeks when ready to execute

VERIFICATION PROTOCOL (Every Week)

cargo clippy --all-targets --all-features — zero warnings
cargo fmt --all
cargo test --all-features — all pass
cd neode-ui && npm run type-check — zero errors
cd neode-ui && npm test — all pass
./scripts/deploy-to-target.sh --target 192.168.1.198 — ONLY .198
curl http://192.168.1.198/health — returns OK with service status
Navigate all affected views in browser — identical behavior
Atomic commit: refactor: <description> or fix: <description>

EXIT CRITERIA (Month 12 Complete)

Reliability (Zero Tolerance)

Health endpoint returns real service status
All async operations have bounded timeouts
Zero blocking I/O in async context (no std::fs in async functions)
Zero .unwrap()/.expect() in production code
All rate limiters have cleanup tasks
Backup restore uses staging + atomic swap + rollback
All 30 containers have health checks + memory limits
All container images pinned to specific versions
Nginx unauthenticated endpoints protected (timeout + rate limit + body size)
Systemd service has resource limits
Tor rotation preserves old address during transition
Deploy has locking + disk check + rollback
Zero sudo podman in any script
Zero :latest image tags anywhere
Zero silent error masking without logging

Frontend (Zero Tolerance)

Global error handler catches and displays all errors
WebSocket: single subscription, reconnect refreshes state, bounded retries
All timers/listeners cleaned up on unmount
Code splitting: initial bundle < 200KB gzipped
v-html always uses DOMPurify
All localStorage operations wrapped in try/catch

Architecture (Target: File Size Limits)

No Rust file > 500 lines (excluding generated code)
No Vue component > 500 lines
No shell script > 400 lines
No Pinia store has more than 1 responsibility
All hardcoded URLs/ports extracted to constants
Shared script library eliminates duplication
TypeScript types auto-generated from Rust structs

Testing

25+ backend integration tests passing
20+ frontend integration tests passing
E2E smoke test script passes on .198
ISO builds are reproducible (pinned digests)

CI/CD (Planned, Not Executed)

CI pipeline design documented
Release pipeline design documented
Pre-requisites list complete
Ready for 2-week implementation sprint

Zero Behavior Changes

Every feature works identically. Every existing test passes. Every user flow unchanged.

37 KiB Raw Blame History Unescape Escape