archy/.claude/plans/plan.md
Dorian e8a0e1af19 feat: add Ollama proxy timeouts, SSH key migration, polish skills, and demo content
- Update all skill SSH commands from sshpass to key-based auth (~/.ssh/archipelago-deploy)
- Add proxy_connect_timeout 120s to nginx Ollama location blocks
- Add new polish/sweep skills for overnight automation
- Add demo content (documents, photos) for demo stack
- Add .ssh/ to .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 08:06:52 +00:00

22 KiB
Raw Blame History

Archipelago Production Polish Plan

Duration: 8 weeks (March 10 May 4, 2026) Goal: Zero new features. Every existing feature polished to flawless production quality. Philosophy: The iPhone moment — everything just works, feels inevitable, no rough edges.

SSH Access

All remote commands use SSH key auth (password auth is disabled):

ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228

Never use sshpass. The deploy script handles this automatically via SSH_KEY.


Audit Summary

Full codebase audit completed March 8, 2026. Findings:

Layer Critical High Medium Low
Frontend (Vue/TS) 4 6 10 4
Backend (Rust) 6 6 6 7
Infrastructure 5 6 7 3
UX Flows 4 4 6 3
Total 19 22 29 17

Skills Required

Existing Skills (14)

deploy, deploy-both, diagnose, check-server, frontend-dev, sync-configs, build-iso, server-logs, add-app, harden, test, lint, ux-review, refactor

New Skills (9)

Skill Purpose
polish Main orchestrator — reads this plan, detects week, executes tasks
polish-errors Fix silent error handling, add user-facing error states
polish-loading Add skeleton loaders, loading indicators, empty states
polish-forms Input validation, trimming, real-time feedback
polish-backend Fix unwrap/expect, add timeouts, connection pooling
polish-deploy Add rollback, health checks, pre-deploy validation
polish-security Systemd hardening, nginx CSP, secrets management
polish-websocket Reconnection UX, connection status indicator, heartbeat
sweep Full automated quality sweep: lint + type-check + verify fixes

Week 1: Silent Failures & Error Handling (March 1016)

Theme: Nothing fails silently. Every error is visible, actionable, recoverable.

Tasks

1.1 Frontend: Kill all silent catch blocks

  • Files: Settings.vue, Web5.vue, router/index.ts, Apps.vue, OnboardingIntro.vue
  • Action: Replace 21+ .catch(() => {}) patterns with proper error handling
  • Pattern: Log to console in dev, show toast/inline error to user in prod
  • Acceptance: Zero .catch(() => {}) in codebase (grep confirms)
  • Skill: /polish-errors

1.2 Frontend: Remove all console.log from production

  • Files: stores/app.ts (15+), api/websocket.ts (12+)
  • Action: Replace with conditional dev-only logging or remove
  • Pattern: if (import.meta.env.DEV) console.log(...) or remove entirely
  • Acceptance: Zero console.log outside of dev guards (grep confirms)
  • Skill: /lint

1.3 Backend: Fix all unwrap/expect in handler.rs

  • Files: core/archipelago/src/api/handler.rs (11 unwraps)
  • Action: Replace .unwrap() on Response builders with .map_err() and ?
  • Acceptance: Zero unwrap() in handler.rs
  • Skill: /polish-backend

1.4 Backend: Fix unwrap/expect across all production paths

  • Files: main.rs, identity.rs, totp.rs, rpc/mod.rs, image_verifier.rs
  • Action: Audit all 32 .unwrap()/.expect() calls, replace with ? or .context()
  • Acceptance: Zero unwrap/expect outside of test modules
  • Skill: /polish-backend

1.5 Backend: Hardcoded Bitcoin RPC credentials

  • Files: core/archipelago/src/api/rpc/bitcoin.rs:89
  • Action: Move archipelago/archipelago123 to env var or secrets manager
  • Pattern: std::env::var("ARCHIPELAGO_BITCOIN_RPC_USER").unwrap_or("archipelago".into())
  • Acceptance: No hardcoded credentials in Rust source

1.6 Deploy & verify

  • Run /lint to confirm zero violations
  • Run /deploy to live server
  • Run /check-server to verify health
  • Manual spot-check: trigger errors in UI, confirm they're visible

Week 2: Loading States & Visual Feedback (March 1723)

Theme: The user always knows what's happening. No blank screens, no mystery waits.

Tasks

2.1 Add skeleton loaders to all async views

  • Files: Apps.vue, AppDetails.vue, Marketplace.vue, Cloud.vue, Server.vue, Settings.vue
  • Action: Create SkeletonLoader.vue component, add to every view that fetches data
  • Pattern: Show skeleton immediately, swap to real content on load
  • Acceptance: Every view shows placeholder content during load
  • Skill: /polish-loading

2.2 Add timeout warnings to long operations

  • Files: Login.vue (server startup), Marketplace.vue (app install)
  • Action: After 15s show "Taking longer than expected...", after 30s show troubleshoot options
  • Acceptance: No operation silently hangs

2.3 Fix Start/Stop button state mismatch

  • Files: Apps.vue, AppDetails.vue, ContainerApps.vue
  • Action: Button reflects actual backend state, not a fixed 5s timer
  • Pattern: Poll backend every 2s during state transition, update button immediately on response
  • Acceptance: Button state always matches container state within 3s

2.4 Connection status indicator

  • Files: Create ConnectionStatus.vue, integrate into App.vue header
  • Action: Show green/amber/red dot based on WebSocket connection state
  • Pattern: Use wsClient.isConnected() — green=connected, amber=reconnecting, red=disconnected
  • Acceptance: User always knows if they're connected
  • Skill: /polish-websocket

2.5 Fix OnlineStatusPill to use real data

  • Files: components/OnlineStatusPill.vue
  • Action: Connect to actual WebSocket state instead of hardcoded "Online"
  • Acceptance: Pill reflects real connection state

2.6 Empty states for all views

  • Files: Apps.vue, Cloud.vue, ContainerApps.vue
  • Action: When no data, show helpful message with CTA (e.g., "No apps installed — Browse Marketplace")
  • Acceptance: Every view handles the zero-data case gracefully

2.7 Deploy & verify

  • /deploy then /check-server
  • Test: disconnect network, observe status indicator
  • Test: slow network (throttle), observe skeleton loaders
  • Test: fresh account with no apps, observe empty states

Week 3: Form Validation & Input Quality (March 2430)

Theme: Every input feels responsive, validated, impossible to misuse.

Tasks

3.1 Real-time password validation

  • Files: Login.vue (password setup), Settings.vue (password change)
  • Action: Show inline validation as user types: length check, match check, strength meter
  • Pattern: Debounced validation on input, green checkmark / red X per rule
  • Acceptance: User sees validation state before clicking submit
  • Skill: /polish-forms

3.2 TOTP input improvements

  • Files: Login.vue (TOTP verify step)
  • Action: Auto-submit on 6 digits, show session countdown timer, trim whitespace
  • Pattern: watch(code, () => { if (code.length === 6) submit() })
  • Acceptance: TOTP flow is fast and clear, session timeout is visible

3.3 Input trimming on all forms

  • Files: Login.vue, Settings.vue, any form input
  • Action: .trim() all text inputs before submission
  • Acceptance: Leading/trailing whitespace never causes failures

3.4 Disable submit buttons during operations

  • Files: Settings.vue (password change), Login.vue (login), Marketplace.vue (install)
  • Action: Add :disabled="isSubmitting" to all action buttons
  • Pattern: Button shows spinner + disabled state during async operation
  • Acceptance: No button can be double-clicked during an operation

3.5 Error message consistency

  • Files: All views with error messages
  • Action: Create formatError() utility that normalizes error messages
  • Pattern: Network errors -> "Can't reach server", Auth errors -> "Session expired", Server errors -> "Something went wrong"
  • Acceptance: Error messages are user-friendly, never show raw error strings

3.6 Deploy & verify

  • Test every form: login, password change, TOTP setup, app install
  • Try invalid inputs, verify feedback is immediate and clear

Week 4: Backend Robustness (March 31 April 6)

Theme: The backend never crashes, never hangs, handles every edge case.

Tasks

4.1 Add timeouts to all container operations

  • Files: core/archipelago/src/container/dev_orchestrator.rs
  • Action: Wrap all podman calls with tokio::time::timeout(Duration::from_secs(30), ...)
  • Acceptance: No container operation can hang indefinitely

4.2 Add timeouts to all external HTTP calls

  • Files: bitcoin.rs, handler.rs (LND proxy)
  • Action: Explicit reqwest::Client with timeout, not default
  • Pattern: Reuse a single Client stored in RpcHandler state
  • Acceptance: Every HTTP call has an explicit timeout

4.3 Connection pooling for Bitcoin RPC

  • Files: core/archipelago/src/api/rpc/bitcoin.rs
  • Action: Store reqwest::Client in RpcHandler, reuse across requests
  • Acceptance: One client instance, connection pooled

4.4 Fix all clippy warnings

  • Action: Run cargo clippy --all-targets --all-features on dev server, fix all 10 warnings
  • Warnings: should_implement_trait, get_first, assign_op_pattern, wildcard_in_or_patterns, redundant_field_names, unused_import, ptr_arg, very_complex_type, if_else_collapse, io::Error::other
  • Acceptance: cargo clippy returns zero warnings
  • Skill: /lint

4.5 Rate limiting on unauthenticated endpoints

  • Files: core/archipelago/src/api/handler.rs
  • Action: Add per-IP rate limiting to /archipelago/node-message and /electrs-status
  • Pattern: In-memory rate limiter with 60 req/min per IP
  • Acceptance: Endpoints return 429 when rate exceeded

4.6 Consistent error codes and messages

  • Files: All RPC endpoints
  • Action: Define error code constants, consistent capitalization
  • Pattern: const ERR_AUTH: i32 = -1001; etc.
  • Acceptance: All error responses use defined constants

4.7 Remove dead code

  • Files: identity.rs (unused field, unused methods), auth.rs (dead_code allows)
  • Action: Remove identity_dir field, remove unused verify() and did_key() methods, remove #[allow(dead_code)] and verify usage
  • Acceptance: Zero #[allow(dead_code)] outside of generated code

4.8 Replace println/eprintln with tracing

  • Files: core/startos/src/* (23+ instances)
  • Action: Replace println! -> tracing::info!, eprintln! -> tracing::warn!
  • Acceptance: Zero println! / eprintln! in non-test code

4.9 Deploy & verify

  • /deploy then /check-server then /diagnose
  • Test: kill Bitcoin container, verify backend doesn't crash
  • Test: flood unauthenticated endpoint, verify rate limiting
  • Test: restart backend, verify graceful startup

Week 5: WebSocket & Real-Time Quality (April 713)

Theme: Real-time updates are bulletproof. Connection issues are transparent to the user.

Tasks

5.1 WebSocket reconnection UX

  • Files: api/websocket.ts, App.vue
  • Action: After max reconnect attempts, show persistent banner "Connection lost. Click to retry."
  • Pattern: Don't silently give up after 10 attempts
  • Acceptance: User always has a path to reconnect
  • Skill: /polish-websocket

5.2 WebSocket heartbeat improvement

  • Files: api/websocket.ts
  • Action: Send ping every 30s, expect pong within 5s, reconnect if missed
  • Acceptance: Stale connections detected within 35s, not 60s

5.3 RPC client session detection

  • Files: api/rpc-client.ts
  • Action: On 401/403 response, redirect to login page instead of showing generic error
  • Pattern: if (status === 401) { router.push('/login'); return; }
  • Acceptance: Expired sessions redirect to login immediately

5.4 Message queuing during reconnection

  • Files: api/rpc-client.ts, api/websocket.ts
  • Action: If WebSocket is down, queue state-update subscriptions, replay on reconnect
  • Pattern: Don't lose container state updates during brief disconnects
  • Acceptance: State is consistent after reconnection without page refresh

5.5 WebSocket race condition fix

  • Files: stores/app.ts, api/websocket.ts
  • Action: Fix duplicate listener issue on rapid reconnect (isWsSubscribed flag)
  • Pattern: Use a Set of listener IDs, deduplicate on registration
  • Acceptance: No duplicate event handlers after reconnect cycles

5.6 Deploy & verify

  • Test: kill backend, observe frontend reconnection behavior
  • Test: toggle wifi, observe status indicator + reconnection
  • Test: let session expire, verify redirect to login

Week 6: Deployment & Infrastructure Hardening (April 1420)

Theme: Deployments are safe, reversible, and verified. Infrastructure is production-grade.

Tasks

6.1 Deploy script: add rollback capability

  • Files: scripts/deploy-to-target.sh
  • Action: Before overwriting binary/frontend, backup to .backup suffix
  • Pattern: On health check failure after restart, restore from backup
  • Acceptance: Failed deploy auto-restores previous working version
  • Skill: /polish-deploy

6.2 Deploy script: pre-deploy sanity checks

  • Files: scripts/deploy-to-target.sh
  • Action: Check disk space (2GB min), verify SSH key exists, verify target dir exists
  • Acceptance: Deploy fails early with clear message if preconditions not met

6.3 Deploy script: post-deploy health verification

  • Files: scripts/deploy-to-target.sh
  • Action: After restart, poll /health endpoint for 30s. If no 200, trigger rollback
  • Acceptance: Every deploy is verified healthy before declaring success

6.4 Deploy script: deployment locking

  • Files: scripts/deploy-to-target.sh
  • Action: Use flock to prevent concurrent deploys
  • Acceptance: Second simultaneous deploy fails immediately with message

6.5 First-boot script: add error handling

  • Files: scripts/first-boot-containers.sh
  • Action: Add set -e, verify each container starts before creating dependents
  • Acceptance: If Bitcoin fails, Mempool is not attempted

6.6 Systemd service hardening

  • Files: image-recipe/configs/archipelago.service
  • Action: Add PrivateTmp=yes, NoNewPrivileges=true, ProtectSystem=strict, ProtectHome=yes, SystemCallFilter=@system-service
  • Acceptance: Service runs with minimal privileges
  • Skill: /harden

6.7 Nginx security headers

  • Files: image-recipe/configs/nginx-archipelago.conf
  • Action: Add HSTS, fix CSP (remove unsafe-inline), add rate limiting zones, custom log format that strips tokens
  • Acceptance: Security headers pass Mozilla Observatory scan

6.8 Nginx config: test before reload

  • Files: scripts/deploy-to-target.sh
  • Action: nginx -t failure should abort deploy and restore backup config
  • Acceptance: Invalid nginx config never goes live

6.9 Deploy & verify

  • Test: deploy with intentionally broken binary, verify rollback
  • Test: deploy with invalid nginx config, verify rollback
  • Test: concurrent deploy attempt, verify lock
  • Run /diagnose full check

Week 7: Accessibility, Polish & Edge Cases (April 2127)

Theme: Every interaction is crisp. Keyboard users, slow networks, edge cases — all handled.

Tasks

7.1 ARIA labels on all interactive elements

  • Files: All views and components
  • Action: Add aria-label to buttons, links, form inputs that lack visible labels
  • Pattern: <button aria-label="Install Bitcoin Core" ...>
  • Acceptance: Every interactive element has accessible name

7.2 Focus management in modals

  • Files: Apps.vue (uninstall modal), Marketplace.vue (filter modal), Settings.vue
  • Action: Trap focus inside modals, return focus on close, autofocus first interactive element
  • Pattern: Use useFocusTrap composable
  • Acceptance: Tab key never leaves modal; Escape closes; focus returns to trigger

7.3 Keyboard navigation completeness

  • Files: All views
  • Action: Verify every action is reachable via keyboard (Tab/Enter/Escape)
  • Acceptance: Full app usable without mouse

7.4 Fix inline Tailwind violations

  • Files: Web5.vue, AppDetails.vue, Cloud.vue, onboarding views
  • Action: Extract inline classes to global classes in style.css
  • Pattern: px-3 py-1.5 rounded-lg bg-white/5 -> .info-row class
  • Acceptance: Zero inline Tailwind utility classes in components
  • Skill: /ux-review

7.5 Touch feedback on mobile

  • Files: style.css, app card components
  • Action: Add :active states for mobile touch feedback
  • Pattern: .app-card:active { transform: scale(0.98); }
  • Acceptance: Every tappable element has tactile feedback

7.6 Responsive edge cases

  • Files: Marketplace.vue, Dashboard.vue, AppDetails.vue
  • Action: Test at 320px, 375px, 768px, 1024px, 1440px widths
  • Fix: Any overflow, text truncation, or broken layouts
  • Acceptance: No horizontal scroll or broken layout at any standard width

7.7 Fix template crash risks

  • Files: ContainerApps.vue:76 (app.image.split('/').pop())
  • Action: Add null guards on all template expressions that chain methods
  • Pattern: app.image?.split('/').pop() ?? 'unknown'
  • Acceptance: No template expression can crash on null/undefined data

7.8 Remove all TODO/FIXME from production code

  • Files: Web5.vue, AppDetails.vue, backend TODO comments
  • Action: Either implement the TODO or remove the dead code
  • Pattern: If feature isn't ready, remove the UI element entirely
  • Acceptance: Zero TODO/FIXME/HACK in committed code
  • Skill: /refactor

7.9 Deploy & verify

  • Test: navigate entire app with keyboard only
  • Test: resize browser through all breakpoints
  • Test: screen reader (VoiceOver) basic navigation
  • Run /ux-review on every view

Week 8: Integration Testing, Final Sweep & ISO (April 28 May 4)

Theme: Everything works together. The final product is tested end-to-end and burned to ISO.

Tasks

8.1 Create critical path tests — Frontend

  • Files: Create neode-ui/src/__tests__/ directory
  • Tests to write:
    • Login flow: valid password, invalid password, TOTP, session timeout
    • App lifecycle: install -> start -> launch -> stop -> uninstall
    • Settings: password change, TOTP setup, TOTP disable
    • WebSocket: connect, disconnect, reconnect
  • Framework: Vitest + @vue/test-utils (already in package.json)
  • Acceptance: 10+ critical path tests passing
  • Skill: /test

8.2 Create critical path tests — Backend

  • Tests to write:
    • RPC endpoint validation (good/bad input for each endpoint)
    • Session management (create, validate, expire, invalidate)
    • Container manifest parsing (valid, invalid, missing fields)
    • Rate limiting (under limit, at limit, over limit)
  • Acceptance: 10+ backend tests passing
  • Skill: /test

8.3 Create deployment verification test

  • Files: scripts/verify-deploy.sh (new)
  • Action: Script that hits every endpoint, checks every container, verifies every UI route
  • Pattern: Automated smoke test run after every deploy
  • Acceptance: Script exits 0 only if everything works

8.4 Full quality sweep

  • Run /lint — zero violations
  • Run /harden — zero findings
  • Run /ux-review — zero findings
  • Run /diagnose — all green
  • Run /sweep — clean bill of health
  • Acceptance: All skills report zero issues

8.5 Build final ISO

  • Sync all configs: /sync-configs
  • Build ISO: /build-iso
  • Flash to USB, boot on clean hardware
  • Verify first-boot experience end-to-end
  • Acceptance: ISO boots, onboarding works, Bitcoin syncs, apps install

8.6 Performance baseline

  • Measure and document:
    • Time to first meaningful paint (target: <2s)
    • Login flow completion time (target: <3s)
    • App install completion time (document actual)
    • WebSocket reconnection time (target: <5s)
    • Backend cold start time (target: <3s)
  • Acceptance: All targets met or documented with explanation

8.7 Final documentation pass

  • Update docs/current-state.md to reflect production status
  • Update CHANGELOG.md with all polish work
  • Verify all CLAUDE.md instructions are still accurate
  • Acceptance: Docs match reality

Metrics & Definition of Done

Per-Week Exit Criteria

Each week is "done" when:

  1. All tasks for that week have acceptance criteria met
  2. /sweep returns zero violations for that week's focus area
  3. /deploy succeeds and /check-server is green
  4. Manual spot-check of affected features passes

Project Exit Criteria (Week 8)

The project is done when ALL of these are true:

  • Zero .catch(() => {}) in frontend
  • Zero console.log outside dev guards
  • Zero unwrap()/expect() in backend production paths
  • Zero clippy warnings
  • Zero inline Tailwind in components
  • Zero TODO/FIXME in committed code
  • Every view has: loading state, error state, empty state
  • Every form has: real-time validation, disabled during submit
  • Every button action has: loading feedback, error feedback
  • WebSocket shows connection status to user
  • Session timeout redirects to login
  • Deploy has: rollback, health check, locking
  • Systemd service is hardened
  • Nginx has: HSTS, proper CSP, rate limiting, clean logs
  • 10+ frontend tests passing
  • 10+ backend tests passing
  • ISO boots and onboards successfully
  • All performance targets met

Risk Register

Risk Mitigation
Skeleton loaders change visual feel Match exact glassmorphism style, use existing color tokens
Backend changes break existing functionality Deploy to secondary server (198) first, test, then primary
Nginx CSP changes break app iframes Test each framed app individually before deploying
Rate limiting blocks legitimate use Set generous limits (60/min), monitor false positives
Test suite becomes maintenance burden Only test critical paths, no unit tests for trivial code
ISO build captures incomplete state Always build ISO from clean deploy, never mid-development