archy/.claude/plans/plan.md

515 lines
22 KiB
Markdown
Raw Normal View History

# Archipelago Production Polish Plan
**Duration**: 8 weeks (March 10 May 4, 2026)
**Goal**: Zero new features. Every existing feature polished to flawless production quality.
**Philosophy**: The iPhone moment — everything just works, feels inevitable, no rough edges.
## SSH Access
All remote commands use SSH key auth (password auth is disabled):
```bash
ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228
```
Never use `sshpass`. The deploy script handles this automatically via `SSH_KEY`.
---
## Audit Summary
Full codebase audit completed March 8, 2026. Findings:
| Layer | Critical | High | Medium | Low |
|-------|----------|------|--------|-----|
| Frontend (Vue/TS) | 4 | 6 | 10 | 4 |
| Backend (Rust) | 6 | 6 | 6 | 7 |
| Infrastructure | 5 | 6 | 7 | 3 |
| UX Flows | 4 | 4 | 6 | 3 |
| **Total** | **19** | **22** | **29** | **17** |
---
## Skills Required
### Existing Skills (14)
`deploy`, `deploy-both`, `diagnose`, `check-server`, `frontend-dev`, `sync-configs`, `build-iso`, `server-logs`, `add-app`, `harden`, `test`, `lint`, `ux-review`, `refactor`
### New Skills (9)
| Skill | Purpose |
|-------|---------|
| `polish` | Main orchestrator — reads this plan, detects week, executes tasks |
| `polish-errors` | Fix silent error handling, add user-facing error states |
| `polish-loading` | Add skeleton loaders, loading indicators, empty states |
| `polish-forms` | Input validation, trimming, real-time feedback |
| `polish-backend` | Fix unwrap/expect, add timeouts, connection pooling |
| `polish-deploy` | Add rollback, health checks, pre-deploy validation |
| `polish-security` | Systemd hardening, nginx CSP, secrets management |
| `polish-websocket` | Reconnection UX, connection status indicator, heartbeat |
| `sweep` | Full automated quality sweep: lint + type-check + verify fixes |
---
## Week 1: Silent Failures & Error Handling (March 1016)
**Theme**: Nothing fails silently. Every error is visible, actionable, recoverable.
### Tasks
#### 1.1 Frontend: Kill all silent catch blocks
- **Files**: Settings.vue, Web5.vue, router/index.ts, Apps.vue, OnboardingIntro.vue
- **Action**: Replace 21+ `.catch(() => {})` patterns with proper error handling
- **Pattern**: Log to console in dev, show toast/inline error to user in prod
- **Acceptance**: Zero `.catch(() => {})` in codebase (grep confirms)
- **Skill**: `/polish-errors`
#### 1.2 Frontend: Remove all console.log from production
- **Files**: stores/app.ts (15+), api/websocket.ts (12+)
- **Action**: Replace with conditional dev-only logging or remove
- **Pattern**: `if (import.meta.env.DEV) console.log(...)` or remove entirely
- **Acceptance**: Zero `console.log` outside of dev guards (grep confirms)
- **Skill**: `/lint`
#### 1.3 Backend: Fix all unwrap/expect in handler.rs
- **Files**: core/archipelago/src/api/handler.rs (11 unwraps)
- **Action**: Replace `.unwrap()` on Response builders with `.map_err()` and `?`
- **Acceptance**: Zero `unwrap()` in handler.rs
- **Skill**: `/polish-backend`
#### 1.4 Backend: Fix unwrap/expect across all production paths
- **Files**: main.rs, identity.rs, totp.rs, rpc/mod.rs, image_verifier.rs
- **Action**: Audit all 32 `.unwrap()`/`.expect()` calls, replace with `?` or `.context()`
- **Acceptance**: Zero unwrap/expect outside of test modules
- **Skill**: `/polish-backend`
#### 1.5 Backend: Hardcoded Bitcoin RPC credentials
- **Files**: core/archipelago/src/api/rpc/bitcoin.rs:89
- **Action**: Move `archipelago/archipelago123` to env var or secrets manager
- **Pattern**: `std::env::var("ARCHIPELAGO_BITCOIN_RPC_USER").unwrap_or("archipelago".into())`
- **Acceptance**: No hardcoded credentials in Rust source
#### 1.6 Deploy & verify
- Run `/lint` to confirm zero violations
- Run `/deploy` to live server
- Run `/check-server` to verify health
- Manual spot-check: trigger errors in UI, confirm they're visible
---
## Week 2: Loading States & Visual Feedback (March 1723)
**Theme**: The user always knows what's happening. No blank screens, no mystery waits.
### Tasks
#### 2.1 Add skeleton loaders to all async views
- **Files**: Apps.vue, AppDetails.vue, Marketplace.vue, Cloud.vue, Server.vue, Settings.vue
- **Action**: Create `SkeletonLoader.vue` component, add to every view that fetches data
- **Pattern**: Show skeleton immediately, swap to real content on load
- **Acceptance**: Every view shows placeholder content during load
- **Skill**: `/polish-loading`
#### 2.2 Add timeout warnings to long operations
- **Files**: Login.vue (server startup), Marketplace.vue (app install)
- **Action**: After 15s show "Taking longer than expected...", after 30s show troubleshoot options
- **Acceptance**: No operation silently hangs
#### 2.3 Fix Start/Stop button state mismatch
- **Files**: Apps.vue, AppDetails.vue, ContainerApps.vue
- **Action**: Button reflects actual backend state, not a fixed 5s timer
- **Pattern**: Poll backend every 2s during state transition, update button immediately on response
- **Acceptance**: Button state always matches container state within 3s
#### 2.4 Connection status indicator
- **Files**: Create `ConnectionStatus.vue`, integrate into App.vue header
- **Action**: Show green/amber/red dot based on WebSocket connection state
- **Pattern**: Use `wsClient.isConnected()` — green=connected, amber=reconnecting, red=disconnected
- **Acceptance**: User always knows if they're connected
- **Skill**: `/polish-websocket`
#### 2.5 Fix OnlineStatusPill to use real data
- **Files**: components/OnlineStatusPill.vue
- **Action**: Connect to actual WebSocket state instead of hardcoded "Online"
- **Acceptance**: Pill reflects real connection state
#### 2.6 Empty states for all views
- **Files**: Apps.vue, Cloud.vue, ContainerApps.vue
- **Action**: When no data, show helpful message with CTA (e.g., "No apps installed — Browse Marketplace")
- **Acceptance**: Every view handles the zero-data case gracefully
#### 2.7 Deploy & verify
- `/deploy` then `/check-server`
- Test: disconnect network, observe status indicator
- Test: slow network (throttle), observe skeleton loaders
- Test: fresh account with no apps, observe empty states
---
## Week 3: Form Validation & Input Quality (March 2430)
**Theme**: Every input feels responsive, validated, impossible to misuse.
### Tasks
#### 3.1 Real-time password validation
- **Files**: Login.vue (password setup), Settings.vue (password change)
- **Action**: Show inline validation as user types: length check, match check, strength meter
- **Pattern**: Debounced validation on input, green checkmark / red X per rule
- **Acceptance**: User sees validation state before clicking submit
- **Skill**: `/polish-forms`
#### 3.2 TOTP input improvements
- **Files**: Login.vue (TOTP verify step)
- **Action**: Auto-submit on 6 digits, show session countdown timer, trim whitespace
- **Pattern**: `watch(code, () => { if (code.length === 6) submit() })`
- **Acceptance**: TOTP flow is fast and clear, session timeout is visible
#### 3.3 Input trimming on all forms
- **Files**: Login.vue, Settings.vue, any form input
- **Action**: `.trim()` all text inputs before submission
- **Acceptance**: Leading/trailing whitespace never causes failures
#### 3.4 Disable submit buttons during operations
- **Files**: Settings.vue (password change), Login.vue (login), Marketplace.vue (install)
- **Action**: Add `:disabled="isSubmitting"` to all action buttons
- **Pattern**: Button shows spinner + disabled state during async operation
- **Acceptance**: No button can be double-clicked during an operation
#### 3.5 Error message consistency
- **Files**: All views with error messages
- **Action**: Create `formatError()` utility that normalizes error messages
- **Pattern**: Network errors -> "Can't reach server", Auth errors -> "Session expired", Server errors -> "Something went wrong"
- **Acceptance**: Error messages are user-friendly, never show raw error strings
#### 3.6 Deploy & verify
- Test every form: login, password change, TOTP setup, app install
- Try invalid inputs, verify feedback is immediate and clear
---
## Week 4: Backend Robustness (March 31 April 6)
**Theme**: The backend never crashes, never hangs, handles every edge case.
### Tasks
#### 4.1 Add timeouts to all container operations
- **Files**: core/archipelago/src/container/dev_orchestrator.rs
- **Action**: Wrap all podman calls with `tokio::time::timeout(Duration::from_secs(30), ...)`
- **Acceptance**: No container operation can hang indefinitely
#### 4.2 Add timeouts to all external HTTP calls
- **Files**: bitcoin.rs, handler.rs (LND proxy)
- **Action**: Explicit `reqwest::Client` with timeout, not default
- **Pattern**: Reuse a single `Client` stored in `RpcHandler` state
- **Acceptance**: Every HTTP call has an explicit timeout
#### 4.3 Connection pooling for Bitcoin RPC
- **Files**: core/archipelago/src/api/rpc/bitcoin.rs
- **Action**: Store `reqwest::Client` in `RpcHandler`, reuse across requests
- **Acceptance**: One client instance, connection pooled
#### 4.4 Fix all clippy warnings
- **Action**: Run `cargo clippy --all-targets --all-features` on dev server, fix all 10 warnings
- **Warnings**: `should_implement_trait`, `get_first`, `assign_op_pattern`, `wildcard_in_or_patterns`, `redundant_field_names`, `unused_import`, `ptr_arg`, `very_complex_type`, `if_else_collapse`, `io::Error::other`
- **Acceptance**: `cargo clippy` returns zero warnings
- **Skill**: `/lint`
#### 4.5 Rate limiting on unauthenticated endpoints
- **Files**: core/archipelago/src/api/handler.rs
- **Action**: Add per-IP rate limiting to `/archipelago/node-message` and `/electrs-status`
- **Pattern**: In-memory rate limiter with 60 req/min per IP
- **Acceptance**: Endpoints return 429 when rate exceeded
#### 4.6 Consistent error codes and messages
- **Files**: All RPC endpoints
- **Action**: Define error code constants, consistent capitalization
- **Pattern**: `const ERR_AUTH: i32 = -1001;` etc.
- **Acceptance**: All error responses use defined constants
#### 4.7 Remove dead code
- **Files**: identity.rs (unused field, unused methods), auth.rs (dead_code allows)
- **Action**: Remove `identity_dir` field, remove unused `verify()` and `did_key()` methods, remove `#[allow(dead_code)]` and verify usage
- **Acceptance**: Zero `#[allow(dead_code)]` outside of generated code
#### 4.8 Replace println/eprintln with tracing
- **Files**: core/startos/src/* (23+ instances)
- **Action**: Replace `println!` -> `tracing::info!`, `eprintln!` -> `tracing::warn!`
- **Acceptance**: Zero `println!` / `eprintln!` in non-test code
#### 4.9 Deploy & verify
- `/deploy` then `/check-server` then `/diagnose`
- Test: kill Bitcoin container, verify backend doesn't crash
- Test: flood unauthenticated endpoint, verify rate limiting
- Test: restart backend, verify graceful startup
---
## Week 5: WebSocket & Real-Time Quality (April 713)
**Theme**: Real-time updates are bulletproof. Connection issues are transparent to the user.
### Tasks
#### 5.1 WebSocket reconnection UX
- **Files**: api/websocket.ts, App.vue
- **Action**: After max reconnect attempts, show persistent banner "Connection lost. Click to retry."
- **Pattern**: Don't silently give up after 10 attempts
- **Acceptance**: User always has a path to reconnect
- **Skill**: `/polish-websocket`
#### 5.2 WebSocket heartbeat improvement
- **Files**: api/websocket.ts
- **Action**: Send ping every 30s, expect pong within 5s, reconnect if missed
- **Acceptance**: Stale connections detected within 35s, not 60s
#### 5.3 RPC client session detection
- **Files**: api/rpc-client.ts
- **Action**: On 401/403 response, redirect to login page instead of showing generic error
- **Pattern**: `if (status === 401) { router.push('/login'); return; }`
- **Acceptance**: Expired sessions redirect to login immediately
#### 5.4 Message queuing during reconnection
- **Files**: api/rpc-client.ts, api/websocket.ts
- **Action**: If WebSocket is down, queue state-update subscriptions, replay on reconnect
- **Pattern**: Don't lose container state updates during brief disconnects
- **Acceptance**: State is consistent after reconnection without page refresh
#### 5.5 WebSocket race condition fix
- **Files**: stores/app.ts, api/websocket.ts
- **Action**: Fix duplicate listener issue on rapid reconnect (`isWsSubscribed` flag)
- **Pattern**: Use a Set of listener IDs, deduplicate on registration
- **Acceptance**: No duplicate event handlers after reconnect cycles
#### 5.6 Deploy & verify
- Test: kill backend, observe frontend reconnection behavior
- Test: toggle wifi, observe status indicator + reconnection
- Test: let session expire, verify redirect to login
---
## Week 6: Deployment & Infrastructure Hardening (April 1420)
**Theme**: Deployments are safe, reversible, and verified. Infrastructure is production-grade.
### Tasks
#### 6.1 Deploy script: add rollback capability
- **Files**: scripts/deploy-to-target.sh
- **Action**: Before overwriting binary/frontend, backup to `.backup` suffix
- **Pattern**: On health check failure after restart, restore from backup
- **Acceptance**: Failed deploy auto-restores previous working version
- **Skill**: `/polish-deploy`
#### 6.2 Deploy script: pre-deploy sanity checks
- **Files**: scripts/deploy-to-target.sh
- **Action**: Check disk space (2GB min), verify SSH key exists, verify target dir exists
- **Acceptance**: Deploy fails early with clear message if preconditions not met
#### 6.3 Deploy script: post-deploy health verification
- **Files**: scripts/deploy-to-target.sh
- **Action**: After restart, poll `/health` endpoint for 30s. If no 200, trigger rollback
- **Acceptance**: Every deploy is verified healthy before declaring success
#### 6.4 Deploy script: deployment locking
- **Files**: scripts/deploy-to-target.sh
- **Action**: Use flock to prevent concurrent deploys
- **Acceptance**: Second simultaneous deploy fails immediately with message
#### 6.5 First-boot script: add error handling
- **Files**: scripts/first-boot-containers.sh
- **Action**: Add `set -e`, verify each container starts before creating dependents
- **Acceptance**: If Bitcoin fails, Mempool is not attempted
#### 6.6 Systemd service hardening
- **Files**: image-recipe/configs/archipelago.service
- **Action**: Add `PrivateTmp=yes`, `NoNewPrivileges=true`, `ProtectSystem=strict`, `ProtectHome=yes`, `SystemCallFilter=@system-service`
- **Acceptance**: Service runs with minimal privileges
- **Skill**: `/harden`
#### 6.7 Nginx security headers
- **Files**: image-recipe/configs/nginx-archipelago.conf
- **Action**: Add HSTS, fix CSP (remove unsafe-inline), add rate limiting zones, custom log format that strips tokens
- **Acceptance**: Security headers pass Mozilla Observatory scan
#### 6.8 Nginx config: test before reload
- **Files**: scripts/deploy-to-target.sh
- **Action**: `nginx -t` failure should abort deploy and restore backup config
- **Acceptance**: Invalid nginx config never goes live
#### 6.9 Deploy & verify
- Test: deploy with intentionally broken binary, verify rollback
- Test: deploy with invalid nginx config, verify rollback
- Test: concurrent deploy attempt, verify lock
- Run `/diagnose` full check
---
## Week 7: Accessibility, Polish & Edge Cases (April 2127)
**Theme**: Every interaction is crisp. Keyboard users, slow networks, edge cases — all handled.
### Tasks
#### 7.1 ARIA labels on all interactive elements
- **Files**: All views and components
- **Action**: Add `aria-label` to buttons, links, form inputs that lack visible labels
- **Pattern**: `<button aria-label="Install Bitcoin Core" ...>`
- **Acceptance**: Every interactive element has accessible name
#### 7.2 Focus management in modals
- **Files**: Apps.vue (uninstall modal), Marketplace.vue (filter modal), Settings.vue
- **Action**: Trap focus inside modals, return focus on close, autofocus first interactive element
- **Pattern**: Use `useFocusTrap` composable
- **Acceptance**: Tab key never leaves modal; Escape closes; focus returns to trigger
#### 7.3 Keyboard navigation completeness
- **Files**: All views
- **Action**: Verify every action is reachable via keyboard (Tab/Enter/Escape)
- **Acceptance**: Full app usable without mouse
#### 7.4 Fix inline Tailwind violations
- **Files**: Web5.vue, AppDetails.vue, Cloud.vue, onboarding views
- **Action**: Extract inline classes to global classes in style.css
- **Pattern**: `px-3 py-1.5 rounded-lg bg-white/5` -> `.info-row` class
- **Acceptance**: Zero inline Tailwind utility classes in components
- **Skill**: `/ux-review`
#### 7.5 Touch feedback on mobile
- **Files**: style.css, app card components
- **Action**: Add `:active` states for mobile touch feedback
- **Pattern**: `.app-card:active { transform: scale(0.98); }`
- **Acceptance**: Every tappable element has tactile feedback
#### 7.6 Responsive edge cases
- **Files**: Marketplace.vue, Dashboard.vue, AppDetails.vue
- **Action**: Test at 320px, 375px, 768px, 1024px, 1440px widths
- **Fix**: Any overflow, text truncation, or broken layouts
- **Acceptance**: No horizontal scroll or broken layout at any standard width
#### 7.7 Fix template crash risks
- **Files**: ContainerApps.vue:76 (`app.image.split('/').pop()`)
- **Action**: Add null guards on all template expressions that chain methods
- **Pattern**: `app.image?.split('/').pop() ?? 'unknown'`
- **Acceptance**: No template expression can crash on null/undefined data
#### 7.8 Remove all TODO/FIXME from production code
- **Files**: Web5.vue, AppDetails.vue, backend TODO comments
- **Action**: Either implement the TODO or remove the dead code
- **Pattern**: If feature isn't ready, remove the UI element entirely
- **Acceptance**: Zero TODO/FIXME/HACK in committed code
- **Skill**: `/refactor`
#### 7.9 Deploy & verify
- Test: navigate entire app with keyboard only
- Test: resize browser through all breakpoints
- Test: screen reader (VoiceOver) basic navigation
- Run `/ux-review` on every view
---
## Week 8: Integration Testing, Final Sweep & ISO (April 28 May 4)
**Theme**: Everything works together. The final product is tested end-to-end and burned to ISO.
### Tasks
#### 8.1 Create critical path tests — Frontend
- **Files**: Create `neode-ui/src/__tests__/` directory
- **Tests to write**:
- Login flow: valid password, invalid password, TOTP, session timeout
- App lifecycle: install -> start -> launch -> stop -> uninstall
- Settings: password change, TOTP setup, TOTP disable
- WebSocket: connect, disconnect, reconnect
- **Framework**: Vitest + @vue/test-utils (already in package.json)
- **Acceptance**: 10+ critical path tests passing
- **Skill**: `/test`
#### 8.2 Create critical path tests — Backend
- **Tests to write**:
- RPC endpoint validation (good/bad input for each endpoint)
- Session management (create, validate, expire, invalidate)
- Container manifest parsing (valid, invalid, missing fields)
- Rate limiting (under limit, at limit, over limit)
- **Acceptance**: 10+ backend tests passing
- **Skill**: `/test`
#### 8.3 Create deployment verification test
- **Files**: scripts/verify-deploy.sh (new)
- **Action**: Script that hits every endpoint, checks every container, verifies every UI route
- **Pattern**: Automated smoke test run after every deploy
- **Acceptance**: Script exits 0 only if everything works
#### 8.4 Full quality sweep
- Run `/lint` — zero violations
- Run `/harden` — zero findings
- Run `/ux-review` — zero findings
- Run `/diagnose` — all green
- Run `/sweep` — clean bill of health
- **Acceptance**: All skills report zero issues
#### 8.5 Build final ISO
- Sync all configs: `/sync-configs`
- Build ISO: `/build-iso`
- Flash to USB, boot on clean hardware
- Verify first-boot experience end-to-end
- **Acceptance**: ISO boots, onboarding works, Bitcoin syncs, apps install
#### 8.6 Performance baseline
- Measure and document:
- Time to first meaningful paint (target: <2s)
- Login flow completion time (target: <3s)
- App install completion time (document actual)
- WebSocket reconnection time (target: <5s)
- Backend cold start time (target: <3s)
- **Acceptance**: All targets met or documented with explanation
#### 8.7 Final documentation pass
- Update `docs/current-state.md` to reflect production status
- Update `CHANGELOG.md` with all polish work
- Verify all CLAUDE.md instructions are still accurate
- **Acceptance**: Docs match reality
---
## Metrics & Definition of Done
### Per-Week Exit Criteria
Each week is "done" when:
1. All tasks for that week have acceptance criteria met
2. `/sweep` returns zero violations for that week's focus area
3. `/deploy` succeeds and `/check-server` is green
4. Manual spot-check of affected features passes
### Project Exit Criteria (Week 8)
The project is done when ALL of these are true:
- [ ] Zero `.catch(() => {})` in frontend
- [ ] Zero `console.log` outside dev guards
- [ ] Zero `unwrap()`/`expect()` in backend production paths
- [ ] Zero clippy warnings
- [ ] Zero inline Tailwind in components
- [ ] Zero TODO/FIXME in committed code
- [ ] Every view has: loading state, error state, empty state
- [ ] Every form has: real-time validation, disabled during submit
- [ ] Every button action has: loading feedback, error feedback
- [ ] WebSocket shows connection status to user
- [ ] Session timeout redirects to login
- [ ] Deploy has: rollback, health check, locking
- [ ] Systemd service is hardened
- [ ] Nginx has: HSTS, proper CSP, rate limiting, clean logs
- [ ] 10+ frontend tests passing
- [ ] 10+ backend tests passing
- [ ] ISO boots and onboards successfully
- [ ] All performance targets met
---
## Risk Register
| Risk | Mitigation |
|------|------------|
| Skeleton loaders change visual feel | Match exact glassmorphism style, use existing color tokens |
| Backend changes break existing functionality | Deploy to secondary server (198) first, test, then primary |
| Nginx CSP changes break app iframes | Test each framed app individually before deploying |
| Rate limiting blocks legitimate use | Set generous limits (60/min), monitor false positives |
| Test suite becomes maintenance burden | Only test critical paths, no unit tests for trivial code |
| ISO build captures incomplete state | Always build ISO from clean deploy, never mid-development |