archy/.claude/plans/plan.md
Dorian e8a0e1af19 feat: add Ollama proxy timeouts, SSH key migration, polish skills, and demo content
- Update all skill SSH commands from sshpass to key-based auth (~/.ssh/archipelago-deploy)
- Add proxy_connect_timeout 120s to nginx Ollama location blocks
- Add new polish/sweep skills for overnight automation
- Add demo content (documents, photos) for demo stack
- Add .ssh/ to .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 08:06:52 +00:00

515 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Archipelago Production Polish Plan
**Duration**: 8 weeks (March 10 May 4, 2026)
**Goal**: Zero new features. Every existing feature polished to flawless production quality.
**Philosophy**: The iPhone moment — everything just works, feels inevitable, no rough edges.
## SSH Access
All remote commands use SSH key auth (password auth is disabled):
```bash
ssh -i ~/.ssh/archipelago-deploy archipelago@192.168.1.228
```
Never use `sshpass`. The deploy script handles this automatically via `SSH_KEY`.
---
## Audit Summary
Full codebase audit completed March 8, 2026. Findings:
| Layer | Critical | High | Medium | Low |
|-------|----------|------|--------|-----|
| Frontend (Vue/TS) | 4 | 6 | 10 | 4 |
| Backend (Rust) | 6 | 6 | 6 | 7 |
| Infrastructure | 5 | 6 | 7 | 3 |
| UX Flows | 4 | 4 | 6 | 3 |
| **Total** | **19** | **22** | **29** | **17** |
---
## Skills Required
### Existing Skills (14)
`deploy`, `deploy-both`, `diagnose`, `check-server`, `frontend-dev`, `sync-configs`, `build-iso`, `server-logs`, `add-app`, `harden`, `test`, `lint`, `ux-review`, `refactor`
### New Skills (9)
| Skill | Purpose |
|-------|---------|
| `polish` | Main orchestrator — reads this plan, detects week, executes tasks |
| `polish-errors` | Fix silent error handling, add user-facing error states |
| `polish-loading` | Add skeleton loaders, loading indicators, empty states |
| `polish-forms` | Input validation, trimming, real-time feedback |
| `polish-backend` | Fix unwrap/expect, add timeouts, connection pooling |
| `polish-deploy` | Add rollback, health checks, pre-deploy validation |
| `polish-security` | Systemd hardening, nginx CSP, secrets management |
| `polish-websocket` | Reconnection UX, connection status indicator, heartbeat |
| `sweep` | Full automated quality sweep: lint + type-check + verify fixes |
---
## Week 1: Silent Failures & Error Handling (March 1016)
**Theme**: Nothing fails silently. Every error is visible, actionable, recoverable.
### Tasks
#### 1.1 Frontend: Kill all silent catch blocks
- **Files**: Settings.vue, Web5.vue, router/index.ts, Apps.vue, OnboardingIntro.vue
- **Action**: Replace 21+ `.catch(() => {})` patterns with proper error handling
- **Pattern**: Log to console in dev, show toast/inline error to user in prod
- **Acceptance**: Zero `.catch(() => {})` in codebase (grep confirms)
- **Skill**: `/polish-errors`
#### 1.2 Frontend: Remove all console.log from production
- **Files**: stores/app.ts (15+), api/websocket.ts (12+)
- **Action**: Replace with conditional dev-only logging or remove
- **Pattern**: `if (import.meta.env.DEV) console.log(...)` or remove entirely
- **Acceptance**: Zero `console.log` outside of dev guards (grep confirms)
- **Skill**: `/lint`
#### 1.3 Backend: Fix all unwrap/expect in handler.rs
- **Files**: core/archipelago/src/api/handler.rs (11 unwraps)
- **Action**: Replace `.unwrap()` on Response builders with `.map_err()` and `?`
- **Acceptance**: Zero `unwrap()` in handler.rs
- **Skill**: `/polish-backend`
#### 1.4 Backend: Fix unwrap/expect across all production paths
- **Files**: main.rs, identity.rs, totp.rs, rpc/mod.rs, image_verifier.rs
- **Action**: Audit all 32 `.unwrap()`/`.expect()` calls, replace with `?` or `.context()`
- **Acceptance**: Zero unwrap/expect outside of test modules
- **Skill**: `/polish-backend`
#### 1.5 Backend: Hardcoded Bitcoin RPC credentials
- **Files**: core/archipelago/src/api/rpc/bitcoin.rs:89
- **Action**: Move `archipelago/archipelago123` to env var or secrets manager
- **Pattern**: `std::env::var("ARCHIPELAGO_BITCOIN_RPC_USER").unwrap_or("archipelago".into())`
- **Acceptance**: No hardcoded credentials in Rust source
#### 1.6 Deploy & verify
- Run `/lint` to confirm zero violations
- Run `/deploy` to live server
- Run `/check-server` to verify health
- Manual spot-check: trigger errors in UI, confirm they're visible
---
## Week 2: Loading States & Visual Feedback (March 1723)
**Theme**: The user always knows what's happening. No blank screens, no mystery waits.
### Tasks
#### 2.1 Add skeleton loaders to all async views
- **Files**: Apps.vue, AppDetails.vue, Marketplace.vue, Cloud.vue, Server.vue, Settings.vue
- **Action**: Create `SkeletonLoader.vue` component, add to every view that fetches data
- **Pattern**: Show skeleton immediately, swap to real content on load
- **Acceptance**: Every view shows placeholder content during load
- **Skill**: `/polish-loading`
#### 2.2 Add timeout warnings to long operations
- **Files**: Login.vue (server startup), Marketplace.vue (app install)
- **Action**: After 15s show "Taking longer than expected...", after 30s show troubleshoot options
- **Acceptance**: No operation silently hangs
#### 2.3 Fix Start/Stop button state mismatch
- **Files**: Apps.vue, AppDetails.vue, ContainerApps.vue
- **Action**: Button reflects actual backend state, not a fixed 5s timer
- **Pattern**: Poll backend every 2s during state transition, update button immediately on response
- **Acceptance**: Button state always matches container state within 3s
#### 2.4 Connection status indicator
- **Files**: Create `ConnectionStatus.vue`, integrate into App.vue header
- **Action**: Show green/amber/red dot based on WebSocket connection state
- **Pattern**: Use `wsClient.isConnected()` — green=connected, amber=reconnecting, red=disconnected
- **Acceptance**: User always knows if they're connected
- **Skill**: `/polish-websocket`
#### 2.5 Fix OnlineStatusPill to use real data
- **Files**: components/OnlineStatusPill.vue
- **Action**: Connect to actual WebSocket state instead of hardcoded "Online"
- **Acceptance**: Pill reflects real connection state
#### 2.6 Empty states for all views
- **Files**: Apps.vue, Cloud.vue, ContainerApps.vue
- **Action**: When no data, show helpful message with CTA (e.g., "No apps installed — Browse Marketplace")
- **Acceptance**: Every view handles the zero-data case gracefully
#### 2.7 Deploy & verify
- `/deploy` then `/check-server`
- Test: disconnect network, observe status indicator
- Test: slow network (throttle), observe skeleton loaders
- Test: fresh account with no apps, observe empty states
---
## Week 3: Form Validation & Input Quality (March 2430)
**Theme**: Every input feels responsive, validated, impossible to misuse.
### Tasks
#### 3.1 Real-time password validation
- **Files**: Login.vue (password setup), Settings.vue (password change)
- **Action**: Show inline validation as user types: length check, match check, strength meter
- **Pattern**: Debounced validation on input, green checkmark / red X per rule
- **Acceptance**: User sees validation state before clicking submit
- **Skill**: `/polish-forms`
#### 3.2 TOTP input improvements
- **Files**: Login.vue (TOTP verify step)
- **Action**: Auto-submit on 6 digits, show session countdown timer, trim whitespace
- **Pattern**: `watch(code, () => { if (code.length === 6) submit() })`
- **Acceptance**: TOTP flow is fast and clear, session timeout is visible
#### 3.3 Input trimming on all forms
- **Files**: Login.vue, Settings.vue, any form input
- **Action**: `.trim()` all text inputs before submission
- **Acceptance**: Leading/trailing whitespace never causes failures
#### 3.4 Disable submit buttons during operations
- **Files**: Settings.vue (password change), Login.vue (login), Marketplace.vue (install)
- **Action**: Add `:disabled="isSubmitting"` to all action buttons
- **Pattern**: Button shows spinner + disabled state during async operation
- **Acceptance**: No button can be double-clicked during an operation
#### 3.5 Error message consistency
- **Files**: All views with error messages
- **Action**: Create `formatError()` utility that normalizes error messages
- **Pattern**: Network errors -> "Can't reach server", Auth errors -> "Session expired", Server errors -> "Something went wrong"
- **Acceptance**: Error messages are user-friendly, never show raw error strings
#### 3.6 Deploy & verify
- Test every form: login, password change, TOTP setup, app install
- Try invalid inputs, verify feedback is immediate and clear
---
## Week 4: Backend Robustness (March 31 April 6)
**Theme**: The backend never crashes, never hangs, handles every edge case.
### Tasks
#### 4.1 Add timeouts to all container operations
- **Files**: core/archipelago/src/container/dev_orchestrator.rs
- **Action**: Wrap all podman calls with `tokio::time::timeout(Duration::from_secs(30), ...)`
- **Acceptance**: No container operation can hang indefinitely
#### 4.2 Add timeouts to all external HTTP calls
- **Files**: bitcoin.rs, handler.rs (LND proxy)
- **Action**: Explicit `reqwest::Client` with timeout, not default
- **Pattern**: Reuse a single `Client` stored in `RpcHandler` state
- **Acceptance**: Every HTTP call has an explicit timeout
#### 4.3 Connection pooling for Bitcoin RPC
- **Files**: core/archipelago/src/api/rpc/bitcoin.rs
- **Action**: Store `reqwest::Client` in `RpcHandler`, reuse across requests
- **Acceptance**: One client instance, connection pooled
#### 4.4 Fix all clippy warnings
- **Action**: Run `cargo clippy --all-targets --all-features` on dev server, fix all 10 warnings
- **Warnings**: `should_implement_trait`, `get_first`, `assign_op_pattern`, `wildcard_in_or_patterns`, `redundant_field_names`, `unused_import`, `ptr_arg`, `very_complex_type`, `if_else_collapse`, `io::Error::other`
- **Acceptance**: `cargo clippy` returns zero warnings
- **Skill**: `/lint`
#### 4.5 Rate limiting on unauthenticated endpoints
- **Files**: core/archipelago/src/api/handler.rs
- **Action**: Add per-IP rate limiting to `/archipelago/node-message` and `/electrs-status`
- **Pattern**: In-memory rate limiter with 60 req/min per IP
- **Acceptance**: Endpoints return 429 when rate exceeded
#### 4.6 Consistent error codes and messages
- **Files**: All RPC endpoints
- **Action**: Define error code constants, consistent capitalization
- **Pattern**: `const ERR_AUTH: i32 = -1001;` etc.
- **Acceptance**: All error responses use defined constants
#### 4.7 Remove dead code
- **Files**: identity.rs (unused field, unused methods), auth.rs (dead_code allows)
- **Action**: Remove `identity_dir` field, remove unused `verify()` and `did_key()` methods, remove `#[allow(dead_code)]` and verify usage
- **Acceptance**: Zero `#[allow(dead_code)]` outside of generated code
#### 4.8 Replace println/eprintln with tracing
- **Files**: core/startos/src/* (23+ instances)
- **Action**: Replace `println!` -> `tracing::info!`, `eprintln!` -> `tracing::warn!`
- **Acceptance**: Zero `println!` / `eprintln!` in non-test code
#### 4.9 Deploy & verify
- `/deploy` then `/check-server` then `/diagnose`
- Test: kill Bitcoin container, verify backend doesn't crash
- Test: flood unauthenticated endpoint, verify rate limiting
- Test: restart backend, verify graceful startup
---
## Week 5: WebSocket & Real-Time Quality (April 713)
**Theme**: Real-time updates are bulletproof. Connection issues are transparent to the user.
### Tasks
#### 5.1 WebSocket reconnection UX
- **Files**: api/websocket.ts, App.vue
- **Action**: After max reconnect attempts, show persistent banner "Connection lost. Click to retry."
- **Pattern**: Don't silently give up after 10 attempts
- **Acceptance**: User always has a path to reconnect
- **Skill**: `/polish-websocket`
#### 5.2 WebSocket heartbeat improvement
- **Files**: api/websocket.ts
- **Action**: Send ping every 30s, expect pong within 5s, reconnect if missed
- **Acceptance**: Stale connections detected within 35s, not 60s
#### 5.3 RPC client session detection
- **Files**: api/rpc-client.ts
- **Action**: On 401/403 response, redirect to login page instead of showing generic error
- **Pattern**: `if (status === 401) { router.push('/login'); return; }`
- **Acceptance**: Expired sessions redirect to login immediately
#### 5.4 Message queuing during reconnection
- **Files**: api/rpc-client.ts, api/websocket.ts
- **Action**: If WebSocket is down, queue state-update subscriptions, replay on reconnect
- **Pattern**: Don't lose container state updates during brief disconnects
- **Acceptance**: State is consistent after reconnection without page refresh
#### 5.5 WebSocket race condition fix
- **Files**: stores/app.ts, api/websocket.ts
- **Action**: Fix duplicate listener issue on rapid reconnect (`isWsSubscribed` flag)
- **Pattern**: Use a Set of listener IDs, deduplicate on registration
- **Acceptance**: No duplicate event handlers after reconnect cycles
#### 5.6 Deploy & verify
- Test: kill backend, observe frontend reconnection behavior
- Test: toggle wifi, observe status indicator + reconnection
- Test: let session expire, verify redirect to login
---
## Week 6: Deployment & Infrastructure Hardening (April 1420)
**Theme**: Deployments are safe, reversible, and verified. Infrastructure is production-grade.
### Tasks
#### 6.1 Deploy script: add rollback capability
- **Files**: scripts/deploy-to-target.sh
- **Action**: Before overwriting binary/frontend, backup to `.backup` suffix
- **Pattern**: On health check failure after restart, restore from backup
- **Acceptance**: Failed deploy auto-restores previous working version
- **Skill**: `/polish-deploy`
#### 6.2 Deploy script: pre-deploy sanity checks
- **Files**: scripts/deploy-to-target.sh
- **Action**: Check disk space (2GB min), verify SSH key exists, verify target dir exists
- **Acceptance**: Deploy fails early with clear message if preconditions not met
#### 6.3 Deploy script: post-deploy health verification
- **Files**: scripts/deploy-to-target.sh
- **Action**: After restart, poll `/health` endpoint for 30s. If no 200, trigger rollback
- **Acceptance**: Every deploy is verified healthy before declaring success
#### 6.4 Deploy script: deployment locking
- **Files**: scripts/deploy-to-target.sh
- **Action**: Use flock to prevent concurrent deploys
- **Acceptance**: Second simultaneous deploy fails immediately with message
#### 6.5 First-boot script: add error handling
- **Files**: scripts/first-boot-containers.sh
- **Action**: Add `set -e`, verify each container starts before creating dependents
- **Acceptance**: If Bitcoin fails, Mempool is not attempted
#### 6.6 Systemd service hardening
- **Files**: image-recipe/configs/archipelago.service
- **Action**: Add `PrivateTmp=yes`, `NoNewPrivileges=true`, `ProtectSystem=strict`, `ProtectHome=yes`, `SystemCallFilter=@system-service`
- **Acceptance**: Service runs with minimal privileges
- **Skill**: `/harden`
#### 6.7 Nginx security headers
- **Files**: image-recipe/configs/nginx-archipelago.conf
- **Action**: Add HSTS, fix CSP (remove unsafe-inline), add rate limiting zones, custom log format that strips tokens
- **Acceptance**: Security headers pass Mozilla Observatory scan
#### 6.8 Nginx config: test before reload
- **Files**: scripts/deploy-to-target.sh
- **Action**: `nginx -t` failure should abort deploy and restore backup config
- **Acceptance**: Invalid nginx config never goes live
#### 6.9 Deploy & verify
- Test: deploy with intentionally broken binary, verify rollback
- Test: deploy with invalid nginx config, verify rollback
- Test: concurrent deploy attempt, verify lock
- Run `/diagnose` full check
---
## Week 7: Accessibility, Polish & Edge Cases (April 2127)
**Theme**: Every interaction is crisp. Keyboard users, slow networks, edge cases — all handled.
### Tasks
#### 7.1 ARIA labels on all interactive elements
- **Files**: All views and components
- **Action**: Add `aria-label` to buttons, links, form inputs that lack visible labels
- **Pattern**: `<button aria-label="Install Bitcoin Core" ...>`
- **Acceptance**: Every interactive element has accessible name
#### 7.2 Focus management in modals
- **Files**: Apps.vue (uninstall modal), Marketplace.vue (filter modal), Settings.vue
- **Action**: Trap focus inside modals, return focus on close, autofocus first interactive element
- **Pattern**: Use `useFocusTrap` composable
- **Acceptance**: Tab key never leaves modal; Escape closes; focus returns to trigger
#### 7.3 Keyboard navigation completeness
- **Files**: All views
- **Action**: Verify every action is reachable via keyboard (Tab/Enter/Escape)
- **Acceptance**: Full app usable without mouse
#### 7.4 Fix inline Tailwind violations
- **Files**: Web5.vue, AppDetails.vue, Cloud.vue, onboarding views
- **Action**: Extract inline classes to global classes in style.css
- **Pattern**: `px-3 py-1.5 rounded-lg bg-white/5` -> `.info-row` class
- **Acceptance**: Zero inline Tailwind utility classes in components
- **Skill**: `/ux-review`
#### 7.5 Touch feedback on mobile
- **Files**: style.css, app card components
- **Action**: Add `:active` states for mobile touch feedback
- **Pattern**: `.app-card:active { transform: scale(0.98); }`
- **Acceptance**: Every tappable element has tactile feedback
#### 7.6 Responsive edge cases
- **Files**: Marketplace.vue, Dashboard.vue, AppDetails.vue
- **Action**: Test at 320px, 375px, 768px, 1024px, 1440px widths
- **Fix**: Any overflow, text truncation, or broken layouts
- **Acceptance**: No horizontal scroll or broken layout at any standard width
#### 7.7 Fix template crash risks
- **Files**: ContainerApps.vue:76 (`app.image.split('/').pop()`)
- **Action**: Add null guards on all template expressions that chain methods
- **Pattern**: `app.image?.split('/').pop() ?? 'unknown'`
- **Acceptance**: No template expression can crash on null/undefined data
#### 7.8 Remove all TODO/FIXME from production code
- **Files**: Web5.vue, AppDetails.vue, backend TODO comments
- **Action**: Either implement the TODO or remove the dead code
- **Pattern**: If feature isn't ready, remove the UI element entirely
- **Acceptance**: Zero TODO/FIXME/HACK in committed code
- **Skill**: `/refactor`
#### 7.9 Deploy & verify
- Test: navigate entire app with keyboard only
- Test: resize browser through all breakpoints
- Test: screen reader (VoiceOver) basic navigation
- Run `/ux-review` on every view
---
## Week 8: Integration Testing, Final Sweep & ISO (April 28 May 4)
**Theme**: Everything works together. The final product is tested end-to-end and burned to ISO.
### Tasks
#### 8.1 Create critical path tests — Frontend
- **Files**: Create `neode-ui/src/__tests__/` directory
- **Tests to write**:
- Login flow: valid password, invalid password, TOTP, session timeout
- App lifecycle: install -> start -> launch -> stop -> uninstall
- Settings: password change, TOTP setup, TOTP disable
- WebSocket: connect, disconnect, reconnect
- **Framework**: Vitest + @vue/test-utils (already in package.json)
- **Acceptance**: 10+ critical path tests passing
- **Skill**: `/test`
#### 8.2 Create critical path tests — Backend
- **Tests to write**:
- RPC endpoint validation (good/bad input for each endpoint)
- Session management (create, validate, expire, invalidate)
- Container manifest parsing (valid, invalid, missing fields)
- Rate limiting (under limit, at limit, over limit)
- **Acceptance**: 10+ backend tests passing
- **Skill**: `/test`
#### 8.3 Create deployment verification test
- **Files**: scripts/verify-deploy.sh (new)
- **Action**: Script that hits every endpoint, checks every container, verifies every UI route
- **Pattern**: Automated smoke test run after every deploy
- **Acceptance**: Script exits 0 only if everything works
#### 8.4 Full quality sweep
- Run `/lint` — zero violations
- Run `/harden` — zero findings
- Run `/ux-review` — zero findings
- Run `/diagnose` — all green
- Run `/sweep` — clean bill of health
- **Acceptance**: All skills report zero issues
#### 8.5 Build final ISO
- Sync all configs: `/sync-configs`
- Build ISO: `/build-iso`
- Flash to USB, boot on clean hardware
- Verify first-boot experience end-to-end
- **Acceptance**: ISO boots, onboarding works, Bitcoin syncs, apps install
#### 8.6 Performance baseline
- Measure and document:
- Time to first meaningful paint (target: <2s)
- Login flow completion time (target: <3s)
- App install completion time (document actual)
- WebSocket reconnection time (target: <5s)
- Backend cold start time (target: <3s)
- **Acceptance**: All targets met or documented with explanation
#### 8.7 Final documentation pass
- Update `docs/current-state.md` to reflect production status
- Update `CHANGELOG.md` with all polish work
- Verify all CLAUDE.md instructions are still accurate
- **Acceptance**: Docs match reality
---
## Metrics & Definition of Done
### Per-Week Exit Criteria
Each week is "done" when:
1. All tasks for that week have acceptance criteria met
2. `/sweep` returns zero violations for that week's focus area
3. `/deploy` succeeds and `/check-server` is green
4. Manual spot-check of affected features passes
### Project Exit Criteria (Week 8)
The project is done when ALL of these are true:
- [ ] Zero `.catch(() => {})` in frontend
- [ ] Zero `console.log` outside dev guards
- [ ] Zero `unwrap()`/`expect()` in backend production paths
- [ ] Zero clippy warnings
- [ ] Zero inline Tailwind in components
- [ ] Zero TODO/FIXME in committed code
- [ ] Every view has: loading state, error state, empty state
- [ ] Every form has: real-time validation, disabled during submit
- [ ] Every button action has: loading feedback, error feedback
- [ ] WebSocket shows connection status to user
- [ ] Session timeout redirects to login
- [ ] Deploy has: rollback, health check, locking
- [ ] Systemd service is hardened
- [ ] Nginx has: HSTS, proper CSP, rate limiting, clean logs
- [ ] 10+ frontend tests passing
- [ ] 10+ backend tests passing
- [ ] ISO boots and onboards successfully
- [ ] All performance targets met
---
## Risk Register
| Risk | Mitigation |
|------|------------|
| Skeleton loaders change visual feel | Match exact glassmorphism style, use existing color tokens |
| Backend changes break existing functionality | Deploy to secondary server (198) first, test, then primary |
| Nginx CSP changes break app iframes | Test each framed app individually before deploying |
| Rate limiting blocks legitimate use | Set generous limits (60/min), monitor false positives |
| Test suite becomes maintenance burden | Only test critical paths, no unit tests for trivial code |
| ISO build captures incomplete state | Always build ISO from clean deploy, never mid-development |