archy/docs/refactoring-plan.md

265 lines
17 KiB
Markdown
Raw Normal View History

2026-03-15 00:40:55 +00:00
# Archy Refactoring Plan — Codebase Quality & Reliability
**Period**: March 2026 — March 2029
**Scope**: Refactoring, bug fixes, library adoption, testing, performance only
**Out of scope**: New features, design changes, UI changes
This plan exists alongside the feature roadmap. Refactoring work should be interleaved with feature sprints — not blocked by them.
---
## Year 1: Fix What's Broken, Adopt Proper Libraries (March 2026 — Feb 2027)
### Q1 2026: Critical Fixes & Database
#### 1. Enable SQLite via sqlx (HIGH — crash resilience)
- **Problem**: All state is in-memory. Crashes lose everything except container snapshots. `sqlx` is commented out in `core/Cargo.toml`.
- **Fix**: Uncomment sqlx, create migrations for: sessions, user data, peer state, metrics history, notification log. Keep the in-memory `DataModel` as a read cache backed by SQLite.
- **Files**: `core/Cargo.toml`, `core/archipelago/src/state.rs`, new `core/archipelago/src/db/` module
- **Why not a full Postgres**: Single-user appliance. SQLite is the right choice — zero config, file-based, embedded.
#### 2. Enforce RBAC (HIGH — security)
- **Problem**: `UserRole::can_access()` is implemented in `auth.rs` but never called in `rpc/mod.rs`. Every authenticated user has full admin access.
- **Fix**: Add role check in `RpcHandler::handle()` before dispatching to method handlers. Wire up role assignment during onboarding.
- **Files**: `core/archipelago/src/api/rpc/mod.rs`, `core/archipelago/src/auth.rs`
#### 3. Fix session TTL clock bug (HIGH — correctness)
- **Problem**: `session.rs` uses `Instant::now()` for TTL. `Instant` is monotonic but resets on system sleep/hibernate — common on the hardware Archy targets.
- **Fix**: Use `SystemTime::now()` for session expiry timestamps, or better — use `tower-sessions` with the new SQLite backend.
- **Files**: `core/archipelago/src/session.rs`
#### 4. Fix 10 failing frontend tests (MEDIUM)
- **Problem**: `appLauncher.test.ts` and `settings.test.ts` are out of sync with current implementation.
- **Fix**: Update test expectations to match current behavior. Don't mock what doesn't need mocking.
- **Files**: `neode-ui/src/stores/__tests__/appLauncher.test.ts`, `neode-ui/src/views/__tests__/settings.test.ts`
#### 5. Remove dead dependencies (LOW)
- **Problem**: `dockerode` in `package.json` is unused (container ops go through RPC).
- **Fix**: `npm uninstall dockerode @types/dockerode`
- **Files**: `neode-ui/package.json`
### Q2 2026: WebSocket Efficiency & Validation
#### 6. Add json-patch crate to backend (HIGH — performance)
- **Problem**: Backend broadcasts the entire `DataModel` on every state change. Frontend already has `fast-json-patch` and supports incremental updates. Backend just doesn't generate patches.
- **Fix**: Add `json-patch` crate. Before broadcasting, diff old vs new `DataModel`, send only the RFC 6902 patch. Fall back to full sync if patch is larger than full model.
- **Files**: `core/Cargo.toml`, `core/archipelago/src/state.rs`
#### 7. Add form validation with zod (MEDIUM — maintainability)
- **Problem**: Manual inline validation scattered across Login, Settings, Onboarding. As forms grow, this becomes a maintenance burden.
- **Fix**: `npm install zod`. Create validation schemas in `src/types/schemas.ts`. Use in forms and RPC request builders. This is especially important for onboarding where bad input causes cryptographic key generation to fail silently.
- **Files**: `neode-ui/package.json`, new `neode-ui/src/types/schemas.ts`, `Login.vue`, `Settings.vue`, onboarding views
#### 8. Move hardcoded app metadata to manifest files (MEDIUM — maintainability)
- **Problem**: `docker_packages.rs` has hardcoded port mappings, titles, descriptions, and icon paths for ~20 apps. App manifests exist in `apps/` but aren't the source of truth.
- **Fix**: Make `apps/{app-id}/manifest.yml` the single source of truth. Load metadata from manifests at startup. Remove hardcoded maps from Rust source.
- **Files**: `core/archipelago/src/container/docker_packages.rs`, `apps/*/manifest.yml`
### Q3 2026: Error Handling & Testing
#### 9. Structured error types per backend module (MEDIUM — debuggability)
- **Problem**: Everything uses `anyhow::Result`. When errors bubble up through RPC, you lose the module context. User-facing vs system errors aren't distinguished at the type level.
- **Fix**: Create `thiserror` error enums for each major module: `AuthError`, `ContainerError`, `FederationError`, `IdentityError`. Map to appropriate HTTP status codes and user-friendly messages in the RPC layer.
- **Files**: Each module in `core/archipelago/src/`
#### 10. Backend integration tests for RPC endpoints (HIGH — reliability)
- **Problem**: 312 unit tests exist but zero integration tests for 80+ RPC endpoints. No test ever makes an actual HTTP request to the server.
- **Fix**: Create integration test harness that spins up a real server instance (with test config, temp data dir). Test auth flow, container operations, identity, federation. Use `reqwest` as test client.
- **Files**: New `core/archipelago/tests/` directory
#### 11. Frontend 404 route (LOW — UX)
- **Problem**: No catch-all route. Invalid URLs silently show nothing.
- **Fix**: Add `/:pathMatch(.*)*` catch-all route that shows a "Page not found" view with navigation back to dashboard.
- **Files**: `neode-ui/src/router/index.ts`, new `neode-ui/src/views/NotFound.vue`
### Q4 2026: Clean Up Dead Code & CI
#### 12. Remove dead code and #[allow(dead_code)] (LOW — cleanliness)
- **Problem**: `auth.rs` has `#[allow(dead_code)]` on `OnboardingState` fields and `AuthManager` methods. Either use them or remove them.
- **Fix**: Audit all `#[allow(dead_code)]`, `#[allow(unused)]`. Remove genuinely unused code. Wire up code that should be used (like RBAC — covered in item 2).
- **Files**: `core/archipelago/src/auth.rs` and others
#### 13. Set up CI pipeline (HIGH — process)
- **Problem**: No automated testing on push/PR. All testing is manual or via deploy scripts.
- **Fix**: GitHub Actions workflow: `cargo clippy`, `cargo test`, `npm run type-check`, `npm run test` on every push. Fail the build on warnings.
- **Files**: New `.github/workflows/ci.yml`
#### 14. Cosign container image verification (MEDIUM — security)
- **Problem**: `podman_client.rs:95` has a TODO for cosign signature verification. Container images are pulled without validation.
- **Fix**: Implement cosign verification using the `sigstore` crate, or shell out to `cosign verify` as a first step. At minimum, verify image digests against a pinned manifest.
- **Files**: `core/container/src/podman_client.rs`, `core/security/`
---
## Year 2: Robustness & Performance (March 2027 — Feb 2028)
### Q1 2027: Backend Architecture
#### 15. Migrate from hyper to axum (MEDIUM — maintainability)
- **Problem**: Raw `hyper` 0.14 with manual routing in `handler.rs` (813 lines). Route matching, middleware, and error handling are all hand-rolled. `hyper` 0.14 is also end-of-life.
- **Fix**: Migrate to `axum` (built on hyper 1.x, maintained by tokio team). Axum gives you: extractors, middleware stack, typed routing, tower integration. The RPC methods stay the same — only the HTTP layer changes.
- **Files**: `core/archipelago/src/api/handler.rs`, `core/archipelago/src/api/mod.rs`, `core/Cargo.toml`
- **Risk**: Medium. Do this on a branch, test thoroughly. The RPC logic doesn't change, just the HTTP glue.
#### 16. Replace custom rate limiter with tower middleware (LOW — correctness)
- **Problem**: Hand-rolled in-memory rate limiter in `rpc/mod.rs`. Works for single instance but not distributed.
- **Fix**: Use `tower::limit::RateLimitLayer` or `governor` crate. Cleaner, tested, configurable per-route.
- **Files**: `core/archipelago/src/api/rpc/mod.rs`
#### 17. Persistent sessions in SQLite (MEDIUM — UX)
- **Problem**: Sessions are in-memory. Server restart logs out all users.
- **Fix**: With SQLite from item 1, store sessions in DB. Users stay logged in across restarts.
- **Files**: `core/archipelago/src/session.rs`
### Q2 2027: Frontend Architecture
#### 18. Audit and optimize bundle size (MEDIUM — performance)
- **Problem**: D3 is a large dependency (~240KB) used only for `LineChart.vue`. Target is <500KB gzipped.
- **Fix**: Replace full `d3` import with only `d3-scale`, `d3-shape`, `d3-axis` (tree-shakeable). Or evaluate `unovis` or native Canvas for simple line charts. Measure before and after.
- **Files**: `neode-ui/package.json`, `neode-ui/src/components/LineChart.vue`
#### 19. Vue Router route transitions (LOW — polish)
- **Problem**: No transition animations between routes. Pages appear/disappear instantly.
- **Fix**: Add `<RouterView v-slot>` with `<Transition>` wrapper. Simple fade (200ms) is enough — matches the existing glassmorphism feel without changing the design.
- **Files**: `neode-ui/src/App.vue`
- **Note**: This is not a design change — it's a missing standard Vue pattern.
#### 20. TypeScript strict cleanup (LOW — type safety)
- **Problem**: WebSocket callback types in `app.ts:105` use inline object types instead of importing the `Update` type from `@/types/api`.
- **Fix**: Audit all stores and components for inline type definitions that should reference shared types. Centralize in `src/types/`.
- **Files**: `neode-ui/src/stores/app.ts`, `neode-ui/src/types/`
### Q3 2027: Testing & Observability
#### 21. Reach 60% test coverage (HIGH — reliability)
- **Problem**: Frontend has ~505 passing tests but many views untested. Backend has zero RPC integration tests.
- **Fix**: Prioritize testing for: auth flow, container lifecycle, WebSocket reconnection, federation handshake, backup/restore. Use coverage reports to find gaps.
- **Target**: 60% line coverage frontend, 50% backend
#### 22. Add OpenTelemetry tracing (MEDIUM — observability)
- **Problem**: `tracing` is used for logging but there's no distributed tracing or metrics export. When something goes wrong in production, you're reading log files.
- **Fix**: Add `tracing-opentelemetry` and `opentelemetry-otlp`. Export traces to a local collector (Grafana is already a supported app). Instrument RPC handlers, container operations, federation sync.
- **Files**: `core/Cargo.toml`, `core/archipelago/src/main.rs`
#### 23. Prometheus metrics export (MEDIUM — monitoring)
- **Problem**: `MetricsStore` collects data but doesn't expose it. No way to monitor Archy health externally.
- **Fix**: Add `/metrics` endpoint in Prometheus format using `prometheus` crate. Expose: RPC latency histograms, active sessions, container health, WebSocket connections, memory usage.
- **Files**: `core/archipelago/src/api/handler.rs`, `core/archipelago/src/monitoring/`
### Q4 2027: Performance
#### 24. Optimize container scanner (MEDIUM — CPU)
- **Problem**: `docker_packages.rs` scans all containers every 10 seconds with full JSON parsing. On a system with 30+ containers, this is unnecessary CPU churn.
- **Fix**: Use Podman events API (`podman events --format json`) to watch for container state changes instead of polling. Fall back to polling every 60s as a safety net.
- **Files**: `core/archipelago/src/container/docker_packages.rs`
#### 25. Lazy-load i18n locales (LOW — bundle size)
- **Problem**: Spanish locale exists but loading behavior isn't optimized.
- **Fix**: Use Vue i18n's lazy loading: load only the active locale on startup, fetch others on demand.
- **Files**: `neode-ui/src/i18n.ts`
---
## Year 3: Production Hardening (March 2028 — March 2029)
### Q1 2028: Resilience
#### 26. Database migration system (MEDIUM — upgradability)
- **Problem**: Once SQLite is in use, schema changes need managed migrations.
- **Fix**: Use `sqlx` migrations (already supported). Create `core/archipelago/migrations/` directory. Run migrations on startup before serving requests.
- **Files**: `core/archipelago/migrations/`, `core/archipelago/src/main.rs`
#### 27. Graceful degradation for container failures (MEDIUM — UX)
- **Problem**: If Podman is down or unresponsive, the entire backend can hang on container operations.
- **Fix**: Add timeouts to all Podman CLI calls (some already have them, make it universal). Show degraded state in UI rather than hanging. Container operations should never block the main RPC handler.
- **Files**: `core/container/src/podman_client.rs`
#### 28. WebSocket backpressure handling (LOW — stability)
- **Problem**: Broadcast channel capacity is 100. If a slow client can't keep up, messages are dropped silently.
- **Fix**: Detect `RecvError::Lagged`, send full resync to that client. Log when clients fall behind consistently.
- **Files**: `core/archipelago/src/api/handler.rs`
### Q2 2028: Security Hardening
#### 29. Full security audit pass (HIGH — security)
- **Problem**: Various small issues accumulated: CORS could be tighter, rate limiting coverage is incomplete, error messages could leak internal paths.
- **Fix**: Systematic pass through all 80+ RPC endpoints. Verify: input validation, authorization, rate limiting, error sanitization, path traversal prevention. Document findings.
- **Files**: All RPC handlers
#### 30. Automated dependency security scanning (MEDIUM — supply chain)
- **Problem**: No automated `cargo audit` or `npm audit` in CI.
- **Fix**: Add to CI pipeline. Run weekly via cron. Block releases on known vulnerabilities (with severity threshold).
- **Files**: `.github/workflows/ci.yml`, `scripts/audit-deps.sh`
### Q3 2028: Final Quality
#### 31. Reach 80% test coverage (HIGH — confidence)
- **Target**: 80% line coverage across frontend and backend
- **Focus**: Edge cases, error paths, recovery scenarios, concurrent operations
#### 32. Load testing (MEDIUM — capacity planning)
- **Problem**: No load testing. Unknown how many concurrent users, containers, or WebSocket connections Archy can handle on target hardware.
- **Fix**: Create load test suite with `k6` or `criterion` (Rust). Test: concurrent RPC calls, WebSocket connections, container operations. Document capacity limits per hardware tier.
- **Files**: New `tests/load/` directory
#### 33. Code documentation pass (LOW — maintainability)
- **Problem**: Module-level docs are sparse. New contributors (or future you) need to understand the architecture from code alone.
- **Fix**: Add `//!` module docs to every Rust module. Add JSDoc to every Vue composable and store. Document the "why" of architectural decisions inline.
- **Files**: All modules
### Q4 2028: Polish & Maintenance
#### 34. Dependency update cycle (ONGOING)
- Monthly: `cargo update`, `npm update`, review changelogs
- Quarterly: Major version upgrades (evaluate breaking changes)
- Yearly: Evaluate if any custom code can be replaced by now-mature libraries
#### 35. Refactoring retrospective
- Review this plan against actual state
- Document what worked, what didn't
- Create Year 4+ maintenance plan if needed
---
## Priority Summary
| Priority | Item | Impact |
|----------|------|--------|
| **Critical** | 1. SQLite database | Crash resilience |
| **Critical** | 2. Enforce RBAC | Security |
| **Critical** | 3. Fix session TTL bug | Correctness |
| **Critical** | 6. JSON patch broadcasting | Performance |
| **Critical** | 13. CI pipeline | Process |
| **High** | 4. Fix failing tests | Reliability |
| **High** | 10. Backend integration tests | Reliability |
| **High** | 14. Cosign verification | Security |
| **High** | 15. Migrate hyper → axum | Maintainability |
| **High** | 21. 60% test coverage | Reliability |
| **High** | 29. Security audit | Security |
| **High** | 31. 80% test coverage | Confidence |
| **Medium** | 7. Zod validation | Maintainability |
| **Medium** | 8. Manifest-driven metadata | Maintainability |
| **Medium** | 9. Structured error types | Debuggability |
| **Medium** | 17. Persistent sessions | UX |
| **Medium** | 18. D3 tree-shaking | Bundle size |
| **Medium** | 22. OpenTelemetry | Observability |
| **Medium** | 23. Prometheus metrics | Monitoring |
| **Medium** | 24. Container scanner optimization | CPU |
| **Low** | 5. Remove dockerode | Cleanliness |
| **Low** | 11. 404 route | UX |
| **Low** | 12. Dead code cleanup | Cleanliness |
| **Low** | 16. Tower rate limiter | Correctness |
| **Low** | 19. Route transitions | Polish |
| **Low** | 20. TypeScript cleanup | Type safety |
| **Low** | 25. Lazy i18n | Bundle size |
---
## Guiding Principles
1. **Use established crates and packages** — Don't reinvent what's solved. `sqlx`, `axum`, `tower`, `json-patch`, `zod`, `governor` exist for a reason.
2. **Keep custom what's custom** — Federation, marketplace, DWN, the design system — these are genuinely yours. Don't force a library where none fits.
3. **Test what matters** — Auth, container lifecycle, data persistence, WebSocket reliability. Not every utility function needs a test.
4. **Refactor in place** — No rewrites. Migrate incrementally. Every commit should leave the codebase better than it found it.
5. **No design changes** — The glassmorphism system, the layout, the UX flow — all stay exactly as they are. This plan only touches the internals.