archy/docs/refactoring-plan.md
2026-03-15 00:40:55 +00:00

17 KiB

Archy Refactoring Plan — Codebase Quality & Reliability

Period: March 2026 — March 2029 Scope: Refactoring, bug fixes, library adoption, testing, performance only Out of scope: New features, design changes, UI changes

This plan exists alongside the feature roadmap. Refactoring work should be interleaved with feature sprints — not blocked by them.


Year 1: Fix What's Broken, Adopt Proper Libraries (March 2026 — Feb 2027)

Q1 2026: Critical Fixes & Database

1. Enable SQLite via sqlx (HIGH — crash resilience)

  • Problem: All state is in-memory. Crashes lose everything except container snapshots. sqlx is commented out in core/Cargo.toml.
  • Fix: Uncomment sqlx, create migrations for: sessions, user data, peer state, metrics history, notification log. Keep the in-memory DataModel as a read cache backed by SQLite.
  • Files: core/Cargo.toml, core/archipelago/src/state.rs, new core/archipelago/src/db/ module
  • Why not a full Postgres: Single-user appliance. SQLite is the right choice — zero config, file-based, embedded.

2. Enforce RBAC (HIGH — security)

  • Problem: UserRole::can_access() is implemented in auth.rs but never called in rpc/mod.rs. Every authenticated user has full admin access.
  • Fix: Add role check in RpcHandler::handle() before dispatching to method handlers. Wire up role assignment during onboarding.
  • Files: core/archipelago/src/api/rpc/mod.rs, core/archipelago/src/auth.rs

3. Fix session TTL clock bug (HIGH — correctness)

  • Problem: session.rs uses Instant::now() for TTL. Instant is monotonic but resets on system sleep/hibernate — common on the hardware Archy targets.
  • Fix: Use SystemTime::now() for session expiry timestamps, or better — use tower-sessions with the new SQLite backend.
  • Files: core/archipelago/src/session.rs

4. Fix 10 failing frontend tests (MEDIUM)

  • Problem: appLauncher.test.ts and settings.test.ts are out of sync with current implementation.
  • Fix: Update test expectations to match current behavior. Don't mock what doesn't need mocking.
  • Files: neode-ui/src/stores/__tests__/appLauncher.test.ts, neode-ui/src/views/__tests__/settings.test.ts

5. Remove dead dependencies (LOW)

  • Problem: dockerode in package.json is unused (container ops go through RPC).
  • Fix: npm uninstall dockerode @types/dockerode
  • Files: neode-ui/package.json

Q2 2026: WebSocket Efficiency & Validation

6. Add json-patch crate to backend (HIGH — performance)

  • Problem: Backend broadcasts the entire DataModel on every state change. Frontend already has fast-json-patch and supports incremental updates. Backend just doesn't generate patches.
  • Fix: Add json-patch crate. Before broadcasting, diff old vs new DataModel, send only the RFC 6902 patch. Fall back to full sync if patch is larger than full model.
  • Files: core/Cargo.toml, core/archipelago/src/state.rs

7. Add form validation with zod (MEDIUM — maintainability)

  • Problem: Manual inline validation scattered across Login, Settings, Onboarding. As forms grow, this becomes a maintenance burden.
  • Fix: npm install zod. Create validation schemas in src/types/schemas.ts. Use in forms and RPC request builders. This is especially important for onboarding where bad input causes cryptographic key generation to fail silently.
  • Files: neode-ui/package.json, new neode-ui/src/types/schemas.ts, Login.vue, Settings.vue, onboarding views

8. Move hardcoded app metadata to manifest files (MEDIUM — maintainability)

  • Problem: docker_packages.rs has hardcoded port mappings, titles, descriptions, and icon paths for ~20 apps. App manifests exist in apps/ but aren't the source of truth.
  • Fix: Make apps/{app-id}/manifest.yml the single source of truth. Load metadata from manifests at startup. Remove hardcoded maps from Rust source.
  • Files: core/archipelago/src/container/docker_packages.rs, apps/*/manifest.yml

Q3 2026: Error Handling & Testing

9. Structured error types per backend module (MEDIUM — debuggability)

  • Problem: Everything uses anyhow::Result. When errors bubble up through RPC, you lose the module context. User-facing vs system errors aren't distinguished at the type level.
  • Fix: Create thiserror error enums for each major module: AuthError, ContainerError, FederationError, IdentityError. Map to appropriate HTTP status codes and user-friendly messages in the RPC layer.
  • Files: Each module in core/archipelago/src/

10. Backend integration tests for RPC endpoints (HIGH — reliability)

  • Problem: 312 unit tests exist but zero integration tests for 80+ RPC endpoints. No test ever makes an actual HTTP request to the server.
  • Fix: Create integration test harness that spins up a real server instance (with test config, temp data dir). Test auth flow, container operations, identity, federation. Use reqwest as test client.
  • Files: New core/archipelago/tests/ directory

11. Frontend 404 route (LOW — UX)

  • Problem: No catch-all route. Invalid URLs silently show nothing.
  • Fix: Add /:pathMatch(.*)* catch-all route that shows a "Page not found" view with navigation back to dashboard.
  • Files: neode-ui/src/router/index.ts, new neode-ui/src/views/NotFound.vue

Q4 2026: Clean Up Dead Code & CI

12. Remove dead code and #[allow(dead_code)] (LOW — cleanliness)

  • Problem: auth.rs has #[allow(dead_code)] on OnboardingState fields and AuthManager methods. Either use them or remove them.
  • Fix: Audit all #[allow(dead_code)], #[allow(unused)]. Remove genuinely unused code. Wire up code that should be used (like RBAC — covered in item 2).
  • Files: core/archipelago/src/auth.rs and others

13. Set up CI pipeline (HIGH — process)

  • Problem: No automated testing on push/PR. All testing is manual or via deploy scripts.
  • Fix: GitHub Actions workflow: cargo clippy, cargo test, npm run type-check, npm run test on every push. Fail the build on warnings.
  • Files: New .github/workflows/ci.yml

14. Cosign container image verification (MEDIUM — security)

  • Problem: podman_client.rs:95 has a TODO for cosign signature verification. Container images are pulled without validation.
  • Fix: Implement cosign verification using the sigstore crate, or shell out to cosign verify as a first step. At minimum, verify image digests against a pinned manifest.
  • Files: core/container/src/podman_client.rs, core/security/

Year 2: Robustness & Performance (March 2027 — Feb 2028)

Q1 2027: Backend Architecture

15. Migrate from hyper to axum (MEDIUM — maintainability)

  • Problem: Raw hyper 0.14 with manual routing in handler.rs (813 lines). Route matching, middleware, and error handling are all hand-rolled. hyper 0.14 is also end-of-life.
  • Fix: Migrate to axum (built on hyper 1.x, maintained by tokio team). Axum gives you: extractors, middleware stack, typed routing, tower integration. The RPC methods stay the same — only the HTTP layer changes.
  • Files: core/archipelago/src/api/handler.rs, core/archipelago/src/api/mod.rs, core/Cargo.toml
  • Risk: Medium. Do this on a branch, test thoroughly. The RPC logic doesn't change, just the HTTP glue.

16. Replace custom rate limiter with tower middleware (LOW — correctness)

  • Problem: Hand-rolled in-memory rate limiter in rpc/mod.rs. Works for single instance but not distributed.
  • Fix: Use tower::limit::RateLimitLayer or governor crate. Cleaner, tested, configurable per-route.
  • Files: core/archipelago/src/api/rpc/mod.rs

17. Persistent sessions in SQLite (MEDIUM — UX)

  • Problem: Sessions are in-memory. Server restart logs out all users.
  • Fix: With SQLite from item 1, store sessions in DB. Users stay logged in across restarts.
  • Files: core/archipelago/src/session.rs

Q2 2027: Frontend Architecture

18. Audit and optimize bundle size (MEDIUM — performance)

  • Problem: D3 is a large dependency (~240KB) used only for LineChart.vue. Target is <500KB gzipped.
  • Fix: Replace full d3 import with only d3-scale, d3-shape, d3-axis (tree-shakeable). Or evaluate unovis or native Canvas for simple line charts. Measure before and after.
  • Files: neode-ui/package.json, neode-ui/src/components/LineChart.vue

19. Vue Router route transitions (LOW — polish)

  • Problem: No transition animations between routes. Pages appear/disappear instantly.
  • Fix: Add <RouterView v-slot> with <Transition> wrapper. Simple fade (200ms) is enough — matches the existing glassmorphism feel without changing the design.
  • Files: neode-ui/src/App.vue
  • Note: This is not a design change — it's a missing standard Vue pattern.

20. TypeScript strict cleanup (LOW — type safety)

  • Problem: WebSocket callback types in app.ts:105 use inline object types instead of importing the Update type from @/types/api.
  • Fix: Audit all stores and components for inline type definitions that should reference shared types. Centralize in src/types/.
  • Files: neode-ui/src/stores/app.ts, neode-ui/src/types/

Q3 2027: Testing & Observability

21. Reach 60% test coverage (HIGH — reliability)

  • Problem: Frontend has ~505 passing tests but many views untested. Backend has zero RPC integration tests.
  • Fix: Prioritize testing for: auth flow, container lifecycle, WebSocket reconnection, federation handshake, backup/restore. Use coverage reports to find gaps.
  • Target: 60% line coverage frontend, 50% backend

22. Add OpenTelemetry tracing (MEDIUM — observability)

  • Problem: tracing is used for logging but there's no distributed tracing or metrics export. When something goes wrong in production, you're reading log files.
  • Fix: Add tracing-opentelemetry and opentelemetry-otlp. Export traces to a local collector (Grafana is already a supported app). Instrument RPC handlers, container operations, federation sync.
  • Files: core/Cargo.toml, core/archipelago/src/main.rs

23. Prometheus metrics export (MEDIUM — monitoring)

  • Problem: MetricsStore collects data but doesn't expose it. No way to monitor Archy health externally.
  • Fix: Add /metrics endpoint in Prometheus format using prometheus crate. Expose: RPC latency histograms, active sessions, container health, WebSocket connections, memory usage.
  • Files: core/archipelago/src/api/handler.rs, core/archipelago/src/monitoring/

Q4 2027: Performance

24. Optimize container scanner (MEDIUM — CPU)

  • Problem: docker_packages.rs scans all containers every 10 seconds with full JSON parsing. On a system with 30+ containers, this is unnecessary CPU churn.
  • Fix: Use Podman events API (podman events --format json) to watch for container state changes instead of polling. Fall back to polling every 60s as a safety net.
  • Files: core/archipelago/src/container/docker_packages.rs

25. Lazy-load i18n locales (LOW — bundle size)

  • Problem: Spanish locale exists but loading behavior isn't optimized.
  • Fix: Use Vue i18n's lazy loading: load only the active locale on startup, fetch others on demand.
  • Files: neode-ui/src/i18n.ts

Year 3: Production Hardening (March 2028 — March 2029)

Q1 2028: Resilience

26. Database migration system (MEDIUM — upgradability)

  • Problem: Once SQLite is in use, schema changes need managed migrations.
  • Fix: Use sqlx migrations (already supported). Create core/archipelago/migrations/ directory. Run migrations on startup before serving requests.
  • Files: core/archipelago/migrations/, core/archipelago/src/main.rs

27. Graceful degradation for container failures (MEDIUM — UX)

  • Problem: If Podman is down or unresponsive, the entire backend can hang on container operations.
  • Fix: Add timeouts to all Podman CLI calls (some already have them, make it universal). Show degraded state in UI rather than hanging. Container operations should never block the main RPC handler.
  • Files: core/container/src/podman_client.rs

28. WebSocket backpressure handling (LOW — stability)

  • Problem: Broadcast channel capacity is 100. If a slow client can't keep up, messages are dropped silently.
  • Fix: Detect RecvError::Lagged, send full resync to that client. Log when clients fall behind consistently.
  • Files: core/archipelago/src/api/handler.rs

Q2 2028: Security Hardening

29. Full security audit pass (HIGH — security)

  • Problem: Various small issues accumulated: CORS could be tighter, rate limiting coverage is incomplete, error messages could leak internal paths.
  • Fix: Systematic pass through all 80+ RPC endpoints. Verify: input validation, authorization, rate limiting, error sanitization, path traversal prevention. Document findings.
  • Files: All RPC handlers

30. Automated dependency security scanning (MEDIUM — supply chain)

  • Problem: No automated cargo audit or npm audit in CI.
  • Fix: Add to CI pipeline. Run weekly via cron. Block releases on known vulnerabilities (with severity threshold).
  • Files: .github/workflows/ci.yml, scripts/audit-deps.sh

Q3 2028: Final Quality

31. Reach 80% test coverage (HIGH — confidence)

  • Target: 80% line coverage across frontend and backend
  • Focus: Edge cases, error paths, recovery scenarios, concurrent operations

32. Load testing (MEDIUM — capacity planning)

  • Problem: No load testing. Unknown how many concurrent users, containers, or WebSocket connections Archy can handle on target hardware.
  • Fix: Create load test suite with k6 or criterion (Rust). Test: concurrent RPC calls, WebSocket connections, container operations. Document capacity limits per hardware tier.
  • Files: New tests/load/ directory

33. Code documentation pass (LOW — maintainability)

  • Problem: Module-level docs are sparse. New contributors (or future you) need to understand the architecture from code alone.
  • Fix: Add //! module docs to every Rust module. Add JSDoc to every Vue composable and store. Document the "why" of architectural decisions inline.
  • Files: All modules

Q4 2028: Polish & Maintenance

34. Dependency update cycle (ONGOING)

  • Monthly: cargo update, npm update, review changelogs
  • Quarterly: Major version upgrades (evaluate breaking changes)
  • Yearly: Evaluate if any custom code can be replaced by now-mature libraries

35. Refactoring retrospective

  • Review this plan against actual state
  • Document what worked, what didn't
  • Create Year 4+ maintenance plan if needed

Priority Summary

Priority Item Impact
Critical 1. SQLite database Crash resilience
Critical 2. Enforce RBAC Security
Critical 3. Fix session TTL bug Correctness
Critical 6. JSON patch broadcasting Performance
Critical 13. CI pipeline Process
High 4. Fix failing tests Reliability
High 10. Backend integration tests Reliability
High 14. Cosign verification Security
High 15. Migrate hyper → axum Maintainability
High 21. 60% test coverage Reliability
High 29. Security audit Security
High 31. 80% test coverage Confidence
Medium 7. Zod validation Maintainability
Medium 8. Manifest-driven metadata Maintainability
Medium 9. Structured error types Debuggability
Medium 17. Persistent sessions UX
Medium 18. D3 tree-shaking Bundle size
Medium 22. OpenTelemetry Observability
Medium 23. Prometheus metrics Monitoring
Medium 24. Container scanner optimization CPU
Low 5. Remove dockerode Cleanliness
Low 11. 404 route UX
Low 12. Dead code cleanup Cleanliness
Low 16. Tower rate limiter Correctness
Low 19. Route transitions Polish
Low 20. TypeScript cleanup Type safety
Low 25. Lazy i18n Bundle size

Guiding Principles

  1. Use established crates and packages — Don't reinvent what's solved. sqlx, axum, tower, json-patch, zod, governor exist for a reason.
  2. Keep custom what's custom — Federation, marketplace, DWN, the design system — these are genuinely yours. Don't force a library where none fits.
  3. Test what matters — Auth, container lifecycle, data persistence, WebSocket reliability. Not every utility function needs a test.
  4. Refactor in place — No rewrites. Migrate incrementally. Every commit should leave the codebase better than it found it.
  5. No design changes — The glassmorphism system, the layout, the UX flow — all stay exactly as they are. This plan only touches the internals.