archy/.claude/plans/memoized-plotting-sifakis.md
Dorian 207e53144c feat: architecture review fixes, self-update system, CI pipeline, supply chain hardening
Architecture review (all P0+P1 issues now fixed):
- Add 10s timeout to 6 bare Nostr client.connect() calls
- Pin all 12 crypto deps to exact versions from Cargo.lock
- Pin all 15 floating container image tags to exact patch versions
- Add CI pipeline (cargo fmt + clippy + tests, frontend type-check + build)

Self-update system (git.tx1138.com):
- scripts/self-update.sh: pull, build, install, restart with rollback
- systemd timer checks daily at 3 AM
- update.check RPC does git-based checks when repo is present
- update.git-apply RPC triggers self-update from UI
- Default update URL changed from GitHub to git.tx1138.com
- Git added to ISO package list for fresh installs

Documentation:
- CHANGELOG v1.3.1 with all changes
- README updated (version, update system section)
- BETA-PROGRESS session #6 logged
- architecture-review.html: 4 issues marked FIXED, 8/12 refactoring done

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 15:52:26 +00:00

146 lines
6.5 KiB
Markdown

# Architecture Review — Fix Remaining Issues
## Context
The architecture review (`docs/architecture-review.html`) identified 4 P0, 6 P1, and 6 medium-priority issues across the codebase. After research, **all 4 P0s and 4 of 6 P1s are already fixed**. This plan addresses the remaining open items that improve reliability and security during the beta freeze.
**What's already fixed:** P0-1 (health RPC), P0-2 (health checks), P0-3 (backup rollback), P0-4 (nginx protections), P1-B (rate limiter cleanup), P1-C (systemd limits), P1-E (WS reconnect), P1-F (Vue error handler), Issue 11 (session async I/O).
**What we're fixing now (4 items):**
---
## Item 1: Add 10s timeout to 6 bare `client.connect()` calls — DONE
**Why:** A down Nostr relay hangs the async task indefinitely, blocking identity publishing, node discovery, and marketplace operations. Direct uptime impact.
### Files & locations
| File | Line | Function |
|------|------|----------|
| `core/archipelago/src/identity_manager.rs` | 409 | `publish_profile()` |
| `core/archipelago/src/nostr_discovery.rs` | 113 | `publish_node_revocation()` |
| `core/archipelago/src/nostr_discovery.rs` | 200 | `verify_revocation()` |
| `core/archipelago/src/nostr_discovery.rs` | 264 | `discover_archipelago_nodes()` |
| `core/archipelago/src/marketplace.rs` | 298 | `discover()` |
| `core/archipelago/src/marketplace.rs` | 406 | `publish()` |
### Pattern (from `nostr_handshake.rs:126`)
Replace each `client.connect().await;` with:
```rust
if tokio::time::timeout(Duration::from_secs(10), client.connect()).await.is_err() {
tracing::warn!("Nostr relay connection timed out after 10s, continuing anyway");
}
```
Ensure `use std::time::Duration;` is imported in each file. `tracing::warn!` is already available in all three files.
### Risk: LOW — Mechanical pattern replication, no logic changes.
---
## Item 2: Pin all crypto dependency versions exactly — DONE
**Why:** Floating versions (`"2.1"` instead of `"2.2.0"`) allow `cargo update` to silently change crypto libraries. Supply chain risk + project rules violation.
### Versions (verified from Cargo.lock)
**`core/archipelago/Cargo.toml`:**
| Line | Current | Pin to |
|------|---------|--------|
| 44 | `sha2 = "0.10"` | `"0.10.9"` |
| 45 | `hmac = "0.12"` | `"0.12.1"` |
| 50 | `ed25519-dalek = { version = "2.1", ... }` | `version = "2.2.0"` |
| 51 | `curve25519-dalek = "4"` | `"4.1.3"` |
| 52 | `rand = "0.8"` | `"0.8.5"` |
| 69 | `argon2 = "0.5"` | `"0.5.3"` |
| 70 | `chacha20poly1305 = "0.10"` | `"0.10.1"` |
| 81 | `zeroize = { version = "1.7", ... }` | `version = "1.8.2"` |
| 92 | `hkdf = "0.12"` | `"0.12.4"` |
**`core/security/Cargo.toml`:**
| Line | Current | Pin to |
|------|---------|--------|
| 16 | `aes-gcm = "0.10"` | `"0.10.3"` |
| 17 | `rand = "0.8"` | `"0.8.5"` |
| 19 | `zeroize = { version = "1", ... }` | `version = "1.8.2"` |
**Note:** `core/models/Cargo.toml` has `ed25519-dalek = "2.0.0"` but this crate is NOT in the workspace — it's dead code. Skip it.
### Risk: LOW — Pins to versions already resolved in Cargo.lock. No actual dependency changes.
---
## Item 3: Pin all floating container image tags — DONE
**Why:** Floating tags (`:1`, `:7`, `:alpine`, `:main`) mean two installs a week apart get different software. Supply chain risk and a support nightmare.
### File: `scripts/image-versions.sh`
| Line | Variable | Current Tag | Action |
|------|----------|-------------|--------|
| 16 | `MARIADB_IMAGE` | `:11.4` | SSH -> get exact patch version |
| 21 | `POSTGRES_IMAGE` | `:15` | SSH -> get exact patch version |
| 22 | `BTCPAY_POSTGRES_IMAGE` | `:15` | SSH -> get exact patch version |
| 25 | `HOMEASSISTANT_IMAGE` | `:2024.12` | SSH -> get exact patch version |
| 27 | `UPTIME_KUMA_IMAGE` | `:1` | SSH -> get exact patch version |
| 32 | `NEXTCLOUD_IMAGE` | `:29` | SSH -> get exact patch version |
| 34 | `ONLYOFFICE_IMAGE` | `:8.2` | SSH -> get exact patch version |
| 35 | `FILEBROWSER_IMAGE` | `:v2` | SSH -> get exact patch version |
| 36 | `NPM_IMAGE` | `:2` | SSH -> get exact patch version |
| 49 | `REDIS_IMAGE` | `:7` | SSH -> get exact patch version |
| 52 | `VALKEY_IMAGE` | `:8` | SSH -> get exact patch version |
| 60 | `INDEEDHUB_POSTGRES_IMAGE` | `:16-alpine` | SSH -> get exact patch version |
| 61 | `INDEEDHUB_REDIS_IMAGE` | `:7-alpine` | SSH -> get exact patch version |
| 64 | `DWN_SERVER_IMAGE` | `:main` | SSH -> get image digest, pin by SHA or tag |
| 68 | `NGINX_ALPINE_IMAGE` | `:alpine` | SSH -> get exact version |
### Pre-work required
Run on 192.168.1.228: `podman images --format '{{.Repository}}:{{.Tag}}'` to get exact versions currently deployed. Pin to THOSE — don't upgrade.
### Risk: MEDIUM — Must match what's actually running. Wrong pin = containers fail on next creation.
---
## Item 4: Add CI pipeline for Rust + frontend checks — DONE
**Why:** No tests or linting run in CI. Regressions from Items 1-3 (and all future beta fixes) go undetected until they hit the server.
### File to create: `.github/workflows/ci.yml`
Two parallel jobs:
1. **`rust`** (ubuntu-latest): `cargo fmt --check` -> `cargo clippy -D warnings` -> `cargo test`
2. **`frontend`** (ubuntu-latest): `npm ci` -> `npm run type-check` -> `npm test`
Trigger: push to `main` + all PRs. Reference existing `build-macos.yml` for action versions (checkout@v4, setup-node@v4 with Node 18).
### Risk: LOW — Additive only, new file, doesn't affect existing workflows.
---
## Execution Order
1. **Item 1** (Nostr timeouts) — lowest risk, immediate reliability gain
2. **Item 2** (crypto pins) — batch with Item 1 for single deploy
3. **Item 3** (container image pins) — requires SSH query first
4. **Item 4** (CI) — validates everything, no deploy needed
Items 1+2 deploy together. Item 3 deploys separately (script only). Item 4 is push-only.
## Verification
- Items 1+2: `cargo clippy --all-targets --all-features` on dev server (zero warnings), then deploy + test identity/discovery/marketplace features
- Item 3: `source scripts/image-versions.sh` + verify all vars have exact patch versions
- Item 4: Push to branch, verify both CI jobs pass green on GitHub Actions
## Deferred (post-beta)
- Issue 6: Generate TS types from Rust (ts-rs) — new dependency
- Issue 7: Consolidate container metadata to single source — structural refactor
- Issue 8: Split deploy/ISO scripts into modules — already planned in script comments
- Issue 9: Single app manifest driving all 6+ locations — architectural change
- Issue 12: useAsyncState composable — touches 14+ views, risky during freeze