archy/.claude/plans/memoized-plotting-sifakis.md
Dorian 207e53144c feat: architecture review fixes, self-update system, CI pipeline, supply chain hardening
Architecture review (all P0+P1 issues now fixed):
- Add 10s timeout to 6 bare Nostr client.connect() calls
- Pin all 12 crypto deps to exact versions from Cargo.lock
- Pin all 15 floating container image tags to exact patch versions
- Add CI pipeline (cargo fmt + clippy + tests, frontend type-check + build)

Self-update system (git.tx1138.com):
- scripts/self-update.sh: pull, build, install, restart with rollback
- systemd timer checks daily at 3 AM
- update.check RPC does git-based checks when repo is present
- update.git-apply RPC triggers self-update from UI
- Default update URL changed from GitHub to git.tx1138.com
- Git added to ISO package list for fresh installs

Documentation:
- CHANGELOG v1.3.1 with all changes
- README updated (version, update system section)
- BETA-PROGRESS session #6 logged
- architecture-review.html: 4 issues marked FIXED, 8/12 refactoring done

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 15:52:26 +00:00

6.5 KiB

Architecture Review — Fix Remaining Issues

Context

The architecture review (docs/architecture-review.html) identified 4 P0, 6 P1, and 6 medium-priority issues across the codebase. After research, all 4 P0s and 4 of 6 P1s are already fixed. This plan addresses the remaining open items that improve reliability and security during the beta freeze.

What's already fixed: P0-1 (health RPC), P0-2 (health checks), P0-3 (backup rollback), P0-4 (nginx protections), P1-B (rate limiter cleanup), P1-C (systemd limits), P1-E (WS reconnect), P1-F (Vue error handler), Issue 11 (session async I/O).

What we're fixing now (4 items):


Item 1: Add 10s timeout to 6 bare client.connect() calls — DONE

Why: A down Nostr relay hangs the async task indefinitely, blocking identity publishing, node discovery, and marketplace operations. Direct uptime impact.

Files & locations

File Line Function
core/archipelago/src/identity_manager.rs 409 publish_profile()
core/archipelago/src/nostr_discovery.rs 113 publish_node_revocation()
core/archipelago/src/nostr_discovery.rs 200 verify_revocation()
core/archipelago/src/nostr_discovery.rs 264 discover_archipelago_nodes()
core/archipelago/src/marketplace.rs 298 discover()
core/archipelago/src/marketplace.rs 406 publish()

Pattern (from nostr_handshake.rs:126)

Replace each client.connect().await; with:

if tokio::time::timeout(Duration::from_secs(10), client.connect()).await.is_err() {
    tracing::warn!("Nostr relay connection timed out after 10s, continuing anyway");
}

Ensure use std::time::Duration; is imported in each file. tracing::warn! is already available in all three files.

Risk: LOW — Mechanical pattern replication, no logic changes.


Item 2: Pin all crypto dependency versions exactly — DONE

Why: Floating versions ("2.1" instead of "2.2.0") allow cargo update to silently change crypto libraries. Supply chain risk + project rules violation.

Versions (verified from Cargo.lock)

core/archipelago/Cargo.toml:

Line Current Pin to
44 sha2 = "0.10" "0.10.9"
45 hmac = "0.12" "0.12.1"
50 ed25519-dalek = { version = "2.1", ... } version = "2.2.0"
51 curve25519-dalek = "4" "4.1.3"
52 rand = "0.8" "0.8.5"
69 argon2 = "0.5" "0.5.3"
70 chacha20poly1305 = "0.10" "0.10.1"
81 zeroize = { version = "1.7", ... } version = "1.8.2"
92 hkdf = "0.12" "0.12.4"

core/security/Cargo.toml:

Line Current Pin to
16 aes-gcm = "0.10" "0.10.3"
17 rand = "0.8" "0.8.5"
19 zeroize = { version = "1", ... } version = "1.8.2"

Note: core/models/Cargo.toml has ed25519-dalek = "2.0.0" but this crate is NOT in the workspace — it's dead code. Skip it.

Risk: LOW — Pins to versions already resolved in Cargo.lock. No actual dependency changes.


Item 3: Pin all floating container image tags — DONE

Why: Floating tags (:1, :7, :alpine, :main) mean two installs a week apart get different software. Supply chain risk and a support nightmare.

File: scripts/image-versions.sh

Line Variable Current Tag Action
16 MARIADB_IMAGE :11.4 SSH -> get exact patch version
21 POSTGRES_IMAGE :15 SSH -> get exact patch version
22 BTCPAY_POSTGRES_IMAGE :15 SSH -> get exact patch version
25 HOMEASSISTANT_IMAGE :2024.12 SSH -> get exact patch version
27 UPTIME_KUMA_IMAGE :1 SSH -> get exact patch version
32 NEXTCLOUD_IMAGE :29 SSH -> get exact patch version
34 ONLYOFFICE_IMAGE :8.2 SSH -> get exact patch version
35 FILEBROWSER_IMAGE :v2 SSH -> get exact patch version
36 NPM_IMAGE :2 SSH -> get exact patch version
49 REDIS_IMAGE :7 SSH -> get exact patch version
52 VALKEY_IMAGE :8 SSH -> get exact patch version
60 INDEEDHUB_POSTGRES_IMAGE :16-alpine SSH -> get exact patch version
61 INDEEDHUB_REDIS_IMAGE :7-alpine SSH -> get exact patch version
64 DWN_SERVER_IMAGE :main SSH -> get image digest, pin by SHA or tag
68 NGINX_ALPINE_IMAGE :alpine SSH -> get exact version

Pre-work required

Run on 192.168.1.228: podman images --format '{{.Repository}}:{{.Tag}}' to get exact versions currently deployed. Pin to THOSE — don't upgrade.

Risk: MEDIUM — Must match what's actually running. Wrong pin = containers fail on next creation.


Item 4: Add CI pipeline for Rust + frontend checks — DONE

Why: No tests or linting run in CI. Regressions from Items 1-3 (and all future beta fixes) go undetected until they hit the server.

File to create: .github/workflows/ci.yml

Two parallel jobs:

  1. rust (ubuntu-latest): cargo fmt --check -> cargo clippy -D warnings -> cargo test
  2. frontend (ubuntu-latest): npm ci -> npm run type-check -> npm test

Trigger: push to main + all PRs. Reference existing build-macos.yml for action versions (checkout@v4, setup-node@v4 with Node 18).

Risk: LOW — Additive only, new file, doesn't affect existing workflows.


Execution Order

  1. Item 1 (Nostr timeouts) — lowest risk, immediate reliability gain
  2. Item 2 (crypto pins) — batch with Item 1 for single deploy
  3. Item 3 (container image pins) — requires SSH query first
  4. Item 4 (CI) — validates everything, no deploy needed

Items 1+2 deploy together. Item 3 deploys separately (script only). Item 4 is push-only.

Verification

  • Items 1+2: cargo clippy --all-targets --all-features on dev server (zero warnings), then deploy + test identity/discovery/marketplace features
  • Item 3: source scripts/image-versions.sh + verify all vars have exact patch versions
  • Item 4: Push to branch, verify both CI jobs pass green on GitHub Actions

Deferred (post-beta)

  • Issue 6: Generate TS types from Rust (ts-rs) — new dependency
  • Issue 7: Consolidate container metadata to single source — structural refactor
  • Issue 8: Split deploy/ISO scripts into modules — already planned in script comments
  • Issue 9: Single app manifest driving all 6+ locations — architectural change
  • Issue 12: useAsyncState composable — touches 14+ views, risky during freeze