archy/loop/plan.md
Dorian 66bf30547b fix: resolve container crash loops on .228 — UFW blocking Podman DNS
Root cause: UFW firewall was blocking all traffic from Podman container
subnets (10.88.0.0/16, 10.89.0.0/16) to the host, which prevented
Aardvark DNS resolution. Containers could not resolve each other by
hostname, causing mempool-web, mempool-api, nbxplorer, btcpay-server,
and immich_server to crash loop (6000+ total restarts).

Fix: Added UFW allow rules for Podman network subnets. Also removed
unused ollama container. All 32 containers now stable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-13 22:35:04 +00:00

466 lines
47 KiB
Markdown

# Archipelago 5-Year Production Hardening Plan
**Version**: 2.0
**Period**: March 2026 -- March 2031
**Goal**: Production-ready Bitcoin Node OS at 10,000 users with zero failures, 100% uptime, full inter-node federation
**Visual constraint**: NEVER change animations, user experience, or flow -- only clean up duplications, information hierarchy, and cosmetic issues
**Web5 additions**: did:dht, DWN protocol definitions for interoperable schemas, Verifiable Credentials (per TBD assessment)
**Primary test node**: `192.168.1.228` (Arch 1) — 4-core i3-8100T, 16GB RAM, 1.8TB NVMe
**Secondary test node**: `192.168.1.198` (Arch 2) — 8GB RAM, 457GB disk
**SSH**: `ssh -i ~/.ssh/archipelago-deploy archipelago@{IP}`
**Deploy**: `./scripts/deploy-to-target.sh --both`
---
## Critical Findings from Investigation (2026-03-13)
### Server .228 Issues
- **6 containers in crash loops**: archy-nbxplorer (3,535 restarts), archy-mempool-web (2,041), mempool-api (906), btcpay-server (888), mempool-electrs (529), immich_server (439)
- **Root cause**: Container networking DNS failures — mempool-web can't resolve "mempool-api" upstream, nbxplorer can't connect to Postgres
- **Load average 5.44 on 4 cores** — entirely caused by crash/restart cycles consuming CPU
- **ollama in Created state** — never started, consuming a container slot
- **Podman rootless warning**: "/" is not a shared mount
### Server .198 Issues
- **No federation configured** — /var/lib/archipelago/federation/ is empty
- **Tor container outdated** (v0.4.6.10) — warns "missing protocols: FlowCtrl=2 Relay=4", will eventually stop working
- **Tor failing every 5 minutes**: "No more HSDir available to query" — can't resolve .onion addresses
- **Memory critically low**: 147MB free of 8GB, NO SWAP configured
- **Nostr identity revoked** — nostr_revoked file exists but empty
- **Containers run under root** — rootless podman shows nothing, sudo podman shows 35 containers
### Cross-Node Issues
- .228 → .198 HTTP health: OK (basic connectivity works)
- .198 → .228 HTTP health: OK
- .198 has ZERO federation peers — no nodes.json, never joined federation
- Tor-based federation impossible from .198 — Tor can't resolve hidden services
- No swap on either server — OOM kills likely under load
- ping not installed on .228 (missing iputils-ping)
---
## User Stories & Acceptance Tests
Every test must pass **10 consecutive times** from BOTH .228→.198 AND .198→.228 directions.
### US-01: System Health
> As a node operator, I want my server to boot cleanly with all services running, zero crashed containers, and stable resource usage, so I never have to manually intervene.
### US-02: Container Lifecycle
> As a node operator, I want every installed app to start, run, survive reboots, and recover from crashes automatically, so my services are always available.
### US-03: Federation Join
> As a node operator, I want to invite another node to my federation using an invite code, so we can share status and deploy apps to each other.
### US-04: Federation Sync
> As a node operator, I want to see all my federated peers' status (online/offline, apps, resources) updated every 5 minutes, so I know my network health.
### US-05: Tor Hidden Services
> As a node operator, I want each app to have a .onion address that works reliably, so my services are accessible over Tor without exposing my IP.
### US-06: Nostr Discovery
> As a node operator, I want my node to publish its identity to Nostr relays and discover other nodes, so peers can find me without manual configuration.
### US-07: File Sharing
> As a node operator, I want to share files with federated peers over Tor with access controls (free, peers-only, paid), so I can selectively distribute content.
### US-08: DWN Sync
> As a node operator, I want DWN messages and protocols to replicate bidirectionally between my federated nodes over Tor, so my decentralized data is available everywhere.
### US-09: NIP-07 Signing
> As a node operator, I want iframe apps to use window.nostr to sign events with my node's Nostr key (with consent), so I can use Nostr apps with my sovereign identity.
### US-10: Backup/Restore
> As a node operator, I want to create encrypted backups and restore them on a fresh install, so I never lose my data or identity.
### US-11: Dashboard Monitoring
> As a node operator, I want real-time CPU, RAM, disk, and container health displayed on my dashboard, so I can spot problems before they escalate.
### US-12: Auto-Updates
> As a node operator, I want my node to check for updates, download them with integrity verification, and apply them with rollback capability.
### US-13: Identity & Credentials
> As a node operator, I want W3C DID Documents and Verifiable Credentials that work with did:dht for discoverable DIDs and proper VCs for proving identity claims between nodes.
### US-14: Web UI Navigation
> As a node operator, I want every page in the UI to load correctly, show real data (not hardcoded), and navigate without broken links or dead buttons.
### US-15: Boot Recovery
> As a node operator, I want all containers to automatically restart after any reboot, crash, or power loss, with zero manual intervention required.
---
## Phase 1: Emergency Stabilization (Week 1-2)
### Sprint 1: Stop the Crash Loops
- [x] **CRASH-01** — Fix container networking on .228. **Root cause**: UFW blocking all traffic from Podman subnets (10.88.0.0/16, 10.89.0.0/16) to host, preventing Aardvark DNS resolution. **Fix**: `ufw allow from 10.88.0.0/16` and `ufw allow from 10.89.0.0/16`. All containers on archy-net can now resolve hostnames. mempool-web stable 30+ minutes, 0 restarts.
- [x] **CRASH-02** — Fix archy-nbxplorer Postgres connection on .228. **Same root cause as CRASH-01**: UFW blocking DNS. After UFW fix, nbxplorer resolves archy-btcpay-db hostname and connects to Postgres. Both nbxplorer and btcpay-server stable 30+ minutes.
- [x] **CRASH-03** — Fix immich_server crash loop on .228. **Same root cause as CRASH-01**: UFW blocking DNS. Immich components on immich-net could not resolve each other. After UFW fix, immich_server started and is running stable 30+ minutes. Logs show successful Nest application startup on port 2283.
- [x] **CRASH-04** — Removed ollama on .228. `sudo podman rm ollama`. Container gone, total count reduced from 33 to 32.
- [x] **CRASH-05** — Verified .228 stability. All 32 containers running, zero exited, zero new crash loops for 30+ minutes. Load avg ~5.3 (high due to 32 containers on 4-core machine, not crash loops — was same before). Memory 1.8GB available (needs swap, see STAB-02). Health checks passing.
### Sprint 2: Stabilize .198
- [ ] **STAB-01** — Add swap on .198. Server has only 8GB RAM, 147MB free, no swap. Create a 4GB swap file: `sudo fallocate -l 4G /swapfile && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile`. Add to `/etc/fstab` for persistence. **Acceptance**: `free -h` shows 4GB swap. `swapon --show` lists /swapfile. Survives reboot.
- [ ] **STAB-02** — Add swap on .228. Even with 16GB, swap prevents OOM kills under load. Create 8GB swap: `sudo fallocate -l 8G /swapfile && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile`. Add to `/etc/fstab`. **Acceptance**: `free -h` shows 8GB swap on .228. Survives reboot.
- [ ] **STAB-03** — Update Tor container on .198. Current version 0.4.6.10 is critically outdated — warns it "will eventually stop working". Pull latest Tor image. Stop archy-tor, update image, restart. **Acceptance**: `sudo podman exec archy-tor tor --version` shows >= 0.4.8.x. Tor logs stop showing "missing protocols" warning. Hidden service hostnames are readable.
- [ ] **STAB-04** — Fix Tor hidden service resolution on .198. After updating Tor, check if .onion resolution works. Test: `sudo podman exec archy-tor curl --socks5-hostname 127.0.0.1:9050 -s http://$(cat /var/lib/tor/hidden_service_archipelago/hostname)/health`. If still failing, check torrc config, hidden service directories, and restart. **Acceptance**: Can resolve at least the local node's .onion address. Tor logs stop showing "No more HSDir available" errors.
- [ ] **STAB-05** — Fix Nostr identity on .198. The nostr_revoked file exists but is empty. Check if the Nostr keypair is valid: call `node.nostr-pubkey` RPC. If revoked, generate a new Nostr keypair via `identity.create-nostr-key` or similar. Remove the empty revocation file if the key is valid. **Acceptance**: `curl -s -X POST -H "Content-Type: application/json" -d '{"method":"node.nostr-pubkey"}' http://localhost:5678/rpc/v1` returns a valid hex pubkey. `node.nostr-discover` can publish to at least 1 relay.
- [ ] **STAB-06** — Establish federation between .228 and .198. On .228: generate invite code via `federation.invite` RPC. On .198: join federation via `federation.join` RPC with the invite code. Verify mutual trust established. **Acceptance**: On .228, `federation.list-nodes` shows .198 as trusted. On .198, `federation.list-nodes` shows .228 as trusted. `federation.sync-state` returns app lists from both nodes. Run 10 times from each direction.
- [ ] **STAB-07** — Verify rootless vs root podman on .198. Containers run under root (sudo podman) but the backend may be calling rootless podman. Check `core/archipelago/src/container/` to see if it uses `sudo podman` or just `podman`. Align the backend config with the actual container runtime. **Acceptance**: Backend RPC `container.list` returns all 35 containers. Health monitor can detect and restart containers.
---
## Phase 2: Cross-Node Test Suite (Week 3-4)
### Sprint 3: Create Bulletproof Test Harness
- [ ] **TEST-01** — Create `scripts/test-cross-node.sh` master test script. This script runs every test from BOTH directions (.228→.198 and .198→.228). Takes `--iterations N` flag (default 10). Each test runs N times and must pass all N. Outputs TAP-format results. SSH into each node and runs checks. Exit code 0 only if ALL tests pass ALL iterations from BOTH directions. **Acceptance**: Script exists, runs, and produces clear pass/fail output per test.
- [ ] **TEST-02** — US-01 tests: System Health (10x each direction). From .228 SSH to .198 (and vice versa): (1) `curl /health` returns "OK", (2) `systemctl is-active archipelago nginx` both "active", (3) `free -h` available > 1GB, (4) load average < number of cores, (5) disk usage < 85%, (6) zero exited containers in `sudo podman ps -a`. Run each check 10 times. **Acceptance**: 60 checks per direction (6 checks x 10 iterations), all pass, both directions = 120 total passes.
- [ ] **TEST-03** US-02 tests: Container Lifecycle (10x each direction). From each node: (1) List all containers all running, (2) Stop filebrowser, wait 90s, verify health monitor restarts it, (3) Install a test container, verify it starts, (4) Reboot the node, wait 120s, verify all containers come back. Run lifecycle test 10 times (skip reboot for 9 of 10, run reboot test once). **Acceptance**: 30+ checks per direction, all pass.
- [ ] **TEST-04** US-03 tests: Federation Join (10x). Already joined in STAB-06. Test: (1) Verify both nodes appear in each other's `federation.list-nodes`, (2) Trust level is "trusted" on both sides, (3) DID and onion address present, (4) `last_seen` within last 10 minutes. Run 10 times from each direction. **Acceptance**: 80 checks (4 x 10 x 2 directions), all pass.
- [ ] **TEST-05** US-04 tests: Federation Sync (10x). (1) Trigger `federation.sync-state` from .228 to .198, verify .198 app list returned, (2) From .198 to .228, verify .228 app list returned, (3) Verify last_seen updates, (4) Verify app count matches `sudo podman ps | wc -l`. Run 10 times each direction. **Acceptance**: 80 checks, all pass.
- [ ] **TEST-06** US-05 tests: Tor Hidden Services (10x). (1) `tor.list-services` returns at least "archipelago" service with valid .onion address, (2) From the OTHER node via Tor SOCKS proxy, resolve the .onion address and curl /health, (3) Per-app .onion addresses are reachable. Run 10 times each direction (Tor latency means each test may take 10-30s). **Acceptance**: 60 checks, all pass. Tor resolution works from both nodes.
- [ ] **TEST-07** US-06 tests: Nostr Discovery (10x). (1) `node.nostr-pubkey` returns valid hex pubkey, (2) `node.nostr-discover` finds at least the other test node, (3) Published Nostr event has valid onion address, (4) Both nodes' npubs are discoverable from each other. Run 10 times. **Acceptance**: 80 checks, all pass.
- [ ] **TEST-08** US-07 tests: File Sharing (10x). (1) On .228: share a test file via `content.add`, (2) From .198: `content.browse-peer` with .228's onion sees the file, (3) Download the file over Tor, verify checksum, (4) Reverse: share from .198, browse from .228. (5) Test access modes: free (accessible), peers_only (accessible from peer, blocked from anonymous). Run 10 times. **Acceptance**: 100 checks, all pass.
- [ ] **TEST-09** US-08 tests: DWN Sync (10x). (1) On .228: register protocol, write 3 messages, (2) Trigger DWN sync, (3) On .198: query messages, verify all 3 present, (4) Reverse: write on .198, sync, verify on .228, (5) Verify bidirectional both nodes have all messages. Run 10 times. **Acceptance**: 100 checks, all pass.
- [ ] **TEST-10** US-09 tests: NIP-07 Signing (10x). (1) Verify nostr-provider.js is injected in iframe app HTML (curl /app/mempool/ and check for script tag), (2) `node.nostr-sign` RPC signs an event and returns valid sig, (3) `node.nostr-pubkey` matches the signing key, (4) NIP-04 encrypt/decrypt roundtrip. Run 10 times per node. **Acceptance**: 80 checks, all pass.
- [ ] **TEST-11** US-10 tests: Backup/Restore (10x). (1) Create encrypted backup via `backup.create`, (2) List backups via `backup.list`, verify it appears, (3) Verify backup integrity via `backup.verify`, (4) Delete backup via `backup.delete`. (5) Once: restore backup and verify identity survives. Run 10 times (skip restore for 9). **Acceptance**: 80+ checks, all pass.
- [ ] **TEST-12** US-15 tests: Boot Recovery (10x from each node). (1) Record running containers, (2) Reboot node, (3) Wait for backend health, (4) Verify ALL containers restarted within 120s, (5) Verify no containers exited. Run full reboot test 3 times per node, container recovery check 10 times. **Acceptance**: All containers survive every reboot. Zero manual intervention needed.
---
## Phase 3: UI Cosmetic Cleanup (Week 5-6)
### Sprint 4: Information Hierarchy & Deduplication
- [ ] **UI-CLEAN-01** Audit all views for hardcoded/fake data. SSH into .228, open each page, and call the RPC endpoints that feed them. Compare what the UI shows vs what the RPC returns. Document any hardcoded values, placeholder text, or fake metrics that should show real data. **Acceptance**: Audit document listing every discrepancy.
- [ ] **UI-CLEAN-02** Fix Dashboard (Home.vue) data accuracy. Verify: CPU/RAM/disk gauges show real `system.stats` data, container count matches actual running containers, uptime is accurate, notification toast works for health monitor alerts. Fix any discrepancies. Deploy and verify at http://192.168.1.228. **Acceptance**: All dashboard metrics match server reality. No fake data.
- [ ] **UI-CLEAN-03** Fix Server.vue information hierarchy. Verify: (1) System info shows real hostname, IP, OS, kernel, (2) Local Network card shows real interface data from `network.list-interfaces`, (3) VPN status from `vpn.status`, (4) DNS config from `network.dns-status`, (5) Web3 card shows "Coming Soon" not fake numbers. Remove any duplicate information that also appears on other pages. **Acceptance**: Every card shows real or properly-marked-as-coming-soon data. No duplication with Dashboard.
- [ ] **UI-CLEAN-04** Fix Web5.vue information hierarchy. Verify: (1) DID section shows real DID from `node.did`, (2) Nostr section shows real npub from `node.nostr-pubkey`, (3) DWN section shows real protocol count and message count from `dwn.status`, (4) Credentials section shows real credential count. Remove any "3 active" or placeholder numbers. **Acceptance**: All Web5 data is real or shows "0" / "Not configured".
- [ ] **UI-CLEAN-05** Fix Settings.vue deduplication. Verify no section duplicates information from Server.vue or Web5.vue. Specifically: (1) Account section is unique to Settings, (2) Security (2FA) is unique, (3) Tor section should NOT duplicate Web5 Tor info keep Tor management in Settings only, (4) Backup section is unique, (5) System Updates link goes to update page. Remove any duplicated sections. **Acceptance**: Zero information duplication between Settings and other pages.
- [ ] **UI-CLEAN-06** Fix Marketplace.vue curated app list accuracy. Verify every app in `getCuratedAppList()` has: correct Docker image that exists on Docker Hub, correct default port, correct icon in `neode-ui/public/assets/img/app-icons/`, correct description. Remove any apps whose images don't exist. **Acceptance**: Every marketplace app can be installed successfully. No 404 icons. No broken image references.
- [ ] **UI-CLEAN-07** Fix Cloud.vue file management. Verify: (1) File type tabs (Photos, Music, Documents, All) correctly filter from FileBrowser, (2) "Peer Files" tab shows federated peers and can browse their catalogs, (3) Upload works, (4) Download works. No hardcoded file lists. **Acceptance**: All Cloud operations work with real data from both nodes.
- [ ] **UI-CLEAN-08** Fix Federation.vue accuracy. Verify: (1) Node list shows real peers from `federation.list-nodes`, (2) Online/offline status based on `last_seen` freshness, (3) Network map (D3.js) renders correctly with real node data, (4) Generate invite works, (5) Sync button triggers real sync. Fix any cosmetic issues (alignment, spacing, truncation). **Acceptance**: Federation page shows accurate real-time data for .228 and .198.
- [ ] **UI-CLEAN-09** Fix Chat.vue state. Verify Chat page works or shows proper "not configured" state if Claude proxy isn't available on the node. Should not show errors or broken UI. **Acceptance**: Chat page either works (if proxy configured) or shows clean "Configure AI Chat in Settings" message.
- [ ] **UI-CLEAN-10** Fix Apps.vue installed app display. Verify: (1) Shows only actually-installed containers, (2) Status badges match container state (running=green, stopped=red, installing=orange), (3) Click opens AppDetails with correct info, (4) No phantom apps that don't exist. **Acceptance**: App list exactly matches `sudo podman ps -a` on the server.
- [ ] **UI-CLEAN-11** Run type-check and fix all TypeScript errors. `cd neode-ui && npm run type-check`. Fix every error. Zero `any` types, zero unused imports, zero type mismatches. **Acceptance**: `npm run type-check` exits 0.
- [ ] **UI-CLEAN-12** Run frontend build and verify zero warnings. `cd neode-ui && npm run build`. Fix any warnings (unused variables, missing imports, deprecated APIs). **Acceptance**: `npm run build` exits 0 with zero warnings.
---
## Phase 4: Backend Hardening (Week 7-10)
### Sprint 5: Container Management Reliability
- [ ] **CONT-01** Audit container network topology on both nodes. Document every podman network, which containers are on each network, and which containers need to communicate. Create a network diagram. Fix any containers that should be on the same network but aren't (root cause of CRASH-01 and CRASH-02). **Acceptance**: Network diagram exists. All dependent containers share a network. No DNS resolution failures.
- [ ] **CONT-02** Add container dependency ordering to startup. In `crash_recovery.rs` `start_stopped_containers()`, implement proper startup ordering: (1) Databases first (postgres, redis, mariadb), (2) Core services second (bitcoin-knots, lnd), (3) Dependent services third (electrs, mempool-api, btcpay-server, nbxplorer), (4) UI containers last (mempool-web, bitcoin-ui, lnd-ui). Wait for each tier's health before starting the next. **Acceptance**: After reboot, containers start in dependency order. Zero crash-restart cycles. Run 10 reboot tests all containers healthy within 120s every time.
- [ ] **CONT-03** Add container health check definitions for all apps. In `get_app_config()`, add `--health-cmd`, `--health-interval`, `--health-retries` to every container that doesn't have one. Currently only filebrowser, jellyfin, vaultwarden, and uptime-kuma have health checks. Add for: bitcoin-knots (`bitcoin-cli getblockchaininfo`), lnd (`lncli getinfo`), mempool-api (HTTP check), btcpay-server (HTTP check), nextcloud, etc. **Acceptance**: `sudo podman ps` shows "(healthy)" for every running container.
- [ ] **CONT-04** Cap health monitor restart attempts with exponential backoff. Currently max 3 restarts with no delay. Change to: restart 1 at 10s, restart 2 at 30s, restart 3 at 90s. After 3 failures, mark container as "failed" and notify (don't keep trying). Reset counter after 1 hour of stability. **Acceptance**: A permanently broken container stops restarting after 3 attempts. No infinite crash loops consuming CPU.
- [ ] **CONT-05** Add memory limits to all containers. Review `get_app_config()` memory limits. Set appropriate `--memory` flags: bitcoin-knots (2GB), lnd (512MB), electrs (1GB), mempool-api (512MB), mempool-web (256MB), nextcloud (1GB), immich_server (1GB), onlyoffice (2GB), etc. Prevent any single container from OOM-killing others. **Acceptance**: `sudo podman stats` shows all containers have MEM LIMIT set. No container exceeds its limit.
- [ ] **CONT-06** Fix rootless podman mount warning on .228. The warning "/ is not a shared mount" appears on every podman command. Fix by making the mount shared: add `mount --make-rshared /` to systemd startup, or configure in `/etc/containers/storage.conf`. **Acceptance**: `sudo podman ps` produces no warnings.
### Sprint 6: Backend Security & Reliability
- [ ] **SEC-01** Audit all RPC endpoints for input validation. In `core/archipelago/src/api/rpc/mod.rs`, list every registered route. For each endpoint, verify: input params are validated (length limits, format checks, no path traversal), auth is required (except health/public endpoints), error messages don't leak internals. **Acceptance**: Audit document with pass/fail per endpoint. All critical endpoints pass.
- [ ] **SEC-02** Add rate limiting to federation endpoints. Federation endpoints (`federation.join`, `federation.invite`) should be rate-limited to prevent invite-code brute force. Max 5 join attempts per minute per source IP. **Acceptance**: 6th join attempt within 60s returns 429.
- [ ] **SEC-03** Verify CSRF on all state-changing endpoints. Call every POST RPC endpoint without X-CSRF-Token header should get 403. Verify the CSRF token is properly generated on login and validated on every mutation. **Acceptance**: 100% of state-changing endpoints reject requests without valid CSRF token.
- [ ] **SEC-04** Audit container security profiles. For every container in `get_app_config()`, verify: `--cap-drop=ALL`, only required capabilities added back, `--security-opt=no-new-privileges:true`, `--read-only` where possible, non-root UID, specific image version pinned (not :latest). Fix any violations. **Acceptance**: All containers pass security checklist. `sudo podman inspect {name} --format "{{.HostConfig.CapDrop}}"` shows ALL for every container.
- [ ] **SEC-05** Implement proper log rotation. Check `/var/lib/archipelago/logs/` and `/var/log/` for log file sizes. Add logrotate config for: archipelago backend logs, container logs, nginx logs. Rotate daily, keep 7 days, compress. **Acceptance**: `du -sh /var/log/` < 500MB. Logrotate config exists and runs daily.
- [ ] **SEC-06** Verify nginx security headers on both nodes. `curl -I http://192.168.1.228` and `curl -I http://192.168.1.198`. Must include: X-Frame-Options, X-Content-Type-Options, Content-Security-Policy, Referrer-Policy. Fix any missing. **Acceptance**: All 4 security headers present on both nodes.
---
## Phase 5: Reboot & Uptime Hardening (Week 11-14)
### Sprint 7: Zero-Downtime Reboot Testing
- [ ] **REBOOT-01** Create reboot survival test script. `scripts/test-reboot-survival.sh` that: (1) Records all container names and states, (2) Reboots the node via `sudo reboot`, (3) Waits for SSH to come back (poll every 10s, max 180s), (4) Verifies ALL containers are running, (5) Verifies health endpoint returns OK, (6) Verifies no containers have restart counts > 0 since boot. Run on .228. **Acceptance**: Script passes. All containers survive reboot.
- [ ] **REBOOT-02** — Run reboot survival test 10 times on .228. Execute test-reboot-survival.sh 10 times with 5-minute rest between reboots. Track: time to full recovery, any containers that fail to start, any services that don't come back. **Acceptance**: 10/10 reboots recover fully within 120s. Zero failed containers.
- [ ] **REBOOT-03** — Run reboot survival test 10 times on .198. Same as REBOOT-02 but on .198. **Acceptance**: 10/10 reboots recover fully. Zero failed containers.
- [ ] **REBOOT-04** — Test simultaneous reboot of both nodes. Reboot .228 and .198 at the same time. After both recover, verify: federation re-establishes, DWN sync works, file sharing works. **Acceptance**: Both nodes fully recover. Federation sync succeeds within 10 minutes of both being back.
- [ ] **REBOOT-05** — Test power-cut simulation (SIGKILL). On each node: `sudo kill -9 $(pgrep archipelago)`. Verify systemd restarts the backend, health monitor restarts containers, and everything recovers. Run 10 times per node. **Acceptance**: Full recovery within 90s, 10/10 times.
### Sprint 8: Memory & Storage Monitoring
- [ ] **MEM-01** — Add OOM-kill detection. In health_monitor.rs, check `dmesg | grep -i oom` and `/var/log/kern.log` for OOM kills. If detected, report via WebSocket notification with which process was killed. **Acceptance**: Trigger an intentional OOM (cgroup limit), verify notification fires.
- [ ] **MEM-02** — Add container memory leak detection. Track per-container RSS over time in the monitoring collector. If a container's memory grows by >50% in 24h without corresponding workload increase, flag as potential leak. **Acceptance**: Monitoring page shows memory trend per container. Alert fires for simulated leak (container with growing allocation).
- [ ] **MEM-03** — Add disk growth alerting. Track disk usage trend. If disk is growing > 1GB/day, alert. If disk > 85%, auto-trigger `system.disk-cleanup`. If > 90%, send critical notification. **Acceptance**: Alert fires when disk threshold crossed. Auto-cleanup runs at 90%.
- [ ] **MEM-04** — Add systemd watchdog to archipelago service. In `archipelago.service`, add `WatchdogSec=60`. In the backend, implement `sd_notify(WATCHDOG=1)` every 30s via the `sd-notify` crate. If backend hangs (stops sending watchdog), systemd auto-restarts it. **Acceptance**: Kill the backend's main loop (not the process), verify systemd detects the hang and restarts within 90s.
- [ ] **MEM-05** — Run 7-day continuous monitoring on both nodes. Deploy uptime-monitor.sh on both nodes. Cron every 5 minutes. Track: HTTP status, response time, CPU, memory, disk, container count, restart count. After 7 days, generate summary. **Acceptance**: Both nodes maintain > 99.9% uptime (< 10 minutes total downtime including intentional tests). Zero OOM kills. Zero unexpected restarts.
---
## Phase 6: did:dht & Interoperable Schemas (Week 15-20)
### Sprint 9: did:dht Implementation
- [ ] **DHT-01** Research and document did:dht integration approach. Study the did:dht spec (uses BitTorrent DHT Mainline DHT). Document: how to publish DIDs to the DHT, how to resolve them, what library/crate to use (or implement), how it fits alongside existing did:key. Write to `docs/did-dht-integration.md`. **Acceptance**: Architecture document with specific implementation plan.
- [ ] **DHT-02** Implement did:dht creation in identity_manager.rs. Add `create_dht_did()` method that: (1) generates Ed25519 keypair, (2) creates a DNS packet encoding per did:dht spec, (3) publishes to Mainline DHT using a Rust BitTorrent DHT library (e.g., `mainline` crate). The node should have BOTH did:key (local, offline) and did:dht (discoverable, no server needed). Add `identity.create-dht-did` RPC endpoint. **Acceptance**: Can create a did:dht and resolve it from another machine using the DHT.
- [ ] **DHT-03** Implement did:dht resolution. Add `identity.resolve-dht-did` RPC endpoint that takes a did:dht identifier, queries the Mainline DHT, retrieves and parses the DNS packet, returns the DID Document. Cache resolved DIDs for 1 hour. **Acceptance**: Can resolve a did:dht created on .228 from .198 without Tor, without Nostr relays, using only the BitTorrent DHT.
- [ ] **DHT-04** Update Web5 UI for did:dht. Show both did:key and did:dht in the identity section. Add "Publish to DHT" button. Show DHT resolution status. **Acceptance**: Web5 page shows both DID types. DHT publish and resolve work from the UI.
### Sprint 10: DWN Protocol Definitions for Interoperable Schemas
- [ ] **SCHEMA-01** Define Archipelago DWN protocol schemas. Create protocol definitions for the data types Archipelago shares between nodes: (1) Node identity announcements, (2) File sharing catalogs, (3) Federation state, (4) App deployment requests. Use the DWN protocol definition format so other apps implementing DWN could read Archipelago data. Document in `docs/dwn-protocols.md`. **Acceptance**: 4 protocol definitions documented with JSON schemas.
- [ ] **SCHEMA-02** Register Archipelago protocols in DWN on both nodes. On startup, the backend should auto-register all 4 Archipelago protocols via `dwn.register-protocol`. Verify protocols are registered on both .228 and .198. **Acceptance**: `dwn.list-protocols` on both nodes shows all 4 Archipelago protocols.
- [ ] **SCHEMA-03** Migrate file sharing catalog to DWN protocol format. Instead of (or in addition to) the custom `content.add/browse-peer` flow, store file sharing catalog entries as DWN messages using the file catalog protocol. This makes the catalog queryable by any DWN-compatible app. **Acceptance**: File sharing still works between .228 and .198. Catalog entries are also available via `dwn.query-messages` with the file catalog protocol filter.
- [ ] **SCHEMA-04** Migrate federation state to DWN protocol format. Store federation node announcements as DWN messages. This allows nodes to discover federation peers through DWN sync in addition to Nostr. **Acceptance**: Federation still works. Node announcements are also available as DWN messages.
### Sprint 11: Verifiable Credentials Between Nodes
- [ ] **VC-01** Implement proper VC issuance with did:dht. Update `credentials.rs` to support did:dht as issuer/subject (currently only did:key). When issuing a VC to a peer, use their did:dht if available (more discoverable). **Acceptance**: Can issue a VC with did:dht issuer, verify it, and present it.
- [ ] **VC-02** Add inter-node identity verification VCs. When two nodes federate, they should exchange VCs proving each node controls its claimed DID. The VC attests: "did:dht:X is a trusted peer of did:dht:Y, established on DATE". Store these VCs in the DWN. **Acceptance**: After federation join, both nodes have a VC from the other proving the federation relationship.
- [ ] **VC-03** Add VC presentation in federation handshake. Update `federation.join` and `federation.get-state` to include VC presentations. Peers can verify the VC chain before trusting a node. **Acceptance**: Federation join includes VC exchange. `federation.list-nodes` includes VC verification status per peer.
- [ ] **VC-04** Test VC flow between .228 and .198 (10x). (1) Issue VC on .228 to .198's DID, (2) Verify VC on .198, (3) Create presentation on .198 including the VC, (4) Verify presentation on .228. Run 10 times each direction. **Acceptance**: 80 checks, all pass.
---
## Phase 7: Deploy Pipeline & ISO Hardening (Week 21-26)
### Sprint 12: Deploy Script Hardening
- [ ] **DEPLOY-01** Audit deploy-to-target.sh for reliability. Read the entire script. Check: error handling (set -e?), rollback on failure, health check after deploy, idempotency, atomic swaps for binary and frontend. Fix any issues. **Acceptance**: Deploy script has proper error handling, health verification, and rollback capability.
- [ ] **DEPLOY-02** Add canary deploy mode. Deploy to .198 first, run health checks, then deploy to .228. If .198 health fails, abort before touching .228. Add `--canary` flag to deploy script. **Acceptance**: `./scripts/deploy-to-target.sh --canary` deploys to .198, verifies, then .228.
- [ ] **DEPLOY-03** Add deploy rollback capability. Before deploying, backup the current binary and frontend. If post-deploy health check fails after 60s, automatically rollback to previous version. Store rollback artifacts in `/opt/archipelago/rollback/`. **Acceptance**: Intentionally deploy a broken binary. Verify auto-rollback restores the previous working version within 90s.
- [ ] **DEPLOY-04** Add `--dry-run` flag to deploy script. Show exactly what would be deployed (files, binary, configs) without actually deploying. **Acceptance**: `./scripts/deploy-to-target.sh --dry-run --live` shows the plan without executing.
### Sprint 13: ISO Build Hardening
- [ ] **ISO-01** Audit ISO build script for all current apps. Verify `CAPTURE_PATTERNS` and `CONTAINER_IMAGES` in `build-auto-installer-iso.sh` include ALL apps currently running on .228 (33+ containers). Any missing container means a fresh install won't have that app. **Acceptance**: ISO capture list matches the full container inventory on .228.
- [ ] **ISO-02** Add swap file creation to first-boot. In the first-boot script, auto-create a swap file sized at 50% of RAM (min 2GB, max 8GB). Add to fstab. **Acceptance**: Fresh install from ISO has swap configured automatically.
- [ ] **ISO-03** Add container dependency ordering to first-boot. Same startup ordering as CONT-02 but for the first-boot-containers.sh script. **Acceptance**: Fresh install starts containers in dependency order with zero crash loops.
- [ ] **ISO-04** Test fresh install from ISO on physical hardware. Build ISO, flash to USB, install on test machine, verify: all containers start, health OK, can federate with .228, can browse files, DWN sync works. **Acceptance**: Fresh install works end-to-end without manual intervention.
---
## Phase 8: Scale Testing for 10K Users (Week 27-36)
### Sprint 14: Resource Budget for 10K Users
- [ ] **SCALE-01** Create resource budget document. Based on current .228 metrics (33 containers, 6.5GB RAM, 1.2TB disk, load 5.44), calculate per-node resource requirements. Estimate: RAM per container (avg), disk per container, CPU per container. Project for 10K users across different hardware tiers. Document in `docs/scale-budget.md`. **Acceptance**: Document with clear resource requirements per hardware tier.
- [ ] **SCALE-02** Identify resource bottlenecks. Profile the top CPU and memory consumers. Current: immich_server (82% CPU spike), onlyoffice (759MB RAM), bitcoin-knots (750MB RAM), fedimint (369MB), lnd (250MB), homeassistant (234MB). Determine which apps should be optional vs core for a minimal install. **Acceptance**: Tiered app list: Core (must-have), Recommended, Optional. Core tier uses < 4GB RAM.
- [ ] **SCALE-03** Implement app tier system in backend. Add a `tier` field to app metadata: `core`, `recommended`, `optional`. First-install only installs core tier. Marketplace shows tier badges. Users choose additional tiers. **Acceptance**: Fresh install only starts core apps. Total RAM < 4GB for core tier.
- [ ] **SCALE-04** Add resource monitoring alerts for scale limits. Alert when: total container memory > 80% of system RAM, CPU load > 2x core count sustained for 5 min, disk > 80%. These proactive alerts prevent scale-related failures. **Acceptance**: Alerts fire at correct thresholds. Tested on both nodes.
### Sprint 15: Automated Fleet Testing
- [ ] **FLEET-01** — Create automated test-all-features script. `scripts/test-all-features.sh` that runs every feature test in sequence: system health, container lifecycle, federation, Tor, Nostr, file sharing, DWN sync, NIP-07, backup, monitoring, identity/VCs. Takes a target IP and runs all checks 10 times. **Acceptance**: One command validates an entire node. Exit 0 = production ready.
- [ ] **FLEET-02** — Run test-all-features on .228. Execute the full test suite 10 iterations. Document any failures, fix them, rerun until 10/10 clean. **Acceptance**: 10 consecutive clean runs on .228.
- [ ] **FLEET-03** — Run test-all-features on .198. Same as FLEET-02 but on .198. **Acceptance**: 10 consecutive clean runs on .198.
- [ ] **FLEET-04** — Run cross-node test suite 10 times. Execute `test-cross-node.sh --iterations 10` covering all bidirectional tests. **Acceptance**: All cross-node tests pass 10/10 from both directions.
### Sprint 16: Long-Duration Soak Test
- [ ] **SOAK-01** — Run 30-day soak test on both nodes. Deploy monitoring, leave both nodes running for 30 days. Monitor: uptime, memory trend (leak detection), disk growth, container restart counts, federation sync success rate, Tor uptime. **Acceptance**: Both nodes > 99.95% uptime. No memory leaks (RSS stable ±10% over 30 days). Zero unexpected restarts.
- [ ] **SOAK-02** — Run hourly federation sync verification for 30 days. Cron job every hour: trigger federation sync, verify success, log result. After 30 days, calculate sync success rate. **Acceptance**: > 99% sync success rate over 30 days.
- [ ] **SOAK-03** — Run daily reboot test for 30 days. Automated daily reboot at 4 AM, verify full recovery by 4:05 AM. Log recovery time each day. **Acceptance**: 30/30 successful recoveries. Average recovery < 120s.
- [ ] **SOAK-04** Compile final stability report. After 30-day soak, generate report: uptime %, memory trend, disk trend, federation reliability, container health, incident log. This becomes the go/no-go for declaring production ready. **Acceptance**: Report shows all metrics meeting production targets.
---
## Phase 9: Production Polish (Week 37-44)
### Sprint 17: Performance Optimization
- [ ] **PERF-01** Optimize backend startup time. Target: < 3 seconds from process start to healthy response. Profile with tracing. Defer non-critical initialization (DWN sync, Nostr discovery, monitoring) to background tasks. **Acceptance**: `time curl http://localhost:5678/health` after restart < 3s.
- [ ] **PERF-02** Optimize frontend bundle size. Target: < 500KB gzipped initial load. Analyze with vite-bundle-visualizer. Lazy-load heavy components (D3.js network map, monitoring charts). **Acceptance**: `ls -la web/dist/neode-ui/assets/*.js | awk '{sum+=$5}END{print sum}'` < 500KB gzipped.
- [ ] **PERF-03** Optimize container image sizes. Pull all container images and check sizes. Replace any > 1GB images with smaller alternatives (alpine-based). Remove any cached layers for old versions. **Acceptance**: Total container image disk usage reduced by > 20%.
- [ ] **PERF-04** — Add caching for RPC responses. Frequently-called read endpoints (`system.stats`, `container.list`, `federation.list-nodes`) should cache results for 5-10 seconds to reduce CPU. **Acceptance**: 100 concurrent `system.stats` calls complete in < 500ms total.
### Sprint 18: Documentation Update
- [ ] **DOC-01** Update CHANGELOG.md for v1.2.0. Document all changes from this hardening cycle: crash loop fixes, cross-node testing, did:dht, DWN protocols, VCs, reboot hardening, memory/swap fixes. **Acceptance**: CHANGELOG updated with all changes.
- [ ] **DOC-02** Update architecture.md for current state. The current doc references StartOS, Docker, macOS. Update to reflect: Debian 12, Podman, multi-node federation, did:dht, DWN protocols. **Acceptance**: Architecture doc matches actual system.
- [ ] **DOC-03** Update current-state.md. Remove references to StartOS dependencies (already removed). Document actual current state: pure Archipelago backend, Podman, 33+ containers, 2-node federation. **Acceptance**: current-state.md reflects reality.
- [ ] **DOC-04** Create operations runbook. `docs/operations-runbook.md` covering: how to check node health, how to fix crashed containers, how to add/remove federation peers, how to rotate Tor address, how to create/restore backups, how to update, how to diagnose high CPU/memory. **Acceptance**: Runbook covers top 20 operational scenarios.
---
## Phase 10: Year 2-5 Roadmap (Month 13-60)
### Year 2 (2027): Multi-Hardware & Community
- [ ] **Y2-01** Test and certify on 5 hardware platforms: generic x86_64 PC, Intel NUC, Raspberry Pi 5, mini-PC (N100), used ThinkCentre. Document per-platform quirks. **Acceptance**: ISO boots and works on all 5 platforms.
- [ ] **Y2-02** Community app submission pipeline. Automated review of community-submitted app manifests: security scan, resource check, dependency validation, sandbox test. **Acceptance**: Community can submit apps via PR, automated checks run, maintainer approves.
- [ ] **Y2-03** Multi-language support. Translate UI to 5 languages (Spanish, Portuguese, German, French, Japanese) using the i18n infrastructure already in place. **Acceptance**: Language selector in Settings, all strings translated.
- [ ] **Y2-04** Mobile companion app (read-only). Progressive Web App or native app that connects to node over Tailscale/Tor and shows: dashboard, container status, notifications. No mutations read-only for safety. **Acceptance**: Can view node status from phone.
### Year 3 (2028): Enterprise & Scale
- [ ] **Y3-01** Multi-user support. Add user roles (admin, viewer, app-user). Admin can manage everything. Viewer sees dashboard only. App-user accesses specific apps. **Acceptance**: 3 user roles with proper permission separation.
- [ ] **Y3-02** Automated backup to S3-compatible storage. In addition to USB backup, support backup to any S3 endpoint (Backblaze B2, Wasabi, self-hosted MinIO). Encrypted before upload. **Acceptance**: Backup to S3 works, restore from S3 works.
- [ ] **Y3-03** Cluster mode for high availability. 3+ nodes form a cluster where apps have replicas. If one node goes down, apps failover to another. Uses Raft or similar consensus. **Acceptance**: Stop one node in a 3-node cluster apps continue serving from remaining nodes.
- [ ] **Y3-04** Hardware attestation with TPM 2.0. Nodes with TPM chips can cryptographically prove their hardware identity. Adds trust layer to federation. **Acceptance**: TPM-equipped node includes hardware attestation in its DID Document.
### Year 4 (2029): Ecosystem & Market
- [ ] **Y4-01** App developer SDK. Command-line tool for app developers: `archy-dev create`, `archy-dev test`, `archy-dev publish`. Scaffolds manifest, runs security checks, publishes to marketplace. **Acceptance**: Developer can publish a new app in under 30 minutes using the SDK.
- [ ] **Y4-02** Paid app marketplace. Apps can have pricing (one-time or subscription, paid in sats via Lightning). Revenue split between developer and node operator. Uses Cashu or Lightning invoices. **Acceptance**: End-to-end payment flow works.
- [ ] **Y4-03** Node analytics dashboard (opt-in). Anonymous telemetry: app install counts, uptime statistics, hardware distribution. Helps prioritize development. Strictly opt-in. **Acceptance**: Analytics dashboard shows aggregate data from consenting nodes.
- [ ] **Y4-04** Cross-chain support (Monero, Liquid). Add support for Monero full node and Liquid sidechain containers. Federation supports multi-chain status reporting. **Acceptance**: Can run Bitcoin + Monero + Liquid on same node.
### Year 5 (2030-2031): Production at Scale
- [ ] **Y5-01** Achieve 10,000 active nodes. Track via opt-in analytics. Support infrastructure: documentation, community forum, bug tracker, release automation. **Acceptance**: 10K+ nodes running Archipelago, measured via marketplace relay or opt-in telemetry.
- [ ] **Y5-02** Zero-downtime updates. Update mechanism that migrates containers one-by-one with health checks between each. No service interruption during update. **Acceptance**: Update from v2.x to v2.y with zero downtime measured by external monitor.
- [ ] **Y5-03** Formal security audit by third party. Engage professional security firm to audit: backend code, container isolation, authentication, cryptography, network security. Fix all findings. **Acceptance**: Clean audit report with no critical/high findings.
- [ ] **Y5-04** v3.0 release with all Year 5 features. Stable, audited, scale-tested release for mass adoption. **Acceptance**: Tagged v3.0.0 release with full documentation and ISO downloads.
---
## Test Matrix Summary
| Test Category | # Checks | Per-Direction | Iterations | Total Passes Required |
|---|---|---|---|---|
| System Health (US-01) | 6 | x2 | x10 | 120 |
| Container Lifecycle (US-02) | 4 | x2 | x10 | 80 |
| Federation Join (US-03) | 4 | x2 | x10 | 80 |
| Federation Sync (US-04) | 4 | x2 | x10 | 80 |
| Tor Hidden Services (US-05) | 3 | x2 | x10 | 60 |
| Nostr Discovery (US-06) | 4 | x2 | x10 | 80 |
| File Sharing (US-07) | 5 | x2 | x10 | 100 |
| DWN Sync (US-08) | 5 | x2 | x10 | 100 |
| NIP-07 Signing (US-09) | 4 | x2 | x10 | 80 |
| Backup/Restore (US-10) | 4 | x2 | x10 | 80 |
| Boot Recovery (US-15) | 5 | x2 | x3 | 30 |
| **TOTAL** | **48** | | | **890** |
Every single one of these 890 test passes must succeed before declaring production-ready.
---
## Milestone Summary
| Date | Milestone | Key Deliverables |
|---|---|---|
| Mar 2026 Week 2 | Phase 1 Complete | Crash loops fixed, .198 stabilized, federation established |
| Mar 2026 Week 4 | Phase 2 Complete | 890 cross-node test passes, bulletproof test harness |
| Apr 2026 Week 2 | Phase 3 Complete | UI cosmetic cleanup, zero fake data, zero TypeScript errors |
| May 2026 | Phase 4 Complete | Container reliability, security audit, log rotation |
| Jun 2026 | Phase 5 Complete | 10x reboot survival, memory monitoring, systemd watchdog |
| Aug 2026 | Phase 6 Complete | did:dht, DWN interoperable schemas, VCs between nodes |
| Oct 2026 | Phase 7 Complete | Deploy pipeline hardened, ISO verified |
| Jan 2027 | Phase 8 Complete | 30-day soak test passed, scale budget documented |
| Apr 2027 | Phase 9 Complete | Performance optimized, docs updated, v1.2.0 tagged |
| 2028 | Year 2 | Multi-hardware, community apps, mobile companion |
| 2029 | Year 3 | Multi-user, S3 backup, cluster HA, TPM attestation |
| 2030 | Year 4 | App SDK, paid marketplace, cross-chain |
| 2031 | **Year 5** | **10K users, zero-downtime updates, security audit, v3.0** |
---
## Execution Instructions
For each task in order:
1. Find the first unchecked `- [ ]` item
2. Read the task description and acceptance criteria carefully
3. Read ALL relevant source files before making changes
4. Implement following CLAUDE.md conventions strictly
5. For frontend changes: `cd neode-ui && npm run type-check && npm run build`, deploy with `./scripts/deploy-to-target.sh --both`
6. For backend changes: deploy with `./scripts/deploy-to-target.sh --both` (builds on server, not macOS)
7. For test scripts: create on local, rsync to server, run via SSH
8. Verify acceptance criteria are met ON BOTH SERVERS
9. Mark it done `- [x]` in this file
10. Commit: `type: description`
11. Move to the next unchecked task immediately
**CRITICAL**: Every change must be deployed to BOTH .228 AND .198. Tests must pass from BOTH directions.
**Total tasks**: 98 across 18 sprints over 5 years.