archy/docs/adr/007-did-federation-trust.md

55 lines
2.1 KiB
Markdown
Raw Normal View History

release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 + v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500) with no recovery path short of SSH. This release adds a self-check guardrail to the update flow. What changed: - apply_update() writes a pending-verify marker with old+new version and a 150s deadline immediately before scheduling the service restart. - verify_pending_update() runs from main.rs startup. If the marker is present and within its freshness window, the new binary waits 15s for nginx + backend to settle, then probes https://127.0.0.1/ every 5s for up to 90s (self-signed certs accepted). - On any probe success within the window, the marker is cleared and nothing else happens. - On window-exhaust, the new binary: 1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts> (quarantined, not deleted, so we can post-mortem). 2. Restores web-ui.bak on top of web-ui. 3. Calls rollback_update() to restore the previous binary. 4. Updates state.current_version to reflect the rollback. 5. systemctl --no-block restart archipelago so the OLD binary boots. - Markers older than 10 minutes are treated as stale and cleared without probing, so a crashed-during-startup marker from weeks ago cannot spontaneously roll back a healthy node on a later reboot. - rollback_update() binary copy now goes through host_sudo instead of tokio::fs::copy, so it escapes the service's ProtectSystem=strict mount namespace. Without this, the rollback silently failed with EROFS on /usr/local/bin and orphaned the rollback - the exact opposite of what auto-rollback is for. Tests: 4 new unit tests in update::tests covering marker round-trip, absent-marker noop, no-panic on verify_pending_update with nothing to verify, and an invariant assert that the 90s probe window stays below the 600s stale threshold. All passing. Side fix: scripts/create-release-manifest.sh was dying with exit 141 (SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail. Replaced with a single awk NR==1 that doesn't short-circuit the upstream pipe, so the release-build flow is idempotent again.
2026-04-22 16:14:35 -04:00
# ADR-007: DID-Based Federation Trust
## Status
Accepted
## Context
Archipelago supports federation — multiple nodes forming a trusted group for remote monitoring, app deployment, and state synchronization. Federation requires a trust establishment mechanism:
- **Centralized PKI** (Certificate Authorities): requires internet access, introduces third-party trust
- **Pre-shared keys**: simple but doesn't scale, no identity verification
- **DID-based bilateral verification**: each node verifies the other's cryptographic identity directly
## Decision
Use **bilateral DID-based verification** with single-use invite codes for federation trust establishment.
### How It Works
1. **Node A** generates a single-use invite code containing its DID, .onion address, and a shared secret
2. **Node B** receives the code (out-of-band: QR code, message, etc.) and submits it
3. **Both nodes** verify each other's DIDs by exchanging signed challenges over Tor
4. **Trust is established** — each node stores the other's DID and public key
5. **Ongoing communication** uses DID-authenticated messages over Tor hidden services
### Trust Levels
- **Trusted**: Full access — can view status, deploy apps, sync state
- **Observer**: Read-only access — can view status but not modify
- **Untrusted**: Blocked from federation operations
## Consequences
### Positive
- No third-party trust dependency (no CA, no central server)
- Works fully offline/air-gapped for the verification step
- Strong cryptographic identity (Ed25519 keys)
- Granular trust levels for different access patterns
- Invite codes are single-use (no replay attacks)
### Negative
- Requires out-of-band code exchange (can't auto-discover peers for federation)
- No revocation mechanism beyond removing the peer from the local trust store
- Key rotation requires re-establishing trust with all peers
- Trust is bilateral — each node maintains its own trust decisions
### Mitigations
- Nostr-based node discovery (ADR-003) handles finding nodes; federation handles trusting them
- Tor hidden services provide transport encryption and anonymity
- State sync includes heartbeat/health checks to detect unreachable peers