archy/docs/adr/004-tor-for-peer-communication.md

# ADR-004: Tor Hidden Services for Peer Communication

**Status**: Accepted
**Date**: 2026-03

## Context

Federated nodes need to communicate directly for state sync, app deployment, and peer verification. Options: direct IP, VPN tunnel, Tor hidden services, I2P.

## Decision

Use Tor hidden services (.onion addresses) for all inter-node communication.

## Consequences

### Positive
- **NAT traversal**: Works behind any firewall or NAT without port forwarding
- **IP privacy**: Nodes never expose their real IP addresses to each other
- **End-to-end encryption**: Tor provides encryption without additional TLS setup
- **Censorship resistance**: Onion routing makes traffic analysis difficult
- **Stable addressing**: .onion addresses persist across IP changes and network migrations
- **No central infrastructure**: No VPN server, STUN/TURN server, or relay needed

### Negative
- **Latency**: Tor adds 200-500ms per hop; 3 hops per direction = noticeable delay
- **Bandwidth**: Tor network has limited bandwidth; not suitable for bulk data transfer
- **Reliability**: Tor circuits can break; connections may need retry logic
- **Setup complexity**: Requires running a Tor daemon (`archy-tor` container)
- **Blocked networks**: Some networks block Tor; bridges can help but add complexity

### Mitigation
- Use Tor only for RPC/control plane; bulk data (container images) pulled from registries
- Implement retry with backoff for Tor connections
- Container `archy-tor` runs automatically with host networking for hidden service access
- Federation sync interval (5 min) tolerates occasional connection failures
release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 + v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500) with no recovery path short of SSH. This release adds a self-check guardrail to the update flow. What changed: - apply_update() writes a pending-verify marker with old+new version and a 150s deadline immediately before scheduling the service restart. - verify_pending_update() runs from main.rs startup. If the marker is present and within its freshness window, the new binary waits 15s for nginx + backend to settle, then probes https://127.0.0.1/ every 5s for up to 90s (self-signed certs accepted). - On any probe success within the window, the marker is cleared and nothing else happens. - On window-exhaust, the new binary: 1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts> (quarantined, not deleted, so we can post-mortem). 2. Restores web-ui.bak on top of web-ui. 3. Calls rollback_update() to restore the previous binary. 4. Updates state.current_version to reflect the rollback. 5. systemctl --no-block restart archipelago so the OLD binary boots. - Markers older than 10 minutes are treated as stale and cleared without probing, so a crashed-during-startup marker from weeks ago cannot spontaneously roll back a healthy node on a later reboot. - rollback_update() binary copy now goes through host_sudo instead of tokio::fs::copy, so it escapes the service's ProtectSystem=strict mount namespace. Without this, the rollback silently failed with EROFS on /usr/local/bin and orphaned the rollback - the exact opposite of what auto-rollback is for. Tests: 4 new unit tests in update::tests covering marker round-trip, absent-marker noop, no-panic on verify_pending_update with nothing to verify, and an invariant assert that the 90s probe window stays below the 600s stale threshold. All passing. Side fix: scripts/create-release-manifest.sh was dying with exit 141 (SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail. Replaced with a single awk NR==1 that doesn't short-circuit the upstream pipe, so the release-build flow is idempotent again. 2026-04-22 16:14:35 -04:00			`# ADR-004: Tor Hidden Services for Peer Communication`

			`Status: Accepted`
			`Date: 2026-03`

			`## Context`

			`Federated nodes need to communicate directly for state sync, app deployment, and peer verification. Options: direct IP, VPN tunnel, Tor hidden services, I2P.`

			`## Decision`

			`Use Tor hidden services (.onion addresses) for all inter-node communication.`

			`## Consequences`

			`### Positive`
			`- NAT traversal: Works behind any firewall or NAT without port forwarding`
			`- IP privacy: Nodes never expose their real IP addresses to each other`
			`- End-to-end encryption: Tor provides encryption without additional TLS setup`
			`- Censorship resistance: Onion routing makes traffic analysis difficult`
			`- Stable addressing: .onion addresses persist across IP changes and network migrations`
			`- No central infrastructure: No VPN server, STUN/TURN server, or relay needed`

			`### Negative`
			`- Latency: Tor adds 200-500ms per hop; 3 hops per direction = noticeable delay`
			`- Bandwidth: Tor network has limited bandwidth; not suitable for bulk data transfer`
			`- Reliability: Tor circuits can break; connections may need retry logic`
			- Setup complexity: Requires running a Tor daemon (`archy-tor` container)
			`- Blocked networks: Some networks block Tor; bridges can help but add complexity`

			`### Mitigation`
			`- Use Tor only for RPC/control plane; bulk data (container images) pulled from registries`
			`- Implement retry with backoff for Tor connections`
			- Container `archy-tor` runs automatically with host networking for hidden service access
			`- Federation sync interval (5 min) tolerates occasional connection failures`