release(v1.7.41-alpha): post-OTA auto-rollback so a bad release cannot strand the fleet

Closes failure mode FM5 from docs/bulletproof-containers.md: the v1.7.38 +
v1.7.39 rollouts left every affected node on an unreachable UI (nginx 500)
with no recovery path short of SSH. This release adds a self-check
guardrail to the update flow.

What changed:
- apply_update() writes a pending-verify marker with old+new version and
  a 150s deadline immediately before scheduling the service restart.
- verify_pending_update() runs from main.rs startup. If the marker is
  present and within its freshness window, the new binary waits 15s for
  nginx + backend to settle, then probes https://127.0.0.1/ every 5s for
  up to 90s (self-signed certs accepted).
- On any probe success within the window, the marker is cleared and
  nothing else happens.
- On window-exhaust, the new binary:
    1. Moves the broken /opt/archipelago/web-ui to web-ui.failed.<ts>
       (quarantined, not deleted, so we can post-mortem).
    2. Restores web-ui.bak on top of web-ui.
    3. Calls rollback_update() to restore the previous binary.
    4. Updates state.current_version to reflect the rollback.
    5. systemctl --no-block restart archipelago so the OLD binary boots.
- Markers older than 10 minutes are treated as stale and cleared without
  probing, so a crashed-during-startup marker from weeks ago cannot
  spontaneously roll back a healthy node on a later reboot.
- rollback_update() binary copy now goes through host_sudo instead of
  tokio::fs::copy, so it escapes the service's ProtectSystem=strict
  mount namespace. Without this, the rollback silently failed with
  EROFS on /usr/local/bin and orphaned the rollback - the exact
  opposite of what auto-rollback is for.

Tests: 4 new unit tests in update::tests covering marker round-trip,
absent-marker noop, no-panic on verify_pending_update with nothing to
verify, and an invariant assert that the 90s probe window stays below
the 600s stale threshold. All passing.

Side fix: scripts/create-release-manifest.sh was dying with exit 141
(SIGPIPE from tar tvzf pipe head pipe awk) under set -euo pipefail.
Replaced with a single awk NR==1 that doesn't short-circuit the upstream
pipe, so the release-build flow is idempotent again.
This commit is contained in:
archipelago 2026-04-22 16:14:35 -04:00
parent 50744952b7
commit 048679065e
11 changed files with 645 additions and 24 deletions

2
core/Cargo.lock generated
View File

@ -80,7 +80,7 @@ checksum = "a23eb6b1614318a8071c9b2521f36b424b2c83db5eb3a0fead4a6c0809af6e61"
[[package]]
name = "archipelago"
version = "1.7.40-alpha"
version = "1.7.41-alpha"
dependencies = [
"anyhow",
"archipelago-container",

View File

@ -1,6 +1,6 @@
[package]
name = "archipelago"
version = "1.7.40-alpha"
version = "1.7.41-alpha"
edition = "2021"
description = "Archipelago Bitcoin Node OS - Native backend"
authors = ["Archipelago Team"]

View File

@ -196,6 +196,19 @@ async fn main() -> Result<()> {
// on a 30s poll of fips0 — so a post-onboarding fips.install brings it
// online without needing an archipelago restart.
// Post-OTA verification: if apply_update() wrote a pending-verify
// marker right before the restart, probe the frontend now and auto-
// rollback if it's broken. This is the guardrail that stops fleet-
// wide breakage when an OTA lands a subtly-bad release (v1.7.38/39
// tarball-perms → nginx 500 was the trigger). Runs concurrently
// with normal startup — doesn't delay the server coming up.
{
let data_dir = config.data_dir.clone();
tokio::spawn(async move {
update::verify_pending_update(&data_dir).await;
});
}
// Spawn background update scheduler
let update_data_dir = config.data_dir.clone();
tokio::spawn(async move {

View File

@ -74,6 +74,23 @@ const DEFAULT_TERTIARY_MIRROR_URL: &str =
"http://146.59.87.168:3000/lfg2025/archy/raw/branch/main/releases/manifest.json";
const UPDATE_STATE_FILE: &str = "update_state.json";
const UPDATE_MIRRORS_FILE: &str = "update-mirrors.json";
/// Marker written by apply_update() just before the service restart and
/// consumed by verify_pending_update() in the NEW binary's startup path.
/// If present, the new binary probes the frontend; if the probe fails,
/// rollback_update() runs and the service restarts on the old binary.
/// Closes the "OTA broke nginx fleet-wide with no auto-rollback" failure
/// mode from 2026-04-22 (v1.7.38/39 tarball-perms bug).
const PENDING_VERIFY_FILE: &str = "update-pending-verify.json";
/// Probe timeout for the frontend health check (total time including
/// retries). Generous: the new binary has to come fully up, health
/// monitor settles, nginx has to re-read any snippet changes. 90s is
/// comfortably longer than the slowest observed startup.
const PENDING_VERIFY_WINDOW_SECS: u64 = 90;
/// If the marker is older than this on read, treat it as stale and
/// delete without probing. Guards against a node that somehow failed
/// to run verification at all (e.g. crashed during startup) from
/// spontaneously rolling back days later when the user reboots.
const PENDING_VERIFY_MAX_AGE_SECS: i64 = 600;
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
pub struct UpdateMirror {
@ -268,6 +285,189 @@ impl Default for UpdateState {
}
}
/// Marker written by apply_update() just before the service restart and
/// consumed by verify_pending_update() in the NEW binary's startup path.
/// See PENDING_VERIFY_FILE for the full rationale — this is the hook
/// that turns "nginx 500 on every page after OTA" from an unrecoverable
/// field incident into an automatic rollback.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct PendingVerification {
/// RFC3339 timestamp of the apply that wrote this marker.
pub applied_at: String,
/// Version we just applied (what the NEW binary should be running).
pub new_version: String,
/// Version the outgoing binary was running (what we roll back to).
pub previous_version: String,
/// Unix epoch seconds after which the probe should give up and
/// trigger rollback. Prevents a probe from retrying forever if e.g.
/// nginx is totally wedged.
pub deadline_ts: i64,
}
async fn write_pending_verification(
data_dir: &Path,
marker: &PendingVerification,
) -> Result<()> {
let path = data_dir.join(PENDING_VERIFY_FILE);
let data = serde_json::to_string_pretty(marker)
.context("serialize pending-verify marker")?;
fs::write(&path, data)
.await
.with_context(|| format!("write pending-verify marker to {}", path.display()))?;
Ok(())
}
async fn read_pending_verification(data_dir: &Path) -> Option<PendingVerification> {
let path = data_dir.join(PENDING_VERIFY_FILE);
let data = fs::read_to_string(&path).await.ok()?;
serde_json::from_str(&data).ok()
}
async fn clear_pending_verification(data_dir: &Path) {
let path = data_dir.join(PENDING_VERIFY_FILE);
let _ = fs::remove_file(&path).await;
}
/// Probe the local frontend through nginx. Returns Ok(()) on the first
/// response that's 2xx or 3xx; errors on timeout / connection refused /
/// any 4xx/5xx. `accept_self_signed` because nodes use a self-signed
/// cert the reqwest default root-set doesn't trust.
async fn probe_frontend_once() -> Result<()> {
let client = reqwest::Client::builder()
.danger_accept_invalid_certs(true)
.timeout(std::time::Duration::from_secs(5))
.build()
.context("build probe client")?;
// Prefer HTTPS since that's the failure mode we're catching (nginx
// 500 on the PWA). HTTP usually redirects to HTTPS and would mask
// the bug.
let resp = client
.get("https://127.0.0.1/")
.send()
.await
.context("probe GET https://127.0.0.1/")?;
let status = resp.status();
if status.is_success() || status.is_redirection() {
return Ok(());
}
anyhow::bail!("frontend probe returned HTTP {}", status);
}
/// Called from main.rs startup. If a pending-verification marker is
/// present, probe the frontend; on failure, trigger rollback and
/// restart the service so the OLD binary boots.
///
/// This is the "post-OTA auto-rollback" guardrail. If ANY problem in
/// the new version takes down the PWA (bad tarball perms as in v1.7.38,
/// a broken service worker, a missing asset, a backend panic on first
/// boot), the node self-heals back to the previous working state
/// without SSH intervention.
pub async fn verify_pending_update(data_dir: &Path) {
let marker = match read_pending_verification(data_dir).await {
Some(m) => m,
None => return, // No update pending; nothing to verify.
};
// Guard against a marker left behind by some earlier crash path —
// don't want a user who reboots days later to suddenly get
// rolled back because the marker was never cleared.
let applied_at = chrono::DateTime::parse_from_rfc3339(&marker.applied_at);
if let Ok(ts) = applied_at {
let age = chrono::Utc::now() - ts.with_timezone(&chrono::Utc);
if age.num_seconds() > PENDING_VERIFY_MAX_AGE_SECS {
tracing::warn!(
age_secs = age.num_seconds(),
"pending-verify marker is stale, clearing without probing"
);
clear_pending_verification(data_dir).await;
return;
}
}
info!(
new_version = %marker.new_version,
previous_version = %marker.previous_version,
"Post-OTA verification: probing frontend at https://127.0.0.1/"
);
// Give the new service time to bind its listeners + nginx to
// pick up any config changes. 15s matches what we observed on
// .116 during the v1.7.40 rollout recovery.
tokio::time::sleep(std::time::Duration::from_secs(15)).await;
let deadline =
std::time::Instant::now() + std::time::Duration::from_secs(PENDING_VERIFY_WINDOW_SECS);
let mut attempt = 0u32;
let mut last_err: Option<String> = None;
while std::time::Instant::now() < deadline {
attempt += 1;
match probe_frontend_once().await {
Ok(()) => {
info!(
attempt,
"Post-OTA verification succeeded — clearing marker"
);
clear_pending_verification(data_dir).await;
return;
}
Err(e) => {
let msg = e.to_string();
tracing::warn!(attempt, error = %msg, "Post-OTA probe failed, retrying");
last_err = Some(msg);
}
}
tokio::time::sleep(std::time::Duration::from_secs(5)).await;
}
tracing::error!(
attempts = attempt,
window_secs = PENDING_VERIFY_WINDOW_SECS,
last_error = last_err.as_deref().unwrap_or(""),
new_version = %marker.new_version,
previous_version = %marker.previous_version,
"Post-OTA verification FAILED — rolling back"
);
// Restore web-ui.bak on top of web-ui. update.rs keeps web-ui.bak
// from the previous apply; moving it back is the frontend half of
// the rollback. The binary half is handled by rollback_update().
let web_ui_bak = Path::new("/opt/archipelago/web-ui.bak");
let web_ui = "/opt/archipelago/web-ui";
if web_ui_bak.exists() {
let ts = chrono::Utc::now().timestamp_millis();
let quarantine = format!("/opt/archipelago/web-ui.failed.{}", ts);
let _ = host_sudo(&["mv", web_ui, &quarantine]).await;
let _ = host_sudo(&["mv", web_ui_bak.to_str().unwrap_or(""), web_ui]).await;
tracing::info!(quarantined = %quarantine, "Restored web-ui from web-ui.bak");
} else {
tracing::warn!(
"web-ui.bak not present — frontend cannot be rolled back, only binary"
);
}
if let Err(e) = rollback_update(data_dir).await {
tracing::error!(error = %e, "rollback_update() failed during post-OTA verification");
// Leave the marker in place so a future boot gets another shot.
return;
}
clear_pending_verification(data_dir).await;
// Record why we rolled back so the UI can show it on the next boot.
if let Ok(mut state) = load_state(data_dir).await {
state.current_version = marker.previous_version.clone();
if let Err(e) = save_state(data_dir, &state).await {
tracing::warn!(error = %e, "Failed to update state after rollback");
}
}
// Restart so the old binary takes over. --no-block because we're
// the service; systemd can't wait for us to exit before starting
// the old process.
let _ = host_sudo(&["systemctl", "--no-block", "restart", "archipelago"]).await;
}
pub async fn load_state(data_dir: &Path) -> Result<UpdateState> {
let path = data_dir.join(UPDATE_STATE_FILE);
if !path.exists() {
@ -985,15 +1185,42 @@ pub async fn apply_update(data_dir: &Path) -> Result<()> {
}
// Update state
let previous_version = {
let state = load_state(data_dir).await?;
state.current_version.clone()
};
let mut state = load_state(data_dir).await?;
if let Some(manifest) = &state.available_update {
let new_version = if let Some(manifest) = &state.available_update {
state.current_version = manifest.version.clone();
}
manifest.version.clone()
} else {
state.current_version.clone()
};
state.available_update = None;
state.update_in_progress = false;
state.rollback_available = true;
save_state(data_dir, &state).await?;
// Write the post-OTA verification marker BEFORE we schedule the
// restart. The new binary will read it on startup, probe the
// frontend, and auto-rollback if nginx is serving 5xx. Covers the
// class of failure where "apply succeeds, restart succeeds, but
// the UI is dead" (v1.7.38/39 tarball-perms bug). Best-effort —
// a failed marker write shouldn't abort the apply.
let marker = PendingVerification {
applied_at: chrono::Utc::now().to_rfc3339(),
new_version,
previous_version,
deadline_ts: chrono::Utc::now().timestamp()
+ PENDING_VERIFY_WINDOW_SECS as i64
+ 60,
};
if let Err(e) = write_pending_verification(data_dir, &marker).await {
tracing::warn!(error = %e, "Failed to write post-OTA verify marker — rollback disabled for this OTA");
} else {
info!("Post-OTA verify marker written; new binary will probe on boot");
}
// Clean staging
let _ = fs::remove_dir_all(&staging_dir).await;
@ -1023,9 +1250,24 @@ pub async fn rollback_update(data_dir: &Path) -> Result<()> {
let backup_binary = backup_dir.join("archipelago");
if backup_binary.exists() {
fs::copy(&backup_binary, "/usr/local/bin/archipelago")
// Use host_sudo + mv so we escape the archipelago service's
// ProtectSystem=strict mount namespace. A plain fs::copy or
// `sudo cp` from inside the service hits EROFS on /usr/local/bin,
// which would silently orphan the rollback — exactly the
// opposite of what auto-rollback is for. Pattern matches
// apply_update()'s binary swap above.
let backup_str = backup_binary.to_string_lossy().to_string();
let _ = host_sudo(&["chmod", "0755", &backup_str]).await;
let _ = host_sudo(&["chown", "root:root", &backup_str]).await;
let status = host_sudo(&["cp", &backup_str, "/usr/local/bin/archipelago"])
.await
.context("Failed to restore backup binary")?;
.context("Failed to restore backup binary via host_sudo")?;
if !status.success() {
anyhow::bail!(
"cp backup binary into /usr/local/bin failed (exit {:?})",
status.code()
);
}
info!("Binary rolled back to previous version");
}
@ -1449,4 +1691,45 @@ mod tests {
assert_eq!(status.current_version, env!("CARGO_PKG_VERSION"));
assert!(status.rollback_available);
}
#[tokio::test]
async fn test_pending_verification_round_trip() {
let dir = tempfile::tempdir().unwrap();
let marker = PendingVerification {
applied_at: chrono::Utc::now().to_rfc3339(),
new_version: "1.7.41-alpha".into(),
previous_version: "1.7.40-alpha".into(),
deadline_ts: chrono::Utc::now().timestamp() + 150,
};
write_pending_verification(dir.path(), &marker).await.unwrap();
let read = read_pending_verification(dir.path()).await.unwrap();
assert_eq!(read.new_version, "1.7.41-alpha");
assert_eq!(read.previous_version, "1.7.40-alpha");
clear_pending_verification(dir.path()).await;
assert!(read_pending_verification(dir.path()).await.is_none());
}
#[tokio::test]
async fn test_pending_verification_absent_is_none() {
let dir = tempfile::tempdir().unwrap();
assert!(read_pending_verification(dir.path()).await.is_none());
}
#[tokio::test]
async fn test_verify_pending_update_noop_without_marker() {
let dir = tempfile::tempdir().unwrap();
// No marker written -- must return quickly without doing anything
// risky (network probes, rollback calls). We're just asserting
// it doesn't panic or hang.
verify_pending_update(dir.path()).await;
}
#[test]
fn test_pending_verify_constants_are_sensible() {
// Window must be generous enough for nginx + backend startup,
// but less than the stale-marker threshold so a normal cycle
// can complete without the marker being considered stale.
assert!(PENDING_VERIFY_WINDOW_SECS < PENDING_VERIFY_MAX_AGE_SECS as u64);
assert!(PENDING_VERIFY_WINDOW_SECS >= 60);
}
}

View File

@ -0,0 +1,314 @@
# Bulletproof Containers for Beta
**Status**: plan agreed 2026-04-22, implementation started.
**Target**: zero-manual-intervention container lifecycle for the beta launch. A user installs, uninstalls, reboots, updates, or loses power — every combination must leave the node in a known-good state without SSH.
**Project memory**: `~/.claude/projects/-home-archipelago-Projects-archy/memory/project_reconcile_architecture.md`
**Failure log**: `~/.claude/projects/-home-archipelago-Projects-archy/memory/feedback_container_lifecycle_failure_modes.md`
---
## Why we're doing this
The v1.7.38 and v1.7.39 rollouts on 2026-04-22 exposed a cluster of container-lifecycle failures that required manual SSH recovery on every affected node (.116, .198, .228, .253). If a user had been on those nodes, they'd have been stuck with "can't reach" or 500 errors and no path forward. We can't ship beta with this class of failure on the table.
The pattern under every failure: **the canonical source of truth had the right answer, but derived state drifted away from it and nothing noticed or fixed it.**
### The six failure modes
| # | Symptom | Root cause |
|---|---|---|
| FM1 | `archy-bitcoin-ui` + `archy-lnd-ui` disappeared from `podman ps -a` after a daemon restart | Archipelago owns container creation imperatively; no owner recreates companions after a crash mid-transition |
| FM2 | ElectrumX "Daemon connection problem" | `bitcoin.conf`'s `rpcauth` drifted from `/var/lib/archipelago/secrets/bitcoin-rpc-password` — config written once at install, never re-derived |
| FM3 | archipelago.service `status=226/NAMESPACE` crash-loop SIGKILL'd every child container | Containers were children of archipelago's cgroup; systemd teardown killed them. `KillMode=control-group` default |
| FM4 | `host.containers.internal` inside containers resolved to LAN gateway (192.168.1.254) | Known podman bug on bridge networks pre-5.3 ([#22644](https://github.com/containers/podman/issues/22644)) |
| FM5 | Nginx 500 fleet-wide after OTA | Tarball root dir was `drwx------` (700), extracted identically on every node. Fixed in v1.7.40 at build time; still need post-OTA auto-rollback |
| FM6 | Rootless podman's `libpod/bolt_state.db` vanished → whole registry node unreachable | No detection of corrupt state; required manual `rm -rf /run/user/$UID/libpod` + `podman system renumber` |
---
## Architecture decision
**Adopt balena-style, level-triggered, desired-state reconciler built on Quadlet + sdnotify.**
This is the one architecture that would have prevented all six failures, because each one is "reality drifted from the intended config and nothing noticed" — the exact problem reconcilers are designed for.
### Why not the alternatives
- **Keep imperative + patch per-failure** — we've been doing this. Five releases in a day. Doesn't scale.
- **Migrate to LXC (StartOS's path)** — 6-month project. Our investment in podman (`install.rs`, `docker_packages.rs`, `image_versions.rs`) is substantial. Quadlet gives us StartOS's isolation property without the migration.
- **Ship k3s / MicroShift** — 400-800 MB RAM baseline on top of bitcoind/electrs. Overkill for a home node OS.
- **Edge-triggered like Umbrel** — their `app.ts` has an explicit TODO admitting they don't handle failure events. We'd inherit the same bug class.
### The four patterns (from mature players)
1. **Desired-state-first, level-triggered reconcile.** balena-supervisor, Kubernetes operators, NixOS. A supervisor owns a manifest of *what should run*; on every tick it diffs against *what is running* and issues steps.
2. **Every container is its own systemd unit, not a child of the daemon.** Red Hat's Quadlet pattern: a `.container` file is parsed by a systemd *generator* into a normal `.service`. The daemon can crash without taking any containers with it.
3. **sdnotify readiness + HealthCmd + rollback.** Podman v3.4+ has real rollback: bad image fails health check, systemd considers service failed, Podman re-tags the previous image digest.
4. **Credentials and config derived from canonical secrets on every apply.** Not trusted across upgrades; re-rendered idempotently from single source of truth.
### Fix-per-failure
| Failure | Fix |
|---|---|
| FM1 | Move companions to Quadlet `.container` files in `/etc/containers/systemd/`. systemd (not archipelago) owns them |
| FM2 | `reconcile::derived::render_bitcoin_conf(secrets)` — pure function, runs every tick, atomic rewrite + HUP on drift |
| FM3 | `KillMode=mixed` in archipelago.service + containers in their own `archipelago-apps.slice`. Quadlet units already live outside archipelago's cgroup |
| FM4 | Ship `/etc/containers/containers.conf` with `host_containers_internal_ip = "10.89.0.1"` + `default_rootless_network_cmd = "pasta"`; also `--add-host=host.archipelago:10.89.0.1` in every unit |
| FM5 | Post-OTA `curl -k https://127.0.0.1/` health probe in new binary startup. If non-200 within 90s, rollback to `web-ui.bak` + binary-backup |
| FM6 | Startup probe: `podman info` with timeout. On "invalid internal status", clear `/run/user/$UID/{containers,libpod,podman}` + `podman system renumber` + reconcile tick rebuilds from Quadlet units |
---
## New code layout (lands in v1.7.48)
```
core/archipelago/src/reconcile/
mod.rs run_reconcile_loop, reconcile_once — called from main.rs
desired.rs DesiredState built from packages.json + catalog + secrets
current.rs snapshot via `systemctl list-units archy-*.service` + `podman ps -a --format json`
diff.rs pure: reconcile(desired, current) -> Vec<Step> (unit-testable without podman)
apply.rs step executor with timeouts, structured logs, backoff
quadlet.rs write `.container` / `.volume` / `.network` units atomically
derived.rs render_bitcoin_conf, render_containers_conf, render_nginx_app_routes
backoff.rs restart-history tracking (moved from health_monitor.rs)
```
### Step types (idempotent)
```rust
enum Step {
WriteQuadletUnit(path, content),
WriteDerivedFile(path, content),
WriteSecret(path, content),
DaemonReload,
EnsureStarted(unit),
StopUnit(unit),
RestartUnit(unit),
PullImage(ref),
}
```
### Triggers
- 30s interval tick
- install/uninstall RPC
- update-applied event
- explicit `/rpc/v1/reconcile.tick`
- podman event stream (if available)
Level-triggered + idempotent — every call considers full desired vs current diff. Missed ticks/events are irrelevant.
### Edits to existing code
- **`src/main.rs`**: replace `tokio::spawn(crash_recovery::start_stopped_containers)` with `tokio::spawn(reconcile::run_reconcile_loop(state))`. Keep self-heal perms + PID-marker crash detection.
- **`src/api/rpc/package/install.rs`**: stop calling `podman run` directly. Writes desired state + Quadlet unit + signals reconciler. Reconciler does pull + `systemctl start`.
- **`src/api/rpc/package/runtime.rs`** + `lifecycle.rs` + `stacks.rs`: same pattern — mutate desired state, reconciler applies.
- **`src/crash_recovery.rs`**: keep PID-marker + snapshot. Delete `start_stopped_containers` (reconciler handles cold boot). Keep `user-stopped.json` as `AppSpec.desired_state: Started | UserStopped | Uninstalled`.
- **`src/health_monitor.rs`**: strip restart logic. Keep memory-leak detection; push unhealthy events as `Trigger::ContainerUnhealthy(name)`.
- **`src/bitcoin_rpc.rs`**: add `pub fn derive_rpcauth_line(user, pass) -> String` (HMAC-SHA256 per Bitcoin Core's `rpcauth.py`).
- **`src/update.rs`**: post-swap health probe + auto-rollback (v1.7.41).
---
## Shipping order
Each release is independently deployable. Not a big-bang rewrite.
### v1.7.41 — Post-OTA health probe + auto-rollback (closes FM5)
- In `update.rs`: write `/var/lib/archipelago/update-pending-verify.json` just before service restart, with `applied_at`, `new_version`, `previous_version`, deadline.
- In `main.rs` startup: read marker, spawn verification task. Wait 15s for full startup, then `curl -k https://127.0.0.1/` with retries up to 90s.
- On 200: delete marker.
- On non-200 after window: call `rollback_update(data_dir)` (already exists), restart service to boot the old binary.
- Smallest diff, highest ROI.
### v1.7.42 — containers.conf + host.archipelago alias (closes FM4)
- Idempotent write of `/etc/containers/containers.conf` on startup (archipelago compares hash, rewrites only on drift).
- Add `--add-host=host.archipelago:10.89.0.1` to every generated container in `install.rs` / `docker_packages.rs`.
- ElectrumX `DAEMON_URL` migrates from `host.containers.internal``host.archipelago`.
### v1.7.43 — `reconcile::derived` for bitcoin.conf / lnd.conf (closes FM2)
- Pure function `render_bitcoin_conf(secrets) -> String`.
- Tick every 30s: read secret, derive `rpcauth`, compare to on-disk, atomic rewrite (via `tempfile::NamedTempFile::persist`) + `podman exec ... kill -HUP 1` on drift.
- Same pattern for `lnd.conf`.
- First user of the eventual `reconcile::` module — ships the `derived.rs` piece early.
### v1.7.44 — Podman state self-heal on startup (closes FM6)
- Startup probe: `podman info --format '{{.Host.OS}}'` with 10s timeout.
- On "invalid internal status" or similar:
- `systemctl --user stop podman.socket podman.service`
- `rm -rf /run/user/$UID/{containers,libpod,podman}`
- `podman system renumber`
- Trigger reconcile tick (will rebuild containers from their source of truth)
- Surface clear error on `/health` if recovery fails — don't silently serve 502.
### v1.7.4547 — Quadlet migration per companion (closes FM1 + FM3)
One companion per release so regressions have a narrow blame window:
- **v1.7.45**: `archy-bitcoin-ui` → Quadlet `.container` unit
- **v1.7.46**: `archy-lnd-ui` → Quadlet
- **v1.7.47**: `archy-electrs-ui` → Quadlet
Each:
1. Write `.container` file to `/etc/containers/systemd/<name>.container`
2. `systemctl daemon-reload`
3. `systemctl enable --now <name>.service`
4. Remove the `podman run` path from `install.rs` for that name
5. Add Goss probe for the lifecycle test matrix
### v1.7.48+ — Full reconcile module
- `core/archipelago/src/reconcile/` replaces imperative `install.rs` container management.
- Main app containers (bitcoin-knots, bitcoin-core, lnd, electrumx, btcpay-server, mempool, fedimint) become Quadlet units.
- `install.rs` shrinks to ~300 lines of "write desired state, poke reconciler."
- Biggest diff, lands last.
---
## Test harness (parallel track)
### Stack
- **Outer runner**: `bats-core` — TAP-style bash testing, readable by anyone
- **Verifier**: `goss` — YAML assertions on ports, processes, HTTP endpoints, files. Reused by CI + live probe
- **Chaos layer**: Chaos Toolkit JSON experiments (steady-state-hypothesis → method → rollback → verify)
- **VM layer**: `vmtest` (Go) for reboot-survival + ISO-boot tests, or raw QEMU+SSH
- **Tor probe**: curl through archipelago's own tor SOCKS5 (`--socks5-hostname 127.0.0.1:9050`), 60-180s retry window
- **Live probe**: small Rust agent on every fleet node, ships same Goss YAMLs to Prometheus. Neither Umbrel nor StartOS has this — real differentiator.
- **Reproducibility**: btrfs subvolume snapshots primary (fast), QEMU qcow2 for ISO/kernel-level repro
### Directory layout
```
tests/lifecycle/
bats/
_helpers.bash # install_app, wait_healthy, assert_no_orphans
00_bootstrap.bats
10_install.bats # per-app install
20_ui_reachable.bats # direct port + HTTPS proxy + iframe
30_tor_reachable.bats # .onion probe
40_stop_start.bats
50_restart.bats
60_reboot.bats # vmtest-driven
70_reinstall.bats # idempotence + data preservation
80_uninstall.bats # leak check
90_soak.bats # 2-6h hold, periodic probe
goss/
bitcoin-knots.yaml
bitcoin-core.yaml
lnd.yaml
electrumx.yaml
btcpay-server.yaml
mempool.yaml
fedimint.yaml
chaos/
kill9_archipelago_mid_install.json
wipe_bolt_db.json
kill9_bitcoind.json
reboot_during_ota.json
corrupt_bitcoin_conf.json
systemctl_restart_mid_install.json
fill_disk_99_percent.json
kill_tor.json
delete_nginx_snippet.json
clock_jump_30min.json
vm/
iso_boot_smoke.go
reboot_survival.go
ci/
vm_runner.sh
collect_artifacts.sh
probe/archy-probe/ # Rust bin, reuses goss YAMLs, ships to fleet
Makefile # `make beta-matrix`, `make chaos`, `make soak`
```
### Minimum beta matrix
7 apps × 9 lifecycle events × 10 chaos scenarios. Pass = every MUST-ship cell green on fresh rootless-podman single-node CI.
| Case \ App | knots | core | lnd | electrumx | btcpay | mempool | fedimint |
|---|---|---|---|---|---|---|---|
| Fresh install | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| UI direct port | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| UI HTTPS proxy | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| UI iframe | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Tor .onion reachable | ✓ | ✓ | ✓ | — | ✓ | ✓ | ✓ |
| Stop → ports released | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Restart → integrations | — | — | ✓↔btc | ✓↔btc | ✓↔btc,lnd | ✓↔electrs | — |
| Reboot survival | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Reinstall idempotent | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Uninstall no orphans | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 6h soak | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
**Harness scaffold lands in v1.7.41.** First lifecycle tests blocking v1.7.45. Full matrix + chaos suite blocking beta tag.
### Chaos scenarios (10)
Ordered by likelihood × severity:
1. `kill -9 archipelagod` mid-install → systemd restart, in-flight install resumes or cleanly rolls back
2. `rm bolt_state.db` while service stopped → restart regenerates, no data loss in named volumes
3. `systemctl restart archipelago` mid-install → no orphans, no half-state
4. Reboot mid-OTA → old version intact OR new version active, never half
5. Corrupt `bitcoin.conf` → container restart-loops; UI surfaces banner; reconcile re-derives; other apps unaffected
6. Fill `/var` to 99% → graceful degradation, disk-pressure report
7. Revoke rootless-netns → self-heal within Tor descriptor window
8. `pkill -9 tor` → supervisor restarts; onions reachable within 35 min
9. Delete nginx conf snippet → reconciler rewrites or `archipelago doctor` flags drift
10. Clock jump +30min → daemons survive; Tor recovers
---
## Decision log
| Decision | Answer | Rationale |
|---|---|---|
| Scope | 6+ incremental releases, not big-bang rewrite | Each closes one failure class, narrow blame window |
| Quadlet migration | Yes | Isolation from daemon crashes, systemd-native recovery, free from Red Hat's production patterns. Minimum podman version becomes 4.4+ (fine for modern Debian) |
| Live probe to Prometheus | Yes, part of beta | Genuine differentiator — neither Umbrel nor StartOS has this. Adds Grafana dep |
| Test gating | Scaffold in v1.7.41, first tests blocking v1.7.45, full matrix blocking beta tag | Gradual rather than all-or-nothing |
---
## Key sources
### Architecture
- Umbrel [app.ts](https://raw.githubusercontent.com/getumbrel/umbrel/master/packages/umbreld/source/modules/apps/app.ts) — edge-triggered, TODO on failure handling
- StartOS [repo](https://github.com/Start9Labs/start-os), [v0.4 podman→LXC announce](https://community.start9.com/t/startos-v0-4-0-alpha-10-has-replaced-podman-new-commands-for-terminal/4062)
- balena-supervisor [repo](https://github.com/balena-os/balena-supervisor), [Supervisor API](https://docs.balena.io/reference/supervisor/supervisor-api)
- Quadlet: [Dan Walsh 2023 blog](https://www.redhat.com/en/blog/quadlet-podman), [podman-systemd.unit(5)](https://docs.podman.io/en/latest/markdown/podman-systemd.unit.5.html)
- Podman rollback: [auto-update blog](https://www.redhat.com/en/blog/podman-auto-updates-rollbacks), [podman-auto-update(1)](https://docs.podman.io/en/latest/markdown/podman-auto-update.1.html)
- Kubernetes operator pattern: [Kubebuilder reconcile](https://deepwiki.com/kubernetes-sigs/kubebuilder/5.2-reconciliation-loop), [good practices](https://book.kubebuilder.io/reference/good-practices)
- NixOS containers: [wiki](https://wiki.nixos.org/wiki/NixOS_Containers)
### Known bugs & references
- `host.containers.internal` → LAN: [podman #22644](https://github.com/containers/podman/issues/22644), [#23782](https://github.com/containers/podman/issues/23782)
- `bolt_state.db` recovery: [podman #17730](https://github.com/containers/podman/issues/17730), [staticdir mismatch #20872](https://github.com/containers/podman/issues/20872)
- aardvark-dns flakiness: [#20396](https://github.com/containers/podman/issues/20396), [#22407](https://github.com/containers/podman/issues/22407)
- systemd 226/NAMESPACE: [Arch forum](https://bbs.archlinux.org/viewtopic.php?id=156963), [systemd #29526](https://github.com/systemd/systemd/issues/29526)
- [systemd CGROUP_DELEGATION](https://systemd.io/CGROUP_DELEGATION/), [systemd.kill(5)](https://www.freedesktop.org/software/systemd/man/latest/systemd.kill.html)
### Test harness prior art
- Umbrel [ci.yml](https://github.com/getumbrel/umbrel/blob/master/.github/workflows/ci.yml) — Vitest + qemu matrix fan-out
- [YunoHost package_check](https://github.com/YunoHost/package_check) — closest analog, scored per-app lifecycle harness on LXC
- [bats-core](https://github.com/bats-core/bats-core)
- [Goss](https://github.com/goss-org/goss), [dgoss](https://github.com/aelsabbahy/goss-docker)
- [Chaos Toolkit](https://chaostoolkit.org/)
- [vmtest (Go)](https://github.com/anatol/vmtest)
### Tor
- [rend-spec-v3](https://github.com/torproject/torspec/blob/main/rend-spec-v3.txt) — descriptor lifetime + republish cadence
- [stem](https://stem.torproject.org/) — Python Tor controller for `HS_DESC UPLOADED` waits
---
## To resume
1. Read project memory: `~/.claude/projects/-home-archipelago-Projects-archy/memory/project_reconcile_architecture.md`
2. Read failure-mode memory: `~/.claude/projects/-home-archipelago-Projects-archy/memory/feedback_container_lifecycle_failure_modes.md`
3. Check task list for current release (should start with v1.7.41)
4. Current state on fleet as of 2026-04-22:
- All 4 mirrors (tx1138, gitea-local, .160, .168) synced to v1.7.40-alpha
- .116, .198, .228, .253 healed manually via `systemd-run chmod 755 /opt/archipelago/web-ui`
- .228 still has stale `bitcoin.conf` rpcauth (regenerated during triage; will drift again until v1.7.43)
- .228 UI companions (archy-bitcoin-ui, archy-lnd-ui) keep vanishing (Quadlet migration in v1.7.45+ fixes)
- .160 Gitea required `podman system renumber` recovery (v1.7.44 automates this)
5. Implementation is in progress on `main` branch — next edit is `core/archipelago/src/update.rs` for v1.7.41.

View File

@ -1,7 +1,7 @@
{
"name": "neode-ui",
"private": true,
"version": "1.7.40-alpha",
"version": "1.7.41-alpha",
"type": "module",
"scripts": {
"start": "./start-dev.sh",

View File

@ -180,6 +180,16 @@ init()
</button>
</div>
<div class="overflow-y-auto flex-1 min-h-0 space-y-6 pr-1">
<!-- v1.7.41-alpha -->
<div>
<div class="flex items-center gap-2 mb-3">
<span class="text-xs font-mono px-2 py-0.5 rounded bg-orange-500/20 text-orange-300">v1.7.41-alpha</span>
<span class="text-xs text-white/40">Apr 22, 2026</span>
</div>
<div class="space-y-3 text-sm text-white/80 pl-3 border-l border-white/10">
<p>Updates now self-check. After an update lands, the node probes its own web UI through nginx if the frontend isn't answering cleanly within 90 seconds, the node automatically rolls back to the previous version and restarts. A bad release can no longer leave the fleet stranded on an unreachable node.</p>
</div>
</div>
<!-- v1.7.40-alpha -->
<div>
<div class="flex items-center gap-2 mb-3">

View File

@ -1,28 +1,26 @@
{
"version": "1.7.40-alpha",
"version": "1.7.41-alpha",
"release_date": "2026-04-22",
"changelog": [
"Proper fix for the 500 / Internal Server Error after update. The v1.7.38 and v1.7.39 frontend archives had the wrong permissions baked into the archive itself — the tarball's root directory entry was private, so every node that extracted it ended up with a web UI directory nginx couldn't read. v1.7.40 packages the archive with correct world-readable permissions from the start, verified before the release is even cut.",
"Signing in is quiet after the first boot. The intro music, welcome voice, and transition sounds only play during initial onboarding — every login after that is silent. Typing sounds in the search bar and dashboard are unaffected.",
"Nodes that completed setup no longer get bounced back through the onboarding wizard after clearing browser cache, updating, or rebooting. The node self-heals so already-onboarded nodes always go straight to the login screen.",
"Trimmed the App Store — FIPS, Nostr Relay, Nostr VPN, Routstr, and Penpot are no longer listed and their container images have been removed from all registries. Your node's built-in FIPS transport is untouched."
"Updates now self-check. When a new version lands, the node probes its own web UI through nginx within the first 90 seconds after the service restarts. If the frontend isn't answering cleanly, the node automatically rolls back to the previous working version and reboots the service. A bad release can no longer leave the fleet stranded on an unreachable UI — the kind of failure that required SSH recovery on every affected node during the v1.7.38 and v1.7.39 rollouts is now self-healing.",
"Rollback is hardened against the service's own mount namespace. Restoring the previous binary goes through the same privileged helper as every other write into /opt/archipelago, so it no longer silently fails with EROFS when ProtectSystem is strict. Both the binary and the previous web UI tarball are restored together; the broken web UI is quarantined rather than deleted so you can inspect it after the fact."
],
"components": [
{
"name": "archipelago",
"current_version": "1.7.37-alpha",
"new_version": "1.7.40-alpha",
"download_url": "https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.40-alpha/archipelago",
"sha256": "5c8c0c6e4700f4da3e1cb58167ddea6d93f46d5c7d7f0352f7367b998c672708",
"size_bytes": 41107136
"current_version": "1.7.40-alpha",
"new_version": "1.7.41-alpha",
"download_url": "https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.41-alpha/archipelago",
"sha256": "eb6eeb9720720c566db614861c1a878f48630e6f6c90276cbc8c032bfd910afc",
"size_bytes": 41215800
},
{
"name": "archipelago-frontend-1.7.40-alpha.tar.gz",
"current_version": "1.7.37-alpha",
"new_version": "1.7.40-alpha",
"download_url": "https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.40-alpha/archipelago-frontend-1.7.40-alpha.tar.gz",
"sha256": "0bb58abd5276c83d42a92b0f09697162a300f0222962ad52c8175fb4c904e3e8",
"size_bytes": 162084678
"name": "archipelago-frontend-1.7.41-alpha.tar.gz",
"current_version": "1.7.40-alpha",
"new_version": "1.7.41-alpha",
"download_url": "https://git.tx1138.com/lfg2025/archy/raw/branch/main/releases/v1.7.41-alpha/archipelago-frontend-1.7.41-alpha.tar.gz",
"sha256": "b1ac88b8bb056033aff8818f5a143b69388b66e388aa1e096b064dfbe892130c",
"size_bytes": 162084894
}
]
}

Binary file not shown.

View File

@ -108,7 +108,10 @@ if [ -z "$FRONTEND_ARCHIVE" ]; then
# Verify the archive root entry is world-readable before we
# declare success — catches regressions in tar-flag handling
# (BSD tar, busybox tar) that might silently drop --mode.
root_mode=$(tar tvzf "$FRONTEND_ARCHIVE" | head -1 | awk '{print $1}')
# SIGPIPE-safe: use awk to read only the first line and exit,
# then terminate the tar pipeline explicitly so `pipefail`+SIGPIPE
# don't kill the whole `set -euo pipefail` script.
root_mode=$(tar tvzf "$FRONTEND_ARCHIVE" 2>/dev/null | awk 'NR==1{print $1; exit}')
case "$root_mode" in
drwxr-xr-x|drwxr-x*x*)
echo " Tarball root perms OK: $root_mode"