archipelago 281e65e697 fix(quadlet): TimeoutStartSec=600 when Notify=healthy is set
Bug surfaced live on .228 2026-05-02 — every backend Quadlet unit
(lnd, electrumx, fedimint, btcpay-server, mempool-api, bitcoin-knots)
hit systemd's default 90s start timeout because Notify=healthy makes
systemctl wait for the first green health probe, but
HealthInterval=30s × HealthRetries=3 = 90s minimum even on a healthy
service. Race: timeout fires the moment the third probe MIGHT succeed.

Result was three different post-states (inactive+running, failed+missing,
inactive+stopped) depending on whether systemd's ExecStopPost ran
podman rm before the orchestrator's adoption logic re-grabbed the
container.

Fix: when health is set, render TimeoutStartSec=600 (10 minutes) into
[Service]. Long enough for slow-starting backends (electrumx index
replay, lnd wallet unlock) without being so long that a truly stuck
unit hangs forever. Companions stay unchanged (no health → no override,
default 90s applies).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 07:14:48 -04:00

1074 lines
41 KiB
Rust
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//! Render and lifecycle Quadlet `.container` units for companion UI
//! containers (archy-bitcoin-ui, archy-lnd-ui, archy-electrs-ui).
//!
//! Why Quadlet: companions used to run as fire-and-forget `tokio::spawn`
//! blocks from `install.rs`. If archipelago crashed mid-spawn or the
//! kernel reaped a parent cgroup, companions vanished from `podman ps`
//! entirely and only a manual `podman run` brought them back. Putting the
//! unit on disk and letting systemd own start/restart removes that whole
//! class of failure: the daemon is now systemd, archipelago is just the
//! provisioner.
//!
//! Design constraints kept this module small on purpose:
//!
//! - **Single responsibility**: render → write → enable → disable. We do
//! NOT pull images here — the caller is expected to have the image
//! present locally (companions either build from `/opt/archipelago/docker/`
//! or are pre-pulled by `install_companion_image`). The quadlet unit
//! declares `Pull=never` so a missing image surfaces immediately
//! instead of silently retrying behind systemd's restart loop.
//! - **Atomic writes**: `tempfile + rename` so a partially-written unit
//! is never visible to systemd. A daemon-reload during a rolling
//! update can't see half a file.
//! - **Idempotent**: `write_if_changed` compares bytes before touching
//! the file. No daemon-reload, no service-restart cascade if the
//! rendered bytes match what's on disk.
//! - **systemctl --user only**: archipelago runs as uid=1000 with
//! linger enabled. We never touch the system bus from here.
//!
//! See `docs/rust-orchestrator-migration.md` and the failure-mode log in
//! `feedback_container_lifecycle_failure_modes.md` for the incident
//! that motivated the move.
use anyhow::{anyhow, Context, Result};
use archipelago_container::AppManifest;
use std::fmt::Write as _;
use std::path::{Path, PathBuf};
use tokio::fs;
use tokio::process::Command;
/// Default rootless quadlet directory. Resolved per-user at runtime via
/// `unit_dir()`. Tests pass an explicit dir.
pub const DEFAULT_REL_UNIT_DIR: &str = ".config/containers/systemd";
#[derive(Debug, Clone)]
pub struct BindMount {
pub host: PathBuf,
pub container: PathBuf,
pub read_only: bool,
}
#[derive(Debug, Clone, Default)]
#[allow(dead_code)] // Bridge is reserved for Phase 5 per-app network isolation.
pub enum NetworkMode {
#[default]
Host,
/// A user-defined podman network — quadlet creates the container
/// attached to it. The network must already exist (orchestrator's
/// `ensure_container_network` handles that on every reconcile tick).
Bridge(String),
}
/// systemd Restart= policy for the generated `.service` unit. Companions
/// use Always (any exit triggers a restart). Backends use OnFailure
/// (clean exits — e.g. operator-issued `systemctl stop` — stay stopped,
/// only crashes get restarted automatically).
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum RestartPolicy {
Always,
/// Used by `from_manifest` for backend manifests. Wired through
/// `install_via_quadlet` (gated by `Config::use_quadlet_backends`).
OnFailure,
}
impl Default for RestartPolicy {
fn default() -> Self {
Self::Always
}
}
impl RestartPolicy {
fn as_systemd(self) -> &'static str {
match self {
Self::Always => "always",
Self::OnFailure => "on-failure",
}
}
}
/// Container healthcheck wired through to systemd via `Notify=healthy`.
/// When set, `systemctl start <name>.service` blocks until the container's
/// own healthcheck reports green — eliminating the "container up but RPC
/// not ready" race that the orchestrator currently papers over with
/// post-start polling.
///
/// Ranges roughly mirror the manifest's HealthCheck struct: `cmd` is the
/// shell form (`/usr/bin/curl -fsS http://localhost:8332/health` etc.),
/// `interval`/`timeout` use systemd time format ("30s", "5m"), `retries`
/// is the consecutive-failures threshold before "unhealthy" trips.
#[derive(Debug, Clone)]
pub struct HealthSpec {
pub cmd: String,
pub interval: String,
pub timeout: String,
pub retries: u32,
}
/// One Quadlet `.container` unit. Field set is deliberately small —
/// add a new field only when a real manifest needs it.
#[derive(Debug, Clone, Default)]
pub struct QuadletUnit {
pub name: String,
pub description: String,
pub image: String,
pub network: NetworkMode,
pub user: Option<String>,
pub memory_mb: Option<u32>,
pub cap_drop_all: bool,
pub cap_add: Vec<String>,
pub bind_mounts: Vec<BindMount>,
pub extra_podman_args: Vec<String>,
pub depends_on: Vec<String>,
/// Phase 3.4: when present the rendered unit emits HealthCmd=,
/// HealthInterval=, HealthTimeout=, HealthRetries=, AND Notify=healthy
/// so systemctl start blocks on a green health probe.
pub health: Option<HealthSpec>,
// Backend-manifest extensions (Phase 3.1). Companion units leave
// these defaulted; the renderer skips empty/false directives so a
// companion's rendered bytes are unchanged from before this PR.
pub ports: Vec<(u16, u16, String)>,
pub environment: Vec<String>,
pub devices: Vec<String>,
pub add_hosts: Vec<(String, String)>,
pub entrypoint: Option<Vec<String>>,
pub command: Vec<String>,
pub read_only_root: bool,
pub no_new_privileges: bool,
pub cpu_quota: Option<u32>,
pub restart_policy: RestartPolicy,
}
impl QuadletUnit {
/// File name on disk: `<name>.container`. Quadlet translates this
/// into a `<name>.service` unit at daemon-reload time.
pub fn unit_filename(&self) -> String {
format!("{}.container", self.name)
}
/// systemd service name created by Quadlet for this unit.
pub fn service_name(&self) -> String {
format!("{}.service", self.name)
}
/// Render the canonical Quadlet unit text. Pure function — no I/O.
pub fn render(&self) -> String {
let mut s = String::with_capacity(512);
let _ = writeln!(s, "# Generated by archipelago. DO NOT EDIT.");
let _ = writeln!(s, "# Edits are overwritten on the next reconcile.");
let _ = writeln!(s);
let _ = writeln!(s, "[Unit]");
let _ = writeln!(s, "Description={}", self.description);
let _ = writeln!(s, "After=network-online.target");
let _ = writeln!(s, "Wants=network-online.target");
for dep in &self.depends_on {
let _ = writeln!(s, "Requires={dep}");
let _ = writeln!(s, "After={dep}");
}
let _ = writeln!(s);
let _ = writeln!(s, "[Container]");
let _ = writeln!(s, "ContainerName={}", self.name);
let _ = writeln!(s, "Image={}", self.image);
// Pull=never: companions are pre-pulled or built. A missing image
// must surface as a unit start failure, not a silent retry storm.
let _ = writeln!(s, "Pull=never");
match &self.network {
NetworkMode::Host => {
let _ = writeln!(s, "Network=host");
}
NetworkMode::Bridge(net) => {
let _ = writeln!(s, "Network={net}");
}
}
if let Some(user) = &self.user {
let _ = writeln!(s, "User={user}");
}
if self.cap_drop_all {
let _ = writeln!(s, "DropCapability=ALL");
}
for cap in &self.cap_add {
let _ = writeln!(s, "AddCapability={cap}");
}
if let Some(mb) = self.memory_mb {
let _ = writeln!(s, "PodmanArgs=--memory={mb}m");
}
for bm in &self.bind_mounts {
let mode = if bm.read_only { ":ro,Z" } else { ":Z" };
let _ = writeln!(
s,
"Volume={}:{}{}",
bm.host.display(),
bm.container.display(),
mode
);
}
for (host, container, proto) in &self.ports {
let p = if proto.is_empty() { "tcp" } else { proto.as_str() };
let _ = writeln!(s, "PublishPort={host}:{container}/{p}");
}
for env in &self.environment {
// env entries already arrive shaped as "KEY=VALUE"; quadlet
// accepts that form on a single Environment= line per pair.
let _ = writeln!(s, "Environment={env}");
}
for dev in &self.devices {
let _ = writeln!(s, "AddDevice={dev}");
}
for (name, ip) in &self.add_hosts {
let _ = writeln!(s, "AddHost={name}:{ip}");
}
if self.read_only_root {
let _ = writeln!(s, "ReadOnly=true");
}
if self.no_new_privileges {
let _ = writeln!(s, "NoNewPrivileges=true");
}
if let Some(cpus) = self.cpu_quota {
let _ = writeln!(s, "PodmanArgs=--cpus={cpus}");
}
if let Some(h) = &self.health {
let _ = writeln!(s, "HealthCmd={}", h.cmd);
let _ = writeln!(s, "HealthInterval={}", h.interval);
let _ = writeln!(s, "HealthTimeout={}", h.timeout);
let _ = writeln!(s, "HealthRetries={}", h.retries);
// Notify=healthy: systemd treats the unit as "started" only
// after the first green health probe. Start ordering
// (Requires=/After=) downstream of this unit therefore
// doesn't fire until the app is actually serving requests.
let _ = writeln!(s, "Notify=healthy");
}
if let Some(ep) = &self.entrypoint {
// Quadlet's Exec= replaces the image entrypoint+cmd. When
// the manifest provides both entrypoint and command we
// concatenate; if only command is set we'll emit that on
// its own below.
let mut parts: Vec<String> = ep.clone();
parts.extend(self.command.iter().cloned());
let _ = writeln!(s, "Exec={}", shell_join(&parts));
} else if !self.command.is_empty() {
let _ = writeln!(s, "Exec={}", shell_join(&self.command));
}
for arg in &self.extra_podman_args {
let _ = writeln!(s, "PodmanArgs={arg}");
}
let _ = writeln!(s);
let _ = writeln!(s, "[Service]");
// Restart policy + 10s backoff. RestartSec keeps a crash-loop
// from saturating the journal. Companions: Always. Backends:
// OnFailure (clean stops stay stopped).
let _ = writeln!(s, "Restart={}", self.restart_policy.as_systemd());
let _ = writeln!(s, "RestartSec=10");
if self.health.is_some() {
// Notify=healthy makes systemd block the unit's "started"
// state on the first green health probe. systemd's default
// TimeoutStartSec is 90s — but `HealthInterval=30s` ×
// `HealthRetries=3` is itself 90s, so the timeout fires the
// moment the third probe MIGHT succeed. On .228 every backend
// (lnd, electrumx, fedimint, btcpay-server, mempool-api,
// bitcoin-knots) timed out at 90s and systemd terminated the
// container while it was still warming up. Bump to 600s — long
// enough for slow-starting backends (electrumx replays its
// index, lnd unlocks its wallet) without being so long that a
// truly stuck unit hangs forever.
let _ = writeln!(s, "TimeoutStartSec=600");
}
let _ = writeln!(s);
let _ = writeln!(s, "[Install]");
let _ = writeln!(s, "WantedBy=default.target");
s
}
}
/// Render a manifest's argv-style list as a single Exec= line. We do
/// the minimum quoting needed so quadlet's parser sees one element per
/// item: anything containing whitespace, quotes, or shell metacharacters
/// gets wrapped in double quotes with embedded `"` and `\` escaped.
fn shell_join(parts: &[String]) -> String {
parts
.iter()
.map(|p| {
if p.is_empty() || p.chars().any(|c| c.is_whitespace() || "\"\\$`".contains(c)) {
let escaped = p.replace('\\', "\\\\").replace('"', "\\\"");
format!("\"{escaped}\"")
} else {
p.clone()
}
})
.collect::<Vec<_>>()
.join(" ")
}
impl QuadletUnit {
/// Build a backend-flavour QuadletUnit from a parsed AppManifest.
/// Wired through `prod_orchestrator::install_via_quadlet`, gated by
/// `Config::use_quadlet_backends`.
///
/// `name` is the on-disk container name (typically the manifest's
/// `app.id`, but the orchestrator may rename — see
/// `compute_container_name`). The returned unit is NOT yet written;
/// the caller is expected to merge in any environment overrides
/// (resolve_dynamic_env, secret_env) before calling write_if_changed.
pub fn from_manifest(manifest: &AppManifest, name: &str) -> Self {
let app = &manifest.app;
let network = match app.security.network_policy.as_str() {
"host" => NetworkMode::Host,
// Bridge name comes from the manifest's container.network if
// set; otherwise the orchestrator manages a default network
// separately and we fall back to host. Quadlet won't refuse
// either form.
other if !other.is_empty() && other != "isolated" => NetworkMode::Bridge(other.into()),
_ => match app.container.network.as_deref() {
Some(n) if !n.is_empty() && n != "host" => NetworkMode::Bridge(n.into()),
_ => NetworkMode::Host,
},
};
let bind_mounts = app
.volumes
.iter()
.filter(|v| v.volume_type != "tmpfs" && !v.source.is_empty())
.map(|v| BindMount {
host: PathBuf::from(&v.source),
container: PathBuf::from(&v.target),
read_only: v.options.iter().any(|o| o == "ro"),
})
.collect::<Vec<_>>();
let memory_mb = app.resources.memory_limit.as_ref().and_then(|s| {
// Manifests use forms like "1g", "512m", "1024". Convert to
// MiB. Anything we can't parse gets dropped (renderer skips
// None) — better to lose the limit than to mis-cap.
parse_memory_mib(s)
});
Self {
name: name.to_string(),
description: format!("Archipelago app: {}", app.id),
image: app.container.image_ref().unwrap_or_default(),
network,
user: None,
memory_mb,
cap_drop_all: true,
cap_add: app.security.capabilities.clone(),
bind_mounts,
extra_podman_args: vec![],
depends_on: vec![],
health: app.health_check.as_ref().and_then(translate_health_check),
ports: app
.ports
.iter()
.map(|p| (p.host, p.container, p.protocol.clone()))
.collect(),
environment: app.environment.clone(),
devices: app.devices.clone(),
add_hosts: vec![("host.archipelago".into(), "10.89.0.1".into())],
entrypoint: app.container.entrypoint.clone(),
command: app.container.custom_args.clone(),
read_only_root: app.security.readonly_root,
no_new_privileges: true,
cpu_quota: app.resources.cpu_limit,
restart_policy: RestartPolicy::OnFailure,
}
}
}
/// Translate the manifest's HealthCheck shape into a HealthSpec the
/// renderer understands. Returns None when the manifest's health spec
/// is malformed or unsupported — we'd rather skip Notify=healthy than
/// emit a broken HealthCmd that fails the unit start forever.
///
/// Supported shapes:
/// - type: tcp, endpoint: "host:port" → `nc -z host port`
/// - type: http, endpoint: "host:port" or "http(s)://host:port", path → curl
/// - type: cmd, endpoint: "<shell command>" → `<shell command>` verbatim
///
/// For type=http we accept the endpoint with or without scheme; manifests
/// in the wild use both forms (`localhost:8175` and
/// `http://localhost:8175/`). Earlier we blindly prepended `http://` even
/// when one was already there, producing `http://http://...` HealthCmds
/// that pasted on .228 2026-05-02 and failed every probe.
fn translate_health_check(
hc: &archipelago_container::HealthCheck,
) -> Option<HealthSpec> {
let cmd = match hc.check_type.as_str() {
"tcp" => {
let endpoint = hc.endpoint.as_deref()?;
let (host, port) = endpoint.rsplit_once(':')?;
// nc is in busybox/coreutils on every base image we ship.
// The -z flag does a "scan" that exits 0 on connect, 1 otherwise.
format!("nc -z {host} {port}")
}
"http" => {
let endpoint = hc.endpoint.as_deref()?.trim();
// Accept either bare host:port or a full URL. If endpoint
// already includes a scheme we use it as-is; otherwise we
// prepend http://. This keeps existing http://foo manifests
// working and stops the http://http:// double-prefix bug.
let url = if endpoint.starts_with("http://") || endpoint.starts_with("https://") {
endpoint.to_string()
} else {
format!("http://{endpoint}")
};
// If the endpoint already carried a path component, honour it
// and ignore hc.path (manifests that bake the path into the
// endpoint don't expect to merge a separate path field).
// Otherwise append hc.path (default "/").
let already_has_path = url
.splitn(4, '/')
.nth(3)
.map(|p| !p.is_empty())
.unwrap_or(false);
let final_url = if already_has_path {
url
} else {
let path = hc.path.as_deref().unwrap_or("/");
format!("{url}{path}")
};
// -fsS: fail on non-2xx, silent except on error, show errors.
// -m 5: per-request timeout matches the default manifest timeout.
format!("curl -fsS -m 5 {final_url}")
}
"cmd" => hc.endpoint.as_deref()?.to_string(),
_ => return None,
};
Some(HealthSpec {
cmd,
interval: hc.interval.clone(),
timeout: hc.timeout.clone(),
retries: hc.retries,
})
}
/// Parse the manifest's memory_limit string into MiB. Recognises the
/// forms our manifests actually use: "<n>", "<n>m"/"<n>M", "<n>g"/"<n>G".
/// Returns None for anything else; the caller treats None as unlimited.
fn parse_memory_mib(raw: &str) -> Option<u32> {
let trimmed = raw.trim();
if trimmed.is_empty() {
return None;
}
let (num_part, mul) = match trimmed.chars().last()? {
'g' | 'G' => (&trimmed[..trimmed.len() - 1], 1024u32),
'm' | 'M' => (&trimmed[..trimmed.len() - 1], 1u32),
'k' | 'K' => return None, // sub-MiB precision: drop, not worth it
c if c.is_ascii_digit() => (trimmed, 1u32), // bare number, treat as MiB
_ => return None,
};
num_part.trim().parse::<u32>().ok()?.checked_mul(mul)
}
/// Resolve the per-user quadlet dir under $HOME. Created if missing.
pub async fn unit_dir() -> Result<PathBuf> {
let home = std::env::var_os("HOME")
.map(PathBuf::from)
.ok_or_else(|| anyhow!("HOME not set; cannot locate quadlet unit dir"))?;
let dir = home.join(DEFAULT_REL_UNIT_DIR);
fs::create_dir_all(&dir)
.await
.with_context(|| format!("create_dir_all {}", dir.display()))?;
Ok(dir)
}
/// Atomically write `unit` into `dir/<name>.container` if the bytes
/// differ from what's already there. Returns true if the file changed.
pub async fn write_if_changed(unit: &QuadletUnit, dir: &Path) -> Result<bool> {
let path = dir.join(unit.unit_filename());
let new_bytes = unit.render();
if let Ok(old) = fs::read_to_string(&path).await {
if old == new_bytes {
return Ok(false);
}
}
fs::create_dir_all(dir)
.await
.with_context(|| format!("create_dir_all {}", dir.display()))?;
let tmp = path.with_extension("container.tmp");
fs::write(&tmp, new_bytes.as_bytes())
.await
.with_context(|| format!("write tmp {}", tmp.display()))?;
fs::rename(&tmp, &path)
.await
.with_context(|| format!("rename {} -> {}", tmp.display(), path.display()))?;
Ok(true)
}
/// Reload the user systemd manager. Required after any quadlet write
/// or removal so systemd picks up the generated `.service` translation.
pub async fn daemon_reload_user() -> Result<()> {
let status = Command::new("systemctl")
.args(["--user", "daemon-reload"])
.status()
.await
.context("spawn systemctl --user daemon-reload")?;
if !status.success() {
return Err(anyhow!("systemctl --user daemon-reload exited {status}"));
}
Ok(())
}
/// Enable + start a quadlet-generated service. `enable --now` makes it
/// survive reboots and starts it immediately.
pub async fn enable_now(service: &str) -> Result<()> {
// Quadlet-generated units cannot be `enable`d directly because the
// .service file lives under /run, not /etc — `enable` would refuse
// ("transient or generated"). The unit's `[Install] WantedBy` is
// honoured at daemon-reload, so we just start it.
let status = Command::new("systemctl")
.args(["--user", "start", service])
.status()
.await
.with_context(|| format!("spawn systemctl --user start {service}"))?;
if !status.success() {
return Err(anyhow!("systemctl --user start {service} exited {status}"));
}
Ok(())
}
/// Stop + remove a quadlet unit and its on-disk file. Best-effort:
/// errors stop only the destructive write at the failing step so a
/// partial removal doesn't leave a quadlet file pointing at a service
/// that systemd no longer knows about.
pub async fn disable_remove(unit_name: &str, dir: &Path) -> Result<()> {
let svc = format!("{unit_name}.service");
// Stop first; ignore failure (unit may already be down).
let _ = Command::new("systemctl")
.args(["--user", "stop", &svc])
.status()
.await;
let path = dir.join(format!("{unit_name}.container"));
if fs::try_exists(&path).await.unwrap_or(false) {
fs::remove_file(&path)
.await
.with_context(|| format!("remove {}", path.display()))?;
}
daemon_reload_user().await.ok();
// Defensive: kill the actual container too, in case quadlet left it.
let _ = Command::new("podman")
.args(["rm", "-f", unit_name])
.status()
.await;
Ok(())
}
/// Is the quadlet-generated service currently active?
pub async fn is_active(service: &str) -> bool {
Command::new("systemctl")
.args(["--user", "is-active", "--quiet", service])
.status()
.await
.map(|s| s.success())
.unwrap_or(false)
}
#[cfg(test)]
mod tests {
use super::*;
use tempfile::tempdir;
fn sample_unit() -> QuadletUnit {
QuadletUnit {
name: "archy-bitcoin-ui".into(),
description: "Bitcoin RPC UI proxy".into(),
image: "146.59.87.168:3000/lfg2025/bitcoin-ui:latest".into(),
network: NetworkMode::Host,
user: Some("0:0".into()),
memory_mb: Some(128),
cap_drop_all: true,
cap_add: vec![
"CHOWN".into(),
"DAC_OVERRIDE".into(),
"NET_BIND_SERVICE".into(),
"SETUID".into(),
"SETGID".into(),
],
bind_mounts: vec![BindMount {
host: PathBuf::from("/var/lib/archipelago/bitcoin-ui/nginx.conf"),
container: PathBuf::from("/etc/nginx/conf.d/default.conf"),
read_only: true,
}],
extra_podman_args: vec![],
depends_on: vec![],
..QuadletUnit::default()
}
}
#[test]
fn render_contains_required_directives() {
let s = sample_unit().render();
assert!(s.contains("[Container]"));
assert!(s.contains("ContainerName=archy-bitcoin-ui"));
assert!(s.contains("Image=146.59.87.168:3000/lfg2025/bitcoin-ui:latest"));
assert!(s.contains("Pull=never"));
assert!(s.contains("Network=host"));
assert!(s.contains("DropCapability=ALL"));
assert!(s.contains("AddCapability=CHOWN"));
assert!(s.contains("AddCapability=NET_BIND_SERVICE"));
assert!(s.contains("PodmanArgs=--memory=128m"));
assert!(s.contains(
"Volume=/var/lib/archipelago/bitcoin-ui/nginx.conf:/etc/nginx/conf.d/default.conf:ro,Z"
));
assert!(s.contains("[Service]"));
assert!(s.contains("Restart=always"));
assert!(s.contains("WantedBy=default.target"));
}
#[test]
fn render_bridge_network_emits_network_name() {
let mut u = sample_unit();
u.network = NetworkMode::Bridge("archy-bitcoin-ui-net".into());
let s = u.render();
assert!(s.contains("Network=archy-bitcoin-ui-net"));
assert!(!s.contains("Network=host"));
}
#[test]
fn unit_filename_and_service_name_are_consistent() {
let u = sample_unit();
assert_eq!(u.unit_filename(), "archy-bitcoin-ui.container");
assert_eq!(u.service_name(), "archy-bitcoin-ui.service");
}
#[tokio::test]
async fn write_if_changed_writes_first_time_then_noops() {
let dir = tempdir().unwrap();
let u = sample_unit();
let changed = write_if_changed(&u, dir.path()).await.unwrap();
assert!(changed, "first write must report changed");
let on_disk = tokio::fs::read_to_string(dir.path().join(u.unit_filename()))
.await
.unwrap();
assert!(on_disk.starts_with("# Generated by archipelago"));
let changed2 = write_if_changed(&u, dir.path()).await.unwrap();
assert!(!changed2, "second write with identical bytes must no-op");
}
#[tokio::test]
async fn write_if_changed_rewrites_when_field_changes() {
let dir = tempdir().unwrap();
let mut u = sample_unit();
write_if_changed(&u, dir.path()).await.unwrap();
u.memory_mb = Some(256);
let changed = write_if_changed(&u, dir.path()).await.unwrap();
assert!(changed, "field change must trigger rewrite");
let on_disk = tokio::fs::read_to_string(dir.path().join(u.unit_filename()))
.await
.unwrap();
assert!(on_disk.contains("PodmanArgs=--memory=256m"));
}
#[tokio::test]
async fn write_if_changed_atomic_rename_leaves_no_tmp() {
let dir = tempdir().unwrap();
write_if_changed(&sample_unit(), dir.path()).await.unwrap();
let mut entries = tokio::fs::read_dir(dir.path()).await.unwrap();
while let Some(e) = entries.next_entry().await.unwrap() {
assert!(
!e.file_name().to_string_lossy().ends_with(".tmp"),
"atomic rename must leave no .tmp residue"
);
}
}
// ────────────────────────────────────────────────────────────────
// Phase 3.1 backend renderer tests
// ────────────────────────────────────────────────────────────────
#[test]
fn parse_memory_mib_recognises_common_forms() {
assert_eq!(parse_memory_mib("1024"), Some(1024));
assert_eq!(parse_memory_mib("512m"), Some(512));
assert_eq!(parse_memory_mib("512M"), Some(512));
assert_eq!(parse_memory_mib("2g"), Some(2048));
assert_eq!(parse_memory_mib("2G"), Some(2048));
assert_eq!(parse_memory_mib("1k"), None); // sub-MiB rejected
assert_eq!(parse_memory_mib("garbage"), None);
assert_eq!(parse_memory_mib(""), None);
assert_eq!(parse_memory_mib(" 256m "), Some(256));
}
#[test]
fn shell_join_quotes_only_when_needed() {
assert_eq!(shell_join(&["bitcoind".into()]), "bitcoind");
assert_eq!(
shell_join(&["bitcoind".into(), "-server=1".into()]),
"bitcoind -server=1"
);
// Whitespace forces quoting:
assert_eq!(
shell_join(&["bash".into(), "-c".into(), "echo hi".into()]),
"bash -c \"echo hi\""
);
// Embedded quotes must escape:
assert_eq!(
shell_join(&[r#"say "hi""#.into()]),
r#""say \"hi\"""#
);
}
#[test]
fn restart_policy_emits_correct_systemd_string() {
assert_eq!(RestartPolicy::Always.as_systemd(), "always");
assert_eq!(RestartPolicy::OnFailure.as_systemd(), "on-failure");
}
#[test]
fn render_emits_backend_directives_when_set() {
let u = QuadletUnit {
name: "bitcoin-knots".into(),
description: "Bitcoin Knots backend".into(),
image: "registry/bitcoin-knots:latest".into(),
network: NetworkMode::Bridge("archy-net".into()),
cap_drop_all: true,
cap_add: vec!["NET_BIND_SERVICE".into()],
ports: vec![(8332, 8332, "tcp".into()), (8333, 8333, "tcp".into())],
environment: vec![
"BITCOIN_RPC_USER=archipelago".into(),
"BITCOIN_RPC_PASS=secret".into(),
],
devices: vec!["/dev/kvm".into()],
add_hosts: vec![("host.archipelago".into(), "10.89.0.1".into())],
entrypoint: Some(vec!["/usr/local/bin/bitcoind".into()]),
command: vec!["-server=1".into(), "-rpcbind=0.0.0.0".into()],
read_only_root: true,
no_new_privileges: true,
cpu_quota: Some(2),
restart_policy: RestartPolicy::OnFailure,
..QuadletUnit::default()
};
let s = u.render();
assert!(s.contains("PublishPort=8332:8332/tcp"));
assert!(s.contains("PublishPort=8333:8333/tcp"));
assert!(s.contains("Environment=BITCOIN_RPC_USER=archipelago"));
assert!(s.contains("Environment=BITCOIN_RPC_PASS=secret"));
assert!(s.contains("AddDevice=/dev/kvm"));
assert!(s.contains("AddHost=host.archipelago:10.89.0.1"));
assert!(s.contains("ReadOnly=true"));
assert!(s.contains("NoNewPrivileges=true"));
assert!(s.contains("PodmanArgs=--cpus=2"));
assert!(s.contains("Exec=/usr/local/bin/bitcoind -server=1 -rpcbind=0.0.0.0"));
assert!(s.contains("Restart=on-failure"));
assert!(s.contains("Network=archy-net"));
}
#[test]
fn render_skips_backend_directives_when_default() {
// Companion-style unit: backend extension fields all defaulted.
// Rendered bytes must not include any of the backend directives,
// so existing companion units stay byte-identical to before.
let s = sample_unit().render();
assert!(!s.contains("PublishPort="));
assert!(!s.contains("Environment="));
assert!(!s.contains("AddDevice="));
assert!(!s.contains("AddHost="));
assert!(!s.contains("ReadOnly="));
assert!(!s.contains("NoNewPrivileges="));
assert!(!s.contains("Exec="));
assert!(!s.contains("--cpus="));
// Default RestartPolicy is Always — companions rely on this.
assert!(s.contains("Restart=always"));
}
#[test]
fn from_manifest_translates_a_typical_backend() {
let yaml = r#"
app:
id: bitcoin-knots
name: Bitcoin Knots
version: 1.0.0
container:
image: registry/bitcoin-knots:1.0
entrypoint: ["/usr/local/bin/bitcoind"]
custom_args: ["-server=1", "-rpcbind=0.0.0.0"]
ports:
- host: 8332
container: 8332
protocol: tcp
volumes:
- type: bind
source: /var/lib/archipelago/bitcoin
target: /home/bitcoin/.bitcoin
options: []
environment:
- BITCOIN_NETWORK=mainnet
devices: []
resources:
cpu_limit: 4
memory_limit: 2g
security:
capabilities: ["NET_BIND_SERVICE"]
readonly_root: true
network_policy: archy-net
"#;
let m = AppManifest::parse(yaml).expect("manifest must parse");
let u = QuadletUnit::from_manifest(&m, "bitcoin-knots");
assert_eq!(u.name, "bitcoin-knots");
assert_eq!(u.image, "registry/bitcoin-knots:1.0");
assert!(matches!(u.network, NetworkMode::Bridge(ref n) if n == "archy-net"));
assert_eq!(u.memory_mb, Some(2048));
assert_eq!(u.cpu_quota, Some(4));
assert!(u.read_only_root);
assert!(u.no_new_privileges);
assert_eq!(u.cap_add, vec!["NET_BIND_SERVICE"]);
assert_eq!(u.ports, vec![(8332, 8332, "tcp".to_string())]);
assert_eq!(u.environment, vec!["BITCOIN_NETWORK=mainnet"]);
assert_eq!(u.bind_mounts.len(), 1);
assert_eq!(
u.bind_mounts[0].host,
PathBuf::from("/var/lib/archipelago/bitcoin")
);
assert!(!u.bind_mounts[0].read_only);
assert_eq!(u.entrypoint, Some(vec!["/usr/local/bin/bitcoind".into()]));
assert_eq!(u.command, vec!["-server=1", "-rpcbind=0.0.0.0"]);
assert!(u.add_hosts.iter().any(|(n, ip)| n == "host.archipelago" && ip == "10.89.0.1"));
assert_eq!(u.restart_policy, RestartPolicy::OnFailure);
}
#[test]
fn from_manifest_marks_ro_volumes_read_only() {
let yaml = r#"
app:
id: x
name: X
version: 1.0.0
container:
image: x:latest
volumes:
- type: bind
source: /etc/host-conf
target: /etc/conf
options: ["ro"]
"#;
let m = AppManifest::parse(yaml).unwrap();
let u = QuadletUnit::from_manifest(&m, "x");
assert_eq!(u.bind_mounts.len(), 1);
assert!(u.bind_mounts[0].read_only);
}
#[test]
fn from_manifest_skips_tmpfs_volumes() {
let yaml = r#"
app:
id: x
name: X
version: 1.0.0
container:
image: x:latest
volumes:
- type: tmpfs
target: /tmp
tmpfs_options: "rw,size=64m"
- type: bind
source: /var/lib/x
target: /data
options: []
"#;
let m = AppManifest::parse(yaml).unwrap();
let u = QuadletUnit::from_manifest(&m, "x");
// tmpfs entry is dropped from bind_mounts; bind entry survives.
assert_eq!(u.bind_mounts.len(), 1);
assert_eq!(u.bind_mounts[0].host, PathBuf::from("/var/lib/x"));
}
#[test]
fn render_emits_health_directives_when_set() {
let mut u = QuadletUnit::default();
u.name = "lnd".into();
u.image = "x:1".into();
u.health = Some(HealthSpec {
cmd: "nc -z localhost 10009".into(),
interval: "30s".into(),
timeout: "5s".into(),
retries: 3,
});
let s = u.render();
assert!(s.contains("HealthCmd=nc -z localhost 10009"));
assert!(s.contains("HealthInterval=30s"));
assert!(s.contains("HealthTimeout=5s"));
assert!(s.contains("HealthRetries=3"));
assert!(s.contains("Notify=healthy"));
// Notify=healthy needs a long-enough TimeoutStartSec or systemd
// kills the unit before the first probe can pass — observed live
// on .228 2026-05-02 across all six backends.
assert!(s.contains("TimeoutStartSec=600"), "got: {s}");
}
#[test]
fn render_skips_health_directives_when_absent() {
// No health spec → no Notify=healthy, no HealthCmd, no
// TimeoutStartSec override (default 90s applies). Companions rely
// on this so their rendered bytes stay unchanged.
let s = sample_unit().render();
assert!(!s.contains("HealthCmd="));
assert!(!s.contains("Notify=healthy"));
assert!(!s.contains("HealthRetries="));
assert!(!s.contains("TimeoutStartSec="));
}
#[test]
fn translate_health_check_handles_each_supported_type() {
use archipelago_container::HealthCheck;
let tcp = HealthCheck {
check_type: "tcp".into(),
endpoint: Some("localhost:10009".into()),
path: None,
interval: "30s".into(),
timeout: "5s".into(),
retries: 3,
};
let h = translate_health_check(&tcp).expect("tcp must translate");
assert_eq!(h.cmd, "nc -z localhost 10009");
assert_eq!(h.retries, 3);
let http = HealthCheck {
check_type: "http".into(),
endpoint: Some("localhost:8080".into()),
path: Some("/health".into()),
interval: "10s".into(),
timeout: "3s".into(),
retries: 5,
};
let h = translate_health_check(&http).expect("http must translate");
assert_eq!(h.cmd, "curl -fsS -m 5 http://localhost:8080/health");
let cmdck = HealthCheck {
check_type: "cmd".into(),
endpoint: Some("/usr/local/bin/probe.sh".into()),
path: None,
interval: "60s".into(),
timeout: "15s".into(),
retries: 2,
};
let h = translate_health_check(&cmdck).expect("cmd must translate");
assert_eq!(h.cmd, "/usr/local/bin/probe.sh");
// Unknown type → None (renderer skips Notify=healthy entirely
// rather than emit a broken HealthCmd that hangs the unit start).
let bad = HealthCheck {
check_type: "exec".into(),
endpoint: Some("foo".into()),
path: None,
interval: "30s".into(),
timeout: "5s".into(),
retries: 3,
};
assert!(translate_health_check(&bad).is_none());
// Malformed tcp endpoint → None (no port separator).
let badtcp = HealthCheck {
check_type: "tcp".into(),
endpoint: Some("hostonly".into()),
path: None,
interval: "30s".into(),
timeout: "5s".into(),
retries: 3,
};
assert!(translate_health_check(&badtcp).is_none());
}
#[test]
fn translate_health_check_http_does_not_double_prefix_scheme() {
// Regression: on .228 2026-05-02 we shipped HealthCmds reading
// `curl -fsS -m 5 http://http://localhost:8175/` because manifests
// in the wild carry the scheme inside the endpoint string. Every
// probe failed and the unit looped. Now we accept either form.
use archipelago_container::HealthCheck;
let with_scheme = HealthCheck {
check_type: "http".into(),
endpoint: Some("http://localhost:8175".into()),
path: Some("/".into()),
interval: "30s".into(),
timeout: "5s".into(),
retries: 3,
};
let h = translate_health_check(&with_scheme).expect("with-scheme must translate");
assert_eq!(h.cmd, "curl -fsS -m 5 http://localhost:8175/");
assert!(!h.cmd.contains("http://http://"), "got: {}", h.cmd);
let with_https = HealthCheck {
check_type: "http".into(),
endpoint: Some("https://example.local/health".into()),
path: None,
interval: "30s".into(),
timeout: "5s".into(),
retries: 3,
};
let h = translate_health_check(&with_https).expect("https must translate");
// Endpoint already has /health → don't append the default "/".
assert_eq!(h.cmd, "curl -fsS -m 5 https://example.local/health");
}
#[test]
fn from_manifest_picks_up_health_check() {
let yaml = r#"
app:
id: lnd
name: LND
version: 1.0.0
container:
image: x:1
health_check:
type: tcp
endpoint: localhost:10009
interval: 15s
timeout: 4s
retries: 5
"#;
let m = AppManifest::parse(yaml).unwrap();
let u = QuadletUnit::from_manifest(&m, "lnd");
let h = u.health.as_ref().expect("health should be populated");
assert_eq!(h.cmd, "nc -z localhost 10009");
assert_eq!(h.interval, "15s");
assert_eq!(h.timeout, "4s");
assert_eq!(h.retries, 5);
assert!(u.render().contains("Notify=healthy"));
}
#[test]
fn from_manifest_renders_to_a_systemd_unit() {
// End-to-end: parse a real-shape manifest, build the unit, render
// the bytes, and assert the unit body contains the directives a
// human would write by hand.
let yaml = r#"
app:
id: lnd
name: LND
version: 1.0.0
container:
image: registry/lnd:latest
ports:
- host: 10009
container: 10009
protocol: tcp
volumes:
- type: bind
source: /var/lib/archipelago/lnd
target: /root/.lnd
options: []
environment:
- LND_NETWORK=mainnet
resources:
memory_limit: 1g
security:
capabilities: []
network_policy: archy-net
"#;
let m = AppManifest::parse(yaml).unwrap();
let body = QuadletUnit::from_manifest(&m, "lnd").render();
assert!(body.contains("ContainerName=lnd"));
assert!(body.contains("Image=registry/lnd:latest"));
assert!(body.contains("Network=archy-net"));
assert!(body.contains("PublishPort=10009:10009/tcp"));
assert!(body.contains("Volume=/var/lib/archipelago/lnd:/root/.lnd:Z"));
assert!(body.contains("Environment=LND_NETWORK=mainnet"));
assert!(body.contains("PodmanArgs=--memory=1024m"));
assert!(body.contains("AddHost=host.archipelago:10.89.0.1"));
assert!(body.contains("DropCapability=ALL"));
assert!(body.contains("NoNewPrivileges=true"));
assert!(body.contains("Restart=on-failure"));
}
}