feat(immich): manifest-driven stack via orchestrator — live-migrated on .228

Completes the immich migration off the legacy hardcoded install_immich_stack
(podman run + sudo chown) to the registry-manifest + orchestrator path. Validated
live on .228 (clean single set, healthy v2.7.4, data dir ownership correct).

- install_immich_stack now tries install_stack_via_orchestrator(immich_stack_app_ids)
  first; legacy remains only as the no-manifests fallback.
- immich-{postgres,redis,server} manifests corrected from live findings:
  * named by app_id (dropped container_name override) — using container_name
    spawned DUPLICATE containers (app_id-named install vs name-override reconcile)
    on the same PGDATA, which corrupted a postgres cluster. Server reaches its
    siblings via app_id aliases (DB_HOSTNAME=immich-postgres, REDIS=immich-redis).
  * immich-postgres data_uid 100998:100998 (postgres drops to container 999 →
    host 100998 under rootless; verified the fresh dir is chowned correctly).
  * immich-server version "release"→"2.7.4" (manifest validation requires a digit;
    the bad version made the manifest silently skip → partial orchestrator install
    → legacy fallback → the duplicate corruption above).
- HARDEN install_stack_via_orchestrator: only fall back to the legacy installer
  when NOTHING was installed yet. An "unknown app_id" AFTER a member is up now
  errors instead of double-creating containers on shared data (the corruption
  root cause).
- Strict the all-manifests round-trip test: fail (not skip) on any invalid shipped
  manifest — this gap let the bad immich-server version through.

Known follow-up (pre-existing, platform-wide): orchestrator-installed backends
(immich, btcpay-db) run as podman --restart, not Quadlet, and podman-restart.service
is disabled on .228 → reboot-survival gap independent of this migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
archipelago 2026-06-21 07:08:45 -04:00
parent 011081d180
commit 9e6c5370fc
6 changed files with 61 additions and 18 deletions

View File

@ -4,15 +4,20 @@ app:
version: "14-vectorchord0.4.3-pgvectors0.2.0"
description: Postgres (pgvecto.rs / vectorchord) backend for Immich.
# The Immich server connects via DB_HOSTNAME=immich_postgres, so the container
# name (and thus its archy-net alias) must be the underscore form.
extensions:
container_name: immich_postgres
# No container_name override: the container is named by app_id (immich-postgres),
# which is also its archy-net alias and the server's DB_HOSTNAME. (Overriding the
# name diverges from the orchestrator's app_id-based naming and spawns duplicate
# containers — mirror the btcpay stack, which names members by app_id.)
container:
image: 146.59.87.168:3000/lfg2025/immich-postgres:14-vectorchord0.4.3-pgvectors0.2.0
pull_policy: if-not-present
network: archy-net
# postgres drops to its own uid (container 999 → host 100998 under rootless),
# so the data dir must be owned by that mapped uid — mirrors archy-btcpay-db.
# Verified on .228: the live immich-db is owned 100998. Without this a FRESH
# install's dir would be service-user-owned and postgres would EACCES.
data_uid: "100998:100998"
generated_secrets:
- name: immich-db-password
kind: hex32

View File

@ -4,9 +4,7 @@ app:
version: "7-alpine"
description: Valkey (Redis-compatible) cache for Immich.
# Immich server connects via REDIS_HOSTNAME=immich_redis — alias must match.
extensions:
container_name: immich_redis
# Named by app_id (immich-redis) = archy-net alias = server's REDIS_HOSTNAME.
container:
image: 146.59.87.168:3000/lfg2025/valkey:7-alpine

View File

@ -1,11 +1,11 @@
app:
id: immich-server
name: Immich
version: "release"
version: "2.7.4"
description: Self-hosted photo and video backup with mobile apps and search.
extensions:
container_name: immich_server
# Named by app_id (immich-server); connects to its siblings by their app_id
# aliases on archy-net (see DB_HOSTNAME / REDIS_HOSTNAME below).
container:
image: 146.59.87.168:3000/lfg2025/immich-server:release
@ -41,10 +41,10 @@ app:
options: [rw]
environment:
- DB_HOSTNAME=immich_postgres
- DB_HOSTNAME=immich-postgres
- DB_USERNAME=postgres
- DB_DATABASE_NAME=immich
- REDIS_HOSTNAME=immich_redis
- REDIS_HOSTNAME=immich-redis
- UPLOAD_LOCATION=/usr/src/app/upload
health_check:

View File

@ -620,16 +620,25 @@ async fn install_stack_via_orchestrator(
))
.await;
let mut installed = 0usize;
for app_id in app_ids {
match orchestrator.install(app_id).await {
Ok(container_name) => {
installed += 1;
install_log(&format!(
"INSTALL ORCH: {} stack — app {} installed as {}",
stack_name, app_id, container_name
))
.await;
}
Err(e) if e.to_string().contains("unknown app_id") => {
Err(e) if e.to_string().contains("unknown app_id") && installed == 0 => {
// None of the stack's manifests are known — the orchestrator
// can't render this stack at all, so defer to the legacy
// installer. Only safe when NOTHING was installed yet: once an
// earlier member is up, falling back would let the legacy path
// double-create containers on the same data dir (observed
// corrupting an immich postgres cluster — two postmasters, one
// PGDATA). A partial set means a deploy bug, not a legacy node.
install_log(&format!(
"INSTALL ORCH SKIP: {} stack — app {} unknown, falling back to legacy stack installer",
stack_name, app_id
@ -637,6 +646,17 @@ async fn install_stack_via_orchestrator(
.await;
return Ok(None);
}
Err(e) if e.to_string().contains("unknown app_id") => {
install_log(&format!(
"INSTALL ORCH FAIL: {} stack — app {} unknown AFTER {} installed; refusing legacy fallback (would double-create on shared data)",
stack_name, app_id, installed
))
.await;
return Err(e.context(format!(
"orchestrator stack install {} aborted: app {} has no manifest but {} member(s) already installed — deploy all stack manifests",
stack_name, app_id, installed
)));
}
Err(e) => {
install_log(&format!(
"INSTALL ORCH FAIL: {} stack — app {} failed: {}",
@ -668,6 +688,11 @@ fn mempool_stack_app_ids() -> &'static [&'static str] {
&["archy-mempool-db", "mempool-api", "archy-mempool-web"]
}
fn immich_stack_app_ids() -> &'static [&'static str] {
// Install order = dependency order: db + cache before the server.
&["immich-postgres", "immich-redis", "immich-server"]
}
const REGISTRY: &str = "146.59.87.168:3000/lfg2025";
const NETBIRD_DASHBOARD_IMAGE: &str = "docker.io/netbirdio/dashboard:v2.38.0";
@ -734,6 +759,17 @@ async fn pull_image_with_retry(image: &str) -> Result<()> {
impl RpcHandler {
/// Install Immich stack (postgres + redis + server).
pub(super) async fn install_immich_stack(&self) -> Result<serde_json::Value> {
// Manifest-driven path (workstream B/C): render the stack from
// apps/immich-*/manifest.yml via the orchestrator (rootless Quadlet
// units, generated_secrets, reboot-survivable). Falls back to the legacy
// installer below only when the orchestrator doesn't know these app_ids
// (manifests not yet deployed). See docs/PRODUCTION-MASTER-PLAN.md.
if let Some(orchestrated) =
install_stack_via_orchestrator(self, "immich", immich_stack_app_ids()).await?
{
return Ok(orchestrated);
}
if let Some(adopted) = adopt_stack_if_exists(
"immich_server",
"immich",

View File

@ -3778,10 +3778,14 @@ app:
if !mf.exists() {
continue;
}
let m = match AppManifest::from_file(&mf) {
Ok(m) => m,
Err(_) => continue, // a malformed disk manifest is a separate concern
};
// Every shipped manifest MUST be valid. load_manifests() silently
// skips malformed ones in prod, which once let an invalid app.version
// ("release", no digit) ship — the app then vanished from the
// orchestrator and a stack install half-fell-back to the legacy path.
// Fail loudly here instead.
let m = AppManifest::from_file(&mf).unwrap_or_else(|e| {
panic!("shipped manifest {} must be valid: {e}", mf.display())
});
let id = m.app.id.clone();
let is_build = m.app.container.build.is_some();
let value = serde_json::to_value(&m).expect("manifest serializes to JSON");

View File

@ -63,7 +63,7 @@ real nodes. Until then, this plan is the priority.
| # | Workstream | Detail doc | Status |
|---|-----------|-----------|--------|
| A | **Manifest-driven app platform** — packaging contract, single/multi-container runtime, routing, controlled hooks, dev tooling (6 phases, security model, migration rules) | `APP-PACKAGING-MIGRATION-PLAN.md` | mostly done; immich + multi-container polish remain |
| B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **design done — implementing phase 1** |
| B | **Registry-distributed manifests** — catalog carries full signed manifest; orchestrator installs from registry; disk = migration fallback | `registry-manifest-design.md` | **phases 1+2 done** (node consume + opt-in publisher embed); not yet flipped on for the fleet |
| C | **Developer-ready external registry** — 3rd-party DID-signed manifests, decentralized Nostr discovery (NIP-78 kind 30078) + trust score, `archy app …` tooling | `marketplace-protocol.md`, `app-developer-guide.md` | design exists; tooling + trust UX pending |
| D | **Distribution backbone** — signed catalog, BLAKE3 content-addressing, iroh swarm (origin-always-wins) | `dht-distribution-design.md` | phases 02 code-complete (worktree) |
| E | **Production test gate** — 20× lifecycle on .228 + .198, per-app L1/L2 matrix | `tests/lifecycle/TESTING.md`, `bulletproof-containers.md` | **never green — exit criterion** |