989 Commits

Author SHA1 Message Date
archipelago
05b41f8946 fix(lnd-ui): align container port across all specs
The LND UI container was unreachable on .228 after the v1.7.43-alpha
deploy because three sources of truth disagreed on which port nginx
listens on inside the container:

  - docker/lnd-ui/nginx.conf        listen 8081
  - docker/lnd-ui/Dockerfile        EXPOSE 8080
  - apps/lnd-ui/manifest.yml        host networking, ports: []
  - scripts/first-boot-containers.sh  -p 8081:8080
  - scripts/deploy-to-target.sh        -p 8081:80     (de-facto)
  - scripts/deploy-tailscale.sh        -p 8081:80
  - scripts/container-specs.sh        SPEC_PORTS=8081:80

Result: podman published host 8081 to container port 80, but no one was
listening on 80 inside, so connections were reset. Canonicalize on
container:80 with host:8081 publish, matching the three deploy paths
already in agreement.

Changes:
  - docker/lnd-ui/nginx.conf: listen 8081 -> listen 80
  - docker/lnd-ui/Dockerfile: EXPOSE 8080 -> EXPOSE 80
  - apps/lnd-ui/manifest.yml: replace host-network (never true) with
    bridge networking and explicit 8081:80 port mapping, correcting a
    documentation-vs-reality mismatch
  - scripts/first-boot-containers.sh: -p 8081:8080 -> -p 8081:80, and
    fix the internal-port comment

Verified on .228 after rebuild: curl http://127.0.0.1:8081/ returns HTTP
200 and the /app/lnd/ host-nginx proxy resolves cleanly.
2026-04-23 15:42:49 -04:00
archipelago
ed73e4709b chore(release): archive ISO build recipes, tarball-only releases
Releases no longer ship as bootable ISOs. Archipelago updates are
distributed as the backend binary plus a frontend tarball referenced by
releases/manifest.json. Nodes OTA-update via scripts/self-update.sh.

Filebrowser and AIUI remain bundled inside the frontend tarball and
deployed atomically, verified present in v1.7.43-alpha release artifact
(189 AIUI files, filebrowser-client bundle).

Archived under image-recipe/_archived/ (resurrectable if ISO distribution
is reintroduced):
  - build-auto-installer-iso.sh
  - build-unbundled-iso.sh
  - test-iso-qemu.sh
  - scripts/convert-iso-to-disk.sh
  - BUILD-ISO-STATUS.md, ISO-BUILD-CHECKLIST.md
  - branding/isohdpfx.bin
  - .gitea/workflows/build-iso-dev.yml

Updated release process docs to drop ISO references:
  - scripts/create-release.sh (next-steps text)
  - docs/BETA-RELEASE-CHECKLIST.md
  - docs/hotfix-process.md
  - README.md
2026-04-23 15:36:00 -04:00
archipelago
0bd4e49a8c docs(release-notes): v1.7.43-alpha bullet for AIUI preservation fix v1.7.43-alpha 2026-04-23 13:22:28 -04:00
archipelago
310c709aba chore(release): bump version to 1.7.43-alpha 2026-04-23 13:21:58 -04:00
archipelago
dbf755e908 fix(aiui): bundle demo/aiui in self-update and ISO builds so updates never wipe it
Every OTA self-update and every ISO capture was implicitly relying on
/opt/archipelago/web-ui/aiui/ already being present on disk. Any node that
had its web-ui directory atomically swapped (for example by a manual
deployment shipping only neode-ui dist output) lost aiui entirely and the
AI Assistant tab fell through to the "needs to be enabled" placeholder.

self-update.sh: drop the rsync --exclude aiui preservation trick and
instead stage demo/aiui into the freshly-built dist tree before rsync.
demo/aiui in the repo is now the source of truth; every update overwrites
the on-disk copy with a matching version rather than carrying forward
whatever stale bundle happened to survive.

build-auto-installer-iso.sh: prepend demo/aiui to the AIUI search list so
ISO builds from a fresh repo clone pick it up automatically, without
requiring a side-checkout of the AIUI project or a live dev server.

This matches create-release-manifest.sh which already bakes demo/aiui
into the release tarball (lines 86-89).
2026-04-23 13:21:49 -04:00
archipelago
2572688468 docs(release-notes): v1.7.43-alpha bullets for chunking, avatar, outbox, parser
Four production-code fixes merit user-visible mention: the transport
chunking data-corruption fix (real user-affecting bug for multi-chunk
mesh payloads), the avatar u16 overflow panic (backend crash on certain
seeds), the outbox TTL boundary, and the image-versions parser hardening.
2026-04-23 13:03:49 -04:00
archipelago
4bf35f95e6 test: repair stale test fixtures across identity, mesh, update, wallet, fips
Several tests had drifted from the current production behavior:

- identity_manager: create() already auto-provisions a Nostr key, so the
  explicit create_nostr_key() call failed with "already exists". Rewrite
  the test to assert on record.nostr_npub from create() directly.
- mesh/protocol: test_build_app_start read the app name from frame[4..]
  but the v2 layout is [0:marker][1-2:len][3:cmd][4:version][5..:name].
  test_identity_broadcast_roundtrip expected input DID = output DID but
  the v2 decoder derives DID from the ed25519 pubkey, so the roundtrip
  compares against did_key_from_pubkey_hex(&pub) now.
- mesh/bitcoin_relay: test_build_block_header_announcement asserted
  sig.is_some(), but the builder intentionally emits an unsigned envelope
  to fit the 160-byte LoRa limit; assert sig.is_none(). Also widen
  placeholder hashes to the required 64 hex chars (32 bytes).
- update: load_mirrors() now merges default mirrors post-migration, so
  the roundtrip test must assert the custom mirror survives alongside
  the defaults rather than strict equality.
- wallet/cashu: test_proof_c_as_pubkey used hex that is not on the curve;
  replace with the secp256k1 generator point G so parsing succeeds.
- fips: test_status_reports_no_key_pre_onboarding asserted npub.is_none(),
  which fails on dev boxes where the fips daemon is already running. Keep
  the !key_present assertion and drop the npub one.
2026-04-23 13:02:45 -04:00
archipelago
4edc420459 test(credentials): seed identity/node_key in test helper so encrypt/decrypt works
Credentials tests created a fresh tempdir and immediately invoked
encrypt/decrypt, but load_encryption_key reads <dir>/identity/node_key
which did not exist, so every test failed with "node key not found".
Add a test_dir_with_node_key() helper that writes a deterministic 32-byte
key and switch all 8 call sites to it.
2026-04-23 13:02:28 -04:00
archipelago
7af048cc1a fix(session): add test-only constructor so tests do not read real sessions
SessionStore::new() reads /var/lib/archipelago/sessions.json, which on
any node with an active dashboard contains live sessions that pollute
test state and cause intermittent failures. Introduce a cfg(test) only
new_for_tests(PathBuf) constructor and switch the test suite to it so
tests always start from a clean tempdir.
2026-04-23 13:02:22 -04:00
archipelago
2843cc1e84 fix(container/image_versions): reject entries that are not image references
The parser retained any key ending in _IMAGE, so a harmless-looking
variable like NOT_AN_IMAGE="something" would be treated as a pinned
container image. Add a value-shape check: the value must contain both
a registry separator (/) and a tag separator (:) to qualify.
2026-04-23 13:02:15 -04:00
archipelago
c5ea41d0cb fix(mesh/outbox): expire messages with zero TTL immediately
is_expired used age > ttl_secs, so a message with ttl_secs=0 whose age
rounded to 0 seconds was considered live forever. Switch to >= so the
zero-TTL boundary expires on the first check, matching the intuitive
meaning of TTL and the behavior the tests assert.
2026-04-23 13:02:07 -04:00
archipelago
9d42645aa3 fix(avatar): prevent u16 overflow panic when seed byte is large
hue_color and accent_color computed (seed as u16) * 360, which overflows
u16 when seed >= 182 — debug builds panicked, release wrapped silently.
Widen to u32 before the multiplication.

This also unblocks several identity_manager tests that constructed avatars
through master_node_svg and were aborting on the panic.
2026-04-23 13:02:01 -04:00
archipelago
f6efe2f356 fix(transport/chunking): stop overwriting first 4 bytes of user data
encode_chunked() split the payload into shards first, then overwrote
the first 4 bytes of shard 0 with a u32 length header, then re-ran
Reed-Solomon to regenerate parity over the now-corrupted shards. The
decoder correctly read the length header and trimmed `[4..4+len]`
from the reconstructed buffer, but those first 4 bytes had already
been destroyed on the encode side, so every chunked mesh payload
lost its first 4 bytes.

Restructure: reserve 4 bytes for the length header up front, build
a single contiguous [len][data][pad] buffer, then split into shards.
Parity is computed over the correct shards on the first pass, no
double-encode needed.

Update test_chunk_roundtrip_medium: 500 bytes + 4-byte header = 504
bytes, which is 5 data shards (ceil(504/124)), not 4. The old test
assertion was wrong all along and masked the corruption bug because
it only checked the roundtripped bytes, which is exactly what we
need to verify. New assertion is correct.

Verified: all 7 transport::chunking tests pass.
2026-04-23 12:29:10 -04:00
archipelago
c4efb30382 docs(release-notes): v1.7.43-alpha bullet for install-log fix; prune stale RESUME note 2026-04-23 12:04:20 -04:00
archipelago
cd6f8bad70 fix(install-log): pre-create /var/log/archipelago/ so non-root backend can write
The backend runs as `archipelago` and calls `install_log()` to append
audit lines to the install log on every install / update / remove /
start / stop / restart. Target path was /var/log/archipelago-container-installs.log,
which does not exist and cannot be created by the service because
/var/log/ is root-owned. OpenOptions errors were silently swallowed,
so the log was never written on any node.

Ship a tmpfiles.d rule that pre-creates /var/log/archipelago/ and
container-installs.log with archipelago:archipelago ownership. Move
the const path to match, keeping logs inside the directory logrotate
already rotates (image-recipe/configs/logrotate.conf). Install the
rule from both the ISO build and self-update, and apply it
immediately on self-update so existing nodes get a working log
without needing a reboot.

Verified on .228: file created, backend user can write, backend
binary rebuilt with new const.
2026-04-23 12:02:46 -04:00
archipelago
9f3d66e24e docs(release-notes): v1.7.43-alpha bullet for self-update script refresh
Document that OTA updates now refresh the reconcile helper scripts,
closing the deploy gap that kept fixes to those scripts from
reaching existing nodes.
2026-04-23 11:51:04 -04:00
archipelago
a272a79706 fix(self-update): install reconcile scripts on OTA updates
The OTA self-update path only refreshed image-versions.sh, leaving
reconcile-containers.sh and container-specs.sh frozen at whatever
version was baked into the ISO that originally provisioned the
node. Any fix to those scripts (notably the --create-missing flag
and the DISK_GB detection fix shipped this round) never reached
existing nodes, and on .228 both scripts were outright missing
because the node predated their inclusion in the ISO recipe.

Install all three helper scripts to /opt/archipelago/scripts/ on
every self-update run. Also preserve the legacy copy of
image-versions.sh at /opt/archipelago/image-versions.sh for any
older backend binaries still looking there first.
2026-04-23 10:07:53 -04:00
archipelago
694e5b0a9d fix(update): pass --create-missing when rollback recreates a destroyed container
The update flow removes the old container before starting the new
one. If the update fails after removal, the rollback path tries
`podman start <name>` first, then falls back to reconcile. But
reconcile without --create-missing treats the now-absent container
as an optional one that the install flow will (re)create later,
and skips it. Result: container stays destroyed until someone
notices and runs reconcile manually.

Add --create-missing to the rollback reconcile invocation so the
fallback actually rebuilds the container from its canonical spec.

Fixes the failure mode observed on .228 where a bitcoin-knots
update left the node with no bitcoin-knots container at all.
2026-04-23 10:06:55 -04:00
archipelago
0f1ad47aec docs(release-notes): v1.7.43-alpha bullets for disk-detection and rollback recovery
Add two user-facing release notes for fixes shipped this round:
- Full-archive Bitcoin nodes no longer silently get pruned on reconcile
  because the disk-size check was reading the OS partition.
- Failed updates can now recover via reconcile --create-missing instead
  of leaving a destroyed container behind.
2026-04-23 10:02:32 -04:00
archipelago
06dcdafda4 fix(specs): measure DISK_GB at /var/lib/archipelago, not /
The reconcile spec for bitcoin-knots auto-enables prune=550 when
DISK_GB < 1000. DISK_GB was measured via `df /`, which on every
archy install reports the ~30 GB OS partition because user data
lives on a separate encrypted /var/lib/archipelago volume.

Result: every archy node with a 2 TB data drive was silently being
configured as a pruned node, and any bitcoin-knots container
recreated by reconcile would delete its historical blocks down to
the 550 MB prune window on next start.

Observed on .228 (2 TB box): blocks dir went from 384 GB to 926 MB
after a reconcile-triggered restart. Historical archive unrecoverable
without full re-IBD from genesis.

Fix: check /var/lib/archipelago first (where bitcoin data actually
lives). Fall back to / only on first-boot before the data partition
is mounted.
2026-04-23 09:54:16 -04:00
archipelago
92612ddc70 feat(reconcile): add --create-missing flag for recovering from failed-update rollbacks
Context: when package update fails after remove-old-container but
before reconcile-recreate, the rollback path in update.rs tries to
restart the old container by name. If the container is already gone
(removed in step 3 of the update), rollback fails silently and the
node is left with no live container for that app but on-disk data
still intact. This is exactly the state .228 ended up in after the
reconcile-script-missing bug killed bitcoin-knots and lnd.

Reconcile was designed to only repair existing containers for
optional apps (SPEC_OPTIONAL=true): it skips "not installed" entries
on the assumption that the install RPC creates them. That safety
check is correct for normal operation but blocks recovery when an
optional-marked container has been destroyed by a failed update.

Fix: add --create-missing flag that overrides the SPEC_OPTIONAL skip.
When set, reconcile treats absent containers exactly the same as
broken containers — it creates them from the canonical spec using
the existing on-disk data directory. Narrow-scope override; the
default behaviour is unchanged.

Updated --help to document all four flags.

Verified on .228: after the failed bitcoin-core update took out both
bitcoin-knots and lnd, running reconcile --container=bitcoin-knots
--create-missing --force (as the archipelago user, not root —
podman is rootless) brought bitcoin-knots back using the pruned
chainstate at /var/lib/archipelago/bitcoin. Repeated for lnd. All
containers now running; electrumx reconnecting; UIs recovering.

Does NOT fix the underlying update-flow rollback hole (rollback
should be able to re-create a container from spec, not just restart
by name). That is a separate commit — this flag is the manual
recovery tool plus the primitive the improved rollback will call.
2026-04-23 09:42:19 -04:00
archipelago
353825b66c docs: release-note image-versions fix, add marketplace QA tracker, update RESUME
- AccountInfoSection.vue: append 5th bullet to v1.7.43-alpha entry
  explaining that update-available badges and version comparisons
  work again now that the pinned-image catalog is found at the
  correct deployed path.

- docs/MARKETPLACE-QA.md: new tracker for the upcoming app-by-app
  install walk on .228. Documents the per-app fix workflow, the
  four layers we might need to fix at (app recipe, registry image,
  backend orchestrator, frontend), status-key table for tracking
  each catalog entry, and the release-notes policy for the walk.

- docs/RESUME.md: refresh with a9908597 commit, updated binary md5
  on .228, and split Immediate Next Step into Phase 1 (browser
  verification) and Phase 2 (marketplace walk) with a pointer to
  the new tracker.
2026-04-23 09:32:41 -04:00
archipelago
12f93cc15e fix(image-versions): locate image-versions.sh at its actual deployed path
The Rust search path listed /opt/archipelago/image-versions.sh and
scripts/image-versions.sh (repo-relative for dev), but the image
recipe deploys the file to /opt/archipelago/scripts/image-versions.sh.
Production nodes therefore silently failed every lookup: find_file
returned None, load_image_versions returned an empty HashMap, and
both pinned_image_for_app and pinned_images_for_stack returned no
matches.

Symptom on deployed nodes: every container scan emitted
"image-versions.sh not found in any search path" at DEBUG level, and
the version-comparison logic in docker_packages.rs plus the
update-check logic in api/rpc/package/update.rs silently degraded to
no-op — users would not see update-available badges and upgrade RPCs
could not resolve pinned targets.

Fix: put the canonical deployed path first in PATHS, keep the older
/opt/archipelago/image-versions.sh as a fallback for not-yet-updated
nodes, and retain scripts/image-versions.sh as the dev-repo-relative
fallback. Verified on .228: backend now logs "Parsed 57 image
versions from /opt/archipelago/scripts/image-versions.sh" on scan.

Pre-existing test_parse_image_versions failure in this module is
unrelated (the NOT_AN_IMAGE assertion was broken before this change
because the parser's _IMAGE-suffix retain keeps it). Leaving that for
the general cargo-test cleanup pass.
2026-04-23 09:29:15 -04:00
archipelago
4faac9cb74 docs(resume): add RESUME.md for context-restart recovery
Consolidated single-file snapshot of plan + progress for a fresh
OpenCode session to pick up the install UX polish work:

- Where we are: v1.7.43-alpha shipped, 5 commits on main, deployed
  to .228, browser verification in progress.
- Immediate next step: await user's verification results from
  https://192.168.1.228/ browser checklist.
- Working layout: SSHFS mount, ssh archy / archy228, deploy recipes.
- Architecture patterns: async-spawn lifecycle, phase-based install
  progress, scanner kick, .23 auto-purge migration.
- Backlog: Vaultwarden exit-on-start, install log perms, 22 stale
  cargo test failures, historical changelog entries left intact.
- User preferences: "best long-term first", one-by-one, no push,
  Bitcoin-only, conventional commits.

Complements STATUS.md (which remains the engineering log) with a
tighter resume-the-work narrative focused on the current round.
2026-04-23 09:14:36 -04:00
archipelago
b62b731db0 docs(status): record rounds 3-5 + config migration + changelog as shipped
Adds a new top section to STATUS.md covering v1.7.43-alpha:

- Round 3: phase-based install progress bar
- Round 4: post-install scanner kick for instant Launch button
- Round 5: .23 VPS retirement, .168 promoted to Server 1
- Config migration: auto-purge .23 from saved registry/mirror JSONs
- Changelog: new v1.7.43-alpha entry in AccountInfoSection

All 5 commits, deployment md5, verification notes, and git remote
cleanup captured. Round 2 rollback command still valid for the full
stack since backups predate every round in this session.
2026-04-23 09:09:02 -04:00
archipelago
6c8cb50679 docs(changelog): add v1.7.43-alpha entry covering async lifecycle + .23 retirement
Four release-note bullets describing the user-visible changes shipped
in this round:

- async-spawn install/update/uninstall (UI no longer freezes)
- phase-based install progress bar (Preparing through Finalizing)
- scanner kick post-install (Launch button appears immediately)
- .23 Hetzner VPS retired, .168 OVH promoted to Server 1 with
  auto-purge migration for existing nodes

Matches the tone of existing changelog entries: what changed from the
operator's perspective, not internal implementation detail.
2026-04-23 09:07:29 -04:00
archipelago
28e38a36a9 fix(config): auto-purge decommissioned .23 VPS from saved registry/mirror configs
load_registries + load_mirrors normally only ADD missing defaults to
the persisted JSON — explicit removals stick. After retiring the .23
Hetzner VPS we need the opposite: existing nodes have .23 baked into
their saved configs and would spend seconds per install/update timing
out against a dead host until the operator manually removes it via
the Settings UI.

Add a targeted one-time migration in both loaders: if any saved entry
has 23.182.128.160 in its URL, drop it on load and rewrite the file.
This is an exception to the usual "explicit removals stick" rule —
the user never chose to add this mirror, it was a default.

Narrow-scope migration (one hardcoded IP match, no schema version)
because the cost/benefit of a general migration system isn't worth
it for a single decommissioned host. Future retirements can follow
the same pattern.
2026-04-23 08:51:26 -04:00
archipelago
d9d5fa65e5 chore: retire .23 VPS mirror, promote .168 OVH to primary
The Hetzner VPS at 23.182.128.160 was decommissioned. Replace it
everywhere with the OVH VPS at 146.59.87.168, which was previously
the tertiary mirror.

  - update.rs: drop DEFAULT_TERTIARY_MIRROR_URL, promote .168 into
    the secondary slot as "Server 1 (OVH)"; tx1138 becomes Server 2.
    Default mirror list shrinks from 3 to 2.
  - container/registry.rs: default RegistryConfig drops .23, promotes
    .168 to Server 1 / priority 0, tx1138 stays Server 2 / priority 10.
  - api/rpc/package/config.rs: trusted-registry allowlist swaps .23
    for .168.
  - api/handler/mod.rs: app-catalog fallback URL uses .168.
  - neode-ui/views/marketplace/marketplaceData.ts: REGISTRY uses .168.
  - scripts/image-versions.sh: ARCHY_REGISTRY_FALLBACK uses .168.
  - image-recipe/build-auto-installer-iso.sh: installer ISO registries
    use .168 (both podman registries.conf and backend registries.json).

Tests updated to assert on the new 2-entry default lists (registry +
mirror). URL-parser fixture tests in update.rs retain .23 strings —
they exercise string-parsing logic, not mirror policy.

Git remotes: dropped `gitea-vps` and the .23 push URL on the `origin`
multi-push alias (not part of this commit — pure working-copy change).
2026-04-23 08:22:32 -04:00
archipelago
980c1b25f4 fix(install): kick scanner post-install so Launch button appears immediately
After install completes, the async-spawn wrapper wrote state=Running
but the skeletal install-time manifest (interfaces: None) persisted
until the next scheduled 60s scan. The frontend saw state=running but
hasUI=false and hid the Launch button for up to a full minute.

Add a shared Notify/watch pair between RpcHandler and the scan loop:
  - scan_kick (Notify): scan loop selects! between the 60s interval
    and this notify, running immediately on either.
  - scan_tick (watch<u64>): scan loop bumps the counter after each
    completed scan so callers can await completion.

Install and update success paths now call kick_scanner_and_wait before
flipping to Running. The scan merges via merge_preserving_transitional
(state stays Installing/Updating, manifest refreshed from live podman
with interfaces.main.ui populated from real port bindings). 2s timeout
falls back to pre-fix behavior on slow podman — no regression.
2026-04-23 07:59:03 -04:00
archipelago
7e62ea07f7 feat(install): phase-based progress bar replaces unparseable pull bytes
Podman emits zero parseable progress when stderr is piped (no TTY), so
the old byte-counter regex never matched in real installs. Users saw
0% for the whole pull, then a jump to 95%, then silence through
create-container, health-check, and post-install hooks.

Replace with 7 explicit lifecycle phases wired through install.rs and
update.rs: Preparing (5%), PullingImage (20%), CreatingContainer (70%),
StartingContainer (80%), WaitingHealthy (88%), PostInstall (95%),
Done (100%). Each maps to a fixed UI progress and status message.

Frontend PHASE_INFO mapper in stores/server.ts prioritizes phase when
present, falls back to byte-counter for legacy. A Math.max forward-only
guard ensures the bar never regresses. Deleted the duplicate watcher
in Discover.vue that was fighting the store's watcher with stale byte
logic. Added shimmer CSS on the fill (with prefers-reduced-motion
opt-out) so the bar looks alive during long phases.
2026-04-23 07:58:43 -04:00
archipelago
576ff1a6de docs(status): mark install/uninstall/update async-spawn as shipped 2026-04-23 06:58:45 -04:00
archipelago
49b98e0271 fix(rpc): empty icon in transient install entry to avoid broken-image flicker
create_installing_entry hardcoded /assets/img/app-icons/<id>.png for
every new install. About half the app icons ship as .svg or .webp
(lnd.svg, vaultwarden.webp, bitcoin-knots.webp, mempool.webp), so the
browser 404s on the wrong extension and renders the default broken-image
glyph for the 10-30s window before the scanner refreshes with real
manifest data.

Send empty icon. The frontend's icon computed in AppCard.vue falls
through to curatedMap which has correct extensions for bundled apps,
and handleImageError still guards any remaining misses with a
placeholder SVG.
2026-04-23 06:58:12 -04:00
archipelago
702b5d64d3 fix(ui): shorten install/uninstall/update timeouts for async RPCs
With the backend flipped to async-spawn, install/uninstall/update return
immediately with a { status, package_id } envelope. Client timeouts of
45m/11m were a leftover from synchronous handlers and masked real RPC
failures.

Drop all install/uninstall/update RPC timeouts to 15s. Progress and
terminal state still arrive through the live state stream — the RPC
only needs to confirm the spawn was accepted.

Return-type annotations updated in rpc-client.ts and stores/server.ts.
Five direct rpcClient.call sites across Marketplace.vue, Discover.vue,
and MarketplaceAppDetails.vue updated with the shorter timeout.
2026-04-23 06:58:02 -04:00
archipelago
1ad889608f feat(rpc): async-spawn install/uninstall/update lifecycle
Extend the async-spawn treatment previously shipped for Stop/Start/Restart
to the three remaining long-running lifecycle RPCs. Each wrapper validates
params, rejects duplicate in-flight ops, flips state to the transitional
variant (Installing/Removing/Updating), then spawns the existing inner
handler on tokio. RPC returns immediately with { status, package_id }; the
spawn task owns the terminal state write.

Install and update success arms explicitly set state=Running. The scan
loop merge (merge_preserving_transitional) refuses to overwrite
transitional states, so the spawn task must write the terminal state.
Uninstall's inner handler removes the entry entirely, so no explicit
terminal write is needed there.

Dispatcher and handler now thread self as Arc<Self> / &Arc<Self> so
spawned tasks can hold their own Arc without extra field cloning.

Transient install entry uses empty icon string. Hardcoding
/assets/img/app-icons/<id>.png 404s for apps that ship .svg or .webp
assets, which produces a broken-image flicker until the scanner refreshes
with manifest data. Empty string causes the frontend's icon computed to
fall through to the curated map, which has correct extensions.

Removed the inner "already updating" guard in update.rs — the wrapper
now owns duplicate-op detection for all three operations.
2026-04-23 06:57:50 -04:00
archipelago
0ea4f96de9 docs(status): mark async-spawn lifecycle fix as shipped
Records the four landed commits, the .228 deploy (binary + frontend
paths, backups, md5), the manual LND Stop verification, and the
rollback incantation. Leaves the older "NEXT SESSION" design block
in place as historical reference with a note that it's stale.

Adds a follow-ups list: chaos matrix is now unblocked, bundled-app
RPCs are still sync (deprecate or mirror-async?), transitional_since
is in-memory only, and there are 22 pre-existing test failures in
unrelated modules that should get their own cleanup pass.
2026-04-23 05:30:45 -04:00
archipelago
a8158b1ef5 fix(ui): single-button lifecycle control with transitional labels
The app card and details view previously used a pair of Start/Stop
buttons whose labels were driven off isAppLoading(), a client-side
"I just clicked the button" flag. When the backend's graceful stop
took longer than the RPC round-trip (up to 600s on bitcoin-core),
the flag cleared while the container was still shutting down, the
UI flipped back to "Running" as soon as the next 10s scan saw the
still-alive container, and the user had no indication the stop was
still in flight.

Now that the backend flips PackageState to Stopping / Starting /
Restarting / Installing / Updating / Removing for the duration of
each lifecycle operation and the scan loop preserves those states,
the UI can drive its label off the container state itself. A single
full-width primary button replaces the Start/Stop pair. Its label,
color, and disabled state come from getAppVisualState(), which
collapses resting states (exited/created/paused/installed) into
"stopped" and passes transitional states through untouched.

Changes:

- container-client.ts: widen ContainerStatus.state union to include
  the six transitional variants plus "installed". Add
  restartContainer() calling the new container-restart RPC.
- stores/container.ts: add getAppVisualState() computed and the
  restartContainer() action.
- ContainerApps.vue: single primary button (Start / Stop / Starting
  / Stopping / Restarting etc.) plus a separate circular Restart
  button visible only when running. Critically, handleStartApp and
  handleStopApp now route through store.startContainer and
  stopContainer (which call container-start / container-stop, the
  async RPCs) instead of the legacy synchronous bundled-app-start /
  bundled-app-stop path. Transitional-state polling widened from
  just "created" to the full set of transitional variants.
- ContainerAppDetails.vue: same single-button pattern, Restart
  button now calls container-restart instead of the old
  stop-sleep-start sequence, added 2s polling interval for
  transitional states.
- components/ContainerStatus.vue: widen state prop to match the
  shared union, render transitional labels with a trailing ellipsis
  and a yellow dot.

No new tests — this is presentation logic. Manual verification on
.228 will confirm the end-to-end async path: click Stop on LND,
button becomes "Stopping" in under a second, stays that way for
roughly 5 minutes, then flips to "Start" with a grey dot. The UI
must never revert to "Running" mid-stop.
2026-04-23 05:20:15 -04:00
archipelago
cd69c3b2f6 fix(state): preserve transitional state across container scans
The 30s package scan loop used to blindly overwrite every package
entry from podman inspect. While a user-initiated Stop / Start /
Restart was in flight, the RPC spawn task would flip the state to
Stopping / Starting / Restarting, the next scan would see podman
still reporting "running" (for the duration of the graceful stop,
up to 600s for bitcoin-core), and clobber the transitional state
back to Running. The dashboard would then flip Running -> Stopping
-> Running -> Stopped, making it look like the stop had silently
failed until it eventually completed.

The merge loop now treats transitional variants (Stopping, Starting,
Restarting, Installing, Updating, Removing, and the three backup
variants) as owned by the RPC spawn task. For those variants,
merge_preserving_transitional keeps the existing state while still
taking live observability fields (health, exit_code, installed,
lan_address, manifest, static_files, available_update) from the
fresh scan so the UI continues to see live health readings.

Adds an escape hatch via a per-scan transitional_since side table:
if a package has been in a transitional state for more than 1200s
(2x the longest graceful stop at 600s on bitcoin-core), the scan
loop assumes the spawn task died without cleanup and overrides with
podman's live state. Prevents a crashed background task from wedging
a package in Stopping forever.

Three unit tests cover the merge rule, the observability passthrough,
and the transitional-variant classifier.
2026-04-23 05:15:13 -04:00
archipelago
39dd1d9dcc fix(rpc): async container stop/start/restart; widen state mapping
RPC handlers no longer block on podman operations. container-stop on
bitcoin-core used to hold the connection for up to 600s while the UI
showed a frozen spinner; it now returns in under a second with
{status: stopping} after flipping the package state to Stopping and
broadcasting over WebSocket. Same treatment for container-start and
the new container-restart route.

Widens container-list state mapping to emit the transitional variants
(stopping, starting, restarting, installing, updating, removing,
installed, and the backup states) instead of collapsing them to
"unknown". Keeps the mapping in sync with the UI ContainerStatus.state
union so the dashboard can render the right transitional label.

Mirrors the treatment in package/runtime.rs for package.start,
package.stop, and package.restart. The body of each handler is lifted
into pure do_package_* helpers that the background task runs; state
flipping is bracketed around the spawn with revert on error. The
pre-existing post-start exit-check verification and restart stop+start
fallback run inside the spawned task, not the RPC body.

Adds container-restart route to the dispatcher. mark_user_stopped
continues to run BEFORE the spawn, preserving the ordering contract
with the crash recovery layer at runtime.rs:145-148.
2026-04-23 04:59:45 -04:00
archipelago
5baced5f5b feat(rpc): spawn_transitional helper for async lifecycle ops
Introduces a new RPC-layer helper that bridges the synchronous
ContainerOrchestrator trait with RPC handlers that must return in <1s.

The helper flips the package state to a transitional variant
(Stopping / Starting / Restarting) in the StateManager so WebSocket
clients see the live label immediately, then tokio::spawns the
actual orchestrator call. On success it writes the final state; on
error it reverts to the pre-transition state and logs via
install_log().

The ContainerOrchestrator trait stays synchronous so the reconciler,
boot flow, unit tests, and chaos harness keep deterministic
behaviour. Async only lives in the RPC layer.

Not wired to any handler yet — Commit 2 consumes this helper.
Widens install_log visibility from pub(super) to
pub(in crate::api::rpc) so the new sibling module can reach it.
2026-04-23 04:55:52 -04:00
archipelago
cad63bdd76 docs: STATUS.md — FUSE/SSHFS development loop section
Dedicated section covering the file-ops-via-mount + git/cargo-via-ssh
split that makes this dev setup work. Includes:

- Exact running mount command (pulled from ps)
- macFUSE + sshfs-mac brew install path
- Health check + recovery sequence for when mount hangs (it will)
- Full which-path-for-which-operation table
- Don't-do list (cargo from mount, rsync without AppleDouble exclude, etc)
- Cache caveat and inode-sharing note between mount and SSH views

No code change.
2026-04-23 04:51:53 -04:00
archipelago
bb2e3fab42 docs: STATUS.md — complete SSH/key/sudo/deploy reference for next session
Expands NEXT SESSION header with fully verified access info so a fresh
agent has zero ambiguity:

- SSH key inventory across laptop, .116, .228 (every file, purpose noted)
- Actual SSH config aliases (archy, archy228) with IdentitiesOnly
- Verified connectivity matrix (laptop -> both; .116 -> .228; .228 has no outbound key)
- Corrected sudo state: .228 sudoers file is /etc/sudoers.d/archipelago
  (not archipelago-ci); .116 has archipelago-ci + archipelago-wg scope-limited drop-ins
- SSHFS mount source command + AppleDouble gotcha
- Cargo over SSH PATH gotcha + detached build pattern for >2min timeout
- End-to-end deploy-to-.228 recipe (build, SCP, atomic swap, verify)
- Git workflow rules (no push, no amend, no force, conventional commits)

Removes duplicate host-reference block that the prior edit left trailing.
No code change.
2026-04-23 04:49:45 -04:00
archipelago
6a5fab709a docs: STATUS.md — dashboard Stop UX bug diagnosis + async-spawn fix plan
Captures full design for the next session:
- Full bug sequence (5.5min blocking RPC + 30s scan clobbering transitional state)
- 4-commit implementation order with exact file:line targets
- Single-button UI spec with full label table
- Verification gates including manual LND stop test on .228
- Architectural decision: spawn lives in RPC layer, orchestrator trait stays sync

No code change yet; next session implements.
2026-04-23 04:45:12 -04:00
archipelago
2a2f10608b docs: STATUS.md — .228 dashboard bugs fixed (macaroon + ExtraHost) 2026-04-23 04:17:56 -04:00
archipelago
7257f72f4a fix(first-boot): use podman host-gateway magic for host.containers.internal
The previous code computed HOST_GATEWAY from `ip route show default` to
work around an alleged podman 4.3.x limitation. Two problems:

1. The comment was wrong. Podman 4.4+ supports --add-host=host-gateway
   natively, and we ship 5.4.2.

2. More critically, `ip route show default` returns the LAN router
   (e.g. 192.168.1.254) — the gateway to the internet, not the gateway
   to the host. Every container configured with DAEMON_URL or
   --bitcoind.rpchost=host.containers.internal was therefore dialing
   the WiFi router instead of the host machine, silently failing.

Symptoms this caused on .228:
- LND crash-looped with "dial tcp 192.168.1.254:8332: connection refused"
- Dashboard showed no LND connect details or QR
- ElectrumX DAEMON_URL broken; stuck at 2 KB index for days
- Any service reaching bitcoin-core through the `archy-net` bridge

Replace the computed value with the literal string "host-gateway",
which podman translates to the correct in-network gateway at container
start. Also drop the stale HOST_GATEWAY reference in the Tor-bootstrap
branch (it always fell back to TARGET_IP anyway). Verified on .228:
after recreating bitcoin-core/electrumx/lnd with the new flag, LND
reached the chain backend, ElectrumX resumed indexing, and the
dashboard /lnd-connect-info endpoint succeeded.
2026-04-23 04:16:42 -04:00
archipelago
30b31b3670 fix(lnd): read admin macaroon via sudo fallback
LND's admin.macaroon is owned by a rootless-podman subordinate UID
(typically 100000) with mode 640. The archipelago server runs as UID
1000 and cannot read the file directly, which caused every dashboard
LND RPC (getinfo, connect-info, export-channel-backup) and lnd_client
to fail with "Failed to read LND admin macaroon".

Add a read_lnd_admin_macaroon() helper that first tries a direct read
(for operators who have relaxed permissions) then falls back to
`sudo -n cat`, mirroring the pattern already used for Tor hidden
service hostnames in handle_lnd_connect_info. Centralise the canonical
macaroon path as LND_ADMIN_MACAROON_PATH and route all four callers
through the helper.

Verified on .228: GET /lnd-connect-info now returns 200 with cert,
macaroon, and tor_onion fields. Dashboard QR/connect-string UI
unblocked.
2026-04-23 04:15:44 -04:00
archipelago
28819d1197 docs: STATUS.md through Step 9 (.228 hot-swap verified)
Logs Step 9 acceptance evidence, the two bugs caught and fixed during
the hot-swap (parse_memory_limit IEC suffix bug in 732df1b8 and
cgroup Delegate in ba83f9bc), and outlines the Step 10 plan for .116.
2026-04-23 03:46:23 -04:00
archipelago
80765c5755 feat(systemd): delegate cgroup controllers to archipelago.service
Adds Delegate=memory pids cpu io to the archipelago.service unit.

Context: the service runs as User=archipelago under system.slice with
rootless podman. When podman creates transient libpod-*.scope units for
containers under user.slice, systemd needs the caller to hold
CAP_SYS_ADMIN on the target cgroup subtree \u2014 which happens iff
Delegate= lists the controllers we want to set. Without Delegate, any
future code path that goes through the podman CLI (runtime.rs) instead
of the libpod HTTP API (podman_client.rs) would hit MemoryMax
rejections that have exactly the same symptom as the bug I just fixed
in parse_memory_limit but with a completely different root cause.

Belt-and-braces: current production path uses PodmanClient and was
fixed in the preceding commit. But the DockerRuntime CLI path in
runtime.rs:262-268 (cmd.arg("--memory")) is still reachable via
AutoRuntime fallback on hosts without podman, and future rust
orchestrator code may legitimately need cgroup delegation. This
directive is no-op harmful on hosts that already delegate upstream
(systemd gracefully handles duplicate/nested delegation).
2026-04-23 03:44:36 -04:00
archipelago
8acf7d1112 fix: parse_memory_limit accepts Ki/Mi/Gi IEC binary suffixes
The libpod HTTP API path (PodmanClient::create_container) ran manifest
memory_limit values like "128Mi" through parse_memory_limit which
lowercased+trim_end_matches("m"), leaving "128i" which parse::<f64>()
rejected. The resulting None became 0 via .unwrap_or(0), and podman
serialised that into the OCI config as memory.limit:0. At container
start time systemd then rejected MemoryMax=0 with "Value specified in
MemoryMax is out of range".

Silently wrong for every manifest in apps/ that uses Kubernetes-style
suffixes (all of them). Became visible on .228 when Step 9 first
exercised the ProdContainerOrchestrator path for bitcoin-ui and lnd-ui
installs \u2014 the old first-boot-containers.sh bash script used podman
run --memory 128m directly, which podman-the-CLI parses correctly, so
the bug never surfaced before.

Two parts:
- parse_memory_limit now recognises Ki/Mi/Gi/Ti (IEC binary, what k8s
  and our manifests use), kB/MB/GB/TB (SI decimal), k/K/m/M/g/G/t/T
  (docker shorthand, treated as IEC binary for backwards compat), and
  bare byte integers. Filters out zero/negative results.
- create_container omits the memory/cpu fields entirely when the
  manifest has no limit or parsing fails, rather than emitting 0. The
  libpod API treats absent as unlimited; 0 is "set MemoryMax=0" which
  systemd rightly rejects. Defence in depth against the next weird
  suffix someone puts in a manifest.

Six regression tests in the new tests module cover IEC, SI, shorthand,
raw bytes, invalid input (empty/garbage/0/negative), and whitespace.
2026-04-23 03:44:23 -04:00
archipelago
c396be8068 feat(iso): Step 8a — retire archipelago-reconcile systemd timer
BootReconciler (in-process, 30s interval, spawned from main.rs as of
Step 6 commit 48f08aa3) fully replaces the timer-driven bash
reconciliation path. Delete the systemd unit + timer and their
ISO-builder touchpoints.

Removed:
- image-recipe/configs/archipelago-reconcile.service
- image-recipe/configs/archipelago-reconcile.timer
- image-recipe/build-auto-installer-iso.sh L412-413 (COPY unit+timer)
- image-recipe/build-auto-installer-iso.sh L449 (systemctl enable)
- image-recipe/build-auto-installer-iso.sh L542-543 (cp to WORK_DIR)

Kept (intentionally):
- scripts/reconcile-containers.sh
- scripts/container-specs.sh

Reason: core/archipelago/src/api/rpc/package/update.rs still invokes
reconcile-containers.sh at two sites (OTA update + rollback paths).
Porting those call sites to ContainerOrchestrator::upgrade() requires
manifests for every container update.rs might touch — that scope
belongs in Step 8b. Until then the script stays on disk, just no
longer runs on a periodic timer.

No Rust code changes. cargo check -p archipelago clean, 6 pre-existing
warnings. Skipped full ISO rebuild validation per user decision —
edits are 5 textual deletions with zero behavioral ambiguity; Step 9
live hot-swap on .228 will catch any regression.
2026-04-23 03:04:58 -04:00
archipelago
236a2dee85 docs: split Step 8 into 8a/8b/8c
Discovered during Step 8 execution that first-boot-containers.sh
creates 30+ containers with per-container logic (wallet loads, DB
init, rpcauth derivations, post-create health waits) and does
substantial non-container setup (secret gen, rootless-podman subuid
chowns, Tor hostnames, WireGuard, firewall, nostr-relay). Only 3 of
the 30+ containers have manifests today (the UIs from Step 7).

Deleting the bash in a single step bricks first-boot on fresh
installs. Split into:

- 8a: delete reconcile-containers.sh + container-specs.sh + reconcile
  systemd unit + timer. BootReconciler fully covers these. Safe,
  atomic, no manifest porting required.
- 8b: port remaining ~25 containers into apps/<id>/manifest.yml. One
  manifest per commit, validated against current bash behavior.
  Multi-day scope.
- 8c: rename first-boot-containers.sh -> first-boot-setup.sh, strip
  container ops, keep secret/dir/Tor/WG/firewall setup. Final
  one-way door, requires 8b complete.
2026-04-23 02:34:43 -04:00