archy/docs/CHAT_TRANSCRIPT_2026-05-02.md
2026-05-06 09:23:57 -04:00

318 lines
10 KiB
Markdown

# Chat Transcript And Working Notes
Date: 2026-05-02
This file captures the current chat context, decisions, progress, and next steps so work can continue from another device/session.
## User Request
The user asked to continue hardening Archipelago app/container lifecycle, then asked multiple times to save the plan/progress/next steps and finally to save the entire chat to Markdown.
Key user constraints and corrections:
- Continue if next steps are clear; ask only if blocked.
- Exhaustively harden app/container lifecycle before release.
- Preserve data during destructive lifecycle testing unless explicitly instructed otherwise.
- Do not rely on `/app/...` proxy paths for app launch/testing. The user corrected: “we never use paths only ports.”
- LND/Electrum wallet-connect tests must validate real connection details and QR, including Tor.
## Earlier Progress Summary
Before the latest work, the project already had substantial lifecycle hardening in progress:
- Remote lifecycle harness exists at `tests/lifecycle/remote-lifecycle.sh`.
- `.198` SSH works with `/home/archipelago/.ssh/id_ed25519`.
- `.228` RPC works, but SSH is blocked with `Permission denied (publickey,password)`.
- Multiple backend release binaries were built and deployed to `.198` with backups in `/usr/local/bin/archipelago.bak-*`.
- Fixed stale package scanner state recovery from `Removing -> Running` when a container is actually live.
- Fixed startup ordering so crash recovery runs before BootReconciler.
- Removed dangerous automatic Podman runtime directory deletion on `podman info` failure.
- Narrowed generic crash recovery to safe legacy containers.
- Fixed companion reconciliation on install/start/restart.
- Fixed uninstall/reinstall behavior so uninstall disables manifest apps instead of deleting manifest availability, and reinstall re-enables them.
- Fixed LND config generation/repair:
- `bitcoin.active=true`
- `bitcoin.mainnet=true`
- `bitcoin.node=bitcoind`
- `bitcoind.rpchost=bitcoin-knots:8332`
- sudo fallback for writing container-owned config paths.
- `.198` had previously passed focused lifecycle for `filebrowser`, `bitcoin-knots`, and a looser LND launch test.
## Major Files Touched In This Session
- `docs/CONTAINER_LIFECYCLE_HANDOFF.md`
- `docs/CHAT_TRANSCRIPT_2026-05-02.md`
- `tests/lifecycle/remote-lifecycle.sh`
- `core/archipelago/src/container/lnd.rs`
- `core/archipelago/src/container/companion.rs`
- `core/archipelago/src/container/prod_orchestrator.rs`
- `core/archipelago/src/container/docker_packages.rs`
- `core/container/src/podman_client.rs`
- `core/archipelago/src/port_allocator.rs`
- `apps/lnd-ui/manifest.yml`
- `neode-ui/src/views/appSession/appSessionConfig.ts`
- `neode-ui/src/stores/container.ts`
- `neode-ui/src/stores/appLauncher.ts`
- `neode-ui/src/views/appDetails/appDetailsData.ts`
- nginx config/snippet files under `scripts/` and `image-recipe/`
## LND Wallet Bootstrap Investigation
Initial strict LND probe failed because `/lnd-connect-info` could not read `admin.macaroon`:
```text
Failed to read LND admin macaroon — is LND installed?
direct: Permission denied (os error 13)
sudo: cat: /var/lib/archipelago/lnd/data/chain/bitcoin/mainnet/admin.macaroon: No such file or directory
```
LND logs showed the wallet was uninitialized/locked:
```text
Waiting for wallet encryption password. Use lncli create...
```
Tests showed `lncli create` is interactive and does not support `--stdin`:
```text
[lncli] flag provided but not defined: -stdin
```
`lncli unlock --stdin` is supported, so the final approach was:
- Use LND REST unlocker endpoints for new wallet creation.
- Use `lncli unlock --stdin` only for an existing wallet.
- Treat “wallet already exists” from REST as a signal to unlock.
- Use sudo-aware checks/reads for wallet artifacts because LND data directories are container-owned and `0700`.
Implemented in `core/archipelago/src/container/lnd.rs`:
- `ensure_wallet_initialized()`
- `file_exists_as_root()`
- `read_file_as_root()`
- `init_wallet_via_rest()`
- `get_lnd_unlocker_json()`
- `post_lnd_unlocker_json()`
- `unlock_existing_wallet()`
- `wait_for_admin_macaroon()`
- `lnd_getinfo_ready()`
Focused Rust test passes:
```bash
cd /home/archipelago/Projects/archy/core
cargo test -p archipelago --bin archipelago lnd
```
Result:
```text
7 passed; 0 failed
```
## LND UI Port Collision
The strict LND UI test then failed with `502`.
Investigation found a real port collision:
- `nostr-rs-relay` uses host `8081`.
- Old `archy-lnd-ui` also used host `8081`.
- nginx `/app/lnd/` proxy also pointed at `8081`.
Fix implemented:
- Move LND UI companion to host port `18083`, container port `80`.
- Keep `nostr-rs-relay` on `8081`.
- Update app metadata/routing to `18083`.
- Update tests to expect direct port launch.
Important correction from user:
```text
we never use paths only ports, how many times do you need to be told
```
Action taken after correction:
- Stop validating through `/app/lnd/` and `/app/electrumx/` in the lifecycle harness.
- Switch `launch_url_for()` to direct app ports.
- Switch app session resolver to direct `http://host:port` launch, even from HTTPS parent pages.
- Remove use of `HTTPS_PROXY_PATHS[id]` in `resolveAppUrl()`.
Direct-port LND audit command:
```bash
ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd tests/lifecycle/remote-lifecycle.sh
```
Result:
```text
### 192.168.1.198 iteration 1 / 1 ###
lnd state=running
all checks passed
```
The audit now validates `http://192.168.1.198:18083/`, not `/app/lnd/`.
## Lifecycle Harness Changes
`tests/lifecycle/remote-lifecycle.sh` changes made:
- Normalize package states with `ascii_downcase` because API returned `Running`.
- Direct port launch URLs:
- LND: `http://${ARCHY_HOST}:18083/`
- Electrum/Electrs: `http://${ARCHY_HOST}:50002/`
- Bitcoin UI: `http://${ARCHY_HOST}:8334/`
- Other apps mapped to direct ports where known.
- LND probe checks:
- `Connect Your Wallet`
- `id="lndQrBox"`
- `id="connHost"`
- `value="rest-tor"`
- `value="grpc-tor"`
- `value="rest-local"`
- `value="grpc-local"`
- `Copy lndconnect URI`
- `/lnd-connect-info` cert, macaroon, ports, and Tor onion.
- Electrum probe checks:
- local QR container and address field
- Tor QR container and onion field
- port `50001`
- QR renderer
- direct `http://${ARCHY_HOST}:50002/qrcode.js`
- `/electrs-status` Tor onion.
- Full lifecycle now fails immediately on any failed phase with `|| return 1` so a later reinstall cannot mask a failed restart/probe.
## Deployments To `.198`
Several release builds were made and deployed:
```bash
cd /home/archipelago/Projects/archy/core
cargo build -p archipelago --bin archipelago --release
```
Deploy pattern:
```bash
scp -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no \
/home/archipelago/Projects/archy/core/target/release/archipelago \
archipelago@192.168.1.198:/tmp/archipelago.new
ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no \
archipelago@192.168.1.198 \
"sudo cp /usr/local/bin/archipelago /usr/local/bin/archipelago.bak-<timestamp> && \
sudo install -m 0755 /tmp/archipelago.new /usr/local/bin/archipelago && \
sudo systemctl restart archipelago.service && \
systemctl is-active archipelago.service"
```
Latest deploy returned:
```text
active
```
## `.198` Current Observations
After forcing LND package restart, companion reconciliation succeeded:
```text
nostr-rs-relay Up ... 0.0.0.0:8081->8080/tcp
lnd Up ... 0.0.0.0:8080->8080/tcp, 0.0.0.0:9735->9735/tcp, 0.0.0.0:10009->10009/tcp
archy-lnd-ui Up ... 0.0.0.0:18083->80/tcp
```
Direct UI test from `.198` returned `200`:
```bash
curl -i http://127.0.0.1:18083/
```
Strict direct-port LND audit is green:
```text
lnd state=running
all checks passed
```
## Full LND Lifecycle Status
Full direct-port lifecycle was started:
```bash
ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh
```
It reached:
```text
### 192.168.1.198 iteration 1 / 1 ###
== lnd: install ==
== lnd: stop ==
```
Then the user aborted the command while asking to save memory/transcript.
The next continuation point is to rerun full LND direct-port lifecycle from scratch and inspect the stop phase if it hangs/fails.
## Handoff File
A durable handoff file was also created:
```text
docs/CONTAINER_LIFECYCLE_HANDOFF.md
```
It contains the plan, progress, current blockers, and next steps.
## Immediate Next Steps
1. Rerun full strict LND direct-port lifecycle:
```bash
ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=lnd ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh
```
2. If it hangs/fails at `stop`, inspect package runtime stop path and logs:
```bash
ssh -i /home/archipelago/.ssh/id_ed25519 -o StrictHostKeyChecking=no archipelago@192.168.1.198 \
'journalctl -u archipelago.service -n 260 --no-pager | egrep -i "package\.(stop|start|restart|install|uninstall)|lnd|companion|error|failed" | sed -n "1,220p"; podman ps -a --format "{{.Names}} {{.Status}} {{.Ports}}" | egrep "lnd|nostr" || true'
```
3. If stop is unreliable, inspect/fix:
- `core/archipelago/src/api/rpc/package/runtime.rs`
- `core/archipelago/src/container/prod_orchestrator.rs`
Likely causes to check:
- Reconciler restarting LND while stop is expected.
- State scanner reporting stale `running`.
- Companion handling interfering with parent app state.
- Async lifecycle returning before actual stop completes.
4. Once LND full lifecycle is green, run Electrum strict lifecycle with direct port `50002`:
```bash
ARCHY_HOST=192.168.1.198 ARCHY_PASSWORD=password123 ARCHY_APPS=electrumx ARCHY_FULL_LIFECYCLE=1 tests/lifecycle/remote-lifecycle.sh
```
5. Continue with app groups after LND/Electrum:
- `filebrowser`
- `bitcoin-knots`
- `lnd`
- `electrumx`
- `mempool`
- `btcpay-server`
- `fedimint`
- remaining catalog apps.
## Important Instruction To Preserve
Use ports only for app launch/testing. Do not add or rely on `/app/...` path proxy launch behavior unless the user explicitly changes this requirement.