From e3baaa5de3f3d16ffed95699967f6ab5eb24de84 Mon Sep 17 00:00:00 2001 From: archipelago Date: Wed, 1 Jul 2026 11:01:27 -0400 Subject: [PATCH] docs: record fleet-deploy ENOSPC bug + fix + cleanup outcome Co-Authored-By: Claude Sonnet 5 --- docs/PRODUCTION-MASTER-PLAN.md | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/docs/PRODUCTION-MASTER-PLAN.md b/docs/PRODUCTION-MASTER-PLAN.md index 7319561a..20592d68 100644 --- a/docs/PRODUCTION-MASTER-PLAN.md +++ b/docs/PRODUCTION-MASTER-PLAN.md @@ -1226,3 +1226,35 @@ fails under full-parallel `--workspace` runs, and never on the same test twice test-fixture/tempfile collision generating non-UTF8 bytes under parallelism, not a real credentials bug and not related to anything touched this session. Worth a real fix at some point (a test isolation issue makes CI flaky) but out of scope here. + +## 15. Fleet deploy of this session's fixes + deploy-script ENOSPC bug (2026-07-01) + +User asked to build+deploy all 8 fixes above to `.116`/`.198`/`.228` via +`scripts/deploy-to-target.sh`. **Found and fixed a real bug in the deploy script itself**: its +`rsync --exclude` list never excluded `releases/` (the local repo's own historical build artifacts +— dozens of versioned binaries + frontend tarballs, 7-10GB) or `reticulum-daemon/.venv` (a Python +virtualenv bundling PyInstaller, ~87MB-several hundred MB depending on state) — every deploy synced +these to the target's root disk. This **filled `.198` (29GB disk) to exactly 100% mid-deploy**, +aborting that deploy with `rsync: ... No space left on device`, and **filled `.228` to 100% right +after a "successful" deploy** (the post-deploy health check kept passing throughout — it doesn't +check free disk space, so nothing alarmed on it). Neither node's actual services were corrupted by +this (verified: containers unaffected, HTTP/HTTPS still 200 after disk was freed) — the risk was +latent (next log/DB write failing), not realized. + +**Fixed**: added `--exclude 'releases'` (`aa849849`) and `--exclude '.venv'` (`84d35b3b`) to the +rsync command in `scripts/deploy-to-target.sh:545-559`. Manually removed the already-synced +`releases/`+`.venv` copies from `.116`/`.198`/`.228` (safe — these are deploy-staging copies of +build artifacts, not live node data). Re-ran `.198`'s deploy after the fix; it and `.228`/`.116` are +now all on `84d35b3b` and healthy. + +**Also checked** (per user request) the broader Tailscale fleet for the same bloat, at IPs the user +supplied: `100.72.136.5`, `100.89.209.89`, `100.70.96.88`, `100.82.34.38` were all clean (no +`releases/`/`.venv`, 13-32% disk used) — not part of this deploy round, just checked for bloat. +`100.66.157.120` was intentionally **not touched** (reserved as another developer's dev node per +[[reference_test_deploy_roster]]). `100.64.83.15` and `100.102.169.103` were **unreachable** with +every credential combination in memory (both `archipelago`/`debian` users, all 3 known passwords, +plus a `tailscale nc` proxy attempt for the timed-out one) — need the user to supply correct +access details if these need checking later. + +`.116`'s HTTPS not responding is **not a bug** — that node's nginx only binds `:80` by design (a +pre-existing dev-node config, see [[reference_116_dev_node]]), unrelated to this deploy.