docs: record fleet-deploy ENOSPC bug + fix + cleanup outcome

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
2026-07-01 11:01:27 -04:00 · 2026-07-01 11:01:27 -04:00 · e3baaa5de3
commit e3baaa5de3
parent 84d35b3b68
1 changed files with 32 additions and 0 deletions
--- a/docs/PRODUCTION-MASTER-PLAN.md
+++ b/docs/PRODUCTION-MASTER-PLAN.md
@ -1226,3 +1226,35 @@ fails under full-parallel `--workspace` runs, and never on the same test twice
 test-fixture/tempfile collision generating non-UTF8 bytes under parallelism, not a real credentials
 bug and not related to anything touched this session. Worth a real fix at some point (a test isolation
 issue makes CI flaky) but out of scope here.
 ## 15. Fleet deploy of this session's fixes + deploy-script ENOSPC bug (2026-07-01)
 User asked to build+deploy all 8 fixes above to `.116`/`.198`/`.228` via
 `scripts/deploy-to-target.sh`. **Found and fixed a real bug in the deploy script itself**: its
 `rsync --exclude` list never excluded `releases/` (the local repo's own historical build artifacts
 — dozens of versioned binaries + frontend tarballs, 7-10GB) or `reticulum-daemon/.venv` (a Python
 virtualenv bundling PyInstaller, ~87MB-several hundred MB depending on state) — every deploy synced
 these to the target's root disk. This **filled `.198` (29GB disk) to exactly 100% mid-deploy**,
 aborting that deploy with `rsync: ... No space left on device`, and **filled `.228` to 100% right
 after a "successful" deploy** (the post-deploy health check kept passing throughout — it doesn't
 check free disk space, so nothing alarmed on it). Neither node's actual services were corrupted by
 this (verified: containers unaffected, HTTP/HTTPS still 200 after disk was freed) — the risk was
 latent (next log/DB write failing), not realized.
 **Fixed**: added `--exclude 'releases'` (`aa849849`) and `--exclude '.venv'` (`84d35b3b`) to the
 rsync command in `scripts/deploy-to-target.sh:545-559`. Manually removed the already-synced
 `releases/`+`.venv` copies from `.116`/`.198`/`.228` (safe — these are deploy-staging copies of
 build artifacts, not live node data). Re-ran `.198`'s deploy after the fix; it and `.228`/`.116` are
 now all on `84d35b3b` and healthy.
 **Also checked** (per user request) the broader Tailscale fleet for the same bloat, at IPs the user
 supplied: `100.72.136.5`, `100.89.209.89`, `100.70.96.88`, `100.82.34.38` were all clean (no
 `releases/`/`.venv`, 13-32% disk used) — not part of this deploy round, just checked for bloat.
 `100.66.157.120` was intentionally **not touched** (reserved as another developer's dev node per
 [[reference_test_deploy_roster]]). `100.64.83.15` and `100.102.169.103` were **unreachable** with
 every credential combination in memory (both `archipelago`/`debian` users, all 3 known passwords,
 plus a `tailscale nc` proxy attempt for the timed-out one) — need the user to supply correct
 access details if these need checking later.
 `.116`'s HTTPS not responding is **not a bug** — that node's nginx only binds `:80` by design (a
 pre-existing dev-node config, see [[reference_116_dev_node]]), unrelated to this deploy.