lfg2025/archy

Dorian 1e283daf13 fix: overhaul container lifecycle — recovery, health, uninstall, UI state

Container recovery:
- Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s
- Dependency-aware restarts: won't restart services before their deps
- Reset dependent counters when a dependency recovers
- Handle "created" state containers (were invisible to health monitor)
- Added IndeedHub, mempool-api, mysql to tier system
- Crash recovery: podman start timeout 30s→120s with retry
- Podman client: socket timeout 5s→30s, added restart policy

UI state representation:
- Exit code 0 shows "stopped" (gray), not "crashed" (red)
- Exit code 137 shows "killed (OOM)"
- Non-zero exit shows "crashed" (red)
- Added exit_code field to PackageDataEntry

Install/uninstall fixes:
- Install returns error when container doesn't start (was silent success)
- Post-install hooks awaited instead of fire-and-forget tokio::spawn
- Uninstall: graceful rm before force, volume prune, network cleanup
- Uninstall returns error on partial failure (was 200 OK)

Config consistency:
- DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded)
- Bitcoin: added ZMQ ports 28332/28333 for LND block notifications
- IndeedHub port 7777→8190 (was conflicting with strfry)
- Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0

Performance:
- Metrics collector interval 60s→300s (was duplicating health monitor)
- Podman client: proper error propagation instead of unwrap_or_default

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-31 07:03:57 +01:00

14 KiB

Raw Blame History

Gold Standard Claude Code Configuration — Archipelago

Context

The last optimization (2026-03-28) cut CLAUDE.md from 130→101 lines and skills from 33→11. That was the right first pass. This plan is the second pass: fixing structural issues the first cleanup didn't address — hook duplication, memory chaos, a leaked API key, missing path scoping, context budget waste, and underutilized agent/permission systems. The goal is a configuration so tight that re-running this audit would produce zero suggestions.

Research base: Every file in .claude/ (project + global), all 26 project memories, all 8 auto-memories, all 11 skills, all 5 rules, all 11 hooks, both settings files, the iframe-specialist agent, the full project structure (core/, neode-ui/, scripts/, image-recipe/, apps/, .gitea/), latest Claude Code docs (CLAUDE.md best practices, hooks v2.1.85+, skills frontmatter, agents, memory, permissions, MCP, context management, agent teams), and the 2026-03-28 cleanup feedback.

Governing principle (carried from cleanup): Every line must prevent a specific mistake Claude would otherwise make. If Claude does it right without the instruction, it's noise.

Phase 0: CRITICAL — Remove Leaked Secret

File: .claude/memory/deploy-automation.md (line 11) Contains a plaintext Anthropic API key: sk-ant-api03-...

Action: Remove the key immediately. Replace with: "ANTHROPIC_API_KEY from secrets store (never stored in memory files)"

This is the only blocking item. Everything else is optimization.

Phase 1: CLAUDE.md — Trim to ~75 Lines

File: /Users/dorian/Projects/archy/CLAUDE.md Current: 101 lines | Target: ~75 lines | Saves: ~500 tokens/session

What to cut (reference data that doesn't prevent mistakes)

Section	Lines	Action	Reason
Infrastructure table	21-30	Move to auto-memory	Reference data, not a rule. Already in memory files
ISO debug commands	79-84	Move to `iso-debug` skill reference	Diagnostic commands, not rules
Kiosk toggle info	85-86	Move to auto-memory or delete	Reference, not a rule
"Backend binds 127.0.0.1"	63	Move to new backend rule	Claude can read the code
"Timeouts on all external operations"	65	Move to new backend rule	Already in `rules/api.md`

What to add

## Compact Instructions
When compacting, preserve: list of modified files, test results, deploy target state, current branch.

This costs 2 lines but saves entire sessions from losing critical context.

Resulting structure (~75 lines)

Lines 1-2:   Project description + stack
Lines 3-6:   Beta freeze notice
Lines 7-12:  Quick reference (dev, build, deploy commands)
Lines 13-18: Architecture diagram (compact)
Lines 19-20: Data paths
Lines 21-26: Critical Rules (5 rules)
Lines 27-33: App Integration Checklist
Lines 34-36: Git conventions
Lines 37-39: Compact instructions

Infrastructure table moves to auto-memory where it's still loaded at session start.

Phase 2: Hook Deduplication — Eliminate Double Execution

Problem

Every Bash call runs both global pretooluse-bash.sh AND project block-risky-bash.sh. Every Edit|Write call runs both global pretooluse-files.sh AND project protect-files.sh. They overlap on ~80% of patterns (rm -rf, git reset --hard, .git/ edits, .env files, etc.).

Cost: 2 extra Python processes per tool call, checking the same patterns twice.

Solution: Project hooks become project-specific only

File: .claude/hooks/block-risky-bash.sh Action: Strip all patterns already covered by global hook. Keep ONLY:

Cargo build on macOS (Archy-specific: "build on dev server via SSH")
Path traversal with rm (more aggressive check than global)

~15 lines instead of ~80.

File: .claude/hooks/protect-files.sh Action: Strip all patterns already covered by global hook. Keep ONLY:

scripts/deploy-config.sh (Archy-specific credential file)
Path-outside-project check (project-specific boundary)

~20 lines instead of ~75.

Global hooks stay unchanged — they're the universal baseline.

Result

Before: 4 Python processes per Bash call (2 global + 2 project parsing same JSON)
After: 2 Python processes per Bash call (1 global comprehensive + 1 tiny project-specific)

Phase 3: Memory System — Consolidate and Clean

Problem

Two separate memory systems with overlapping content:

Auto-memory (~/.claude/projects/-Users-dorian-Projects-archy/memory/) — 8 files, auto-loaded
Project memory (.claude/memory/) — 26 files, NOT auto-loaded

Claude sees auto-memory every session. Project memory only loads if Claude manually reads it.

Solution: Curate auto-memory, keep project memory as archive

Auto-memory MEMORY.md — restructure to ~25 lines with the most critical feedback:

# Archipelago Project Memory

## Critical Feedback (prevent recurring mistakes)
- [Direct Port Rule](feedback_apps_always_direct_port.md) — Apps MUST use direct port, NEVER proxy paths
- [External URLs](feedback_external_urls_iframe.md) — Open https:// directly, never /ext/
- [Deploy All Nodes](feedback_indeedhub_deploy_all_servers.md) — Deploy to ALL nodes
- [No Tor Publishing](feedback_no_tor_relay_publishing.md) — Never publish .onion to relays
- [UFW Forward](feedback_podman_ufw_forward.md) — DEFAULT_FORWARD_POLICY=ACCEPT
- [Deploy Patterns](feedback_deploy_patterns.md) — Rootless port 80, cred sync, image export
- [Asset Workflow](feedback_asset_workflow.md) — Never generate images, user is designer
- [ASCII Logo](feedback_logo_ascii.md) — Block-letter logo locked, never change
- [Claude Cleanup](feedback_claude_cleanup.md) — Instruction optimization principles

## Infrastructure
- [CI/CD & Registry](reference_cicd_registry.md) — git.tx1138.com, act_runner, insecure registry
- [Multi-Node Deploy](reference_multi_node_deploy.md) — 5 nodes, SSH keys, deploy methods
- [Infrastructure Quick Ref](reference_infrastructure.md) — IPs, passwords, SSH keys (moved from CLAUDE.md)

## Project State
- [ISO Testing](project_iso_testing_plan.md) — Hardware matrix, boot compatibility
- [ISO Custom Base](project_iso_size_reduction.md) — Debootstrap ISO, remaining issues

## Archive
Detailed project memory in .claude/memory/MEMORY.md (26 files, not auto-loaded).

New auto-memory files to create (migrated from project memory):

feedback_apps_always_direct_port.md — Broken THREE TIMES, highest-value feedback
feedback_deploy_patterns.md — Hard-won container patterns
feedback_asset_workflow.md — Prevents wasted effort generating images
feedback_logo_ascii.md — Prevents changing locked-in branding
reference_infrastructure.md — Infrastructure table from CLAUDE.md (IPs, SSH, passwords)

Project memory (.claude/memory/):

Add comment at top of MEMORY.md: 
Fix deploy-automation.md (Phase 0 — remove API key)
Update unbundled-iso.md (still says "NOT YET BUILT")

Phase 4: Permissions — Auto-Approve Safe Commands

File: .claude/settings.local.json

Current: Only ssh:* and gh api:* allowed.

Updated — add read-only and build/test commands:

{
  "permissions": {
    "allow": [
      "Bash(ssh:*)",
      "Bash(gh api:*)",
      "Bash(cd neode-ui*)",
      "Bash(npm run *)",
      "Bash(npm test*)",
      "Bash(npm start*)",
      "Bash(npx vue-tsc*)",
      "Bash(npx vitest*)",
      "Bash(git log*)",
      "Bash(git diff*)",
      "Bash(git status*)",
      "Bash(git branch*)",
      "Bash(git show*)",
      "Bash(git stash*)",
      "Bash(cargo check*)",
      "Bash(cargo clippy*)",
      "Bash(cargo test*)",
      "Bash(journalctl*)",
      "Bash(systemctl status*)",
      "Bash(ls *)",
      "Bash(wc *)",
      "Bash(file *)",
      "Bash(xxd *)",
      "Bash(df *)",
      "Bash(du *)"
    ]
  }
}

NOT auto-approved (still require confirmation):

git push/commit — Affects remote/creates state
cargo build — Blocked by hook on macOS anyway
npm install — Modifies dependencies
./scripts/deploy-* — Deploys to servers
rm, mv, cp — Potentially destructive

Phase 5: Merge iso-branding into build-iso

Problem: iso-branding is a pure design reference, only relevant during ISO builds. Its description consumes skill budget.

Action:

Move .claude/skills/iso-branding/SKILL.md content → .claude/skills/build-iso/references/branding.md
Update build-iso/SKILL.md to reference the branding file
Delete .claude/skills/iso-branding/ directory

Skill count: 11 → 10

Phase 6: Add Backend Rule File

Problem: No path-scoped rule for Rust backend. 3 backend rules sit in CLAUDE.md (loaded every session even for frontend-only work).

New file: .claude/rules/backend.md

---
globs:
  - "core/**/*.rs"
  - "core/**/Cargo.toml"
---

# Backend Rules (Archipelago — Rust)

- Backend binds `127.0.0.1` only — nginx handles external access
- Validate all input before path construction — reject `..`, `/`, null bytes
- Timeouts on all external operations (10s default, 30s heavy)
- Use `anyhow::Result` for error propagation, not `.unwrap()` in handlers
- Log with `tracing`, never `println!` or `eprintln!` in production paths
- Container commands through `PodmanClient` (core/container/), never raw Command::new("podman")

Delete the Backend section from CLAUDE.md (moved here).

Phase 7: Tighten prompt-injection-detect.sh

Problem: context_manipulation pattern matches IMPORTANT:, CRITICAL:, <system> — normal in code/docs. Creates false positive warnings.

Action: Tighten the context_manipulation regex to require injection-specific signatures:

# OLD (too broad):
"IMPORTANT:|CRITICAL:|SYSTEM:|ADMIN:|<system>|</system>|<instructions>"

# NEW (specific):
"(?:^|\s)(?:SYSTEM|ADMIN):\s*(?:you are|ignore|forget|override|new instructions)|<(?:system|instructions)>.*(?:ignore|override|forget)"

Phase 8: Add 2 Focused Agents

Current: 1 agent (iframe-specialist, 678 lines)

Add:

`.claude/agents/deploy-specialist.md`

---
name: deploy-specialist
description: Deploys to all 5 Archipelago nodes. Knows SSH access, build capabilities, post-deploy verification.
tools: Bash, Read, Grep, Glob
model: sonnet
---

Body: Node inventory, deploy workflow, IndeedHub multi-node rules, post-deploy checklist.

`.claude/agents/code-reviewer.md`

---
name: code-reviewer
description: Reviews code against Archipelago standards — frontend patterns, Rust safety, container security, crypto rules.
tools: Read, Grep, Glob
model: sonnet
---

Body: Frontend rules, backend rules, container rules, security checklist.

Agent count: 1 → 3

Phase 9: Skill Frontmatter Audit

Problem: Action skills that have side effects should have disable-model-invocation: true to prevent Claude from auto-invoking them.

Skill	Has `disable-model-invocation: true`?	Needs it?
add-app	Yes	Yes (side effects)
add-web-app	Verify	Yes
build-iso	Verify	Yes (builds ISO)
iso-debug	Verify	Yes (runs diagnostics)
podman	Verify	Yes (modifies containers)
polish	Verify	Yes (modifies code)
sweep	Verify	Yes (runs checks, may fix)
mesh	No	No (reference knowledge)
design-pixel-retro	No	No (reference knowledge)
gamepad-nav	No	No (reference knowledge)

Action: Verify and add disable-model-invocation: true to all 7 action skills.

Summary

Phase	Impact	Files Changed	Benefit
0. Remove API key	CRITICAL	1	Security
1. Trim CLAUDE.md	HIGH	1	~500 tokens/session saved
2. Dedup hooks	HIGH	2	~200ms faster per tool call
3. Memory consolidate	HIGH	~8	Cleaner context, no stale data
4. Permissions	MEDIUM	1	~3s saved per safe command
5. Merge iso-branding	LOW	3	1 less skill description
6. Backend rule	MEDIUM	2	Path-scoped, not always-loaded
7. Injection hook	LOW	1	Fewer false positives
8. New agents	MEDIUM	2 new	Better delegation
9. Skill frontmatter	LOW	~5	Prevents unintended auto-invoke

Net changes: CLAUDE.md 101→~75 lines, skills 11→10, agents 1→3, rules 5→6, hooks 60% smaller

What This Plan Does NOT Change (and why each was evaluated)

Global CLAUDE.md (36 lines) — Already optimized, passes the "would removing cause mistakes?" test
Global hooks (8 scripts) — Universal baseline, well-tuned, no project overlap
Global rules (api, crypto, bitcoin) — Correct glob scoping, concise content
Global settings.json — Plugins, effort level, hook config all justified
iframe-specialist agent — Deep reference, correctly scoped, rarely loaded
Skills mesh/gamepad-nav/design-pixel-retro — Tiny description cost (~120 chars each), valuable on-demand
MCP servers — Not needed (self-hosted infra, no external API integrations)
Agent teams — Experimental, single-developer project doesn't benefit
Project .claude/memory/ (26 files) — Kept as archive with annotation

Verification Checklist

After implementation:

grep -r "sk-ant" .claude/ returns zero results
New session auto-loads MEMORY.md with all critical feedback
git status auto-approves without permission prompt
/sweep skill loads and executes correctly
Project hooks run fast (no duplicate pattern checks)
cd neode-ui && npx vue-tsc -b --noEmit passes
Spawning deploy-specialist agent works
CLAUDE.md is ≤80 lines
/context shows reasonable token budget

14 KiB Raw Blame History

Gold Standard Claude Code Configuration — Archipelago

Context

Phase 0: CRITICAL — Remove Leaked Secret

Phase 1: CLAUDE.md — Trim to ~75 Lines

What to cut (reference data that doesn't prevent mistakes)

What to add

Resulting structure (~75 lines)

Phase 2: Hook Deduplication — Eliminate Double Execution

Problem

Solution: Project hooks become project-specific only

Result

Phase 3: Memory System — Consolidate and Clean

Problem

Solution: Curate auto-memory, keep project memory as archive

Phase 4: Permissions — Auto-Approve Safe Commands

Phase 5: Merge iso-branding into build-iso

Phase 6: Add Backend Rule File

Phase 7: Tighten prompt-injection-detect.sh

Phase 8: Add 2 Focused Agents

.claude/agents/deploy-specialist.md

.claude/agents/code-reviewer.md

Phase 9: Skill Frontmatter Audit

Summary

What This Plan Does NOT Change (and why each was evaluated)

Verification Checklist

14 KiB

Raw Blame History

`.claude/agents/deploy-specialist.md`

`.claude/agents/code-reviewer.md`