archy/.claude/plans/mutable-roaming-pancake.md

358 lines
14 KiB
Markdown
Raw Normal View History

fix: overhaul container lifecycle — recovery, health, uninstall, UI state Container recovery: - Health monitor: MAX_RESTART_ATTEMPTS 3→10, interval 60s→120s - Dependency-aware restarts: won't restart services before their deps - Reset dependent counters when a dependency recovers - Handle "created" state containers (were invisible to health monitor) - Added IndeedHub, mempool-api, mysql to tier system - Crash recovery: podman start timeout 30s→120s with retry - Podman client: socket timeout 5s→30s, added restart policy UI state representation: - Exit code 0 shows "stopped" (gray), not "crashed" (red) - Exit code 137 shows "killed (OOM)" - Non-zero exit shows "crashed" (red) - Added exit_code field to PackageDataEntry Install/uninstall fixes: - Install returns error when container doesn't start (was silent success) - Post-install hooks awaited instead of fire-and-forget tokio::spawn - Uninstall: graceful rm before force, volume prune, network cleanup - Uninstall returns error on partial failure (was 200 OK) Config consistency: - DB passwords read from /var/lib/archipelago/secrets/ (was hardcoded) - Bitcoin: added ZMQ ports 28332/28333 for LND block notifications - IndeedHub port 7777→8190 (was conflicting with strfry) - Marketplace versions: LND 0.17.4→0.18.4, Mempool 2.5.0→3.0.0 Performance: - Metrics collector interval 60s→300s (was duplicating health monitor) - Podman client: proper error propagation instead of unwrap_or_default Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 07:03:57 +01:00
# Gold Standard Claude Code Configuration — Archipelago
## Context
The last optimization (2026-03-28) cut CLAUDE.md from 130→101 lines and skills from 33→11. That was the right first pass. This plan is the second pass: fixing structural issues the first cleanup didn't address — hook duplication, memory chaos, a leaked API key, missing path scoping, context budget waste, and underutilized agent/permission systems. The goal is a configuration so tight that re-running this audit would produce zero suggestions.
**Research base**: Every file in `.claude/` (project + global), all 26 project memories, all 8 auto-memories, all 11 skills, all 5 rules, all 11 hooks, both settings files, the iframe-specialist agent, the full project structure (core/, neode-ui/, scripts/, image-recipe/, apps/, .gitea/), latest Claude Code docs (CLAUDE.md best practices, hooks v2.1.85+, skills frontmatter, agents, memory, permissions, MCP, context management, agent teams), and the 2026-03-28 cleanup feedback.
**Governing principle** (carried from cleanup): *Every line must prevent a specific mistake Claude would otherwise make. If Claude does it right without the instruction, it's noise.*
---
## Phase 0: CRITICAL — Remove Leaked Secret
**File**: `.claude/memory/deploy-automation.md` (line 11)
Contains a plaintext Anthropic API key: `sk-ant-api03-...`
**Action**: Remove the key immediately. Replace with: `"ANTHROPIC_API_KEY from secrets store (never stored in memory files)"`
This is the only blocking item. Everything else is optimization.
---
## Phase 1: CLAUDE.md — Trim to ~75 Lines
**File**: `/Users/dorian/Projects/archy/CLAUDE.md`
**Current**: 101 lines | **Target**: ~75 lines | **Saves**: ~500 tokens/session
### What to cut (reference data that doesn't prevent mistakes)
| Section | Lines | Action | Reason |
|---------|-------|--------|--------|
| Infrastructure table | 21-30 | Move to auto-memory | Reference data, not a rule. Already in memory files |
| ISO debug commands | 79-84 | Move to `iso-debug` skill reference | Diagnostic commands, not rules |
| Kiosk toggle info | 85-86 | Move to auto-memory or delete | Reference, not a rule |
| "Backend binds 127.0.0.1" | 63 | Move to new backend rule | Claude can read the code |
| "Timeouts on all external operations" | 65 | Move to new backend rule | Already in `rules/api.md` |
### What to add
```markdown
## Compact Instructions
When compacting, preserve: list of modified files, test results, deploy target state, current branch.
```
This costs 2 lines but saves entire sessions from losing critical context.
### Resulting structure (~75 lines)
```
Lines 1-2: Project description + stack
Lines 3-6: Beta freeze notice
Lines 7-12: Quick reference (dev, build, deploy commands)
Lines 13-18: Architecture diagram (compact)
Lines 19-20: Data paths
Lines 21-26: Critical Rules (5 rules)
Lines 27-33: App Integration Checklist
Lines 34-36: Git conventions
Lines 37-39: Compact instructions
```
Infrastructure table moves to auto-memory where it's still loaded at session start.
---
## Phase 2: Hook Deduplication — Eliminate Double Execution
### Problem
Every `Bash` call runs **both** global `pretooluse-bash.sh` AND project `block-risky-bash.sh`. Every `Edit|Write` call runs **both** global `pretooluse-files.sh` AND project `protect-files.sh`. They overlap on ~80% of patterns (rm -rf, git reset --hard, .git/ edits, .env files, etc.).
**Cost**: 2 extra Python processes per tool call, checking the same patterns twice.
### Solution: Project hooks become project-specific only
**File**: `.claude/hooks/block-risky-bash.sh`
**Action**: Strip all patterns already covered by global hook. Keep ONLY:
- Cargo build on macOS (Archy-specific: "build on dev server via SSH")
- Path traversal with rm (more aggressive check than global)
~15 lines instead of ~80.
**File**: `.claude/hooks/protect-files.sh`
**Action**: Strip all patterns already covered by global hook. Keep ONLY:
- `scripts/deploy-config.sh` (Archy-specific credential file)
- Path-outside-project check (project-specific boundary)
~20 lines instead of ~75.
**Global hooks stay unchanged** — they're the universal baseline.
### Result
- Before: 4 Python processes per Bash call (2 global + 2 project parsing same JSON)
- After: 2 Python processes per Bash call (1 global comprehensive + 1 tiny project-specific)
---
## Phase 3: Memory System — Consolidate and Clean
### Problem
Two separate memory systems with overlapping content:
1. **Auto-memory** (`~/.claude/projects/-Users-dorian-Projects-archy/memory/`) — 8 files, auto-loaded
2. **Project memory** (`.claude/memory/`) — 26 files, NOT auto-loaded
Claude sees auto-memory every session. Project memory only loads if Claude manually reads it.
### Solution: Curate auto-memory, keep project memory as archive
**Auto-memory MEMORY.md** — restructure to ~25 lines with the most critical feedback:
```markdown
# Archipelago Project Memory
## Critical Feedback (prevent recurring mistakes)
- [Direct Port Rule](feedback_apps_always_direct_port.md) — Apps MUST use direct port, NEVER proxy paths
- [External URLs](feedback_external_urls_iframe.md) — Open https:// directly, never /ext/
- [Deploy All Nodes](feedback_indeedhub_deploy_all_servers.md) — Deploy to ALL nodes
- [No Tor Publishing](feedback_no_tor_relay_publishing.md) — Never publish .onion to relays
- [UFW Forward](feedback_podman_ufw_forward.md) — DEFAULT_FORWARD_POLICY=ACCEPT
- [Deploy Patterns](feedback_deploy_patterns.md) — Rootless port 80, cred sync, image export
- [Asset Workflow](feedback_asset_workflow.md) — Never generate images, user is designer
- [ASCII Logo](feedback_logo_ascii.md) — Block-letter logo locked, never change
- [Claude Cleanup](feedback_claude_cleanup.md) — Instruction optimization principles
## Infrastructure
- [CI/CD & Registry](reference_cicd_registry.md) — git.tx1138.com, act_runner, insecure registry
- [Multi-Node Deploy](reference_multi_node_deploy.md) — 5 nodes, SSH keys, deploy methods
- [Infrastructure Quick Ref](reference_infrastructure.md) — IPs, passwords, SSH keys (moved from CLAUDE.md)
## Project State
- [ISO Testing](project_iso_testing_plan.md) — Hardware matrix, boot compatibility
- [ISO Custom Base](project_iso_size_reduction.md) — Debootstrap ISO, remaining issues
## Archive
Detailed project memory in .claude/memory/MEMORY.md (26 files, not auto-loaded).
```
**New auto-memory files to create** (migrated from project memory):
- `feedback_apps_always_direct_port.md` — Broken THREE TIMES, highest-value feedback
- `feedback_deploy_patterns.md` — Hard-won container patterns
- `feedback_asset_workflow.md` — Prevents wasted effort generating images
- `feedback_logo_ascii.md` — Prevents changing locked-in branding
- `reference_infrastructure.md` — Infrastructure table from CLAUDE.md (IPs, SSH, passwords)
**Project memory (.claude/memory/)**:
- Add comment at top of MEMORY.md: `<!-- Archive: not auto-loaded. Active memory at ~/.claude/projects/.../memory/ -->`
- Fix `deploy-automation.md` (Phase 0 — remove API key)
- Update `unbundled-iso.md` (still says "NOT YET BUILT")
---
## Phase 4: Permissions — Auto-Approve Safe Commands
**File**: `.claude/settings.local.json`
**Current**: Only `ssh:*` and `gh api:*` allowed.
**Updated** — add read-only and build/test commands:
```json
{
"permissions": {
"allow": [
"Bash(ssh:*)",
"Bash(gh api:*)",
"Bash(cd neode-ui*)",
"Bash(npm run *)",
"Bash(npm test*)",
"Bash(npm start*)",
"Bash(npx vue-tsc*)",
"Bash(npx vitest*)",
"Bash(git log*)",
"Bash(git diff*)",
"Bash(git status*)",
"Bash(git branch*)",
"Bash(git show*)",
"Bash(git stash*)",
"Bash(cargo check*)",
"Bash(cargo clippy*)",
"Bash(cargo test*)",
"Bash(journalctl*)",
"Bash(systemctl status*)",
"Bash(ls *)",
"Bash(wc *)",
"Bash(file *)",
"Bash(xxd *)",
"Bash(df *)",
"Bash(du *)"
]
}
}
```
**NOT auto-approved** (still require confirmation):
- `git push/commit` — Affects remote/creates state
- `cargo build` — Blocked by hook on macOS anyway
- `npm install` — Modifies dependencies
- `./scripts/deploy-*` — Deploys to servers
- `rm`, `mv`, `cp` — Potentially destructive
---
## Phase 5: Merge iso-branding into build-iso
**Problem**: `iso-branding` is a pure design reference, only relevant during ISO builds. Its description consumes skill budget.
**Action**:
1. Move `.claude/skills/iso-branding/SKILL.md` content → `.claude/skills/build-iso/references/branding.md`
2. Update `build-iso/SKILL.md` to reference the branding file
3. Delete `.claude/skills/iso-branding/` directory
**Skill count**: 11 → 10
---
## Phase 6: Add Backend Rule File
**Problem**: No path-scoped rule for Rust backend. 3 backend rules sit in CLAUDE.md (loaded every session even for frontend-only work).
**New file**: `.claude/rules/backend.md`
```markdown
---
globs:
- "core/**/*.rs"
- "core/**/Cargo.toml"
---
# Backend Rules (Archipelago — Rust)
- Backend binds `127.0.0.1` only — nginx handles external access
- Validate all input before path construction — reject `..`, `/`, null bytes
- Timeouts on all external operations (10s default, 30s heavy)
- Use `anyhow::Result` for error propagation, not `.unwrap()` in handlers
- Log with `tracing`, never `println!` or `eprintln!` in production paths
- Container commands through `PodmanClient` (core/container/), never raw Command::new("podman")
```
Delete the Backend section from CLAUDE.md (moved here).
---
## Phase 7: Tighten prompt-injection-detect.sh
**Problem**: `context_manipulation` pattern matches `IMPORTANT:`, `CRITICAL:`, `<system>` — normal in code/docs. Creates false positive warnings.
**Action**: Tighten the `context_manipulation` regex to require injection-specific signatures:
```bash
# OLD (too broad):
"IMPORTANT:|CRITICAL:|SYSTEM:|ADMIN:|<system>|</system>|<instructions>"
# NEW (specific):
"(?:^|\s)(?:SYSTEM|ADMIN):\s*(?:you are|ignore|forget|override|new instructions)|<(?:system|instructions)>.*(?:ignore|override|forget)"
```
---
## Phase 8: Add 2 Focused Agents
**Current**: 1 agent (iframe-specialist, 678 lines)
**Add**:
### `.claude/agents/deploy-specialist.md`
```yaml
---
name: deploy-specialist
description: Deploys to all 5 Archipelago nodes. Knows SSH access, build capabilities, post-deploy verification.
tools: Bash, Read, Grep, Glob
model: sonnet
---
```
Body: Node inventory, deploy workflow, IndeedHub multi-node rules, post-deploy checklist.
### `.claude/agents/code-reviewer.md`
```yaml
---
name: code-reviewer
description: Reviews code against Archipelago standards — frontend patterns, Rust safety, container security, crypto rules.
tools: Read, Grep, Glob
model: sonnet
---
```
Body: Frontend rules, backend rules, container rules, security checklist.
**Agent count**: 1 → 3
---
## Phase 9: Skill Frontmatter Audit
**Problem**: Action skills that have side effects should have `disable-model-invocation: true` to prevent Claude from auto-invoking them.
| Skill | Has `disable-model-invocation: true`? | Needs it? |
|-------|--------------------------------------|-----------|
| add-app | Yes | Yes (side effects) |
| add-web-app | Verify | Yes |
| build-iso | Verify | Yes (builds ISO) |
| iso-debug | Verify | Yes (runs diagnostics) |
| podman | Verify | Yes (modifies containers) |
| polish | Verify | Yes (modifies code) |
| sweep | Verify | Yes (runs checks, may fix) |
| mesh | No | No (reference knowledge) |
| design-pixel-retro | No | No (reference knowledge) |
| gamepad-nav | No | No (reference knowledge) |
Action: Verify and add `disable-model-invocation: true` to all 7 action skills.
---
## Summary
| Phase | Impact | Files Changed | Benefit |
|-------|--------|---------------|---------|
| 0. Remove API key | CRITICAL | 1 | Security |
| 1. Trim CLAUDE.md | HIGH | 1 | ~500 tokens/session saved |
| 2. Dedup hooks | HIGH | 2 | ~200ms faster per tool call |
| 3. Memory consolidate | HIGH | ~8 | Cleaner context, no stale data |
| 4. Permissions | MEDIUM | 1 | ~3s saved per safe command |
| 5. Merge iso-branding | LOW | 3 | 1 less skill description |
| 6. Backend rule | MEDIUM | 2 | Path-scoped, not always-loaded |
| 7. Injection hook | LOW | 1 | Fewer false positives |
| 8. New agents | MEDIUM | 2 new | Better delegation |
| 9. Skill frontmatter | LOW | ~5 | Prevents unintended auto-invoke |
**Net changes**: CLAUDE.md 101→~75 lines, skills 11→10, agents 1→3, rules 5→6, hooks 60% smaller
---
## What This Plan Does NOT Change (and why each was evaluated)
- **Global CLAUDE.md** (36 lines) — Already optimized, passes the "would removing cause mistakes?" test
- **Global hooks** (8 scripts) — Universal baseline, well-tuned, no project overlap
- **Global rules** (api, crypto, bitcoin) — Correct glob scoping, concise content
- **Global settings.json** — Plugins, effort level, hook config all justified
- **iframe-specialist agent** — Deep reference, correctly scoped, rarely loaded
- **Skills mesh/gamepad-nav/design-pixel-retro** — Tiny description cost (~120 chars each), valuable on-demand
- **MCP servers** — Not needed (self-hosted infra, no external API integrations)
- **Agent teams** — Experimental, single-developer project doesn't benefit
- **Project .claude/memory/ (26 files)** — Kept as archive with annotation
---
## Verification Checklist
After implementation:
- [ ] `grep -r "sk-ant" .claude/` returns zero results
- [ ] New session auto-loads MEMORY.md with all critical feedback
- [ ] `git status` auto-approves without permission prompt
- [ ] `/sweep` skill loads and executes correctly
- [ ] Project hooks run fast (no duplicate pattern checks)
- [ ] `cd neode-ui && npx vue-tsc -b --noEmit` passes
- [ ] Spawning deploy-specialist agent works
- [ ] CLAUDE.md is ≤80 lines
- [ ] `/context` shows reasonable token budget