Project · Shipped

System Health

shipped self-discovering

Something in my personal OS breaks silently roughly once a fortnight — a launchd job stops firing, an OAuth token expires, a scheduled task gets disabled. System Health is the Python script that notices before I do.

What it is

System Health scans the personal OS every 15 minutes — launchd plists, scheduled tasks, git hooks, sync scripts, OAuth tokens, vault integrity — verifies each one is doing what it claims, and writes the result to a health_checks table in os.db plus a markdown report. The morning briefing reads that report at 8:27am and surfaces warnings with the exact command to run.

Currently 47 ok / 1 warn across 48 checks. It started at 22; the system keeps growing, and so does the check list — for free.

Why it exists

A personal OS with ten-plus automations fails silently. Something stops firing; nothing alerts me; I notice three days later when vault-state.md has yesterday's timestamp. Manual monitoring scales badly: holding "is the Whoop token still auto-refreshing?", "is the post-commit hook still executable?", and "did the morning briefing actually read the file it was supposed to?" in your head at once is not how reliable systems work. The system needs to tell me what's wrong, not require me to remember what to check.

Shape

Three steps, in order.

Discover. Walk the filesystem — ~/Library/LaunchAgents/com.reasons.*.plist, ~/.claude/scheduled-tasks/*/SKILL.md, ~/reasons_brain/.git/hooks/*, ~/.claude/skills/*/SKILL.md, ~/reasons-os/packages/sync/*.py, ~/.whoop_tokens.json, os.db tables. No hardcoded checklist. New launchd jobs pick up monitoring automatically.
Check. Run the type-specific test — plist loaded? log file fresh? token not expired? output file less than 2× its interval old? — and return ok | warn | fail plus a message.
Write. Upsert into health_checks; render 40-reference/system-health-report.md grouped by severity.

Load-bearing bits

Fix commands as a design constraint

Every check that can fail ships with a concrete remediation command in its message field. Not "check your Whoop token" — python ~/reasons-os/packages/sync/pull_whoop.py --auth. The gap between "I see it's broken" and "I know what to type" is where tasks stall out. Hardcoding the what-to-type back into the warning closes that loop.

Two cadences

Fast checks (tokens, freshness, launchd status) run every 15 minutes with the sync pipeline. Vault integrity (orphan notes, broken wikilinks, now.md staleness) runs daily via a --full flag before the morning briefing — same-day visibility without paying the vault-walk cost twelve times an hour.

Self-monitoring sentinel

The script touches .health_check_last_success on successful completion. On the next run, that sentinel's age is the first check. If the health checker itself is broken, the report says so before discussing anything else. Turtles-all-the-way-down, handled explicitly.

Retention

DELETE FROM health_checks WHERE checked_at < date('now', '-30 days') AND status = 'ok'. Warn and fail rows stay forever. If something has been broken for thirty days, that history is evidence, not noise.

What caught the first real failure

An expired Whoop token. The briefing that morning opened with python ~/reasons-os/packages/sync/pull_whoop.py --auth. I ran it before finishing coffee. Worked.