Project · Shipped
System Health
Something in my personal OS breaks silently roughly once a fortnight — a launchd job stops firing, an OAuth token expires, a scheduled task gets disabled. System Health is the Python script that notices before I do.
What it is
System Health scans the personal OS every 15 minutes — launchd plists, scheduled tasks, git hooks, sync scripts, OAuth tokens, vault integrity — verifies each one is doing what it claims, and writes the result to a health_checks table in os.db plus a markdown report. The morning briefing reads that report at 8:27am and surfaces warnings with the exact command to run.
Currently 47 ok / 1 warn across 48 checks. It started at 22; the system keeps growing, and so does the check list — for free.
Why it exists
A personal OS with ten-plus automations fails silently. Something stops firing; nothing alerts me; I notice three days later when vault-state.md has yesterday's timestamp. ADHD compounds this specifically: task initiation is the hard part, and I can't hold the specifics of "is the Whoop token still auto-refreshing?", "is the post-commit hook still executable?", and "did the morning briefing actually read the file it was supposed to?" in working memory at once. I need the system to tell me what's wrong, not require me to remember what to check.
Shape
Three steps, in order.
- Discover. Walk the filesystem —
~/Library/LaunchAgents/com.reasons.*.plist,~/.claude/scheduled-tasks/*/SKILL.md,~/reasons_brain/.git/hooks/*,~/.claude/skills/*/SKILL.md,~/reasons-os/packages/sync/*.py,~/.whoop_tokens.json,os.dbtables. No hardcoded checklist. Newlaunchdjobs pick up monitoring automatically. - Check. Run the type-specific test — plist loaded? log file fresh? token not expired? output file less than 2× its interval old? — and return
ok | warn | failplus a message. - Write. Upsert into
health_checks; render40-reference/system-health-report.mdgrouped by severity.
Load-bearing bits
Fix commands as a design constraint
Every check that can fail ships with a concrete remediation command in its message field. Not "check your Whoop token" — python ~/reasons-os/packages/sync/pull_whoop.py --auth. The gap between "I see it's broken" and "I know what to type" is where ADHD tasks stall out. Hardcoding the what-to-type back into the warning closes that loop.
Two cadences
Fast checks (tokens, freshness, launchd status) run every 15 minutes with the sync pipeline. Vault integrity (orphan notes, broken wikilinks, now.md staleness) runs daily via a --full flag before the morning briefing — same-day visibility without paying the vault-walk cost twelve times an hour.
Self-monitoring sentinel
The script touches .health_check_last_success on successful completion. On the next run, that sentinel's age is the first check. If the health checker itself is broken, the report says so before discussing anything else. Turtles-all-the-way-down, handled explicitly.
Retention
DELETE FROM health_checks WHERE checked_at < date('now', '-30 days') AND status = 'ok'. Warn and fail rows stay forever. If something has been broken for thirty days, that history is evidence, not noise.
What caught the first real failure
An expired Whoop token. The briefing that morning opened with python ~/reasons-os/packages/sync/pull_whoop.py --auth. I ran it before finishing coffee. Worked.