Vitals — a three-layer personal-data stack for an AI fitness coach

What I built

Three layers, joined at a SQLite file.

Layer 1 — a pull layer. Adapter classes for Oura (biometrics, sleep, workouts, sub-minute HR during efforts), Renpho (body composition since 2020), and an iMessage roll-up that turns ten years of chat.db into per-day behavioral signals — late-night message volume, reply latency percentiles, sleep/alcohol/conflict/intimacy keyword counts, max 5-minute burst. Each adapter is idempotent on a natural key with a 2-day overlap on re-pulls. Launchd runs the sync twice a day. No webhooks; nothing leaves the machine.

Layer 2 — a derivation layer. Six analysis modules that materialize trustworthy derived tables on top of the raw samples — personal HR zones (versioned, because fitness drifts), per-day time-in-zone with cadence-aware weights, Banister TRIMP and Coggan CTL/ATL/TSB curves, HR-only session reconstruction independent of the wearable's own workout tags, cardiac-drift analysis on long steady-state efforts, recovery-load regressions with block-bootstrap CIs and BH-FDR correction. The agent never reads raw HR samples directly — it consults the derived tables. This is the layer I didn't know I was missing until I built it.

Layer 3 — a persona contract. A single markdown file (coach.md) that tells whatever LLM you load it into: your role is a quantitative fitness coach; query the derivation tables before answering; cite specific values and rolling baselines; coach with a recommendation, don't just narrate the numbers; cross-reference across tables when relevant. The contract is model-agnostic — it's the system prompt for OpenAI, the system instruction for Gemini, the --system flag for an Ollama-served local 9B, or the CLAUDE.md when working in Claude Code. The LLM is interchangeable; the contract pointed at the derived tables is the actual artifact.

Why I built it

Even with the pull layer and the persona contract, the agent was wrong.

When I asked it "is my training elite?" it answered confidently: "99.8% moderate intensity, 0% hard time, you live entirely in the gray zone." All three numbers were wrong by factors of five to fifty. Not roughly wrong — categorically wrong. And it was confident.

The proximate cause was a timezone bug. I'd joined heart-rate samples (stored in UTC with a Z suffix) against workout windows (stored in local time with ±HH:MM offsets) by string comparison. ISO 8601 strings with different offsets don't compare moments in time — they're just strings, and the lexicographic compare is meaningless across offsets. The join silently dropped most samples inside workout windows. The agent dutifully ratified the corrupt analysis and I had to retract three paragraphs of confident coaching to myself.

The deeper cause was that the only signal cheap enough for an agent to grab on demand was Oura's pre-computed intensity field — a coarse per-workout label that turns out to be ground-truth-disconnected. The pull layer gives an agent raw sensor data. The persona contract instructs the agent to query before answering. Neither layer is sufficient when the cheapest query returns a misleading number, because that's the answer the model will use.

What was missing was the layer in the middle. A derivation tier that takes 3.4 million raw samples and produces a small set of trustworthy tables — proper zone-time, proper training load, HRV gates anchored on personal baselines — that the agent reads instead of reaching for whatever pre-computed shortcut the upstream vendor happened to ship. That's Layer 2. It exists because Layer 3 was demonstrably wrong without it.

The derivation layer is the unlock. The pull layer gives the agent data. The persona contract gives it instructions. Neither makes the agent correct. The derived tables are what make the agent's answers trustworthy — they're the single source of truth the contract points at. Strip the model out of the loop and the derivation layer is what describes what the agent is for.

How it works

Layer 1 — the pull layer

Each source — Oura, Renpho, iMessage — is its own subpackage exposing a sync_* function with a shared shape: read the cursor from sync_state, pull with a 2-day overlap because upstream APIs sometimes back-fill late, upsert into the source's table keyed on a natural key (day, document id, or (timestamp, source)), update sync_state. Re-running a sync is always safe and idempotent. The contract is convention, not an ABC — a new source is a new subpackage plus three lines in cli.py.

The raw API JSON is kept in a raw_json column on every table. If the schema misses a field I want later, json_extract(raw_json, '$.path') gets it without a re-sync. Heart-rate samples come in at the API's native cadence — about 45 seconds during workouts, 12 minutes at rest, which works out to roughly 3.4 million rows over four years on a single Oura account. Well within SQLite's comfort zone, queryable with no index gymnastics.

Layer 2 — the derivation layer

Six analysis modules in src/health/analysis/, each materializing its own derived table:

hr_clean.py — TZ-correct UTC sample loader, gap-aware time weights (60-second cap for workout-source samples, 600-second cap for rest/awake; never extend across a gap), per-sample local-day assignment from the nearest sleep-period offset. The unglamorous foundation everything else depends on.
zones.py → zone_config. Personal HRmax (rolling 365-day robust max, requires ≥3 sustained samples), HRrest (5th percentile of sleep lowest_heart_rate), LTHR (peak 30-min rolling-average HR). Karvonen %HRR and Friel %LTHR zones. Versioned — fitness drifts, so zones get a new row every six months or whenever a boundary moves ≥2 bpm.
daily_load.py → daily_load. One row per local day with weighted time-in-zone, Banister exp-weighted TRIMP, Edwards zone-sum TRIMP, and Coggan CTL/ATL/TSB curves. Idempotent rebuild; the EWMAs are explicitly seeded from the prior row so partial rebuilds preserve history.
sessions.py → hr_session. Reconstructs sessions from HR alone, independent of the wearable's own workout tags. Smooths bpm, identifies sustained elevation, merges adjacent bouts, matches against logged workouts by temporal IoU. Surfaced that 51% of my real sessions over a 90-day window were unlogged.
decoupling.py — cardiac drift on long sub-threshold sessions. Honest naming: without a power signal, this is HR drift over time, not true Pa:HR. Still useful as an aerobic-base field indicator under the Uphill Athlete framework.
recovery_coupling.py — pre-registered hypothesis tests for daily training load against next-day HRV, RHR, and readiness. Block-bootstrap 95% CIs (14-day block, n=10,000 resamples) and BH-FDR correction across the hypothesis family. Three of seven were significant in the expected direction; four were null, including chronic-load measures — evidence that my system has adapted to high baseline volume.

Plus an HRV autoregulation gate (hrv_swc.py) implementing the Plews-Buchheit smallest-worthwhile-change framework. Daily green / yellow / red verdict against a personal 60-day baseline (travel days excluded — TZ shifts perturb HRV mechanically), with explicit detection of the "rMSSD-CV rising while mean falling" pattern that flags non-functional overreaching. Backtested over my prior 90 days, the gate would have told me to back off on 44 of them. I trained over 100 TRIMP on every single one. Adherence: zero. Building the gate didn't change my training, but it made the cost legible — that's what derivation layers do.

Layer 3 — the persona contract

The whole "agent" is a markdown file. An abbreviated excerpt:

# Vitals — Coach Persona

You are a fitness coach and quantitative health analyst.

When asked anything about recovery, training, body
composition, or how those metrics interact:

1. Query the derivation tables first — daily_load,
   hr_session, hrv_verdict, zone_config, protocol_state.
   Never re-aggregate raw sensor data on the fly.
2. Be quantitative. Cite specific values, % changes,
   day-over-day deltas. No "your sleep seems okay" —
   give the score, the 7d trend, the rolling baseline.
3. Coach, don't just report. After the numbers, give
   a concrete recommendation.
4. Cross-reference. Sleep, HRV, body temperature, and
   message-derived stress signals interact. Surface
   lagged correlations across tables when relevant.

That's the full contract — it fits in a screenshot. Loaded into Claude Code it lives as CLAUDE.md. Pasted into the OpenAI Responses API as a system message, the same words work. Fed to a local Qwen 3.5 9B via the local LLM stack as the system prompt, ditto. The contract names which derived tables to consult and forbids re-computation on the fly; the derivation layer ensures the answer to that consultation is trustworthy. The contract plus the derivation layer plus a model: that's the agent.

Cross-domain joins

Because every source lands in the same SQLite — and the derivation layer materializes its outputs to the same place — every cross-domain question is a one-line query the agent runs and cites. The figure below plots a random sample of 200 days from the corpus — late-night messages sent vs. next-day HRV. Each dot is one day. The lower-right cluster is the kind of pattern that's invisible to a single-domain app: high late-night phone activity followed by depressed next-day autonomic recovery. The agent finds these when asked; it doesn't need a separate analytics dashboard.

Scatter plot: late-night messages sent (x-axis) vs next-day HRV in ms (y-axis). 200 sampled days across 2024–2026. Most points cluster at zero late-night messages with HRV spread across 20–110. High-late-night-message days (10+) skew toward lower HRV values. — Sampled cross-domain join: messages × biometrics. Real anonymized data, no dates or contact identifiers.

Local and pull-based

The whole stack is sync httpx, SQLite, pandas/numpy, and launchd. No webhooks, no daemon, no VPS, no cloud function. The trade-off is real — there's no real-time signal — and it's the right one, because the data is daily-grain anyway and the cloud APIs deliver intraday HR at sub-minute resolution only during detected workouts. Real-time would be theatrical, not useful. Two cron firings a day catches everything that matters and the entire system runs in user-space on one laptop.

The stack

Language

Python 3.11

Storage

SQLite (local)

Compute

pandas · numpy

HTTP

httpx · sync

Scheduler

launchd · 2× daily

Reference sources

Oura v2 · Renpho · iMessage

Agent

Any LLM via coach.md

What it gets me

A coach that's quantitative and correct. Asking "how am I feeling today?" returns readiness score, sleep contributors, body-temp deviation vs 10-day average, and a concrete recommendation — all sourced from derived tables the agent can cite. The contract forbids hand-waving; the derivation layer forbids the cheap-shortcut wrong answer.
Falsifiable training protocols. On top of the derivation tables sits a pyramidal-distribution polarization protocol — explicit phases (deload → re-anchor → rebuild), HRV-gated daily prescriptions, and pre-registered exit criteria evaluated automatically against daily_load, hrv_verdict, and the readiness panel. The protocol either works against measurable criteria or it doesn't. No vibes.
Cross-domain answers a single health app can't give. "Why was my recovery index 40 last night?" pulls body temp, recent workout load, late-night message count, and alcohol keyword mentions in one query and answers with citations.
Provider-independence. When the next-generation model ships or I want to run on the plane, I swap it out. The persona contract, the derivation modules, the data layer, and the CLI don't change. The agent is the contract pointed at the derived tables; the vendor is just the interpreter.
A reference implementation for the broader pattern. Plug in Whoop instead of Oura, a Withings scale instead of Renpho, an Apple Health export instead of iMessage — the sync shape is the same, the derivation tables sit on top, the persona contract still applies. github.com/gabrielkagan/vitals ships the abstracted version.

Privacy

Nothing on this page is identifying. The scatter is real but anonymized — no dates, no contact IDs, no names. The corpus and the agent's outputs stay on my machine, encrypted, never synced anywhere. The open-source repo ships with no data either. If you want a vitals corpus of your body and your messages, you point the adapters at your own accounts on your own laptop. That's the whole architecture.