Local LLM Stack — a cheap inference tier below Claude

What I built

A small stack that lets me run real LLM work on my MacBook for zero marginal cost — and decide when each request is worth the higher Claude tier and when it isn't.

Three pieces. A 250-line stdlib Python proxy on :11436 that speaks the OpenAI chat-completions API and forwards to Ollama (and a deferred llama.cpp backend for heavier weights). An ask CLI for one-shot prompts, with a SQLite prompt cache and a JSONL telemetry log so I can see what's actually being used. A council pseudo-model that runs two members in parallel and a stronger chairman to synthesize, for the harder questions where I want a second opinion before I trust an answer.

The whole thing is opinionated. I don't auto-route. I pick the tier explicitly: ask --9b, ask --think, ask --heavy, ask --council. The proxy doesn't try to be smart about which model to use — it just makes the model I asked for actually work, and tells me if it had to fall back to a smaller one.

Why I built it

The real reason is cost discipline.

I'm running an algorithmic trading bot on Kalshi, and a chunk of the work it spawns is mechanical: format this digest, summarize this log, classify this news headline, write a one-line commit message. That work doesn't need a frontier model. Sending it to Claude is paying a Michelin-star kitchen to make toast. So I built a tier below Claude — fast enough for the cheap work, present enough to actually use it instead of defaulting to the API. The local stack is the cheap tier; Claude is reserved for the work that actually needs it.

The on-the-go angle is a real but secondary benefit. I'm often in cafes, on planes, on hotel wifi — the bot's primary machine is a VPS, but my development laptop is a MacBook, and being able to run real inference on it without the API working is genuinely useful. The hard requirement, though, was cost — making the cheap tier good enough that I'd actually reach for it instead of habitually defaulting to Claude for things it didn't need.

Never bet against the model. Boris Cherny's framing for Claude Code: any scaffolding you build to boost performance 10–20% becomes obsolete when the next model handles it natively. So this stack is deliberately thin. The proxy is one file. There's no auto-routing classifier, no quality detector, no clever orchestration framework. The model picks the answer; I pick the tier.

The 24 GB constraint

The interesting engineering is the hardware budget. macOS plus everything I have open eats 12–16 GB before any model loads, which leaves roughly 8–12 GB for inference. A 9B model in Q4 is 6.6 GB. A 30B coder is 18 GB. Both can't be resident at once. Anything I do has to respect that.

So every logical model in the proxy has a keep_alive policy. The interactive 9B is held warm for 15 minutes — sporadic ask calls don't pay cold-start latency, which is the difference between a tool I reach for and a tool I don't. The 30B coder evicts in 30 seconds, because it can't be allowed to squat on RAM that the interactive tier needs. Smaller models have shorter warmth too. None of this is rocket science, but if you don't think about it the experience degrades silently and you stop using the stack.

There's also a stress test in the repo — fire concurrent prompts at fast-9b and coder-30b at the same time, watch Ollama RSS and macOS pageouts, see if the system thrashes. It does, sometimes, when the 30B is loading. That's known. I don't pretend it doesn't.

Orchestration patterns

Two patterns layered into the proxy, both lifted from the literature, neither novel.

Steinberger-style fallback chains

Every model has a fallback list. fast-9b → fast-3b → fast-1b. If a request fails (timeout, HTTP error, connection drop), the proxy retries down the chain. Importantly the fallback crosses model families — if 9B fails because of RAM pressure or a runner crash, the next attempt hits a different runtime. Same-blob fallback would hit the same broken state.

The fallback isn't smart — it's reliability, not routing. And critically, it's surfaced: when a request gets demoted, the proxy returns an X-Served-By header and the ask CLI prints [degraded: fast-9b → fast-1b] to stderr. That way I know the answer came from a smaller model and can re-run when 9B is healthy. Silent demotion is worse than honest failure.

Karpathy-style council

For harder questions, ask --council "..." runs two members in parallel (the same 9B at temperature 0.7, for self-consistency-style sampling diversity), then a stronger chairman (the thinking-mode 9B) reads both responses anonymized as Response A and Response B and synthesizes a single final answer. Members can be the same model with sampled diversity; the chairman is deliberately stronger so the synthesis adds something a member couldn't.

It's not magic. It's a pattern from Karpathy's nanoChat work — running the same model multiple times with temperature, then composing. It helps on questions where I want the model to argue with itself before I trust the output. For trivial questions it's overkill and I don't use it.

Telemetry & cache

This part I added late, after staring at a dashboard of nothing and realizing I had no idea whether the local stack was actually paying off.

Every ask call now writes one line to ~/.local/share/ask/log.jsonl: timestamp, model requested, model actually served, prompt hash, prompt and response sizes, latency, whether the cache hit, whether the request was demoted, exit code. One jq query over a week of data tells me everything that matters: what models I actually use, how often the cache helps, how often demotion fires, where the latency goes.

The cache is a SQLite table keyed on sha256(model + prompt) with a 30-day TTL. Misses run inference; hits return in under 5 milliseconds. For repetitive cron work — the digest format that runs hourly, the news classifier that hits the same headlines — this is concentrated win. For interactive work the hit rate is near zero, and that's fine. The cache earns its keep on the cron side. The council is excluded (intentionally stochastic).

{"ts":1777978067.96,"model":"fast-9b","served_by":"fast-9b",
 "prompt_hash":"775a61d9","prompt_chars":22,"response_chars":2,
 "latency_s":0.003,"cache_hit":true,"demoted":false,"exit_code":0}

The stack

Hardware

M4 Pro · 24 GB

Proxy

Python stdlib · 250 LOC

Backends

Ollama + llama.cpp HEAD

Primary model

Qwen 3.5 9B (Q4)

Cache

SQLite · sha256 keys

Telemetry

JSONL · jq-friendly

What it gets me

A real cheap tier I actually reach for. Mechanical work goes to ask; the hard work still goes to Claude. The point isn't to replace Claude — it's to stop sending toast to the Michelin kitchen.
Inference that works on a plane. Same code, same CLI, no API key required. It's not a marketing point — it's a real situation a few times a year.
Visibility into my own usage. The JSONL log answers questions I couldn't answer before: which model do I actually use, what does the cache hit rate look like, how often does the fallback chain fire. Without that, every optimization is theological.
A demonstration that thin scaffolding wins. No framework, no daemon-zoo, no auto-routing. ~250 lines of stdlib Python, a SQLite cache, a JSONL log. The model does the hard part; the infrastructure stays out of its way.