v0.5.0 shipping · source-available · what's new →

The control plane
for AI inference.

AI inference is burning megawatts of GPU power and datacenter buildout is racing to keep up — meanwhile your inference stack is paying again at every hop on top of the GPU bill. Models think in tokens, but the rest of the stack speaks text. Every gateway, router, tool dispatcher, and middleware in the path does the same ritual: detokenize the model's IDs to text, encode as UTF-8, wrap in JSON, ship it, parse it, decode UTF-8, re-tokenize back to IDs — burning CPU, memory, and latency on lossy conversions the AI never asked for, and risking KV-cache corruption when the re-tokenize doesn't round-trip cleanly. Codec is a drop-in upgrade that keeps token IDs as the wire format end-to-end: gateways forward IDs verbatim, tool dispatchers match on raw IDs, cross-model handoffs translate vocabularies in-process. Same model, same prompts, same answers; typically 16× less data on the wire on real agent traffic, up to ~1,700× when the content compresses well — how big the win is depends on what your AI generates, full receipts below. On mobile: snappier app, lighter cloud bill. At fleet scale: megawatt-hours of network energy and middleware CPU not burned on bytes nobody reads. Plug-in libraries for TypeScript, Python, Rust, Java, .NET, and C work with the AI servers you already use (sglang, vllm, llama.cpp). Your code doesn't change. We can't make the model smaller — the waste, we can. And by shrinking the wire 1,000+×, Codec opens AI access to the ~5 billion people on slow, expensive, or metered connections that JSON-SSE prices out of the door.

What it gives you → Protocol map github / wdunn001/Codec

~$400M+/yr total wire + GPU savings worldwide ~$320M cloud egress (heavy-agent baseline — tool-use + A2A is default at Claude/ChatGPT/Gemini) + ~$50M GPU on blocked prompts + ~$35M Starlink; sub-agent-heavy flows push to $500–700M/yr
up to 10× faster on mobile 2 K-token reply over 10 Mbps 4G
~5B people AI accessible where it wasn't 2.6B offline + 2–3B on slow / expensive mobile (ITU 2024)

Token IDs straight on the wire. Tool-call dispatch, observability, cross-vocab handoff — all the things you'd want to do at the inference layer reduce to integer compares on the stream. Detokenize becomes a byproduct, not a per-token cost.

control-plane primitives

Three operations. All on raw token IDs.

Codec gives the inference layer the same primitives a service mesh gives a microservice fleet: route, dispatch, translate. Run them on raw uint32 tokens, never on text. The compression you see in the receipts below is what falls out for free when you stop reserializing every hop.

route

Tokens all the way down

Models think in tokens. Every middleware in your stack — gateway, router, log sink — speaks text, so it detokenizes, JSON-wraps, ships, parses, re-tokenizes — once per hop, burning CPU and risking KV-cache drift. Codec keeps token IDs as the wire format end-to-end; UTF-8 happens once, at the edge that actually displays text. Same compression options on top (gzip / brotli / dict-zstd). Same framing on every engine; six client languages decode byte-identically.

16–1700×less wire (workload-dep.)
3engines, one wire

dispatch

MCP tool calls, leaf-mode bypass

The MCP path normally tokenizes the tool result at the gateway, every call. A Codec-aware MCP server (codec-time-leaf) attaches token IDs to its result via _meta['ai.codec/leaf-tokenization']; the codec-metamcp gateway forwards them verbatim — [Codec][leaf] fires, the gateway becomes a transparent ID pipe, and the consumer skips its BPE re-tokenize. tools/list across a 40-tool namespace: 21.4 KB → 5.9 KB (3.6×). ToolWatcher detects tool boundaries on the raw ID stream at 26.7× the speed of detokenize+regex (lab EPYC, 481 Mtok/s).

12.4×leaf consumer CPU
26.7×ToolWatcher vs detok+regex

translate

Cross-vocab agent handoff

A Llama-3 agent's stream feeds a Qwen-2 agent through one in-process detokenize / retokenize step. UTF-8 never crosses the wire. At 2 K tokens the Codec path ships 15.1× fewer wire bytes (10.4 KB → 709 B) at bridge CPU within noise of the JSON-SSE+retokenize path. Both paths emit byte-identical Qwen-2 output; the bench asserts strict equality before reporting numbers.

15.1×smaller wire @ 2K
≡byte-identical output

latents

Image & video diffusion on the wire

The same wire format extends to diffusion models: VAE latents stream in length-prefixed binary frames instead of decoded pixels. The client runs vae_decode locally; pixels never touch the wire. Measured on the lab against codec-diffusers running SD-1.5: a 512×512 latent at int8 packs to 16.4 KB (~5–10× smaller than JPEG, ~90× smaller than raw fp16 pixels). The int4 pipeline halves it again. Pipeline math validates byte-for-byte against spec/PIPELINES.md.

3.9×int4 vs raw latent
~90×vs raw fp16 pixels

receipts

What falls out when the inference layer stays token-native.

Compression isn't the headline — the primitives are. But once every hop runs on raw uint32 token IDs, the wire reduction and the tool-call latency floor are measurable byproducts. Numbers below are from the cross-stack benchmark matrix: same prompt, same model, three real inference engines, six real client languages. Every cell is measured. Full SCHEMA-v1 result JSONs in packages/bench/results/.

Lab run — all three pathways measured

vinez@192.168.1.88 · 2× RTX 3090 · v0.4.1 cut, 2026-05-15 (wire numbers unchanged from v0.3.x — v0.4.x is wire-additive; v0.4.1 ships dict-zstd across all 6 clients + br + zstd on llama.cpp)

3.9× int4 latent vs raw

MCP gateway

tools/list across 40 tools

	wire	vs JSON
JSON-RPC	21.4 KB	1.0×
msgpack + gzip	5.9 KB	3.6×

[Codec][leaf] log fires end-to-end on codec-time-leaf tool calls — gateway becomes a transparent ID pipe.

Latents (v0.3)

512×512 SD-1.5 latent on the wire

pipeline	wire	vs raw
raw fp16	32.4 KB	1.0×
int8	16.4 KB	2.0×
int4	8.4 KB	3.9×

For comparison: same image as a JPEG ~80–150 KB; raw fp16 pixels 1.5 MB. int4 = ~10× smaller than JPEG, ~180× vs raw pixels.

Streaming wire bytes by payload size

sglang · lower is better · Y axis is log scale (each gridline is 10×)

1,707× smaller at 2,048 tokens

JSON-SSE
Codec (identity)
Codec + dict-zstd

Same Codec, three engines

2,048-token reply · best-available Codec wire vs JSON-SSE baseline

3 engines, one wire format

sglang

JSON-SSE 485 KB

Codec msgpack + dict-zstd 291 B

1,707× @ 44.7 ms

vllm

JSON-SSE 518 KB

Codec msgpack + gzip 3.9 KB

137× @ 59.0 ms

llama.cpp

JSON-SSE 529 KB

Codec msgpack + dict-zstd 140 B

3,868× @ 40.8 ms

Time-to-first-byte at 2 K tokens

Codec msgpack + gzip · median across reps · first body byte (Python httpx)

~ 40 ms on local-network sglang & llama.cpp

Cross-vocab handoff — Llama-3 → Qwen-2

Same source IDs, two wire paths · gzip on both · bridge produces byte-identical Qwen-2 output

30% faster bridge response @ 2 K · 15.1× smaller wire @ 2 K

Bridge response time — the latency the next agent waits on

CPU cost to turn the inbound stream into Qwen-2 IDs ready for agent B. Lower is better — this is the wall-clock the handoff blocks on, before agent B's first new token.

Wire bytes — the bandwidth the bridge has to ingest

What the bridge has to receive before any translation can run. Network-bound: the slower the link, the more this dominates response time on top of the bridge CPU above.

Agent loops — end-to-end tool dispatch

codec-sglang:v0.5.0 · Qwen2.5-0.5B · prompt → model emits tool call → real dispatch → tool result → final answer

16.9–18.0× wire reduction across three tool surfaces

Surface	JSON wire	Codec wire	Reduction	JSON total	Codec total	Speedup
mock `get_weather`	13,419 B	794 B	16.9×	1,662 ms	189 ms	8.8×
SearXNG (live web)	42,302 B	2,348 B	18.0×	2,078 ms	1,257 ms	1.65×
MetaMCP gateway (Time MCP)	18,072 B	1,061 B	17.0×	210 ms	216 ms	~neutral

+ MCP leaf-mode — tool-result-side axis

Complementary to the rows above (those measure the model-emission side — ToolWatcher fires on raw IDs in the inference stream). Leaf measures the tool-result hop: a Codec-aware tool ships token IDs in _meta['ai.codec/leaf-tokenization'] alongside its text so the consumer reads ids directly instead of re-tokenizing.

Path (`get_current_time`, ~30 char result)	wire (bytes)	consumer tokenize	total
plain MCP (consumer re-tokenizes text)	105	0.052 ms	0.5 ms
mcp-leaf (consumer reads ids from `_meta`)	316	0.004 ms	0.4 ms
delta	+211 bytes leaf 3× larger on wire	12.4× faster	—

The leaf _meta envelope is a fixed ~210-byte cost per text block; the consumer-CPU savings scale linearly with text length. The wire crossover where leaf ≤ plain sits at ~300+ characters per text block — timestamps pay a wire tax for the CPU win, while paginated docs / search results / large MCP outputs win on both axes. 20/20 integrity: every leaf sample's ids equal tokenizer.encode(text) under the declared map_id.

Tool-call detection on a 1 M-token stream

Time to detect a tool-call region without detokenizing · v0.4.1 lab measurement on EPYC 8124P + gcc:13

26.7× faster than detokenize+regex

What the bench numbers cost — in dollars and watts

honest translations from the measured wire/CPU numbers above · public egress rates · published per-platform reply volumes (ChatGPT 2.5B, Claude 900M, Gemini 600M, …)

~$400M+/yr cloud egress + Starlink + GPU-on-blocked-prompts saved at worldwide AI scale (heavy-agent baseline — tool-use + A2A is now default across Claude, ChatGPT, Gemini)

Codec is a wire + dispatch primitive, not an inference accelerator. The model still runs at the same TPS on the same GPU. The cost story is on the network (egress, mobile data, radio energy), the client CPU (BPE tokenize + JSON parse removed from the hot path), and the server CPU floor (response-side serialize + UTF-8 encode removed, raising the concurrent-request ceiling per GPU). It is NOT on GPU compute.

cloud egress

AWS S3 outbound @ $0.09/GB — compound rate

A 2K-token chat reply ships 485 KB JSON-SSE vs 291 B Codec on sglang — but per visible user reply, real bytes-out are ~4× that because every major AI platform now defaults to tool-use + agent-to-agent: initial response + final response + 2–3 tool requests/results + sub-agent handoffs that span regions. The "single chat reply" era is over — Claude Code, ChatGPT-with-tools, Gemini-Agentic are all multi-hop by design.

Scope	Replies/day	Chat-only floor	Heavy-agent baseline (~4×)
Anthropic Claude	~900M	$14M/yr	~$56M/yr
OpenAI ChatGPT + Copilot	~2.5B	$40M/yr	~$160M/yr
Google Gemini	~600M	$9M/yr	~$36M/yr
Others (Grok, Perplexity, …)	~300M	$5M/yr	~$20M/yr
Worldwide AI traffic (heavy baseline)	~5B	$80M/yr	~$320M/yr

Chat-only floor: 485 KB JSON-SSE per reply × replies/day × $0.09/GB AWS S3. Heavy-agent baseline multiplies by ~4× for the topology that's now default at every major provider — multi-tool dispatch, A2A handoffs, sub-agent invocations, RAG context retrieval all crossing egress. Extreme deep-research / sub-agent-heavy flows push to 6-8× ($480-640M/yr); only legacy chat-only deployments hit the floor. GCP + Azure egress are in the same ballpark ($0.08-$0.12/GB).

mobile + radio

Per-response data + battery

Carrier add'l-data rate ~$10/GB. Radio-link energy ~50 nJ/bit (conservative cross-tech estimate from published 4G/5G/Wi-Fi measurements). Per 2K-token chat reply:

	JSON-SSE	Codec
Data cost	$0.0049	$0.000003
Radio energy	194 mJ	0.12 mJ
Bits over the air	3.88 Mb	2.3 Kb

Per-response cost is tiny on a phone; the unit you can intuit comes from multiplying by the user base. ~20M Claude users × ~50 mobile chat replies/day each ≈ 1B replies/day across a mobile fleet: ~194 MJ vs ~0.12 MJ on radio links — ~54 kWh/day saved at the airlink alone, about ~1.8 average US households' daily electricity (EIA ~30 kWh/household-day), plus the per-user battery + data-cap relief. The full non-GPU energy delta (radio + network + client CPU) is bigger — see the power+latency card below.

client CPU

Tokenize / detokenize removed from the hot path

Two measured points from the bench section above, both worth real CPU on the consumer side:

ToolWatcher — 2.08 ns/token (single 32-bit compare) vs 55.42 ns/token for detokenize+regex match. 26.7× less CPU on every tool-detection pass.
mcp-leaf — 0.004 ms meta-read vs 0.052 ms BPE tokenize per tool result. 12.4× less CPU per tool call where the result includes _meta['ai.codec/leaf-tokenization'].

Per call it's microseconds. At fleet scale (100M consumers × 100 tool-bearing turns/day) the agent-mesh saves on the order of ~1,000 CPU-hours/day across the consumer fleet. On a single laptop running an agent loop locally: less fan noise, longer battery.

GPU compute (blocked prompts)

The one place Codec actually saves GPU $

Codec-aware clients with web-safety enabled refuse doomed prompts locally — policy violations, safety-policy mismatches, malformed payloads — before any wire round-trip. Those requests never reach the GPU. At ~10% client-side block rate on ~5B daily requests = ~500M GPU requests/day avoided.

Assumption	Per call	At 500M blocks/day
GPU-seconds avoided	~1 s avg	~138K GPU-hr/day
@ $1/GPU-hr blended	$0.000278	~$50M/yr saved
@ $2/GPU-hr (premium)	$0.000556	~$100M/yr saved

This is the only place Codec actually reduces GPU dollars — not because the model runs faster, but because the request never runs at all. On the ~90% of requests that DO reach the GPU, compute is unchanged. ~$50–100M/yr is the defensible range at 10% block rate; more aggressive client-side dedup / safety / format-validation pushes it higher.

server CPU floor

What Codec does NOT change on served requests

The model still runs at the same TPS. Codec doesn't accelerate token generation, doesn't change weights, doesn't change KV-cache footprint. If you're GPU-bound, Codec is invisible to your $/token.

What it DOES reduce on the server: response-side serialize + UTF-8 encode + per-token JSON envelope. That's typically 1–5% of server CPU on a busy node and dominates only at high concurrency. The practical effect is the concurrent-request ceiling per GPU moves up — lab measurements on sglang show ~5–10% more sustained concurrent streams before TTFB degrades, because the response-builder thread isn't UTF-8-encoding every token. That translates to fewer GPUs needed to hit the same QPS target, not faster individual responses.

Caveat: this is observation from the bench rig, not a controlled GPU-utilisation A/B against vanilla. Treat as directional — the safe claim is "Codec doesn't slow down inference and removes a known serialize bottleneck at concurrency." We do NOT claim a $/token reduction on the model itself.

What about the energy / CO₂ story? Codec does reduce non-GPU CPU + network electricity at every detokenize/retokenize/parse cycle that gets eliminated. At realistic heavy-agent workflow assumptions the savings work out to roughly 60-200 cars-equivalent of CO₂ per year today, ~250-800 by 2030 — real but small compared to the dollar + accessibility + IoT framings above. We don't lead with this; the methodology + reproducible harness live at packages/bench/docs/ENERGY_METHODOLOGY.md + packages/bench/scripts/energy_bench.py for anyone who wants to plug in their own conversion-event count, per-byte tokenizer cost, and blocked-prompt rate.

Source numbers: cross-stack matrix (wire bytes), leaf-live.ts (consumer CPU), bench_watcher.c (ToolWatcher ns/token). Egress rates: AWS S3 standard tier; GCP equivalent. All cost translations are back-of-envelope from public rates × measured bytes — production deployments should re-run with their own egress contract + traffic mix.

Accessibility — who can use AI when the wire shrinks 1,000×

AI as utility vs AI as luxury · the populations Codec brings inside the door

~5B people on slow / expensive / metered connections

JSON-SSE makes AI a rich-country product. Each real AI request actually moves ~4 MB across the wire (~8 wire round-trips × bidirectional; the heavy-agent baseline the cost card uses) — fine on US fibre, a luxury where mobile data costs $2–10/GB, a no-show on satellite or weak cell. Codec at ~2.4 KB per request works in all of those places at the same TTFB.

Per-AI-request data cost across the world

Mobile data retail prices vary >100× across countries. Wages don't. The per-request cost is what shows up on a metered customer's bill.

Region / connection	$ per GB	JSON-SSE / request	Codec / request
US, postpaid mobile add'l	~$10	$0.040	$0.000024
India, prepaid mobile	~$0.20	$0.0008	$0.0000005
Kenya / Sub-Saharan Africa avg	~$2–5	$0.008–0.020	~$0.000005–0.000012
Starlink Roam add'l data	~$2	$0.008	$0.000005
Starlink Mobile Priority overage	~$1–2	$0.004–0.008	~$0.000005
Starlink Maritime Mobile Priority	~$10	$0.040	$0.000024
In-flight Wi-Fi (Gogo / Viasat)	~$15–30/hr	~$0.80	~$0.0005
Iridium satellite (legacy maritime)	~$5–15/MB	~$20–60	$0.012–0.036

Bottom row is the dramatic one: at Iridium maritime rates, a single agentic AI request bills $20–60 just for envelopes. Same answer on Codec wire bills under 4 cents. That's not a small efficiency win — it's the difference between "AI on this connection" and "AI is impossible on this connection."

Starlink — "pay-per-byte AI" at fleet scale

Starlink's metered tiers (Roam add'l, Mobile Priority overage, Maritime Priority) make the per-byte savings show up as a line-item dollar figure on the customer's bill, every month.

Customer profile	AI requests/mo	Starlink tier	JSON-SSE / mo	Codec saves
RV / van-life heavy user (100 requests/day)	~3K	Roam $2/GB	$24/mo	~$24/mo / ~$290/yr
Remote homestead remote-work (300 requests/day)	~9K	Roam $2/GB	$72/mo	~$72/mo / ~$865/yr
Offshore vessel, 20-crew running AI agents (~1K req/day)	~30K	Maritime $10/GB	$1,200/mo	~$1.2K/mo / ~$14.4K/yr per vessel
Mining / oil-rig field ops, 100 users (~5K req/day)	~150K	Maritime $10/GB	$6,000/mo	~$6K/mo / ~$72K/yr per site
Starlink fleet estimate — ~1M subs on metered tiers running heavy-agent AI	~3B	blended ~$2–3/GB	~$12M/mo	~$150M+/yr Starlink bandwidth saved

The big number isn't consumer Starlink — it's fleet deployments. An offshore vessel running heavy-agent AI tools through Starlink Maritime at $10/GB now pays ~$1,200/month just for the JSON-SSE round-trip overhead on a 20-crew, 1K-request/day workload. Drop that to Codec wire and the same workload bills out under $1/month — a $14K/year line item per vessel that goes away. A maritime operator running 100 vessels: $1.4M/year of Starlink Maritime bandwidth invoiced for envelopes. Aviation customers ($2K–25K/month all-in plans) don't see a per-GB line item but get many more concurrent AI sessions on the same priority pool. Starlink Business / Maritime / Aviation tariffs are public. Numbers scale with the multi-round-trip multiplier — chat-only deployments are 4× smaller, deep-research / sub-agent-heavy flows push higher.

Who this brings inside the door: the ITU's 2024 Facts & Figures puts ~2.6 billion people offline entirely and another ~2–3 billion online via slow or expensive mobile-only connections; A4AI's affordability index flags 1 GB of mobile data as costing >5% of monthly income across most of sub-Saharan Africa and parts of Asia. JSON-SSE at ~4 MB per heavy-agent AI request (the actual real-world transaction size, both directions, all round-trips) prices those users out; Codec at ~2.4 KB per request pushes per-request cost into rounding error in any region, on any connection, on any device. The Response Time card below shows where JSON-SSE crosses from "sluggish" to "unusable" while Codec keeps working — mobile 4G to 2G/EDGE to satellite voice link, all viable. Plus: $50 Android phones can run agent loops on Codec frames where JSON-SSE parse + re-tokenize would burn the battery.

And then there's everything that isn't a phone — IoT & low-bandwidth networks

There are entire device categories that physically can't fit a JSON-SSE AI conversation in their network budget. Not "too expensive" — impossible. Codec brings AI to them for the first time.

Network class	Typical budget	JSON-SSE AI request (~4 MB)	Codec AI request (~2.4 KB)
LoRaWAN sensors / smart meters / asset trackers	11–242 B per uplink, ~handful/day	~20,000× too big	fits in 1–2 packets
NB-IoT / LTE-M meters, telematics, wearables, industrial	~hundreds of KB/day total	can't fit one request	hundreds of requests/day
Sigfox endpoints	12 B/msg, 140 msgs/day	doesn't start	control-frame responses fit
Satellite IoT — Iridium SBD, Swarm, Astrocast	$/byte; tens-hundreds of B per msg	economically nonsensical	inside the cost envelope
Mesh — Meshtastic, Helium, Reticulum, tactical mesh	tens to hundreds of bps	unavailable	viable
Industrial bus — Modbus / CAN / BACnet gateways	tight gateway uplink windows	doesn't fit	rides through

What this opens up: smart agriculture with AI-derived irrigation recommendations on a daily LoRaWAN uplink; wildlife / anti-poaching collars with on-the-fly classification over satellite; cold-chain logistics with AI anomaly alerts en route instead of at port arrival; disaster response + field medicine on degraded networks; pipeline / grid / remote-infrastructure predictive maintenance; maritime, aviation, and expedition AI advisories on the satellite link that previously could only carry position pings; smart-city endpoints (parking / lighting / water / waste) running adaptive AI on the same NB-IoT links they already use. These are massive markets the AI industry has treated as out of scope — not because AI can't help them, because JSON envelopes can't reach them.

The Codec C99 library is small enough for microcontrollers. Six client packages (TypeScript, Python, Rust, Java, .NET, C) cover everything from a phone browser down to a battery-budget LoRaWAN endpoint. For the first time, AI is something a sensor can call — not just something the cloud can do to a sensor's data.

Response time — what the user feels

TTFB unchanged · wire-transfer time goes from dominant to invisible as link speed drops

~3.5 s → ~0.4 s per AI request on 10 Mbps mobile (heavy-agent baseline); JSON-SSE crosses into "unusable" around 1 Mbps, Codec stays under half a second

Same scope as the cost card: per real AI request, not per single response. A modern AI call makes ~8 wire round-trips behind the scenes (user→API, agent-to-agent, tool calls, tool results, sub-agent, synthesis, final response). Each round-trip pays TTFB — the per-request totals below are the heavy-agent baseline at 8 round-trips. TTFB itself is unchanged by Codec (the model still runs first); what changes is the payload-transfer time at every hop, which goes from "dominates" to "invisible."

Wall-clock per heavy-agent AI request, by link speed

~8 round-trips × (~45 ms TTFB + payload transfer at link rate). Chat-only deployments are ~¼ these numbers.

Link	JSON-SSE per request	Codec per request
Datacenter LAN — 1 Gbps	~390 ms	~360 ms
Office Wi-Fi — 100 Mbps	~680 ms	~360 ms
Mobile 4G — 10 Mbps	~3.5 s	~0.4 s
Edge / weak mobile — 1 Mbps	~32 s	~0.4 s
Satellite / IoT — 256 Kbps	~2 min	~0.5 s

The interesting line isn't the speedup ratio — it's the absolute wall-clock. JSON-SSE on a 1 Mbps edge link takes half a minute per agentic AI request. On 256 Kbps satellite it's 2 minutes. That's unusable. Codec keeps the same workload under half a second on every connection.

Agent loop end-to-end — measured, not extrapolated

From the v0.4.1 agent-loop bench. Wall-clock includes the model emitting the tool call + dispatch + tool result + final answer.

Tool surface	JSON-SSE total	Codec total	Speedup	What dominates
mock `get_weather` (in-process)	1,662 ms	189 ms	8.8×	wire + serialize
SearXNG (live web tool)	2,078 ms	1,257 ms	1.65×	tool latency
MetaMCP gateway (Time MCP)	210 ms	216 ms	~neutral	model + tool

TTFB numbers from MATRIX.md §4 (Python httpx aiter_raw first-body-byte median). Agent-loop wall-clock from results/2026-05-15T20-00-00Z/agent-loop/ — wire numbers measured, not extrapolated; speedups are real wall-clock from POST to final answer. Per-AI-request totals assume the heavy-agent baseline of ~8 wire round-trips per visible request. Energy and dollar aggregation across the fleet (network electricity, datacenter pressure, cars-off-road equivalence) live in the cost card above — this card is just about the wall-clock one user feels.

Source: cross-stack MATRIX.md · codec-sglang, codec-vllm, codec-llamacpp (all v0.5.0) · Qwen-2.5 0.5B · RTX 3090 · temp 0.0 · reproducible from packages/bench/scripts/run-all-langs.sh (cross-stack matrix), synthetic_wire_bench.py (§1 protocol-only), and leaf-live.ts (MCP leaf-mode tool-result-side).

how it works

Three pieces. That's the whole spec.

01

Handshake the vocab.

Client and server agree on which tokenizer to speak before any token ID crosses the wire. Maps are sha256-addressed JSON: pull a pre-generated one from codec-maps, or generate one for any model with a tokenizer.json via the maps CLI.
02

Stream uint32.

Token IDs go directly on the wire as 32-bit big-endian integers. No JSON envelope, no UTF-8 round-trip, no per-message structural overhead. Four bytes per token, every token.
03

Frame with control words.

The high byte of each word distinguishes data from control. Roles, tool calls, completion boundaries, and stream resets ride in-band as reserved control IDs. One framing layer covers everything.

Ready to ship bytes, not sentences?

Codec is source-available under BSL 1.1, free for non-production use and for production use under US $5M annual revenue.

Read the spec Commercial licensing

The control plane for AI inference.