AI inference is burning megawatts of GPU power and datacenter
buildout is racing to keep up — meanwhile your inference stack
is paying again at every hop on top of the GPU bill. Models
think in tokens, but the rest of the stack speaks text. Every gateway,
router, tool dispatcher, and middleware in the path does the same
ritual: detokenize the model's IDs to text, encode as UTF-8, wrap in
JSON, ship it, parse it, decode UTF-8, re-tokenize back to IDs —
burning CPU, memory, and latency on lossy conversions the AI never
asked for, and risking KV-cache corruption when the re-tokenize
doesn't round-trip cleanly. Codec is a drop-in upgrade that keeps
token IDs as the wire format end-to-end: gateways forward
IDs verbatim, tool dispatchers match on raw IDs, cross-model handoffs
translate vocabularies in-process. Same model, same prompts, same
answers; typically 16× less data on the wire on real
agent traffic, up to ~1,700× when the content compresses
well — how big the win is depends on what your AI
generates, full receipts below. On mobile: snappier app, lighter
cloud bill. At fleet scale: megawatt-hours of network energy and
middleware CPU not burned on bytes nobody reads. Plug-in
libraries for TypeScript, Python, Rust, Java, .NET, and C work with
the AI servers you already use (sglang, vllm, llama.cpp). Your code
doesn't change. We can't make the model smaller — the waste,
we can. And by shrinking the wire 1,000+×, Codec opens AI
access to the ~5 billion people on slow,
expensive, or metered connections that JSON-SSE prices out of
the door.
~$400M+/yrtotal wire + GPU savings worldwide~$320M cloud egress (heavy-agent baseline — tool-use + A2A is default at Claude/ChatGPT/Gemini) + ~$50M GPU on blocked prompts + ~$35M Starlink; sub-agent-heavy flows push to $500–700M/yr
up to 10×faster on mobile2 K-token reply over 10 Mbps 4G
~5B peopleAI accessible where it wasn't2.6B offline + 2–3B on slow / expensive mobile (ITU 2024)
Token IDs straight on the wire. Tool-call dispatch, observability,
cross-vocab handoff — all the things you'd want to do at the
inference layer reduce to integer compares on the stream. Detokenize
becomes a byproduct, not a per-token cost.
control-plane primitives
Three operations. All on raw token IDs.
Codec gives the inference layer the same primitives a service mesh gives
a microservice fleet: route, dispatch, translate.
Run them on raw uint32 tokens, never on text. The
compression you see in the receipts below is what falls out for free
when you stop reserializing every hop.
route
Tokens all the way down
Models think in tokens. Every middleware in your stack — gateway, router, log sink — speaks text, so it detokenizes, JSON-wraps, ships, parses, re-tokenizes — once per hop, burning CPU and risking KV-cache drift. Codec keeps token IDs as the wire format end-to-end; UTF-8 happens once, at the edge that actually displays text. Same compression options on top (gzip / brotli / dict-zstd). Same framing on every engine; six client languages decode byte-identically.
16–1700×less wire (workload-dep.)
3engines, one wire
dispatch
MCP tool calls, leaf-mode bypass
The MCP path normally tokenizes the tool result at the gateway, every call. A Codec-aware MCP server
(codec-time-leaf)
attaches token IDs to its result via
_meta['ai.codec/leaf-tokenization']; the
codec-metamcp gateway forwards them
verbatim — [Codec][leaf] fires, the gateway becomes
a transparent ID pipe, and the consumer skips its BPE re-tokenize.
tools/list across a 40-tool namespace:
21.4 KB → 5.9 KB (3.6×).
ToolWatcher detects tool boundaries on the raw ID stream at
26.7× the speed of detokenize+regex
(lab EPYC, 481 Mtok/s).
12.4×leaf consumer CPU
26.7×ToolWatcher vs detok+regex
translate
Cross-vocab agent handoff
A Llama-3 agent's stream feeds a Qwen-2 agent through one
in-process detokenize / retokenize step. UTF-8 never crosses
the wire. At 2 K tokens the Codec path ships
15.1× fewer wire bytes
(10.4 KB → 709 B) at bridge CPU within noise of
the JSON-SSE+retokenize path. Both paths emit byte-identical
Qwen-2 output; the bench asserts strict equality before
reporting numbers.
15.1×smaller wire @ 2K
≡byte-identical output
latents
Image & video diffusion on the wire
The same wire format extends to diffusion
models: VAE latents stream in length-prefixed binary frames
instead of decoded pixels. The client runs vae_decode
locally; pixels never touch the wire. Measured on the lab against
codec-diffusers running SD-1.5: a 512×512 latent at int8 packs
to 16.4 KB (~5–10× smaller than
JPEG, ~90× smaller than raw fp16 pixels). The int4 pipeline
halves it again. Pipeline math validates byte-for-byte against
spec/PIPELINES.md.
3.9×int4 vs raw latent
~90×vs raw fp16 pixels
receipts
What falls out when the inference layer stays token-native.
Compression isn't the headline — the primitives are. But once
every hop runs on raw uint32 token IDs, the wire reduction
and the tool-call latency floor are measurable byproducts. Numbers
below are from the cross-stack benchmark matrix: same prompt, same
model, three real inference engines, six real client languages. Every
cell is measured. Full SCHEMA-v1 result JSONs in
packages/bench/results/.
Lab run — all three pathways measured
vinez@192.168.1.88 · 2× RTX 3090 · v0.4.1 cut, 2026-05-15 (wire numbers unchanged from v0.3.x — v0.4.x is wire-additive; v0.4.1 ships dict-zstd across all 6 clients + br + zstd on llama.cpp)
3.9×int4 latent vs raw
MCP gateway
tools/list across 40 tools
wire
vs JSON
JSON-RPC
21.4 KB
1.0×
msgpack + gzip
5.9 KB
3.6×
[Codec][leaf] log fires end-to-end on
codec-time-leaf
tool calls — gateway becomes a transparent ID pipe.
Latents (v0.3)
512×512 SD-1.5 latent on the wire
pipeline
wire
vs raw
raw fp16
32.4 KB
1.0×
int8
16.4 KB
2.0×
int4
8.4 KB
3.9×
For comparison: same image as a JPEG ~80–150 KB; raw fp16 pixels 1.5 MB.
int4 = ~10× smaller than JPEG, ~180× vs raw pixels.
Streaming wire bytes by payload size
sglang · lower is better · Y axis is log scale (each gridline is 10×)
1,707×smaller at 2,048 tokens
JSON-SSE
Codec (identity)
Codec + dict-zstd
Same Codec, three engines
2,048-token reply · best-available Codec wire vs JSON-SSE baseline
3engines, one wire format
sglang
JSON-SSE485 KB
Codec msgpack + dict-zstd291 B
1,707×@ 44.7 ms
vllm
JSON-SSE518 KB
Codec msgpack + gzip3.9 KB
137×@ 59.0 ms
llama.cpp
JSON-SSE529 KB
Codec msgpack + dict-zstd140 B
3,868×@ 40.8 ms
Time-to-first-byte at 2 K tokens
Codec msgpack + gzip · median across reps · first body byte (Python httpx)
~ 40 mson local-network sglang & llama.cpp
Cross-vocab handoff — Llama-3 → Qwen-2
Same source IDs, two wire paths · gzip on both · bridge produces byte-identical Qwen-2 output
30%faster bridge response @ 2 K·15.1×smaller wire @ 2 K
Bridge response time — the latency the next agent waits on
CPU cost to turn the inbound stream into Qwen-2 IDs ready for agent B. Lower is better — this is the wall-clock the handoff blocks on, before agent B's first new token.
Wire bytes — the bandwidth the bridge has to ingest
What the bridge has to receive before any translation can run. Network-bound: the slower the link, the more this dominates response time on top of the bridge CPU above.
Agent loops — end-to-end tool dispatch
codec-sglang:v0.5.0 · Qwen2.5-0.5B · prompt → model emits tool call → real dispatch → tool result → final answer
16.9–18.0×wire reduction across three tool surfaces
Surface
JSON wire
Codec wire
Reduction
JSON total
Codec total
Speedup
mock get_weather
13,419 B
794 B
16.9×
1,662 ms
189 ms
8.8×
SearXNG (live web)
42,302 B
2,348 B
18.0×
2,078 ms
1,257 ms
1.65×
MetaMCP gateway (Time MCP)
18,072 B
1,061 B
17.0×
210 ms
216 ms
~neutral
+ MCP leaf-mode — tool-result-side axis
Complementary to the rows above (those measure the model-emission side — ToolWatcher fires on raw IDs in the inference stream). Leaf measures the tool-result hop: a Codec-aware tool ships token IDs in _meta['ai.codec/leaf-tokenization'] alongside its text so the consumer reads ids directly instead of re-tokenizing.
Path (get_current_time, ~30 char result)
wire (bytes)
consumer tokenize
total
plain MCP (consumer re-tokenizes text)
105
0.052 ms
0.5 ms
mcp-leaf (consumer reads ids from _meta)
316
0.004 ms
0.4 ms
delta
+211 bytes leaf 3× larger on wire
12.4× faster
—
The leaf _meta envelope is a fixed ~210-byte cost per text block; the consumer-CPU savings scale linearly with text length. The wire crossover where leaf ≤ plain sits at ~300+ characters per text block — timestamps pay a wire tax for the CPU win, while paginated docs / search results / large MCP outputs win on both axes. 20/20 integrity: every leaf sample's ids equal tokenizer.encode(text) under the declared map_id.
Tool-call detection on a 1 M-token stream
Time to detect a tool-call region without detokenizing · v0.4.1 lab measurement on EPYC 8124P + gcc:13
26.7×faster than detokenize+regex
What the bench numbers cost — in dollars and watts
honest translations from the measured wire/CPU numbers above · public egress rates · published per-platform reply volumes (ChatGPT 2.5B, Claude 900M, Gemini 600M, …)
~$400M+/yrcloud egress + Starlink + GPU-on-blocked-prompts saved at worldwide AI scale (heavy-agent baseline — tool-use + A2A is now default across Claude, ChatGPT, Gemini)
Codec is a wire + dispatch primitive, not an inference accelerator.
The model still runs at the same TPS on the same GPU. The cost story is on the
network (egress, mobile data, radio energy), the client CPU
(BPE tokenize + JSON parse removed from the hot path), and the
server CPU floor (response-side serialize + UTF-8 encode removed,
raising the concurrent-request ceiling per GPU). It is NOT on GPU compute.
cloud egress
AWS S3 outbound @ $0.09/GB — compound rate
A 2K-token chat reply ships 485 KB JSON-SSE vs 291 B Codec on sglang — but per visible user reply, real bytes-out are ~4× that because every major AI platform now defaults to tool-use + agent-to-agent: initial response + final response + 2–3 tool requests/results + sub-agent handoffs that span regions. The "single chat reply" era is over — Claude Code, ChatGPT-with-tools, Gemini-Agentic are all multi-hop by design.
Scope
Replies/day
Chat-only floor
Heavy-agent baseline (~4×)
Anthropic Claude
~900M
$14M/yr
~$56M/yr
OpenAI ChatGPT + Copilot
~2.5B
$40M/yr
~$160M/yr
Google Gemini
~600M
$9M/yr
~$36M/yr
Others (Grok, Perplexity, …)
~300M
$5M/yr
~$20M/yr
Worldwide AI traffic (heavy baseline)
~5B
$80M/yr
~$320M/yr
Chat-only floor: 485 KB JSON-SSE per reply × replies/day × $0.09/GB AWS S3. Heavy-agent baseline multiplies by ~4× for the topology that's now default at every major provider — multi-tool dispatch, A2A handoffs, sub-agent invocations, RAG context retrieval all crossing egress. Extreme deep-research / sub-agent-heavy flows push to 6-8× ($480-640M/yr); only legacy chat-only deployments hit the floor. GCP + Azure egress are in the same ballpark ($0.08-$0.12/GB).
mobile + radio
Per-response data + battery
Carrier add'l-data rate ~$10/GB. Radio-link energy ~50 nJ/bit (conservative cross-tech estimate from published 4G/5G/Wi-Fi measurements). Per 2K-token chat reply:
JSON-SSE
Codec
Data cost
$0.0049
$0.000003
Radio energy
194 mJ
0.12 mJ
Bits over the air
3.88 Mb
2.3 Kb
Per-response cost is tiny on a phone; the unit you can intuit comes from multiplying by the user base. ~20M Claude users × ~50 mobile chat replies/day each ≈ 1B replies/day across a mobile fleet: ~194 MJ vs ~0.12 MJ on radio links — ~54 kWh/day saved at the airlink alone, about ~1.8 average US households' daily electricity (EIA ~30 kWh/household-day), plus the per-user battery + data-cap relief. The full non-GPU energy delta (radio + network + client CPU) is bigger — see the power+latency card below.
client CPU
Tokenize / detokenize removed from the hot path
Two measured points from the bench section above, both worth real CPU on the consumer side:
ToolWatcher — 2.08 ns/token (single 32-bit compare) vs 55.42 ns/token for detokenize+regex match. 26.7× less CPU on every tool-detection pass.
mcp-leaf — 0.004 ms meta-read vs 0.052 ms BPE tokenize per tool result. 12.4× less CPU per tool call where the result includes _meta['ai.codec/leaf-tokenization'].
Per call it's microseconds. At fleet scale (100M consumers × 100 tool-bearing turns/day) the agent-mesh saves on the order of ~1,000 CPU-hours/day across the consumer fleet. On a single laptop running an agent loop locally: less fan noise, longer battery.
GPU compute (blocked prompts)
The one place Codec actually saves GPU $
Codec-aware clients with web-safety enabled refuse doomed prompts locally — policy violations, safety-policy mismatches, malformed payloads — before any wire round-trip. Those requests never reach the GPU. At ~10% client-side block rate on ~5B daily requests = ~500M GPU requests/day avoided.
Assumption
Per call
At 500M blocks/day
GPU-seconds avoided
~1 s avg
~138K GPU-hr/day
@ $1/GPU-hr blended
$0.000278
~$50M/yr saved
@ $2/GPU-hr (premium)
$0.000556
~$100M/yr saved
This is the only place Codec actually reduces GPU dollars — not because the model runs faster, but because the request never runs at all. On the ~90% of requests that DO reach the GPU, compute is unchanged. ~$50–100M/yr is the defensible range at 10% block rate; more aggressive client-side dedup / safety / format-validation pushes it higher.
server CPU floor
What Codec does NOT change on served requests
The model still runs at the same TPS. Codec doesn't accelerate token generation, doesn't change weights, doesn't change KV-cache footprint. If you're GPU-bound, Codec is invisible to your $/token.
What it DOES reduce on the server: response-side serialize + UTF-8 encode + per-token JSON envelope. That's typically 1–5% of server CPU on a busy node and dominates only at high concurrency. The practical effect is the concurrent-request ceiling per GPU moves up — lab measurements on sglang show ~5–10% more sustained concurrent streams before TTFB degrades, because the response-builder thread isn't UTF-8-encoding every token. That translates to fewer GPUs needed to hit the same QPS target, not faster individual responses.
Caveat: this is observation from the bench rig, not a controlled GPU-utilisation A/B against vanilla. Treat as directional — the safe claim is "Codec doesn't slow down inference and removes a known serialize bottleneck at concurrency." We do NOT claim a $/token reduction on the model itself.
What about the energy / CO₂ story? Codec does reduce non-GPU CPU + network electricity at every detokenize/retokenize/parse cycle that gets eliminated. At realistic heavy-agent workflow assumptions the savings work out to roughly 60-200 cars-equivalent of CO₂ per year today, ~250-800 by 2030 — real but small compared to the dollar + accessibility + IoT framings above. We don't lead with this; the methodology + reproducible harness live at packages/bench/docs/ENERGY_METHODOLOGY.md + packages/bench/scripts/energy_bench.py for anyone who wants to plug in their own conversion-event count, per-byte tokenizer cost, and blocked-prompt rate.
Accessibility — who can use AI when the wire shrinks 1,000×
AI as utility vs AI as luxury · the populations Codec brings inside the door
~5Bpeople on slow / expensive / metered connections
JSON-SSE makes AI a rich-country product. Each real AI request actually moves ~4 MB across the wire (~8 wire round-trips × bidirectional; the heavy-agent baseline the cost card uses) — fine on US fibre, a luxury where mobile data costs $2–10/GB, a no-show on satellite or weak cell. Codec at ~2.4 KB per request works in all of those places at the same TTFB.
Per-AI-request data cost across the world
Mobile data retail prices vary >100× across countries. Wages don't. The per-request cost is what shows up on a metered customer's bill.
Region / connection
$ per GB
JSON-SSE / request
Codec / request
US, postpaid mobile add'l
~$10
$0.040
$0.000024
India, prepaid mobile
~$0.20
$0.0008
$0.0000005
Kenya / Sub-Saharan Africa avg
~$2–5
$0.008–0.020
~$0.000005–0.000012
Starlink Roam add'l data
~$2
$0.008
$0.000005
Starlink Mobile Priority overage
~$1–2
$0.004–0.008
~$0.000005
Starlink Maritime Mobile Priority
~$10
$0.040
$0.000024
In-flight Wi-Fi (Gogo / Viasat)
~$15–30/hr
~$0.80
~$0.0005
Iridium satellite (legacy maritime)
~$5–15/MB
~$20–60
$0.012–0.036
Bottom row is the dramatic one: at Iridium maritime rates, a single agentic AI request bills $20–60 just for envelopes. Same answer on Codec wire bills under 4 cents. That's not a small efficiency win — it's the difference between "AI on this connection" and "AI is impossible on this connection."
Starlink — "pay-per-byte AI" at fleet scale
Starlink's metered tiers (Roam add'l, Mobile Priority overage, Maritime Priority) make the per-byte savings show up as a line-item dollar figure on the customer's bill, every month.
Customer profile
AI requests/mo
Starlink tier
JSON-SSE / mo
Codec saves
RV / van-life heavy user (100 requests/day)
~3K
Roam $2/GB
$24/mo
~$24/mo / ~$290/yr
Remote homestead remote-work (300 requests/day)
~9K
Roam $2/GB
$72/mo
~$72/mo / ~$865/yr
Offshore vessel, 20-crew running AI agents (~1K req/day)
~30K
Maritime $10/GB
$1,200/mo
~$1.2K/mo / ~$14.4K/yr per vessel
Mining / oil-rig field ops, 100 users (~5K req/day)
~150K
Maritime $10/GB
$6,000/mo
~$6K/mo / ~$72K/yr per site
Starlink fleet estimate — ~1M subs on metered tiers running heavy-agent AI
~3B
blended ~$2–3/GB
~$12M/mo
~$150M+/yr Starlink bandwidth saved
The big number isn't consumer Starlink — it's fleet deployments. An offshore vessel running heavy-agent AI tools through Starlink Maritime at $10/GB now pays ~$1,200/month just for the JSON-SSE round-trip overhead on a 20-crew, 1K-request/day workload. Drop that to Codec wire and the same workload bills out under $1/month — a $14K/year line item per vessel that goes away. A maritime operator running 100 vessels: $1.4M/year of Starlink Maritime bandwidth invoiced for envelopes. Aviation customers ($2K–25K/month all-in plans) don't see a per-GB line item but get many more concurrent AI sessions on the same priority pool. Starlink Business / Maritime / Aviation tariffs are public. Numbers scale with the multi-round-trip multiplier — chat-only deployments are 4× smaller, deep-research / sub-agent-heavy flows push higher.
Who this brings inside the door: the ITU's 2024 Facts & Figures puts ~2.6 billion people offline entirely and another ~2–3 billion online via slow or expensive mobile-only connections; A4AI's affordability index flags 1 GB of mobile data as costing >5% of monthly income across most of sub-Saharan Africa and parts of Asia. JSON-SSE at ~4 MB per heavy-agent AI request (the actual real-world transaction size, both directions, all round-trips) prices those users out; Codec at ~2.4 KB per request pushes per-request cost into rounding error in any region, on any connection, on any device. The Response Time card below shows where JSON-SSE crosses from "sluggish" to "unusable" while Codec keeps working — mobile 4G to 2G/EDGE to satellite voice link, all viable. Plus: $50 Android phones can run agent loops on Codec frames where JSON-SSE parse + re-tokenize would burn the battery.
And then there's everything that isn't a phone — IoT & low-bandwidth networks
There are entire device categories that physically can't fit a JSON-SSE AI conversation in their network budget. Not "too expensive" — impossible. Codec brings AI to them for the first time.
What this opens up: smart agriculture with AI-derived irrigation recommendations on a daily LoRaWAN uplink; wildlife / anti-poaching collars with on-the-fly classification over satellite; cold-chain logistics with AI anomaly alerts en route instead of at port arrival; disaster response + field medicine on degraded networks; pipeline / grid / remote-infrastructure predictive maintenance; maritime, aviation, and expedition AI advisories on the satellite link that previously could only carry position pings; smart-city endpoints (parking / lighting / water / waste) running adaptive AI on the same NB-IoT links they already use. These are massive markets the AI industry has treated as out of scope — not because AI can't help them, because JSON envelopes can't reach them.
The Codec C99 library is small enough for microcontrollers. Six client packages (TypeScript, Python, Rust, Java, .NET, C) cover everything from a phone browser down to a battery-budget LoRaWAN endpoint. For the first time, AI is something a sensor can call — not just something the cloud can do to a sensor's data.
Response time — what the user feels
TTFB unchanged · wire-transfer time goes from dominant to invisible as link speed drops
~3.5 s → ~0.4 sper AI request on 10 Mbps mobile (heavy-agent baseline); JSON-SSE crosses into "unusable" around 1 Mbps, Codec stays under half a second
Same scope as the cost card: per real AI request, not per single response. A modern AI call makes ~8 wire round-trips behind the scenes (user→API, agent-to-agent, tool calls, tool results, sub-agent, synthesis, final response). Each round-trip pays TTFB — the per-request totals below are the heavy-agent baseline at 8 round-trips. TTFB itself is unchanged by Codec (the model still runs first); what changes is the payload-transfer time at every hop, which goes from "dominates" to "invisible."
Wall-clock per heavy-agent AI request, by link speed
~8 round-trips × (~45 ms TTFB + payload transfer at link rate). Chat-only deployments are ~¼ these numbers.
Link
JSON-SSE per request
Codec per request
Datacenter LAN — 1 Gbps
~390 ms
~360 ms
Office Wi-Fi — 100 Mbps
~680 ms
~360 ms
Mobile 4G — 10 Mbps
~3.5 s
~0.4 s
Edge / weak mobile — 1 Mbps
~32 s
~0.4 s
Satellite / IoT — 256 Kbps
~2 min
~0.5 s
The interesting line isn't the speedup ratio — it's the absolute wall-clock. JSON-SSE on a 1 Mbps edge link takes half a minute per agentic AI request. On 256 Kbps satellite it's 2 minutes. That's unusable. Codec keeps the same workload under half a second on every connection.
Agent loop end-to-end — measured, not extrapolated
From the v0.4.1 agent-loop bench. Wall-clock includes the model emitting the tool call + dispatch + tool result + final answer.
Client and server agree on which tokenizer to speak before any token
ID crosses the wire. Maps are sha256-addressed JSON: pull a
pre-generated one from codec-maps,
or generate one for any model with a tokenizer.json
via the maps CLI.
02
Stream uint32.
Token IDs go directly on the wire as 32-bit big-endian integers. No
JSON envelope, no UTF-8 round-trip, no per-message structural
overhead. Four bytes per token, every token.
03
Frame with control words.
The high byte of each word distinguishes data from control. Roles,
tool calls, completion boundaries, and stream resets ride in-band as
reserved control IDs. One framing layer covers everything.
Ready to ship bytes, not sentences?
Codec is source-available under BSL 1.1, free for non-production use
and for production use under US $5M annual revenue.