v0.1 · source-available

The control plane
for AI inference.

Codec is the substrate that lets gateways, routers, agents, and tool dispatchers operate on raw token IDs end-to-end. Detokenize once, at the only edge that needs text — never per-token, never per-hop. Compression and wire reduction are byproducts; what you actually buy is the ability to run the inference layer like infrastructure.

Token IDs straight on the wire. Tool-call dispatch, observability, cross-vocab handoff — all the things you'd want to do at the inference layer reduce to integer compares on the stream. Detokenize becomes a byproduct, not a per-token cost.

control-plane primitives

Three operations. All on raw token IDs.

Codec gives the inference layer the same primitives a service mesh gives a microservice fleet: route, dispatch, translate. Run them on raw uint32 tokens, never on text. The compression you see in the receipts below is what falls out for free when you stop reserializing every hop.

route

Wire-native streaming

Token IDs flow as length-prefixed binary frames — no JSON envelope per token, no UTF-8 round-trip at every hop. Compression is a layer on the same wire (gzip / brotli / dict-zstd). Same framing on every engine in the matrix; clients in six languages producing byte-identical output. Receipts: a short chat reply shrinks ~67× (15.2 KB JSON-SSE → 226 B Codec+gzip), a 2 K-token agent stream 1,404×. TTFB unchanged.

  • 1,404×peak wire reduction
  • 3engines, one wire
dispatch

Tool calls without detokenize

ToolWatcher matches reserved control IDs in the raw token stream with a single 32-bit compare per token — no detokenize, no regex, no per-chunk text scan. The MetaMCP gateway is where the primitive lives in production, but the same hook works in any inference proxy, agent runtime, or middleware. Detokenize runs once at the JSON-RPC seam; everything upstream stays token-native. Microbench: 0.61 ms vs 60.4 ms on a 1 M-token stream.

  • 100×vs detokenize+regex
  • 0text on the hot path
translate

Cross-vocab agent handoff

A Llama-3 agent's stream feeds a Qwen-2 agent through one in-process detokenize / retokenize step. UTF-8 never crosses the wire. At 2 K tokens the bridge produces target-vocab IDs ~30% sooner (10.9 ms → 7.7 ms of bridge CPU) on 15.1× fewer wire bytes. Both paths emit byte-identical Qwen-2 output; the bench asserts strict equality.

  • 30%faster bridge
  • 15.1×smaller wire

receipts

What falls out when the inference layer stays token-native.

Compression isn't the headline — the primitives are. But once every hop runs on raw uint32 token IDs, the wire reduction and the tool-call latency floor are measurable byproducts. Numbers below are from the cross-stack benchmark matrix: same prompt, same model, three real inference engines, six real client languages. Every cell is measured. Full SCHEMA-v1 result JSONs in packages/bench/results/.

Streaming wire bytes by payload size

sglang · lower is better · Y axis is log scale (each gridline is 10×)

1,404× smaller at 2,048 tokens
1 MB 100 KB 10 KB 1 KB 100 B 64 tokens512 tokens2,048 tokens 485.2 KB 30.0 KB 354 B TTFB @ 2 K tokens · sglang · same wire JSON-SSE ~46 ms Codec + gzip 45.6 ms first-body-byte median, body-byte cohort
  • JSON-SSE
  • Codec (identity)
  • Codec + gzip (dict-zstd at 2 K)
At 2 K tokens the JSON-SSE stream ships 485 KB; Codec msgpack with dict-zstd ships 354 bytes. That's 237 B/token → 0.17 B/token at the same TTFB — first-body-byte lands at ~45 ms either way on the same sglang server. The wire reduction is essentially free in latency.

Same Codec, three engines

2,048-token reply · best-available Codec wire vs JSON-SSE baseline

3 engines, one wire format
sglang
JSON-SSE 485 KB
Codec msgpack + dict-zstd 354 B
1,404× @ 45.6 ms
vllm
JSON-SSE 479 KB
Codec msgpack + gzip 3.9 KB
126× @ 67.3 ms
llama.cpp
JSON-SSE 529 KB
Codec msgpack + gzip 16 KB
33× @ 40.7 ms
sglang reaches 1,404× because it has the full compression stack (gzip + brotli + dict-zstd) wired into the Codec chunker. vllm hits 126× with gzip alone — the dict-loader hook to engage zstd is in flight on the wdunn001/vllm fork. llama.cpp's HTTP layer ships gzip only; 33× at zero protocol cost.

Time-to-first-byte at 2 K tokens

Codec msgpack + gzip · median across reps · first body byte (Python httpx)

~ 40 ms on local-network sglang & llama.cpp
llama.cpp 40.7 ms sglang 45.6 ms vllm 67.3 ms
Codec doesn't trade latency for size. llama.cpp 40.7 ms, sglang 45.6 ms, and vllm 67.3 ms from POST to first body byte at 2 K tokens — the JSON-SSE baseline against the same engines is within 1 ms either way. The decode-side payoff (no UTF-8 reparse, no JSON.parse on every chunk) is pure cost reduction on the client.

Cross-vocab handoff — Llama-3 → Qwen-2

Same source IDs, two wire paths · gzip on both · bridge produces byte-identical Qwen-2 output

30% faster bridge response @ 2 K · 15.1× smaller wire @ 2 K

Bridge response time — the latency the next agent waits on

CPU cost to turn the inbound stream into Qwen-2 IDs ready for agent B. Lower is better — this is the wall-clock the handoff blocks on, before agent B's first new token.

64 tokens JSON-SSE 1.21 ms Codec msgpack 1.16 ms tied 512 tokens JSON-SSE 4.79 ms Codec msgpack 3.88 ms −19% 2048 tokens JSON-SSE 10.9 ms Codec msgpack 7.67 ms −30%

Wire bytes — the bandwidth the bridge has to ingest

What the bridge has to receive before any translation can run. Network-bound: the slower the link, the more this dominates response time on top of the bridge CPU above.

64 tokens JSON-SSE 585 B Codec msgpack 215 B 2.7× 512 tokens JSON-SSE 2.9 KB Codec msgpack 672 B 4.3× 2048 tokens JSON-SSE 10.4 KB Codec msgpack 709 B 15.1×
At a 2 K-token agent A→B handoff, the Codec bridge produces Qwen-2 IDs 3.2 ms sooner (10.9 ms → 7.7 ms, ~30% less CPU) and ingests 15× fewer wire bytes (10.4 KB → 709 B) along the way. The CPU win is the irreducible cost — the bridge skips JSON-SSE parsing and BPE-from-scratch, and runs Translator's word-boundary-buffered re-tokenize instead. The wire win compounds it on slow links: on a 10 Mbps mobile uplink that's another ~8 ms saved on top of the CPU savings, every handoff. Both paths emit the same Qwen-2 token stream byte-for-byte; the bench asserts strict equality before reporting numbers.

Tool-call detection on a 1 M-token stream

Time to detect that a token range is a tool-call region · lower is better

100× faster, no detokenize
Detokenize + regex 60.4 ms Codec ToolWatcher 0.61 ms
ToolWatcher matches reserved control IDs in the token stream with a single 32-bit compare per token. The text path has to detokenize back to UTF-8 first — the same work the model just did in reverse, every chunk. Same advantage shows up in agent loops: a sub-millisecond in-band signal beats a multi-millisecond out-of-band parse on every single tool boundary.

Source: cross-stack MATRIX.md · sglang nightly, vllm nightly + PR #41765, llama.cpp + PR #22757 · Qwen-2.5 0.5B · RTX 3090 · temp 0.0 · reproducible from packages/bench/scripts/run-all-langs.sh.

how it works

Three pieces. That's the whole spec.

  1. 01

    Handshake the vocab.

    Client and server agree on which tokenizer to speak before any token ID crosses the wire. Maps are sha256-addressed JSON: pull a pre-generated one from codec-maps, or generate one for any model with a tokenizer.json via the maps CLI.

  2. 02

    Stream uint32.

    Token IDs go directly on the wire as 32-bit big-endian integers. No JSON envelope, no UTF-8 round-trip, no per-message structural overhead. Four bytes per token, every token.

  3. 03

    Frame with control words.

    The high byte of each word distinguishes data from control. Roles, tool calls, completion boundaries, and stream resets ride in-band as reserved control IDs. One framing layer covers everything.

Ready to ship bytes, not sentences?

Codec is source-available under BSL 1.1, free for non-production use and for production use under US $5M annual revenue.

Patent posture: Quasarke is pursuing patent protection on certain Codec mechanisms. The wire format, handshake, and content-addressed map distribution described in the spec are intended to be made available on royalty-free or FRAND terms to implementers of the Codec specification when patents issue. See PATENTS.md for details.