v0.1 · source-available

The control plane
for AI inference.

Codec is the substrate that lets gateways, routers, agents, and tool dispatchers operate on raw token IDs end-to-end. Detokenize once, at the only edge that needs text — never per-token, never per-hop. Compression and wire reduction are byproducts; what you actually buy is the ability to run the inference layer like infrastructure.

What it gives you → github / wdunn001/Codec

Token IDs straight on the wire. Tool-call dispatch, observability, cross-vocab handoff — all the things you'd want to do at the inference layer reduce to integer compares on the stream. Detokenize becomes a byproduct, not a per-token cost.

control-plane primitives

Three operations. All on raw token IDs.

Codec gives the inference layer the same primitives a service mesh gives a microservice fleet: route, dispatch, translate. Run them on raw uint32 tokens, never on text. The compression you see in the receipts below is what falls out for free when you stop reserializing every hop.

route

Wire-native streaming

Token IDs flow as length-prefixed binary frames — no JSON envelope per token, no UTF-8 round-trip at every hop. Compression is a layer on the same wire (gzip / brotli / dict-zstd). Same framing on every engine in the matrix; clients in six languages producing byte-identical output. Receipts: a short chat reply shrinks ~67× (15.2 KB JSON-SSE → 226 B Codec+gzip), a 2 K-token agent stream 1,404×. TTFB unchanged.

1,404×peak wire reduction
3engines, one wire

dispatch

Tool calls without detokenize

ToolWatcher matches reserved control IDs in the raw token stream with a single 32-bit compare per token — no detokenize, no regex, no per-chunk text scan. The MetaMCP gateway is where the primitive lives in production, but the same hook works in any inference proxy, agent runtime, or middleware. Detokenize runs once at the JSON-RPC seam; everything upstream stays token-native. Microbench: 0.61 ms vs 60.4 ms on a 1 M-token stream.

100×vs detokenize+regex
0text on the hot path

translate

Cross-vocab agent handoff

A Llama-3 agent's stream feeds a Qwen-2 agent through one in-process detokenize / retokenize step. UTF-8 never crosses the wire. At 2 K tokens the bridge produces target-vocab IDs ~30% sooner (10.9 ms → 7.7 ms of bridge CPU) on 15.1× fewer wire bytes. Both paths emit byte-identical Qwen-2 output; the bench asserts strict equality.

30%faster bridge
15.1×smaller wire

receipts

What falls out when the inference layer stays token-native.

Compression isn't the headline — the primitives are. But once every hop runs on raw uint32 token IDs, the wire reduction and the tool-call latency floor are measurable byproducts. Numbers below are from the cross-stack benchmark matrix: same prompt, same model, three real inference engines, six real client languages. Every cell is measured. Full SCHEMA-v1 result JSONs in packages/bench/results/.

Streaming wire bytes by payload size

sglang · lower is better · Y axis is log scale (each gridline is 10×)

1,404× smaller at 2,048 tokens

JSON-SSE
Codec (identity)
Codec + gzip (dict-zstd at 2 K)

Same Codec, three engines

2,048-token reply · best-available Codec wire vs JSON-SSE baseline

3 engines, one wire format

sglang

JSON-SSE 485 KB

Codec msgpack + dict-zstd 354 B

1,404× @ 45.6 ms

vllm

JSON-SSE 479 KB

Codec msgpack + gzip 3.9 KB

126× @ 67.3 ms

llama.cpp

JSON-SSE 529 KB

Codec msgpack + gzip 16 KB

33× @ 40.7 ms

sglang reaches 1,404× because it has the full compression stack (gzip + brotli + dict-zstd) wired into the Codec chunker. vllm hits 126× with gzip alone — the dict-loader hook to engage zstd is in flight on the wdunn001/vllm fork. llama.cpp's HTTP layer ships gzip only; 33× at zero protocol cost.

Time-to-first-byte at 2 K tokens

Codec msgpack + gzip · median across reps · first body byte (Python httpx)

~ 40 ms on local-network sglang & llama.cpp

Cross-vocab handoff — Llama-3 → Qwen-2

Same source IDs, two wire paths · gzip on both · bridge produces byte-identical Qwen-2 output

30% faster bridge response @ 2 K · 15.1× smaller wire @ 2 K

Bridge response time — the latency the next agent waits on

CPU cost to turn the inbound stream into Qwen-2 IDs ready for agent B. Lower is better — this is the wall-clock the handoff blocks on, before agent B's first new token.

Wire bytes — the bandwidth the bridge has to ingest

What the bridge has to receive before any translation can run. Network-bound: the slower the link, the more this dominates response time on top of the bridge CPU above.

Tool-call detection on a 1 M-token stream

Time to detect that a token range is a tool-call region · lower is better

100× faster, no detokenize

Source: cross-stack MATRIX.md · sglang nightly, vllm nightly + PR #41765, llama.cpp + PR #22757 · Qwen-2.5 0.5B · RTX 3090 · temp 0.0 · reproducible from packages/bench/scripts/run-all-langs.sh.

how it works

Three pieces. That's the whole spec.

01

Handshake the vocab.

Client and server agree on which tokenizer to speak before any token ID crosses the wire. Maps are sha256-addressed JSON: pull a pre-generated one from codec-maps, or generate one for any model with a tokenizer.json via the maps CLI.
02

Stream uint32.

Token IDs go directly on the wire as 32-bit big-endian integers. No JSON envelope, no UTF-8 round-trip, no per-message structural overhead. Four bytes per token, every token.
03

Frame with control words.

The high byte of each word distinguishes data from control. Roles, tool calls, completion boundaries, and stream resets ride in-band as reserved control IDs. One framing layer covers everything.

Ready to ship bytes, not sentences?

Codec is source-available under BSL 1.1, free for non-production use and for production use under US $5M annual revenue.

Read the spec Commercial licensing

Patent posture: Quasarke is pursuing patent protection on certain Codec mechanisms. The wire format, handshake, and content-addressed map distribution described in the spec are intended to be made available on royalty-free or FRAND terms to implementers of the Codec specification when patents issue. See PATENTS.md for details.

The control plane for AI inference.

Three operations. All on raw token IDs.

What falls out when the inference layer stays token-native.

Bridge response time — the latency the next agent waits on

Wire bytes — the bandwidth the bridge has to ingest

Three pieces. That's the whole spec.

Handshake the vocab.

Stream uint32.

Frame with control words.

Ready to ship bytes, not sentences?

The control plane
for AI inference.

Stream `uint32`.