Codec is the substrate that lets gateways, routers, agents, and tool
dispatchers operate on raw token IDs end-to-end. Detokenize once, at
the only edge that needs text — never per-token, never per-hop.
Compression and wire reduction are byproducts; what you actually buy is
the ability to run the inference layer like infrastructure.
Token IDs straight on the wire. Tool-call dispatch, observability,
cross-vocab handoff — all the things you'd want to do at the
inference layer reduce to integer compares on the stream. Detokenize
becomes a byproduct, not a per-token cost.
control-plane primitives
Three operations. All on raw token IDs.
Codec gives the inference layer the same primitives a service mesh gives
a microservice fleet: route, dispatch, translate.
Run them on raw uint32 tokens, never on text. The
compression you see in the receipts below is what falls out for free
when you stop reserializing every hop.
route
Wire-native streaming
Token IDs flow as length-prefixed binary frames — no JSON
envelope per token, no UTF-8 round-trip at every hop. Compression
is a layer on the same wire (gzip / brotli / dict-zstd). Same
framing on every engine in the matrix; clients in six languages
producing byte-identical output. Receipts: a short chat reply
shrinks ~67× (15.2 KB JSON-SSE →
226 B Codec+gzip), a 2 K-token agent stream
1,404×. TTFB unchanged.
1,404×peak wire reduction
3engines, one wire
dispatch
Tool calls without detokenize
ToolWatcher matches reserved control IDs in the raw token stream
with a single 32-bit compare per token — no detokenize, no
regex, no per-chunk text scan. The
MetaMCP gateway is where the
primitive lives in production, but the same hook works in any
inference proxy, agent runtime, or middleware. Detokenize runs
once at the JSON-RPC seam; everything upstream stays
token-native. Microbench: 0.61 ms vs 60.4 ms
on a 1 M-token stream.
100×vs detokenize+regex
0text on the hot path
translate
Cross-vocab agent handoff
A Llama-3 agent's stream feeds a Qwen-2 agent through one
in-process detokenize / retokenize step. UTF-8 never crosses
the wire. At 2 K tokens the bridge produces target-vocab
IDs ~30% sooner (10.9 ms →
7.7 ms of bridge CPU) on 15.1×
fewer wire bytes. Both paths emit byte-identical Qwen-2 output;
the bench asserts strict equality.
30%faster bridge
15.1×smaller wire
receipts
What falls out when the inference layer stays token-native.
Compression isn't the headline — the primitives are. But once
every hop runs on raw uint32 token IDs, the wire reduction
and the tool-call latency floor are measurable byproducts. Numbers
below are from the cross-stack benchmark matrix: same prompt, same
model, three real inference engines, six real client languages. Every
cell is measured. Full SCHEMA-v1 result JSONs in
packages/bench/results/.
Streaming wire bytes by payload size
sglang · lower is better · Y axis is log scale (each gridline is 10×)
1,404×smaller at 2,048 tokens
JSON-SSE
Codec (identity)
Codec + gzip (dict-zstd at 2 K)
Same Codec, three engines
2,048-token reply · best-available Codec wire vs JSON-SSE baseline
3engines, one wire format
sglang
JSON-SSE485 KB
Codec msgpack + dict-zstd354 B
1,404×@ 45.6 ms
vllm
JSON-SSE479 KB
Codec msgpack + gzip3.9 KB
126×@ 67.3 ms
llama.cpp
JSON-SSE529 KB
Codec msgpack + gzip16 KB
33×@ 40.7 ms
Time-to-first-byte at 2 K tokens
Codec msgpack + gzip · median across reps · first body byte (Python httpx)
~ 40 mson local-network sglang & llama.cpp
Cross-vocab handoff — Llama-3 → Qwen-2
Same source IDs, two wire paths · gzip on both · bridge produces byte-identical Qwen-2 output
30%faster bridge response @ 2 K·15.1×smaller wire @ 2 K
Bridge response time — the latency the next agent waits on
CPU cost to turn the inbound stream into Qwen-2 IDs ready for agent B. Lower is better — this is the wall-clock the handoff blocks on, before agent B's first new token.
Wire bytes — the bandwidth the bridge has to ingest
What the bridge has to receive before any translation can run. Network-bound: the slower the link, the more this dominates response time on top of the bridge CPU above.
Tool-call detection on a 1 M-token stream
Time to detect that a token range is a tool-call region · lower is better
Client and server agree on which tokenizer to speak before any token
ID crosses the wire. Maps are sha256-addressed JSON: pull a
pre-generated one from codec-maps,
or generate one for any model with a tokenizer.json
via the maps CLI.
02
Stream uint32.
Token IDs go directly on the wire as 32-bit big-endian integers. No
JSON envelope, no UTF-8 round-trip, no per-message structural
overhead. Four bytes per token, every token.
03
Frame with control words.
The high byte of each word distinguishes data from control. Roles,
tool calls, completion boundaries, and stream resets ride in-band as
reserved control IDs. One framing layer covers everything.
Ready to ship bytes, not sentences?
Codec is source-available under BSL 1.1, free for non-production use
and for production use under US $5M annual revenue.
Patent posture: Quasarke is pursuing patent protection
on certain Codec mechanisms. The wire format, handshake, and
content-addressed map distribution described in the spec are intended
to be made available on royalty-free or FRAND terms to implementers of
the Codec specification when patents issue. See
PATENTS.md
for details.