What is Codec?
A control-plane primitive for AI inference. Lets gateways, routers, agents, and tool dispatchers operate on raw token IDs end-to-end, with detokenization localized at the edges that need text.
Codec is a control-plane primitive for AI inference. It’s the substrate that lets gateways, routers, agents, tool dispatchers, and observers operate on raw token IDs end-to-end — no detokenize on the hot path, no JSON-parse per token, no UTF-8 round-trip at every hop. Compression and wire reduction are byproducts of the framing; what you actually buy is the ability to run the inference layer like infrastructure.
It separates three concerns that today’s /v1/chat/completions mashes together:
- Token IDs — the model layer. The protocol carries
uint32IDs directly. Routing, sampling, observability, and tool dispatch all reduce to integer compares on the stream. - Framing — the transport layer. Length-prefixed msgpack or protobuf frames; the same wire on every engine in the matrix.
- UTF-8 text — the presentation layer. Detokenization happens at the edge, only at the boundary that has a real reason to need text (a human display, a JSON-RPC tool call, a logging sink). Never per-token, never per-hop.
Three primitives fall out of that layering:
- Wire-native streaming. Token IDs flow as length-prefixed binary frames over plain HTTP. Compression layers on top of the same wire. Receipts: a short chat reply ships in 226 B (vs 15.2 KB JSON-SSE = ~67×); a 2 K-token agent stream in 354 B (1,404×) on sglang with the full stack. TTFB is within 1 ms of the JSON path on the same server. Cross-stack matrix covers three engines and six client languages.
- Tool-call dispatch without detokenization.
ToolWatchermatches reserved control IDs in the raw token stream — one 32-bit compare per token. Microbench: 0.61 ms vs 60.4 ms on a 1 M-token stream (~100× faster than detokenize+regex). The MetaMCP gateway is the canonical place this primitive lives in production; the same hook works in any inference proxy, agent runtime, or middleware. - Cross-vocab agent handoff.
Translatorpipes a stream from V_A to V_B via one in-process detokenize/retokenize step — UTF-8 never crosses the wire. Llama-3 → Qwen-2 at 2 K tokens: the bridge produces target-vocab IDs ~30% sooner (10.9 ms → 7.7 ms bridge CPU) on 15.1× fewer wire bytes. Both paths emit byte-identical Qwen-2 output; the bench asserts strict equality.
Wire format in five sentences
Each Codec frame is 4-byte big-endian length + msgpack or protobuf body. The body carries a packed array of uint32 token IDs, a done boolean, and an optional finish_reason. A vocab handshake (a sha256-addressed JSON map — see codec-maps) tells both ends which tokenizer the IDs belong to. Frames stream over plain HTTP responses, with Accept-Encoding: gzip for streaming-safe compression. That’s the whole spec — nothing else.
Where Codec earns its keep
Codec is opt-in per request (stream_format: "msgpack" | "protobuf", default "json"), so adding it never disturbs existing JSON-SSE traffic on the same endpoint. Pick the format that fits the call.
- Inference gateways. Token IDs at the wire and at the dispatch layer. ToolWatcher signals fire on raw IDs in real time; the gateway’s text-touching code (JSON-RPC dispatch to MCP servers, human-display sinks) runs once at the seam. The MetaMCP gateway ships this pattern as a docker image; the same primitive is reusable in any proxy, agent runtime, or middleware.
- Heterogeneous model meshes.
Translatorcarries one model’s stream into another’s vocabulary without UTF-8 ever crossing the wire. Measured Llama-3 → Qwen-2 handoff: ~30% less bridge CPU and 15.1× smaller wire, byte-identical Qwen-2 output. Source:packages/bench/results/2026-05-08T01-15-02Z/translator/. - Agent-to-agent traffic. No human is reading these tokens. Vocab fixed at handshake, dict pre-shared, msgpack frames collapse to a control byte plus a delta. The headline 1,404× at 2 K tokens lives here.
- Streaming long outputs at scale. Wire bytes ≈ content-length / 4 uncompressed; with dict-zstd the marginal cost per extra token is two compressed bytes. The win compounds with payload size.
- Human-facing chat UIs. ~67× smaller wire on a short reply, ~1,400× on long ones, TTFB within 1 ms of JSON-SSE. The client decodes once into a string at the edge before render; mobile, edge, and chat-platform-scale traffic all benefit. Bandwidth drops without users seeing anything except faster paint on flaky networks.
- Observable inference. Routing, sampling decisions, anomaly detection, SLO checks — everything you’d want a service mesh to do — reduces to integer comparisons on token streams. No log-scraping a JSON envelope; no detokenize-every-chunk pipeline.
The one constraint: Codec is a wire-and-dispatch primitive, not a JSON-to-Codec transformation gateway. Both client and server need to speak it. If you’re calling a third-party JSON-only API you don’t control, that’s outside Codec’s scope — we can’t compress bytes a service refuses to emit. Stand up your own and you control both ends.
Stand up a Codec-speaking server in 30 seconds
Three pre-built Docker images, each docker run-ready and OpenAI-compatible. Pick the engine that fits your model + GPU stack:
codec-sglang— full Codec stack (msgpack/protobuf, gzip + brotli + dict-zstd). The 1,404× headline lane.codec-vllm— Codec PR over upstream vllm with dicts pre-baked. 126× today via gzip; ~1,400× once the lifespan dict-loader hook lands on the wdunn001/vllm fork.codec-llamacpp— llama.cpp built from the fork with the Codec PR + streaming gzip middleware. 33× at zero protocol cost; ideal for CPU/edge boxes.
If you’d rather build from source against vanilla upstream, see sglang — vanilla setup for the DIY path. The wire is bit-identical between the bundled and DIY paths.
Source-available, BSL 1.1
The protocol and the six reference implementations are published under BSL 1.1 by Quasarke LLC. Free for non-production use and for production use under US $5M annual revenue. Each release auto-converts to Apache-2.0 four years after publication. For commercial licensing, licensing@quasarke.com.
Patent posture
Quasarke is pursuing patent protection on certain Codec mechanisms. The wire format, handshake, and content-addressed map distribution described in the spec are intended to be made available on royalty-free or FRAND terms to implementers of the Codec specification when patents issue. Adjacent improvements (ToolWatcher, Translator, the dictionary system, Codec-Zstd-Dict negotiation) may be commercially licensed separately — a Codec-compliant implementation does not require those modules. Full text in PATENTS.md.
Next: pick a runtime in the sidebar, or jump to the quickstart.