sglang (vanilla)

Run upstream sglang with the Codec patches applied yourself. Use this when you need a custom build; otherwise prefer the pre-built codec-sglang Docker image.

If you just want a working server, use the pre-built codec-sglang Docker image — one container, GPU-ready, supervisor and patches already applied. This page is for the DIY path: vanilla upstream sglang plus the two Codec PRs.

sglang is the easiest path to a Codec-speaking server. As of PR #24483 (merged) the standard sglang /v1/completions endpoint accepts stream_format: "msgpack" | "protobuf" against any model it can serve.

Run a Codec-capable sglang server

You don’t need a fork. Any nightly with PR #24483 and (for tool-call detection) PR #24557 merged works:

docker run --gpus all -p 30000:30000 --ipc host \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --port 30000

That’s it — the server now speaks both JSON-SSE and Codec on the same endpoint. Clients pick which one they want via the request body.

Server prerequisites in version terms: sglang nightly tag nightly-dev-cu12-20260506-22cf7d2b or later. Check the release notes for the exact build that lands on your platform.

Request shape

The client picks the wire format on a per-request basis by adding stream_format:

POST /v1/completions HTTP/1.1
Host: localhost:30000
Content-Type: application/json
Accept-Encoding: gzip

{
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "prompt": "Explain entropy.",
  "stream_format": "msgpack",
  "max_tokens": 256
}

Server responds with Content-Type: application/codec+msgpack and a sequence of length-prefixed msgpack frames. Every other knob (temperature, top-p, max_tokens, stop sequences) works exactly as before.

If the client omits stream_format (or sets stream: true), sglang behaves exactly as upstream — you get JSON-SSE. So one server can simultaneously serve a JSON SaaS chat UI and a Codec-native agent fleet.

Server-side ToolWatcher (PR #24557)

PR #24557 adds an in-server tool-call detector that runs on the token-ID stream before the stream leaves the server. When the model emits a tool-call region, sglang surfaces it as a discrete event in the Codec stream — with a reserved control-ID frame the client picks up via ToolWatcher.feed().

This is the source of the ~100× tool-call detection speedup in RESULTS.md §3. The server skips its usual “detokenize and regex-match against tool delimiters” step entirely; the client gets pre-segmented frames and can dispatch with no further parsing.

from codecai import Detokenizer, ToolWatcher, decode_msgpack_stream, load_map

watcher = ToolWatcher(map, start="<tool_call>", end="</tool_call>")

async for frame in decode_msgpack_stream(resp.aiter_raw()):
    for ev in watcher.feed(frame.ids):
        if ev.kind == "captured":
            await dispatch(detok.render(ev.ids))

The same client-side code works whether or not the server has PR #24557 merged — if the server doesn’t pre-segment, the watcher segments client-side. PR #24557 just moves the work earlier in the pipeline.

End-to-end agentic example numbers

Live runs from RESULTS.md on a real sglang server with two-turn tool dispatch:

Path	Wire (2 turns)	TTFB	Total
JSON-SSE + client regex	61.9 KB	52 ms	2,426 ms
Codec + ToolWatcher	3.4 KB	16 ms	1,954 ms

18× less wire, 20% faster end-to-end on a real-world agent loop with a real-world tool (SearXNG search).

Configuration knobs

stream_format is the only Codec-specific request knob. The compression negotiation is the standard HTTP Accept-Encoding. gzip, identity is the safe default for any streaming workload — gzip preserves TTFT and gives 30–40× wire savings on Codec frames.

zstd, with a dict

zstd is supported but only when the server has a pre-trained dictionary loaded for the request’s stream_format — see Protocol » Compression for the rule and the rationale. Without a dict, sglang’s compression module falls through to gzip even if the client advertised zstd — no-dict zstd’s bytes match gzip but the buffered middleware adds a 334× TTFB cliff at 2K tokens.

The reference dicts in the main repo’s dictionaries/ are trained for Qwen-2.5 / msgpack and Qwen-2.5 / protobuf. Load them at server startup:

from sglang.srt.entrypoints.codec_compression import set_zstd_dict

with open("qwen2.5-msgpack-v1.dict", "rb") as f:
    set_zstd_dict("msgpack", f.read())
with open("qwen2.5-protobuf-v1.dict", "rb") as f:
    set_zstd_dict("protobuf", f.read())

After this, requests with Accept-Encoding: zstd, gzip come back with:

Content-Encoding: zstd
Codec-Zstd-Dict:  sha256:<hex of the dict bytes>
Vary:             Accept-Encoding

Clients use the Codec-Zstd-Dict header to pick the right local dict before decompressing. With dict-zstd the wire is 16–38% smaller than gzip (RESULTS.md §1g) at +0.13 ms TTFB — sub-millisecond, dwarfed by network. For deployments that ship a dict alongside the model, zstd is the right pick for both interactive and agent traffic.

Deployments that don’t load a dict keep getting gzip on every request — backward-compatible by default. Empty registry → zstd never picked.

vLLM

vLLM support is in flight in PR #41765, with the same surface area: stream_format request field, server-side ToolWatcher, dict-gated zstd, and the Codec-Zstd-Dict response header. Watch that PR for status.