codec-llamacpp (Docker)

Pre-built llama.cpp server with the Codec patches applied and a control plane bolted on, in one GPU container. OpenAI-compatible. Smallest of the three.

codec-llamacpp is the easy way to stand up a Codec-speaking inference server on top of llama.cpp. It’s a pre-built Docker image bundling:

  • llama-server — statically-linked CUDA binary built from the Codec fork (ggml-org/llama.cpp#22757 for token-native binary transport on the OpenAI-compatible server, plus the stacked feat/codec-compression follow-ups: server-side ToolWatcher, streaming gzip, zstd-dict-header docs).
  • codec-supervisor — the same FastAPI admin sidecar as codec-sglang, handling model uploads, Hugging Face pulls, hot-swaps, and reverse-proxying the llama-server backend.
  • Static linking (GGML_BACKEND_DL=OFF, BUILD_SHARED_LIBS=OFF) — the CUDA backend is compiled into the binary, no .so plugins to load at runtime, no LD_LIBRARY_PATH config.

This image is ~3.6 GB — an order of magnitude smaller than codec-sglang or codec-vllm because llama.cpp doesn’t ship a heavy ML Python stack.

Quick start

Default boot downloads Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M (~400 MB) and serves it.

docker run -d --gpus all \
  -p 8080:8080 \
  -v codec-models:/models \
  -v llamacpp-cache:/root/.cache/llama.cpp \
  --shm-size 8g \
  wdunn001/codec-llamacpp:latest
# OpenAI-compatible (JSON-SSE)
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"x","prompt":"Hello","max_tokens":20}'

# Codec wire format - msgpack frames of token IDs
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"x","prompt":"Hello","max_tokens":20,"stream":true,"stream_format":"msgpack"}'

llama-server ignores the model field for routing (single-model-per-process), so "x" is fine.

GPU prereq: NVIDIA Container Toolkit + --gpus all. The image is built for compute capability sm_86 (RTX 3090); use --build-arg CUDA_DOCKER_ARCH=<arch> if you rebuild for a different GPU.

Model spec: HF id or local file

CODEC_INITIAL_MODEL accepts two forms; the supervisor’s LlamaCppBackend picks --model vs -hf automatically:

SpecWhat llama-server getsUse case
Owner/Repo-GGUF:filename-glob-hf Owner/Repo-GGUF:filename-globPull a quantized GGUF from Hugging Face directly. The default.
/absolute/path/to/file.gguf--model /absolute/path/to/file.ggufBind-mounted local model file.

Examples

HF GGUF id:

docker run --gpus all -p 8080:8080 \
  -e CODEC_INITIAL_MODEL='Qwen/Qwen2.5-7B-Instruct-GGUF:Q4_K_M' \
  -v llamacpp-cache:/root/.cache/llama.cpp \
  wdunn001/codec-llamacpp:latest

Local .gguf file:

docker run --gpus all -p 8080:8080 \
  -e CODEC_INITIAL_MODEL=/models/my-model.gguf \
  -v /path/to/my-model.gguf:/models/my-model.gguf:ro \
  wdunn001/codec-llamacpp:latest

Hot-swap via admin API:

curl -X POST http://localhost:8080/admin/load \
  -H "Content-Type: application/json" \
  -d '{"name":"Qwen/Qwen2.5-7B-Instruct-GGUF:Q4_K_M","allow_remote":true}'

Configuration

VariableDefaultEffect
CODEC_INITIAL_MODELQwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_MModel spec (HF id :filename-glob or absolute .gguf path).
CODEC_BACKEND_ARGS--ctx-size 4096 --gpu-layers 999Verbatim arguments to llama-server. --gpu-layers 999 offloads everything to GPU.
CODEC_PORT8080Supervisor port.
HF_TOKEN(unset)Required only for gated GGUF repos.

Admin API

Identical to codec-sglang.

See also

  • codec-sglang — same image story, sglang backend (best throughput on supported models).
  • codec-vllm — same image story, vLLM backend.