codec-comfyui (Docker)

ComfyUI image-generation server with the Codec v0.3 latent transport patch. Streams VAE latents on the wire instead of decoded pixels — 48× smaller, decoder runs at the leaf.

codec-comfyui is a pre-built Docker image of ComfyUI with the Codec v0.3 latent transport patch applied. Stand it up like any image-gen server, point any Codec-aware client at it, and image generations ship as VAE latents instead of decoded pixels — same physics as text-token streams in codec-sglang / codec-vllm, but for diffusion.

Why latents and not pixels: a 512×512 RGB frame at fp16 is ~1.5 MB; the SD-1 latent that produced it is 4×64×64 fp16 = 32 KB, a 48× reduction. With per-channel int8 quantization on top, the wire weight collapses further. The client does vae_decode locally and never re-encodes, so the round-trip pixel quality is bounded by the published per-pipeline LPIPS thresholds (see spec/PIPELINES.md).

This image is built from the wdunn001/ComfyUI fork at branch feat/codec-latent-transport. The fork is the canonical surface — ComfyUI’s plugin/custom-node architecture would let us ship the codec endpoints as a custom node, but the latent-frame emitter and zstd-dict overlay touch enough of the request loop that maintaining a downstream fork is cleaner.

Quick start

Default boot loads stabilityai/sd-vae-ft-mse (SD-1 VAE) and serves it.

docker run -d --gpus all \
  -p 8080:8080 \
  -v codec-models:/models \
  --shm-size 8g \
  wdunn001/codec-comfyui:latest
# Codec wire format — msgpack frames of LatentStreamHeader + LatentFrame
curl http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Accept: application/x-codec-msgpack" \
  -H "Accept-Encoding: zstd" \
  -d '{
    "model": "sd1.5",
    "prompt": "a wide-angle photograph of a snowy mountain at dusk",
    "stream_format": "msgpack",
    "modality":      "image-latents",
    "latent_space":  "stabilityai/sd-vae-ft-mse",
    "pipeline":      "int8",
    "size": "512x512", "steps": 30, "seed": 42
  }'

The response carries:

  • Content-Encoding: zstd (when a per-pipeline zstd dict is loaded)
  • Codec-Latent-Map: sha256:… — the latent-space map document hash so the client can fail-fast if it doesn’t have a matching map loaded
  • Codec-Zstd-Dict: sha256:… — the active dict identifier

Body is one LatentStreamHeader followed by one LatentFrame (image) or N LatentFrames (video).

Pipelines

codec-comfyui advertises the seven Codec v0.3 pipelines documented in spec/PIPELINES.md:

PipelineWire shapeReduction vs rawUse case
rawPack tensor in row-major orderBit-exact baseline
int8Per-channel symmetric int82× over fp16Default for SD-family images
int4Per-channel symmetric int4 (packed)4× over fp16Aggressive lossy mode
int8-adaptiveint8 with per-keyframe scales~2×Heterogeneous frames
int4-adaptiveint4 with per-keyframe scales~4×Same use case, more lossy
delta+int8int8 residual against prior keyframe2× + temporal collapseVideo only
delta+int4int4 residual against prior keyframe4× + temporal collapseVideo, most aggressive

Adding a pipeline is an additive v0.3+ point release — the registry is normative, not extensible per-deployment.

Pointing a Codec client at it

Any @codecai/web client (v0.4+) speaks the latent wire shape via LatentStreamDecoder:

import {
  decodeLatentHeaderMsgpack,
  decodeLatentFrameMsgpack,
  LatentStreamDecoder,
} from "@codecai/web";

const resp = await fetch("http://localhost:8080/v1/images/generations", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Accept": "application/x-codec-msgpack",
    "Accept-Encoding": "zstd",
  },
  body: JSON.stringify({ /* …request as above… */ }),
});

// Frames stream length-prefixed; iterate them as Uint8Array chunks
// (see decodeMsgpackStream for the streaming helper).
const [headerBytes, ...frameChunks] = /* …split per the stream protocol… */;
const header = decodeLatentHeaderMsgpack(headerBytes);
const decoder = new LatentStreamDecoder(header);

for (const chunk of frameChunks) {
  const frame = decodeLatentFrameMsgpack(chunk);
  const latent = decoder.decodeFrame(frame); // Float32Array, channel-first
  // Hand `latent` to a browser-side VAE (WebGPU / ONNX-Web / etc.)
}

The Python (codecai) and the polyglot clients (rust / java / dotnet / c) carry the same parser surface — a single tokenizer-map and latent-space-map registry; one wire shape; six languages.

When to use this

  • Use codec-comfyui when you want browser- or edge-side VAE decoding, when you’re streaming frames into a downstream vision model that accepts latents directly, or when bandwidth is the bottleneck.
  • Use upstream ComfyUI when you need the full ComfyUI workflow surface (custom nodes, queue management, the visual graph editor) and pixel output is fine.

The Codec patch is fully backwards-compatible per request — JSON-SSE clients see exactly the upstream behaviour.

See also

  • codec-diffusers — sister image, also a v0.3 latent server. Doubles as the bench/golden perceptual reference.
  • codec-metamcp — gateway in front of latent servers + tool servers.
  • Protocol overview — the wire format spec the framing in this image speaks.