Browser safety — @codecai/web-safety

Optional client-side safety layer. Catches secrets, PII, jailbreak templates, dangerous commands, and host-blocked patterns before the prompt hits the wire — keeps doomed inputs out of the inference budget. New in v0.4.

@codecai/web-safety is the optional client-side safety layer that ships with Codec v0.4. It’s a sibling of @codecai/web — install it alongside when you want to prevent doomed prompts from consuming wire, server inference budget, or classifier-tier compute.

The package is framework-free. Host apps (leet, codec-website, future clients) render their own UI on top of the framework-agnostic SafetyGate state machine.

Install

npm install @codecai/web-safety @codecai/web

Optional peer dependencies — only installed if you opt into the corresponding classifier:

npm install @huggingface/transformers   # for the default Prompt Guard 86M classifier
npm install @mlc-ai/web-llm             # for the opt-in Llama Guard 3 1B (WebGPU) tier

Two layers

Layer 1 — Prefilter (always-on, no network, no model load)

Catches obviously-doomed inputs via regex + Shannon-entropy detection. Pure JavaScript, runs in browsers, Node, edge runtimes. Five categories:

Category	Rules	Examples
`secrets`	AWS / GitHub / OpenAI / Anthropic / Google / Slack / Stripe keys, SSH key headers, JWTs	`AKIA…`, `ghp_…`, `sk-ant-…`
`pii`	Email, US phone, SSN, Luhn-valid credit-card candidates	—
`high_entropy`	base64/hex runs ≥ 24 chars with Shannon ≥ 4.0 bits	Unknown-vendor API keys
`dangerous_action`	Jailbreak templates, malware/exploit authoring asks, destructive command literals	`ignore previous instructions`, `write working ransomware`, `rm -rf /`, `dd if=/dev/zero of=/dev/sda`
`blocked_action`	Host-supplied patterns — empty by default	Internal hostnames, `--privileged`, “no `DROP TABLE prod_*`”

import { SafetyGate } from "@codecai/web-safety";

const gate = new SafetyGate({
  // Optional: telemetry sink that sees categories + rule IDs only,
  // never the matched values.
  audit: (e) => {
    if (e.kind === "blocked") console.info(`prefilter: ${e.categories}`);
  },
});

const decision = gate.check(promptText);
if (decision.kind === "blocked") {
  // Host renders a redact / send-anyway / cancel dialog using
  // decision.matches; user picks; gate.apply() returns the final
  // text or a cancel signal.
  const action = await showHostModal(decision);
  const result = gate.apply(decision, action);
  if (result.kind === "cancel") return;
  promptText = result.text;        // possibly redacted with [REDACTED:<rule>]
}
// ... tokenize and send via @codecai/web as usual

Layer 3 — Browser-side classifier registry (opt-in)

When regex doesn’t catch the nuance, fall through to a semantic classifier. The registry mirrors the codec-supervisor server registry exactly so policy decisions stay symmetric across hosts.

Two shipped classifiers:

Prompt Guard 86M (default tier) — Transformers.js, ≈80 MB ONNX, CPU/WASM. Best for always-on inbound-prompt classification.
Llama Guard 3 1B (opt-in tier) — codec-web-llm, ≈1 GB WebGPU quant. Same 14-category Llama Guard taxonomy as the server-side classifier so policy decisions are symmetric across mesh peers.

import { registerPromptGuard86m } from "@codecai/web-safety/classifiers/prompt-guard-86m";
import { registerLlamaGuard31B } from "@codecai/web-safety/classifiers/llama-guard-3-1b";
import { resolveClassifier } from "@codecai/web-safety";

registerPromptGuard86m();
registerLlamaGuard31B();          // opt-in

const { classifier, downgraded } = await resolveClassifier("Llama-Guard-3-1B");
// downgraded === true → registry fell back to Prompt Guard because
// the device couldn't load Llama Guard (no WebGPU, insufficient memory).
// Surface a "downgraded enforcement" badge in your UI.

const result = await classifier.score({ form: "text", payload: userMessage });
if (result.scores.jailbreak >= 0.5) {
  // host policy decides: stop, redact, regenerate, flag
}

Host-supplied blocked patterns

Deployments often need patterns the generic rules can’t anticipate — internal hostnames, “no rm -rf /prod”, regulator-mandated refusals. Inject them via PrefilterOptions.blockedActionPatterns:

import { scanText } from "@codecai/web-safety";

const matches = scanText(promptText, {
  blockedActionPatterns: [
    { rule: "no_prod_db",       pattern: /\b(?:db|database)-prod-\w+\b/g },
    { rule: "no_privileged_run", pattern: /docker\s+run\s+[^\n]*--privileged/g },
    { rule: "no_drop_table_prod", pattern: /\bDROP\s+TABLE\s+prod_\w+/gi },
  ],
});

These patterns are decided by the host application and don’t ship in the npm package. They never cross the wire either — the prefilter runs locally before any encode + send.

Public-by-design vs. server-side private

The client-side prefilter rules are public by design — they ship in the npm package source, visible via npm view @codecai/web-safety or by reading src/prefilter.ts in the Codec repo. The vendor-anchored secret patterns are public anyway (AWS publishes the AKIA prefix; GitHub publishes the ghp_ prefix); the jailbreak templates are public (well-documented in adversarial-prompt literature); the destructive-command literals are common-knowledge unix.

This is the opposite boundary from the server-side policy disclosure contract introduced in Codec v0.4:

Server-side, private: operator-internal banned-token-ID lists, regex patterns, classifier thresholds, multi-token patterns. Live in codec-supervisor/policies_dir/. Never serialised to the wire.
Server-side, public: the sanitized descriptor at .well-known/codec/policies/<id>.json — categories + actions + classifier family + summary counts. Listed publicly so clients can verify what shape of enforcement applies, without leaking what’s enforced.
Client-side, public (this package): regex rules that run in the browser before transmission. The output of the prefilter (gate-redacted text, or “user cancelled”) reaches the wire, never the rule list.