Cortex: Unknown Behavior

01 · What it does

LLM-as-judge, deployable.

Most teams have already discovered that a second LLM evaluating the first improves quality. The problem is the evaluator is usually a one-off prompt buried in a notebook, with no rule library, no retry policy, no audit log, and no way to ship it as part of the application's runtime.

Cortex is the productionized version of that pattern. The worker and verifier are first-class roles, the rules are declarative, retries are bounded, and every challenge is logged so a reviewer can read why a response was blocked without re-running the model.

Model-agnostic on both sides. Anthropic, OpenAI, in-house models, or a mix. The verifier does not have to be a stronger model than the worker; it just has to read the ruleset and report what fired.

02 · How it works

Three stages.

01

Worker drafts

A primary LLM produces a candidate response to the user request. This is the model the application thinks is in charge. It runs the same way it would without Cortex.

02

Verifier challenges

A second LLM evaluates the candidate against a user-defined ruleset (factual, policy, format, tone). Each rule that fires returns a structured failure with an explanation, not a score.

03

Reconcile, retry, or block

If the verifier flags a failure, the worker is asked to revise with the failure in context. After a configurable retry budget, persistent failures are blocked rather than silently passed downstream.

Live

Run the demo.

Cortex dashboard

Hosted on Streamlit. Inspect worker and overseer traffic in real time: rule fires, blocks, retries, the three-strike shutdown, and the audit log that drives the view.

Open demo

03 · Behind the demo

Methodology.

Rules are declarative. A rule names what it checks, what counts as a violation, and how the verifier should phrase the failure back to the worker. Rules can be factual (citation matches source), policy (no medical advice outside scope), format (must return JSON of shape X), or tone (no second-person to a partner-facing surface).

Retries are bounded by a configurable budget. The worker receives the failure as part of its context for the next attempt, not a generic "try again." Persistent failures after the budget exit are blocked. A blocked response is an audit event, not a silent fall-through.

The dashboard reads from the audit log, so the same data that powers the runtime is what a reviewer inspects later. No separate observability path; the log is the observability surface.