Your AI gives a different answer every time you ask.
Your production system needs the same answer every time.
That tension is the central engineering problem of agent-directed software, and it is the one most teams are trying to solve at the wrong layer. The instinct is to make the model more consistent — better prompts, more examples, lower temperature, a fine-tune. None of these strategies resolves the underlying tension because the tension is not a defect in the model. The variance is the model. Asking a probabilistic system to behave deterministically by tightening its inputs is asking water to be a brick.
The teams that have trusted output stopped trying to make the model behave like a deterministic system. They made the system around the model deterministic instead. The model still varies between runs. The infrastructure surrounding it does not. That asymmetry is the entire game.
The variance is not a bug
A large language model is, mechanically, a function that produces a probability distribution over the next token given the context that came before. Sampling from that distribution is what generates each token of output. The same prompt, sampled twice, produces two different outputs because the sampling step is stochastic. Lowering the temperature narrows the distribution; it does not eliminate it. Setting temperature to zero removes one source of variance but leaves several others — the underlying model is still nondeterministic across hardware, across batch sizes, across precision modes, and across versions.
This is not a complaint about today's models. It is a description of what they are. The same property that allows the model to generalize across instructions, recover from imperfect prompts, and produce useful output in domains it has never seen before is the property that makes its output vary. Determinism would require erasing the structure that makes the model useful. The variance is not a bug. The variance is the design.
The implication for production systems is direct. Production systems have requirements that variance violates by definition. A payment processor must produce the same balance from the same ledger every time. An authorization layer must reject the same input every time. An invoice must total the same amount every time. None of these systems can tolerate "the answer is usually right." Usually right at scale becomes wrong predictably, and wrong predictably becomes a regulatory event, a customer escalation, or a security incident. The bar for production has always been determinism, because the cost of nondeterminism in a system humans depend on is that humans cannot depend on it.
So the question is not "how do we make the model deterministic." The model cannot be made deterministic without ceasing to be the kind of model that is useful. The question is "how do we build a system that is deterministic with a probabilistic component inside it." That question has a precise answer, and it is older than the current AI cycle.
Probability inside, determinism outside
Every reliable system humans have built that contains a stochastic process has the same architectural shape. The stochastic process is wrapped in deterministic infrastructure that translates probabilistic behavior into reliable output. Manufacturing tolerances around variable parts. Statistical process control around variable yields. Contractual SLAs around variable network latency. Type systems around variable developer behavior. The pattern is so consistent across domains that it is barely noticed: the world is full of probabilistic substrates, and the engineering that makes them useful is the deterministic layer around them.
The harness is the version of this pattern for agent-directed systems. The model is the probabilistic substrate. The harness is the deterministic infrastructure that surrounds it. Every boundary between the model and the rest of the system is a typed tool call with a defined contract — the model proposes, the contract disposes. Every output the model produces passes through a validation layer that decides whether it is admissible before anything downstream sees it. Every workflow step routes based on deterministic, tool-produced values rather than on the model's free-form opinion. The model can be as creative as it likes inside the box. The box does not negotiate.
This is the inversion that most AI deployments have not made yet. The default configuration treats the model as the system: a frontier model with API access, a few prompt templates, and engineers comparing notes about what worked. In that configuration, the only variance buffer is human attention — a senior engineer reading the output and deciding whether to ship it. Human attention is expensive, scarce, and gets tired. It does not scale linearly with team size, and it does not scale at all overnight. A system whose variance buffer is human attention has, in practice, no variance buffer.
The harnessed configuration treats the model as a component of the system. The system itself is the specifications, the validation gates, the orchestration graph, and the structured execution history that surrounds the model. That system is deterministic by construction. The model contributes generation; the harness contributes correctness. Replace the model and the harness still works. Replace the harness and the model is back to being a demo.
Validation is the bridge
Specifications, validation, and orchestration are the three layers of the harness, and each one matters. But validation is the layer where the bridge between probability and certainty actually gets built, and it is the layer most teams underbuild.
A specification on its own is a wish. It describes what correct looks like, but nothing about it forces the model to comply. The model is free to interpret the specification, fill gaps with assumptions, and produce output that is plausible without being right. Without enforcement, the specification is a document that everyone agrees to and no one runs against. A wish that is not enforced is paperwork.
Validation is what converts the specification into a contract. The validation layer is the set of automated gates that decide whether the model's output meets the standard before the output reaches anyone or anything that depends on it. Type checkers. Schema validators on every structured field. Acceptance tests derived from the specification's contract. Property-based tests that exercise the edge cases. Performance budgets. Security scanners. Every gate runs every time, with no fatigue, no shortcuts, and no judgment calls about whether the output is "good enough." The contract is the contract.
The critical property of the validation layer is that the gates are deterministic even though the input is not. The model's output varies between runs. The validation gates produce the same verdict from the same input every time. That asymmetry is what produces the bridge. A probabilistic generator feeds a deterministic verifier; the verifier returns a binary verdict; the binary verdict is the signal the rest of the system routes on. The probabilistic generation never reaches the deterministic system without passing through a deterministic gate first. Variance enters; variance is bounded; variance is contained.
A handler is the atomic unit of this layer. It receives structured context, performs one operation, and returns a structured result. It does not reach into global state. It does not produce side effects outside its declared interface. It does not decide what runs next. It answers exactly one question: given this typed input, what is the typed output? Handlers are testable by construction — you build the input context, call the handler, assert on the result. No environment to replicate. No globals to mock. No subjective judgment about whether the output is acceptable. The contract is the contract. This is shift-left quality assurance applied to agent output: the typed contract defines expected behavior, and the validation gate decides admissibility before any human looks at the result.
There is a failure mode worth naming, because it is the most common shortcut and the most dangerous one. A test suite the agent wrote against the agent's own implementation is not a contract. It is a tautology — the agent confirming its own work. The implementation passes its own tests because its own tests were generated from its own behavior. A validation pipeline written against the specification, independent of any particular implementation, is what enforces correctness across the lifetime of the module. When the specification changes, the validation pipeline changes with it. When the implementation changes, the validation pipeline holds the line. The pipeline is the part that does not drift, and it is the only piece that turns probabilistic output into reliable systems.
What unvalidated AI looks like in production
Every team building with AI today has been told that AI-generated code can be insecure, brittle, or hallucinatory. The phrasing is technically accurate and operationally useless, because it locates the problem inside the model rather than inside the system that ships the model's output. The actual operational risk is not that the model produces a flawed output once. It is that the system around the model has no gate to catch the flawed output, and so the flaw ships.
The clearest illustration is in security-sensitive code. Consider an authentication module written by an agent against an underspecified prompt. The agent produces a handler that validates a session token, checks an expiration, and returns a user identity. The implementation looks plausible. It compiles. It runs. The integration tests pass. The senior engineer who reviews it on a Tuesday afternoon, three minutes before a release window, signs off. What did the model actually do?
Without a validation pipeline, no one knows. The agent may have implemented constant-time comparison correctly or it may have used a string equality check that is timing-vulnerable. It may have validated the token signature against the expected key or against any key the token claims. It may have rejected expired tokens or accepted them with a stale clock-skew window that an attacker can extend. Each of these is a known class of bug. Each of them is invisible to a casual reading of working code. Each of them is the kind of issue that a security review agent — or a static analyzer, or a property-based fuzzer, or a regression test derived from the specification — would catch automatically.
The team that ships this code in the unvalidated configuration has not adopted AI tooling and accepted some manageable risk. They have shipped variance directly into the security boundary of the system. The variance is not visible because the code looks right. It is structurally identical to the variance you would get if you assigned a different junior engineer to the task each release without code review, security review, or integration testing — and accepted whatever they produced. Most organizations would not accept that with humans. Many of them are accepting it with AI without recognizing it as the same configuration.
The harnessed alternative is structurally different. The specification states what the authentication module must do — accept tokens of a defined shape, validate signatures using a constant-time comparison against a versioned key, reject expired or malformed tokens with a typed error. The validation pipeline runs every time the module changes: a security review agent scans for known vulnerability patterns, a property-based test exercises malformed inputs, a timing test verifies the comparison runs in constant time, a schema validator checks that every error path returns a typed result. The model can implement the module however its sampler decides on any given run; the gates either admit the implementation or they do not. Variance is contained at the gate. Production sees only output that has cleared the gate.
This is what containment looks like in practice. The probabilistic substrate produces an implementation. The deterministic infrastructure decides whether the implementation is acceptable. The asymmetry — probabilistic generation, deterministic verification — is the entire mechanism by which AI-generated code becomes safe to ship. It is not magic. It is not optional. It is the layer most teams skip, because skipping it is faster than building it, and the consequences of skipping it do not become visible until later.
Containment, not elimination
The phrase that makes this concrete is the one in the LinkedIn version of this argument: the harness does not eliminate probability. It contains it.
That distinction matters. Elimination would require the model to behave deterministically, which the model cannot do without ceasing to be the kind of model that is useful. Containment requires the deterministic infrastructure around the model to do the work of converting variable output into a reliable system. The model stays probabilistic. The system around it stays deterministic. The boundary between them — the validation gate — is where probability stops and certainty starts.
Every part of the harness participates in containment. Specifications contain probability by removing the gaps the model would otherwise fill with statistical priors from its training data. Validation contains probability by checking each output against the specification before the output goes anywhere. Orchestration contains probability by routing on deterministic, tool-produced values instead of on free-form model decisions, which means a workflow with thirty probabilistic nodes still produces the same execution path on identical inputs. The model varies. The path through the harness does not.
The reason this distinction matters at the leadership level is that elimination is impossible and many teams are spending budget chasing it anyway. Engineering hours go into prompt-tuning, into custom fine-tunes, into model evaluation harnesses that measure whether the model gets the right answer 94% of the time instead of 91%. Those investments can be useful, but they do not solve the problem. A model that is right 99% of the time is still wrong 1% of the time, and 1% of an unbounded production traffic stream is a stack of incidents. No amount of accuracy improvement reduces the model's variance to zero, and so no amount of accuracy improvement obviates the need for the validation layer. The leverage is not in pushing the model's accuracy up. It is in building the deterministic layer that makes the model's residual error someone else's problem — caught at the gate, never shipped.
This reframes the question every leadership team eventually asks: "is our AI accurate enough to ship?" The honest answer is that the model's accuracy is the wrong measurement. The right measurement is whether the system around the model converts the model's output into reliable production behavior. A 91% accurate model inside a robust harness produces a deterministic, auditable, regenerable system. A 99% accurate model with no harness produces a system that fails in production 1% of the time and silently corrupts the other 99% with no audit trail. The accuracy delta does not predict the reliability delta. The harness does.
What this asks of the leadership team
The hard part of this conversation is that the harness is invisible from outside engineering. The procurement spend on the model is visible — a budget line, a vendor relationship, a slide for the board. The harness investment is engineering hours. It does not show up on a tool-adoption dashboard. It is the time spent writing specifications that did not previously exist, building validation pipelines for outputs that previously went unchecked, and replacing imperative orchestration scripts with declarative graphs. None of those line items survive a one-line summary, and all of them determine whether AI changes the economics of the work or just adds a vendor invoice.
The asymmetry is uncomfortable. The visible spend produces a demo. The invisible spend produces a shipping system. Boards that have not yet developed the literacy to distinguish them will reward the demo and underfund the harness, because the demo is legible and the harness is not. The leadership move is to make the harness legible — to name the validation pipeline as a first-class engineering artifact with its own roadmap, budget, and accountability. The teams that have done this are the ones whose AI work shows up as throughput in production. The teams that have not are the ones whose AI work shows up as a Q3 presentation and a Q4 retraction.
There is no shortcut. The model is available to everyone, and so the model is not where the differentiation lives. The harness is where the differentiation lives, because the harness is what determines whether the model's output can be trusted, regenerated, and shipped. Containment is engineering work. Engineering work is what produces reliable systems. The next-generation model will be more capable, and more capable means it will produce better output inside a harness and marginally better demos outside one. The relative position of harnessed and unharnessed teams will widen, not narrow, as the models improve.
Stop treating AI output as something to admire. Start treating it as something to verify.
Are you still shipping AI output without checking it?
