Methodology

How the diagnostic actually works.

The AI Behavioral Integrity Diagnostic uses a structured methodology developed across two decades of operational pattern recognition, applied to a new failure surface. This page details how the diagnostic maps the reliance chain, identifies failure categories beyond hallucination, and contrasts with standard AI testing approaches. It exists for buyers and technical reviewers who want to understand the work in depth before engaging.

Decision-Reliance Mapping

The diagnostic maps where the answer enters business reliance, not just whether the answer is right.

An AI response is the middle of the chain, not the end of it. The diagnostic traces the path — from the input the model receives, through the reasoning and output it produces, to where the response enters the decisions and actions that depend on it.

InputRetrievalInterpretationOutputDecision SignalHuman RelianceDownstream ActionBusiness Outcome

Most AI evaluation stops at the output. The diagnostic continues through the reliance chain to where AI behavior actually meets the business.

Beyond the three dimensions

Two failure categories that shape how decision-signal integrity behaves in production.

The three-dimension framework — Signal, Boundary, Reliance — introduced on the home page is the structural model. The diagnostic also examines two specific failure categories that shape how those three dimensions actually behave in production.

The first is systematic bias from training, system prompt configuration, and guardrail design — the predictable response tendencies that look like neutral processing but actually shape what the workflow recommends across thousands of interactions.

The second is behavior under sustained pressure— whether the AI maintains calibrated responses when users push back, claim authority, or apply social pressure, and whether the workflow's escalation logic holds up when the AI's confidence is questioned.

In agentic workflows where the model can execute actions rather than only generate text, both failure categories carry materially higher consequence.

Necessary, not sufficient

Standard testing answers a different question than the diagnostic.

Hallucination checks, safety reviews, jailbreak tests, prompt evaluations, governance documentation, and benchmark testing all have a place. They do not answer whether the workflow can support the decisions that depend on it.

Standard testing asksClearMark Advisory asks
Did the system produce an expected answer?Can the answer support the decision that follows?
Did the model hallucinate?Did it preserve source weight, uncertainty, and decision boundaries?
Did the response sound safe?Did careful language weaken the decision signal?
Did the prompt pass evaluation?Did the workflow hold up under real reliance pressure?

Next step

The diagnostic produces a Decision-Risk Findings Brief that translates this methodology into specific findings for one defined AI workflow.

The full brief structure is detailed on the home page. A simulated example of the deliverable is available below.