The AI Engineering Maturity Diagnostic

Score your organization. One point per "yes." Be honest.

The diagnostic that follows is not about how much AI is being used in the organization. Most organizations are using a lot of AI, in a lot of places, and the volume of usage tells the strategy team nothing about whether the usage is producing durable value. The diagnostic is about whether the AI is operating inside engineering infrastructure that makes its output reliable, or whether it is operating in unstructured chat windows where the quality depends on whoever happens to be using it. The two postures look similar from the outside. The diagnostic is designed to surface which one the organization is actually in.

The questions are organized in three categories. Specifications, which determine what the AI is being directed to do. The harness, which determines how the AI's output is constrained and validated. Roles and process, which determine whether the organization is structurally aligned with the engineering practice the harness requires. Each question is binary. Yes counts. Partially does not. The threshold is deliberate, because the value of the diagnostic depends on the team being honest about where it actually is, not where it would like to be.

Specifications (0–3)

Do written specification standards and templates exist?

The first question separates organizations that have decided what a specification looks like from organizations that have not. The standard does not need to be elaborate. It does need to be written, shared, and applied across the team. A specification standard might define the fields a spec must contain, the format it must follow, the tooling that produces it, and the review process it goes through before it is handed to the AI. Without a standard, every engineer writes specs in whatever form they happen to prefer that week, and the quality of AI output varies accordingly. The standard is the scaffolding that makes specification quality measurable across the team.

Do specifications include inputs, outputs, constraints, and acceptance criteria?

The second question gets at whether the specifications, even when written to a standard, are actually directing the AI completely. Inputs the system receives. Outputs it must produce. Constraints it must respect. Acceptance criteria that distinguish correct from almost-correct. A spec missing any of these is a spec that delegates the missing dimension to the model's pattern matching, which is the loose joint where probabilistic output starts to drift from what the team actually wanted. If the team's specs reliably contain all four, the AI is being directed. If they reliably contain only some, the AI is being prompted, and the difference is the difference between engineering and experimentation.

Is specification quality reviewed before AI generates output?

The third question separates teams that treat specifications as artifacts that get reviewed from teams that treat them as throwaway briefs. Reviewing the spec before the AI runs catches the gaps the model would otherwise fill probabilistically. It also produces a feedback loop on specification quality itself — engineers learn what makes a spec sufficient by seeing their drafts critiqued before generation. Teams that do this report higher first-pass output quality and lower rework rates. Teams that do not are running the rework loop without realizing it.

The harness (0–4)

Do AI operations receive structured context automatically?

The fourth question asks whether context is engineered or pasted. Context engineering is a system that decides, for each AI operation, what information enters the model's window, in what structure, drawn from which corpus. Pasting is a developer manually copying files and instructions into the prompt window. The two produce different output quality, because pasting depends on the developer's judgment about what to include, every time, and the judgment varies. Structured context assembly is consistent, observable, and tunable. Pasting is none of those things.

Do enforceable workflows with defined quality gates exist?

The fifth question gets at whether the engineering process the AI runs inside is encoded in software or held in conventions. An enforceable workflow is one the harness will not let an operation skip. Tests have to pass. Security checks have to pass. Validation has to complete. The workflow is the same on Friday at 5pm as it is on Tuesday at 11am, because the workflow is not negotiable. A workflow held in conventions degrades under pressure, because conventions are negotiable. The teams that have moved to enforceable workflows report that the rigor floor rises and stops eroding.

Is there full provenance tracking — who, what, when for every artifact?

The sixth question asks whether the team can reconstruct, after the fact, what the AI did. Provenance is the structured execution history that records, for every operation, which agent ran it, on which spec version, with which inputs, producing which outputs, at which time, with which validation results. This is the audit trail the regulator will eventually ask for, the debugging surface the team will need when something downstream breaks, and the structural evidence that lets the organization defend its AI work to any party that asks. Provenance built into the harness is automatic. Provenance reconstructed from logs after the fact is unreliable.

Is validation automated, not manual review?

The seventh question separates teams whose quality bar is enforced by software from teams whose quality bar depends on what the human reviewer happened to look at this time. Automated validation runs the same checks, at the same depth, on every change. Manual review varies by reviewer and by day. The two systems produce different output quality at the same nominal level of effort. The harness is the layer that lets validation be automated, which is the layer most teams have not yet built deliberately.

Roles and process (0–3)

Do engineers spend more time on specifications than editing AI output?

The eighth question is the time-allocation tell. In a team operating effectively, the engineer's leverage is in directing the AI, not in fixing what the AI produced. The time budget reflects this — more hours on specs, fewer hours editing output. In a team operating in the rework loop, the time budget is inverted, with the engineer spending most of their time finishing the AI's first drafts. The answer to this question, asked honestly across the team, is one of the cleanest signals of whether the organization is in the productive register or the rework register.

Are output correctness and rework rate measured — not just velocity?

The ninth question is the metric tell. Velocity, commit volume, lines of code, time-saved-per-task — these are activity metrics. Output correctness, rework rate, harness reliability, time to validated output — these are effectiveness metrics. Teams measuring effectiveness know whether the AI investment is producing durable value. Teams measuring activity know whether the AI is being used. The question of whether the investment is working is answered by the first set, not the second.

Do job descriptions and career ladders reflect system design skills?

The tenth question asks whether HR has caught up to what AI engineering actually requires. The skills that produce effective AI engineering — specification quality, harness design, validation discipline, multi-agent orchestration, knowledge management — are not the skills that engineering job descriptions emphasized five years ago. If the job descriptions still reward typing speed and individual code mastery, the organization is hiring for the wrong skills, training for the wrong skills, and rewarding the wrong skills. The career ladder change is unfashionable, slow, and structural. It is also a pre-condition for the organization to actually retain the engineers who can do the new work.

Scoring

0–3: Vibe coding. Your AI investment is generating rework, not results. The model is producing output, the team is editing it to ship, and the velocity numbers may look fine, but the rework cost is consuming the leverage the AI was supposed to produce. The organization is operating in the experimental register. The risks of normalization of deviance are accumulating. The competitive position relative to organizations with higher scores is eroding every quarter the work is not done.

4–6: Structured use. A foundation exists, but gaps are costing the team. Some of the right pieces are in place — specifications written to a standard, partial automation of validation, some attention to metrics — but the system is not yet a coherent harness. The output is more reliable than vibe coding, but not yet at the level the organization needs to defend it confidently. The roadmap from this score to the next tier is the most actionable, because the team can identify which specific gaps to close and in what order.

7–9: Engineered. The team is building the harness. Most of the structural pieces are in place. The remaining gaps are usually in the harder categories — full provenance, mature validation pipelines, career ladder alignment. Organizations at this score can scale AI use across teams with confidence, because the foundation is solid enough that adding capacity does not produce chaos. The competitive position is durable. The work that remains is refinement, not foundation.

10: Reliable systems. The organization is in the compounding zone. Every quarter of operation produces an improvement to the harness that the organization keeps. The output is reliable at production scale. The audit trail is complete by construction. The metrics describe effectiveness. The hiring rubric matches the work. The competitive position is structurally strong, and the lead is widening relative to organizations that are still earlier in the curve.

Why the score matters more than the number

Most organizations, asked honestly, score between two and five. The honest answer surprises leadership, because the surface activity — AI tools deployed, engineers using them, demos that work — looks like a higher score than the diagnostic produces. The gap between the surface impression and the diagnostic score is exactly the gap the diagnostic is designed to surface, because the surface impression is what the strategy team has been reporting and the diagnostic score is what the engineering reality actually is.

The organizations that know their score are already ahead of the organizations that do not. Knowing the score does not by itself improve the engineering capability. It does name the gap, which makes the gap addressable. The roadmap from a score of four to a score of seven is a sequence of investments — specification standards, validation automation, provenance, metrics — that the team can execute deliberately. The roadmap from "we have AI" to "we have effective AI" is undefined, because the destination is not specified.

The diagnostic is also useful as a board artifact. The score is structured, defensible, and comparable across time. A board that asks "where are we on AI engineering maturity?" can be answered with a number, a trend line, and a list of the specific gates that remain to be closed. The conversation moves from "we are doing AI" to "we are at six and aiming for nine, here is the plan," which is the conversation the board actually needs to have.

The gap between your score and ten is your AI engineering roadmap. The work to close the gap is the work the organization should be funding. The pace of closing the gap is the pace at which competitive position is being built. The diagnostic does not produce the roadmap by itself, but it produces the structure within which the roadmap can be defined and the progress tracked.

The diagnostic question

What did you score?

Score honestly. The value of the exercise depends on it. A team that scores itself optimistically gets to feel better about its position and continues to defer the work the diagnostic is designed to surface. A team that scores honestly gets a real assessment of where it is, and the assessment is the input to a real plan.

If the score is in the vibe-coding range, the work is foundational. Specifications, harness construction, automation of validation, the metrics that surface effectiveness rather than activity. The team will produce visible improvements within a quarter if the work is funded.

If the score is in the structured-use range, the work is targeted. Identify the specific gates that are not closed and close them in sequence. The team will move into the engineered range within two to three quarters with deliberate effort.

If the score is in the engineered range, the work is refinement. The hard parts have been done. The remaining work is the kind that produces the compounding zone, where each quarter's investment builds on the last and the organization's position relative to competitors widens.

If the score is ten, the organization is operating at the level the rest of the field is trying to reach. The work, at this stage, is sustaining the discipline and resisting the entropy that pulls every system back toward less rigor over time. The harness has to be maintained. The metrics have to be watched. The norms have to be reinforced. The position is real, and it has to be defended.

What did you score? The number is the start of the conversation, not the end of it.