Confidence Engineering didn't emerge from nowhere. It inherits decades of testing thought — and departs from it in specific ways. Here is where it converges, where it diverges, and what the AI era changes about each.
The major schools of software testing — from Context-Driven Testing to Modern Testing to Rapid Software Testing — were developed by serious practitioners who were right about most things, most of the time. Many of their core insights hold up under the pressure of the AI era. Some do not.
The most honest framing is this: the old frameworks diagnosed the right disease — the over-reliance on metrics, the misalignment between what teams measure and what users experience, the failure of gate-based thinking. Confidence Engineering extends that diagnosis to a new set of conditions: AI-generated code that no one fully reads, probabilistic systems that cannot be deterministically verified, and software that evolves faster than any team can review.
This is not a rejection of what came before. It is an attempt to name what comes next.
| Framework | Origin | CE Alignment | AI Era Relevance |
|---|---|---|---|
| Google Testing Approach | Whittaker, Arbon, Carollo | High | Strong in parts |
| Human Experience Testing (HXT) | Tariq King | High | Still strong |
| Modern Testing (MT) | Alan Page, Brent Jensen | High | Still strong |
| Context-Driven Testing (CDT) | Bach, Kaner, Pettichord | High | Still strong |
| Exploratory Testing | Cem Kaner | High | Still strong |
| Risk-Based Testing | Bach, Whittaker, various | High | Still strong |
| Agile Testing | Crispin, Gregory | Medium | Needs evolution |
| Behavior-Driven Development (BDD) | Dan North | Medium | Needs evolution |
| Test-Driven Development (TDD) | Kent Beck | Partial | Needs evolution |
| Rapid Software Testing (RST) | James Bach, Michael Bolton | Medium | Needs evolution |
| ISTQB / ISO 29119 | Standards bodies | Low | Limited |
RST frames testing as a cognitive skill, not a procedure. It draws a sharp distinction between checking (automated verification of known expectations) and testing (a human intellectual activity involving exploration, judgment, and learning). It rejects scripted testing as sufficient and insists that good testing requires intelligent adaptation to context.
RST's distinction between checking and testing maps almost directly onto CE. Checking = automated verification of known expectations. Testing = exploration of what you don't yet know. CE's value of exploration over repetition is RST's central argument restated.
RST's rejection of scripts and pass rates as quality proxies is foundational to CE's position that activity metrics are not quality metrics.
RST's insistence on judgment, context, and skepticism toward process over thinking resonates deeply with CE's belief in risk intelligence over execution volume.
RST is fundamentally a human discipline. Its models of the skilled tester assume a person is always the primary agent of exploration and judgment. CE extends this: AI is not the enemy of good testing — it is a tool to be directed, much like RST would direct a junior tester.
RST doesn't address what happens when no human authored the code being tested. Its heuristics and oracles are designed for systems humans built and understand; they need significant extension for systems that emerged from model outputs.
RST operates at the individual tester level. CE is also concerned with organizational confidence infrastructure — evals, observability, production telemetry — that sits above any individual tester's work.
RST's critical thinking and humanistic approach are not just still relevant — they are among the most important intellectual foundations CE builds on. The insistence that testing requires judgment, skepticism, and genuine curiosity about how systems fail is exactly right, and in an era of AI-generated code it matters more, not less.
What pulls the rating down is RST's deep skepticism of automation and instrumentation. RST tends to treat automated checks as a lesser form of testing — useful but fundamentally limited compared to skilled human exploration. CE disagrees: at AI-era scale, sophisticated automation, telemetry, evaluation pipelines, and data-driven measurement are not compromises — they are the only way to maintain confidence at all. A framework that treats data and measurement as secondary to human craft will struggle to speak to engineering leaders building systems that process billions of events a day.
The core RST insight — thinking matters more than process — travels perfectly into the CE era. The tooling philosophy does not.
Modern Testing's seven principles argue that quality is the team's responsibility, not a QA function's. Testers should accelerate the team, act as quality coaches, be data-driven, and continuously improve. It pushes testing from a gate-keeping activity into a capability distributed across the whole engineering team.
MT's core argument — quality belongs to everyone, not a QA department — is a precondition for CE. You cannot build organizational confidence infrastructure if quality is siloed.
MT's "accelerate the team" principle maps to CE's belief that Confidence Engineering exists to help organizations move faster, not slower. Confidence is an enabler of speed, not a brake on it.
MT's emphasis on being data-driven and questioning the value of testing activities directly echoes CE's rejection of activity metrics.
MT focuses on team culture and organizational process. CE goes further: even a perfectly organized team with excellent quality culture faces new problems when AI writes the code and AI evaluates it. The cultural shift MT describes is necessary but not sufficient.
MT's coaching model assumes humans remain the primary authors and verifiers of software. AI authorship breaks the model: who does a quality coach coach when the developer is a language model?
MT doesn't address probabilistic systems or the two-unknowns problem. Its data-driven principle assumes data that is interpretable in traditional ways.
CDT holds that there are no universal best practices in software testing — only practices that are good in a given context. Good testing requires skilled people exercising judgment, and the value of any practice depends entirely on the project, product, and people involved. It rejects process dogma and checklist compliance as substitutes for thinking.
CDT's rejection of best practices maps directly onto CE's rejection of coverage targets and pass rates. There is no universal metric that constitutes confidence. What gives you confidence depends on what you're building, who's using it, and what could go wrong.
CDT's emphasis on skilled judgment over process compliance is the philosophical bedrock CE builds on. Directing AI verification toward meaningful risk is a judgment skill, not a process to follow.
CDT's notion that testing is fundamentally about learning about the product — not just running test cases — aligns with CE's comprehension coverage concern. Testing is partly how you maintain understanding of a system you didn't fully author.
CDT is almost entirely concerned with human testing skill applied to human-authored systems. Its rich vocabulary of heuristics, oracles, and test charters was developed for a world where a skilled tester explores a system a human team built. That world still exists — but it no longer describes the whole of software development.
CDT has less to say about organizational confidence infrastructure: the systems of evaluation, observability, and production telemetry that CE treats as first-class engineering concerns.
Exploratory Testing treats test design and test execution as simultaneous rather than sequential activities. The tester learns about the system while testing it, adapting their approach based on what they discover. It is empirical, adaptive, and cognitive — the opposite of scripted test execution.
CE's value of exploration over repetition is Exploratory Testing restated at a higher level of abstraction. Scripted test execution is repetition; exploratory testing is exploration. CE says repetition should be automated and attention directed toward the unknown.
Exploratory Testing's adaptive, learning-while-doing approach is precisely what's needed for probabilistic systems. You cannot pre-script tests for a system whose outputs vary by definition. You must explore.
The emphasis on unknown unknowns — testing not just what you expect to find, but what you didn't know to look for — is arguably more important when AI generates system behavior that no human anticipated.
Traditional Exploratory Testing is a human practice at human speed. A single skilled explorer testing a system for a day will cover a fraction of the behavioral surface area that an AI-assisted exploration can cover in the same time.
CE reframes exploration as something that can be directed and amplified by AI — not replaced by it. The human judgment about where to explore, and what a finding means, remains essential. But the scale of exploration is no longer limited to human bandwidth.
Risk-Based Testing argues that testing effort should be allocated in proportion to risk — the likelihood and impact of failure. Rather than attempting to cover everything, teams identify what could go wrong, assess severity and probability, and invest testing effort accordingly. It is the argument against uniform coverage made into a practice.
Risk-Based Testing is arguably the direct ancestor of CE's risk intelligence principle. "Direct verification toward meaningful risk" is risk-based testing elevated to a first principle of the discipline.
The underlying logic — not all parts of a system are equally worth testing, and resources should follow risk — is foundational to CE and becomes even more important when AI generates vast amounts of code faster than teams can evaluate all of it.
Traditional risk-based testing still tends to use test counts and coverage as the unit of investment once risk areas are identified. CE argues that these are the wrong measures even after you've correctly identified the risk.
Risk-based testing models typically assume you can enumerate risks upfront with reasonable completeness. In AI-generated systems that evolve continuously, risk maps become stale almost immediately. Risk identification itself must become continuous.
Agile Testing embeds quality throughout the development cycle rather than at the end. It advocates whole-team quality ownership, shift-left testing, fast feedback loops, and a testing quadrants model that balances business-facing and technology-facing tests, manual and automated, supportive and critical.
Agile Testing's shift toward whole-team quality ownership and continuous feedback is a necessary precursor to CE. You cannot build confidence infrastructure if testing is a separate department that receives code at the end of a sprint.
The move away from big-bang release testing toward fast, continuous feedback is directionally correct — CE extends this logic all the way to continuous production evaluation.
Agile Testing is still fundamentally sprint-structured. Even "continuous testing" in Agile means testing within each iteration. CE argues that sprint boundaries are as artificial as release gates — software changes continuously, and confidence must be maintained continuously.
The testing quadrants model assumes a relatively stable classification of test types. It has no accommodation for AI-generated tests of AI-generated code, nor for the behavioral distributions that characterize probabilistic systems.
Agile Testing's shift-left assumes humans shift left. When AI generates code at any moment, "left" has no clear location.
TDD requires writing a failing test before writing any production code, then writing the minimum code to pass the test, then refactoring. The test suite becomes a specification of behavior, a safety net for change, and a design tool. Red-green-refactor is the rhythm.
TDD's idea of specifying behavior before implementing it is perhaps more important in the AI era than it was when Kent Beck invented it. When AI generates the implementation, the human's job is to define the behavioral contract that the AI must satisfy. That is TDD's core intent.
TDD's fast feedback loop — know immediately if something breaks — is the seed of CE's continuous evaluation principle.
TDD's unit test focus creates exactly the isolation problem CE addresses. Unit tests verify that components work in isolation; they say nothing about emergent behavior, system interactions, or real-world conditions.
When AI writes both the tests and the code, TDD's verification loop is broken. The tests and implementation may share the same blind spots. A passing TDD suite from an AI pair is not evidence of confidence in the same way a human-authored TDD suite is.
TDD also assumes a deterministic system. The red-green-refactor cycle has no mechanism for probabilistic outputs or behavioral distributions.
BDD extends TDD's intent outward — toward shared understanding between business and engineering. Using Given-When-Then scenario language, it creates living documentation that bridges what the business wants with what the system does. Tools like Cucumber and SpecFlow make these specifications executable.
BDD's insistence on human-readable behavioral specifications that connect business intent to system behavior is directly aligned with CE's comprehension coverage concern. When no one fully reads the code, the behavioral spec becomes the primary artifact of understanding.
BDD's idea that shared understanding is a precondition for quality maps to CE's argument that confidence requires a legible model of system intent — especially when the author is an AI.
BDD's tools and ceremony assume a stable, enumerable set of scenarios that humans collaboratively define. In AI-generated systems that evolve continuously, the scenario set is always incomplete and the living documentation quickly becomes stale.
BDD doesn't accommodate probabilistic outputs. Given-When-Then assumes a deterministic "Then" — that the system always produces the same result for the same inputs. AI systems don't.
The overhead of BDD tooling and ceremony doesn't scale to AI-driven development velocity. By the time a Cucumber scenario is written, reviewed, and automated, the behavior it describes may have already changed.
Google's testing approach distributes quality responsibility across three roles — Software Engineers in Test (SET), Test Engineers (TE), and Test Engineering Managers (TEM) — and emphasizes testing as an engineering discipline rather than a QA function. It treats testing at scale as a systems problem and pioneered many automation and exploratory techniques at web-scale.
Testing as a first-class engineering discipline, not a separate department, is a direct ancestor of CE. The argument that quality requires engineering investment — not just manual testers running test cases — is foundational.
Google's scale-first thinking — how do you maintain confidence across thousands of engineers and millions of lines of code? — is the right framing for the AI era, where the scale problem has simply become more extreme.
The emphasis on production signals and real user data as quality inputs, not just pre-release testing, anticipates CE's continuous evaluation and observability principles.
Google's approach was built for a world of large-scale deterministic software written by human engineers. The SET/TE/TEM roles were designed for that world. The AI era removes the assumption that humans wrote — and therefore can reason about — the code.
Google's testing approach still relies heavily on automated test suites as the primary confidence signal. CE argues this is insufficient when AI writes both code and tests, potentially sharing the same blind spots.
The testing approach in How Google Tests Software describes what most of Google's software engineering looked like — and it was representative of the industry. But it was not the whole picture. The core Search engineering teams operated with a different model: Search Quality Engineers who sat inside the engineering organisation, not in testing or QA.
These engineers were responsible for evaluating whether search results were actually good — measuring relevance, freshness, ranking quality, and user satisfaction across billions of queries. They built evaluation frameworks, defined quality metrics, ran large-scale human rating programmes, and tracked confidence in the system's behaviour over time. They weren't testing for bugs. They were building and maintaining the organisation's confidence that the system was doing the right thing at scale.
That is Confidence Engineering — not by name, but in every meaningful sense. And critically, they were just engineers. They weren't part of the testing field, didn't use testing frameworks, and weren't measured by test coverage. They were measured by whether the product was actually getting better in ways real users experienced. The Search Quality model is arguably the closest real-world precursor to what CE describes — and it emerged not from testing practice, but from the demands of operating a complex, probabilistic, continuously-changing system at planetary scale.
HXT expands the scope of testing beyond the software itself to encompass the complete human experience with a product — including physical, emotional, social, and situational factors. Quality is measured not by whether the software functions correctly, but by whether it serves human needs across all the moments people interact with a brand or service.
HXT is the framework CE most directly inherits from and credits. CE's belief that quality is a property of experience, not software is HXT's central claim. The scope expansion — from app to full brand experience — is both frameworks' most distinctive departure from traditional testing.
HXT's inclusion of physical, emotional, and situational context directly informs CE's value of operational reality over synthetic certainty. No test environment captures what real people experience in the real world.
HXT's human-first framing maps to CE's confidence framing: what matters is whether the person trusts and can rely on the experience, not whether the system passed its own tests.
HXT focuses primarily on the experience dimension of quality. CE extends the analysis into the engineering infrastructure required to maintain confidence — evals, observability, production telemetry, AI verification systems. HXT names the problem; CE also addresses the engineering apparatus required to solve it.
HXT predates the AI-as-author problem. How HXT principles apply when AI generates the experiences humans then have — and the experiences are themselves probabilistic — is an open question both frameworks would need to address together.
ISTQB provides a structured, internationally recognized certification framework for software testing. ISO 29119 is a formal testing standard covering test processes, documentation, and terminology. Both aim to create a common, codifiable body of knowledge for the testing profession — a lingua franca for practitioners across industries and organizations.
ISTQB's foundational vocabulary — test levels, test types, defect lifecycle — provides a shared language that is genuinely useful. CE does not reject the idea of a common vocabulary for testing; it argues that vocabulary needs updating for the AI era.
In regulated and safety-critical environments — medical devices, aviation, nuclear — the rigorous documentation and process requirements that ISTQB and ISO 29119 demand remain important. CE's steelman acknowledges that deterministic verification still provides important guarantees in these contexts.
ISTQB treats testing as a codifiable, certifiable body of procedures. CE's foundational claim — informed by RST and CDT — is that testing requires judgment that cannot be reduced to procedure. Coverage percentages and test case counts, the primary metrics of ISTQB-aligned quality programs, are precisely what CE argues don't represent confidence.
ISTQB's certification model assumes testing skill is transferable via structured curriculum. CE treats good testing as contextual judgment that depends on deep domain knowledge and adaptive thinking — not something a syllabus fully conveys.
ISO 29119's heavyweight documentation requirements are incompatible with AI-era development velocity. By the time a formal test plan is approved, the system it describes may have been regenerated.
The striking thing about reviewing these frameworks together is how much the best of them already knew. RST, CDT, Exploratory Testing, and Risk-Based Testing each diagnosed the same disease: that what teams measure (coverage, pass rates, test counts) is not what they care about (whether the software works for real people in real conditions). They said this clearly, and largely, the industry ignored them.
The AI era doesn't change that diagnosis. It makes ignoring it more dangerous. When humans wrote all the code, the gap between activity metrics and actual confidence was bad but bounded. When AI generates code faster than teams can review, evaluates it with tools that may share its blind spots, and deploys into systems no individual understands end to end — that gap becomes a chasm.
Confidence Engineering inherits the insight of every framework on this page that rejected the illusion of diligence. It tries to name what the discipline looks like when you take that rejection seriously — not just as a philosophy of skilled individual testing, but as an organizational engineering practice built for a world where the authors of software are increasingly algorithms, and the people responsible for confidence must direct, amplify, and evaluate rather than simply execute.
The old frameworks were right. They just weren't finished.