Working draft

CE and the Testing Landscape

Confidence Engineering didn't emerge from nowhere. It inherits decades of testing thought — and departs from it in specific ways. Here is where it converges, where it diverges, and what the AI era changes about each.

The major schools of software testing — from Context-Driven Testing to Modern Testing to Rapid Software Testing — were developed by serious practitioners who were right about most things, most of the time. Many of their core insights hold up under the pressure of the AI era. Some do not.

The most honest framing is this: the old frameworks diagnosed the right disease — the over-reliance on metrics, the misalignment between what teams measure and what users experience, the failure of gate-based thinking. Confidence Engineering extends that diagnosis to a new set of conditions: AI-generated code that no one fully reads, probabilistic systems that cannot be deterministically verified, and software that evolves faster than any team can review.

This is not a rejection of what came before. It is an attempt to name what comes next.

Converges with CE
Diverges from CE
Still relevant in AI era
Less relevant in AI era
At a glance
Framework Origin CE Alignment AI Era Relevance
Google Testing Approach Whittaker, Arbon, Carollo High Strong in parts
Human Experience Testing (HXT) Tariq King High Still strong
Modern Testing (MT) Alan Page, Brent Jensen High Still strong
Context-Driven Testing (CDT) Bach, Kaner, Pettichord High Still strong
Exploratory Testing Cem Kaner High Still strong
Risk-Based Testing Bach, Whittaker, various High Still strong
Agile Testing Crispin, Gregory Medium Needs evolution
Behavior-Driven Development (BDD) Dan North Medium Needs evolution
Test-Driven Development (TDD) Kent Beck Partial Needs evolution
Rapid Software Testing (RST) James Bach, Michael Bolton Medium Needs evolution
ISTQB / ISO 29119 Standards bodies Low Limited
Framework by framework
School of thought
Rapid Software Testing (RST)
James Bach · Michael Bolton

RST frames testing as a cognitive skill, not a procedure. It draws a sharp distinction between checking (automated verification of known expectations) and testing (a human intellectual activity involving exploration, judgment, and learning). It rejects scripted testing as sufficient and insists that good testing requires intelligent adaptation to context.

Converges with CE

RST's distinction between checking and testing maps almost directly onto CE. Checking = automated verification of known expectations. Testing = exploration of what you don't yet know. CE's value of exploration over repetition is RST's central argument restated.

RST's rejection of scripts and pass rates as quality proxies is foundational to CE's position that activity metrics are not quality metrics.

RST's insistence on judgment, context, and skepticism toward process over thinking resonates deeply with CE's belief in risk intelligence over execution volume.

Diverges from CE

RST is fundamentally a human discipline. Its models of the skilled tester assume a person is always the primary agent of exploration and judgment. CE extends this: AI is not the enemy of good testing — it is a tool to be directed, much like RST would direct a junior tester.

RST doesn't address what happens when no human authored the code being tested. Its heuristics and oracles are designed for systems humans built and understand; they need significant extension for systems that emerged from model outputs.

RST operates at the individual tester level. CE is also concerned with organizational confidence infrastructure — evals, observability, production telemetry — that sits above any individual tester's work.

Still deeply relevant
  • Checking vs testing distinction — now more important than ever
  • Skepticism toward metrics and coverage targets
  • Exploration as the primary vehicle for finding unknown unknowns
  • Context-sensitivity: no universal best practice
Needs extension in AI era
  • Human-as-primary-agent model doesn't scale to AI generation speeds
  • Heuristics and oracles assume human-authored systems
  • No framework for AI-generated code comprehension gaps
  • Individual skill focus misses systemic confidence infrastructure
⚑ Note on the Medium rating

RST's critical thinking and humanistic approach are not just still relevant — they are among the most important intellectual foundations CE builds on. The insistence that testing requires judgment, skepticism, and genuine curiosity about how systems fail is exactly right, and in an era of AI-generated code it matters more, not less.

What pulls the rating down is RST's deep skepticism of automation and instrumentation. RST tends to treat automated checks as a lesser form of testing — useful but fundamentally limited compared to skilled human exploration. CE disagrees: at AI-era scale, sophisticated automation, telemetry, evaluation pipelines, and data-driven measurement are not compromises — they are the only way to maintain confidence at all. A framework that treats data and measurement as secondary to human craft will struggle to speak to engineering leaders building systems that process billions of events a day.

The core RST insight — thinking matters more than process — travels perfectly into the CE era. The tooling philosophy does not.

School of thought
Modern Testing
Alan Page · Brent Jensen

Modern Testing's seven principles argue that quality is the team's responsibility, not a QA function's. Testers should accelerate the team, act as quality coaches, be data-driven, and continuously improve. It pushes testing from a gate-keeping activity into a capability distributed across the whole engineering team.

Converges with CE

MT's core argument — quality belongs to everyone, not a QA department — is a precondition for CE. You cannot build organizational confidence infrastructure if quality is siloed.

MT's "accelerate the team" principle maps to CE's belief that Confidence Engineering exists to help organizations move faster, not slower. Confidence is an enabler of speed, not a brake on it.

MT's emphasis on being data-driven and questioning the value of testing activities directly echoes CE's rejection of activity metrics.

Diverges from CE

MT focuses on team culture and organizational process. CE goes further: even a perfectly organized team with excellent quality culture faces new problems when AI writes the code and AI evaluates it. The cultural shift MT describes is necessary but not sufficient.

MT's coaching model assumes humans remain the primary authors and verifiers of software. AI authorship breaks the model: who does a quality coach coach when the developer is a language model?

MT doesn't address probabilistic systems or the two-unknowns problem. Its data-driven principle assumes data that is interpretable in traditional ways.

Still deeply relevant
  • Distributed quality ownership — even more critical when AI writes code
  • Testing as acceleration, not gate-keeping
  • Questioning the value of testing activities
  • Data-driven mindset as a foundation
Needs extension in AI era
  • Coaching model breaks when the "developer" is an LLM
  • Doesn't address AI-generated code or verification
  • No framework for probabilistic systems evaluation
  • Culture shift is necessary but not sufficient
School of thought
Context-Driven Testing (CDT)
James Bach · Cem Kaner · Brett Pettichord

CDT holds that there are no universal best practices in software testing — only practices that are good in a given context. Good testing requires skilled people exercising judgment, and the value of any practice depends entirely on the project, product, and people involved. It rejects process dogma and checklist compliance as substitutes for thinking.

Converges with CE

CDT's rejection of best practices maps directly onto CE's rejection of coverage targets and pass rates. There is no universal metric that constitutes confidence. What gives you confidence depends on what you're building, who's using it, and what could go wrong.

CDT's emphasis on skilled judgment over process compliance is the philosophical bedrock CE builds on. Directing AI verification toward meaningful risk is a judgment skill, not a process to follow.

CDT's notion that testing is fundamentally about learning about the product — not just running test cases — aligns with CE's comprehension coverage concern. Testing is partly how you maintain understanding of a system you didn't fully author.

Diverges from CE

CDT is almost entirely concerned with human testing skill applied to human-authored systems. Its rich vocabulary of heuristics, oracles, and test charters was developed for a world where a skilled tester explores a system a human team built. That world still exists — but it no longer describes the whole of software development.

CDT has less to say about organizational confidence infrastructure: the systems of evaluation, observability, and production telemetry that CE treats as first-class engineering concerns.

Still deeply relevant
  • No universal best practices — context always matters
  • Skilled judgment over process compliance
  • Skepticism toward metrics as quality proxies
  • Testing as learning, not just verification
Needs extension in AI era
  • Human-skill focus doesn't address AI-authored systems
  • Heuristics and oracles need extension for probabilistic outputs
  • Limited framework for organizational confidence systems
Practice
Exploratory Testing
Cem Kaner · James Bach · Elisabeth Hendrickson

Exploratory Testing treats test design and test execution as simultaneous rather than sequential activities. The tester learns about the system while testing it, adapting their approach based on what they discover. It is empirical, adaptive, and cognitive — the opposite of scripted test execution.

Converges with CE

CE's value of exploration over repetition is Exploratory Testing restated at a higher level of abstraction. Scripted test execution is repetition; exploratory testing is exploration. CE says repetition should be automated and attention directed toward the unknown.

Exploratory Testing's adaptive, learning-while-doing approach is precisely what's needed for probabilistic systems. You cannot pre-script tests for a system whose outputs vary by definition. You must explore.

The emphasis on unknown unknowns — testing not just what you expect to find, but what you didn't know to look for — is arguably more important when AI generates system behavior that no human anticipated.

Diverges from CE

Traditional Exploratory Testing is a human practice at human speed. A single skilled explorer testing a system for a day will cover a fraction of the behavioral surface area that an AI-assisted exploration can cover in the same time.

CE reframes exploration as something that can be directed and amplified by AI — not replaced by it. The human judgment about where to explore, and what a finding means, remains essential. But the scale of exploration is no longer limited to human bandwidth.

Still deeply relevant
  • Adaptive, learning-based approach to unknown systems
  • Critical for probabilistic AI systems that cannot be fully scripted
  • Focus on unknown unknowns — more important than ever
  • Cognitive flexibility as a core testing skill
Needs extension in AI era
  • Human-speed exploration doesn't cover AI-scale behavioral surfaces
  • Needs a framework for AI-assisted exploration at scale
  • Session-based charter model needs adaptation for continuous systems
Practice
Risk-Based Testing
James Bach · James Whittaker · various

Risk-Based Testing argues that testing effort should be allocated in proportion to risk — the likelihood and impact of failure. Rather than attempting to cover everything, teams identify what could go wrong, assess severity and probability, and invest testing effort accordingly. It is the argument against uniform coverage made into a practice.

Converges with CE

Risk-Based Testing is arguably the direct ancestor of CE's risk intelligence principle. "Direct verification toward meaningful risk" is risk-based testing elevated to a first principle of the discipline.

The underlying logic — not all parts of a system are equally worth testing, and resources should follow risk — is foundational to CE and becomes even more important when AI generates vast amounts of code faster than teams can evaluate all of it.

Diverges from CE

Traditional risk-based testing still tends to use test counts and coverage as the unit of investment once risk areas are identified. CE argues that these are the wrong measures even after you've correctly identified the risk.

Risk-based testing models typically assume you can enumerate risks upfront with reasonable completeness. In AI-generated systems that evolve continuously, risk maps become stale almost immediately. Risk identification itself must become continuous.

Still deeply relevant
  • Prioritizing by risk, not by coverage — more critical as code volume explodes
  • Explicit framing of likelihood × impact as the basis for investment
  • Rejection of uniform coverage as a goal
Needs extension in AI era
  • Static upfront risk enumeration doesn't work for continuously changing systems
  • Risk identification must itself become continuous and AI-assisted
  • Test counts as the unit of risk coverage must be replaced with behavioral evidence
Process
Agile Testing
Lisa Crispin · Janet Gregory

Agile Testing embeds quality throughout the development cycle rather than at the end. It advocates whole-team quality ownership, shift-left testing, fast feedback loops, and a testing quadrants model that balances business-facing and technology-facing tests, manual and automated, supportive and critical.

Converges with CE

Agile Testing's shift toward whole-team quality ownership and continuous feedback is a necessary precursor to CE. You cannot build confidence infrastructure if testing is a separate department that receives code at the end of a sprint.

The move away from big-bang release testing toward fast, continuous feedback is directionally correct — CE extends this logic all the way to continuous production evaluation.

Diverges from CE

Agile Testing is still fundamentally sprint-structured. Even "continuous testing" in Agile means testing within each iteration. CE argues that sprint boundaries are as artificial as release gates — software changes continuously, and confidence must be maintained continuously.

The testing quadrants model assumes a relatively stable classification of test types. It has no accommodation for AI-generated tests of AI-generated code, nor for the behavioral distributions that characterize probabilistic systems.

Agile Testing's shift-left assumes humans shift left. When AI generates code at any moment, "left" has no clear location.

Still relevant
  • Whole-team quality ownership — foundational
  • Fast feedback loops over slow release gates
  • Testing embedded in development, not appended after
Loses relevance in AI era
  • Sprint-based rhythm doesn't match continuous AI-driven change
  • Testing quadrants don't accommodate probabilistic systems
  • Shift-left assumes a linear development flow that AI disrupts
  • Human-centered team model needs significant rethinking
Practice
Test-Driven Development (TDD)
Kent Beck

TDD requires writing a failing test before writing any production code, then writing the minimum code to pass the test, then refactoring. The test suite becomes a specification of behavior, a safety net for change, and a design tool. Red-green-refactor is the rhythm.

Converges with CE

TDD's idea of specifying behavior before implementing it is perhaps more important in the AI era than it was when Kent Beck invented it. When AI generates the implementation, the human's job is to define the behavioral contract that the AI must satisfy. That is TDD's core intent.

TDD's fast feedback loop — know immediately if something breaks — is the seed of CE's continuous evaluation principle.

Diverges from CE

TDD's unit test focus creates exactly the isolation problem CE addresses. Unit tests verify that components work in isolation; they say nothing about emergent behavior, system interactions, or real-world conditions.

When AI writes both the tests and the code, TDD's verification loop is broken. The tests and implementation may share the same blind spots. A passing TDD suite from an AI pair is not evidence of confidence in the same way a human-authored TDD suite is.

TDD also assumes a deterministic system. The red-green-refactor cycle has no mechanism for probabilistic outputs or behavioral distributions.

Still relevant
  • Behavioral specification before implementation — more important than ever
  • Fast feedback loops as a design principle
  • Tests as a specification of intent, not just a safety net
Loses relevance in AI era
  • Unit test isolation misses emergent behavior by design
  • AI-authored TDD suites don't provide the same confidence as human-authored ones
  • No framework for probabilistic or non-deterministic systems
  • Mechanical red-green-refactor cycle doesn't transfer to AI-assisted development
Practice
Behavior-Driven Development (BDD)
Dan North

BDD extends TDD's intent outward — toward shared understanding between business and engineering. Using Given-When-Then scenario language, it creates living documentation that bridges what the business wants with what the system does. Tools like Cucumber and SpecFlow make these specifications executable.

Converges with CE

BDD's insistence on human-readable behavioral specifications that connect business intent to system behavior is directly aligned with CE's comprehension coverage concern. When no one fully reads the code, the behavioral spec becomes the primary artifact of understanding.

BDD's idea that shared understanding is a precondition for quality maps to CE's argument that confidence requires a legible model of system intent — especially when the author is an AI.

Diverges from CE

BDD's tools and ceremony assume a stable, enumerable set of scenarios that humans collaboratively define. In AI-generated systems that evolve continuously, the scenario set is always incomplete and the living documentation quickly becomes stale.

BDD doesn't accommodate probabilistic outputs. Given-When-Then assumes a deterministic "Then" — that the system always produces the same result for the same inputs. AI systems don't.

The overhead of BDD tooling and ceremony doesn't scale to AI-driven development velocity. By the time a Cucumber scenario is written, reviewed, and automated, the behavior it describes may have already changed.

Still relevant
  • Human-readable behavioral specs as the primary understanding artifact
  • Shared understanding between business and engineering
  • Specification of intent independent of implementation
Loses relevance in AI era
  • Tooling ceremony doesn't match AI development velocity
  • Deterministic Given-When-Then breaks for probabilistic systems
  • Living documentation goes stale faster than AI changes the code
Process
How Google Tests Software
James Whittaker · Jason Arbon · Jeff Carollo

Google's testing approach distributes quality responsibility across three roles — Software Engineers in Test (SET), Test Engineers (TE), and Test Engineering Managers (TEM) — and emphasizes testing as an engineering discipline rather than a QA function. It treats testing at scale as a systems problem and pioneered many automation and exploratory techniques at web-scale.

Converges with CE

Testing as a first-class engineering discipline, not a separate department, is a direct ancestor of CE. The argument that quality requires engineering investment — not just manual testers running test cases — is foundational.

Google's scale-first thinking — how do you maintain confidence across thousands of engineers and millions of lines of code? — is the right framing for the AI era, where the scale problem has simply become more extreme.

The emphasis on production signals and real user data as quality inputs, not just pre-release testing, anticipates CE's continuous evaluation and observability principles.

Diverges from CE

Google's approach was built for a world of large-scale deterministic software written by human engineers. The SET/TE/TEM roles were designed for that world. The AI era removes the assumption that humans wrote — and therefore can reason about — the code.

Google's testing approach still relies heavily on automated test suites as the primary confidence signal. CE argues this is insufficient when AI writes both code and tests, potentially sharing the same blind spots.

Still relevant
  • Testing as engineering, not QA overhead
  • Scale-first thinking about confidence systems
  • Production signals as quality inputs
  • Distributed quality ownership across engineering roles
Needs extension in AI era
  • Role model assumes human authors who can be coached
  • Automated suites as primary signal breaks for AI-co-authored code
  • Designed for deterministic software at human-generation speed
⚑ Insider note

The testing approach in How Google Tests Software describes what most of Google's software engineering looked like — and it was representative of the industry. But it was not the whole picture. The core Search engineering teams operated with a different model: Search Quality Engineers who sat inside the engineering organisation, not in testing or QA.

These engineers were responsible for evaluating whether search results were actually good — measuring relevance, freshness, ranking quality, and user satisfaction across billions of queries. They built evaluation frameworks, defined quality metrics, ran large-scale human rating programmes, and tracked confidence in the system's behaviour over time. They weren't testing for bugs. They were building and maintaining the organisation's confidence that the system was doing the right thing at scale.

That is Confidence Engineering — not by name, but in every meaningful sense. And critically, they were just engineers. They weren't part of the testing field, didn't use testing frameworks, and weren't measured by test coverage. They were measured by whether the product was actually getting better in ways real users experienced. The Search Quality model is arguably the closest real-world precursor to what CE describes — and it emerged not from testing practice, but from the demands of operating a complex, probabilistic, continuously-changing system at planetary scale.

Experience
Human Experience Testing (HXT)
Tariq King

HXT expands the scope of testing beyond the software itself to encompass the complete human experience with a product — including physical, emotional, social, and situational factors. Quality is measured not by whether the software functions correctly, but by whether it serves human needs across all the moments people interact with a brand or service.

Converges with CE

HXT is the framework CE most directly inherits from and credits. CE's belief that quality is a property of experience, not software is HXT's central claim. The scope expansion — from app to full brand experience — is both frameworks' most distinctive departure from traditional testing.

HXT's inclusion of physical, emotional, and situational context directly informs CE's value of operational reality over synthetic certainty. No test environment captures what real people experience in the real world.

HXT's human-first framing maps to CE's confidence framing: what matters is whether the person trusts and can rely on the experience, not whether the system passed its own tests.

Diverges from CE

HXT focuses primarily on the experience dimension of quality. CE extends the analysis into the engineering infrastructure required to maintain confidence — evals, observability, production telemetry, AI verification systems. HXT names the problem; CE also addresses the engineering apparatus required to solve it.

HXT predates the AI-as-author problem. How HXT principles apply when AI generates the experiences humans then have — and the experiences are themselves probabilistic — is an open question both frameworks would need to address together.

Still deeply relevant
  • Quality as human experience — the most durable framing
  • Physical, emotional, situational context in quality assessment
  • Full brand arc, not just software boundary
  • Human-first definition of what "works" means
Needs extension in AI era
  • Doesn't yet address AI-generated experiences and probabilistic outputs
  • Engineering infrastructure for confidence is outside HXT's current scope
Standard
ISTQB / ISO 29119
International Software Testing Qualifications Board · ISO

ISTQB provides a structured, internationally recognized certification framework for software testing. ISO 29119 is a formal testing standard covering test processes, documentation, and terminology. Both aim to create a common, codifiable body of knowledge for the testing profession — a lingua franca for practitioners across industries and organizations.

Converges with CE

ISTQB's foundational vocabulary — test levels, test types, defect lifecycle — provides a shared language that is genuinely useful. CE does not reject the idea of a common vocabulary for testing; it argues that vocabulary needs updating for the AI era.

In regulated and safety-critical environments — medical devices, aviation, nuclear — the rigorous documentation and process requirements that ISTQB and ISO 29119 demand remain important. CE's steelman acknowledges that deterministic verification still provides important guarantees in these contexts.

Diverges from CE

ISTQB treats testing as a codifiable, certifiable body of procedures. CE's foundational claim — informed by RST and CDT — is that testing requires judgment that cannot be reduced to procedure. Coverage percentages and test case counts, the primary metrics of ISTQB-aligned quality programs, are precisely what CE argues don't represent confidence.

ISTQB's certification model assumes testing skill is transferable via structured curriculum. CE treats good testing as contextual judgment that depends on deep domain knowledge and adaptive thinking — not something a syllabus fully conveys.

ISO 29119's heavyweight documentation requirements are incompatible with AI-era development velocity. By the time a formal test plan is approved, the system it describes may have been regenerated.

Still relevant
  • Common vocabulary across the profession
  • Rigorous process requirements in safety-critical, regulated domains
  • Foundational testing concepts that remain valid regardless of methodology
Largely inapplicable in AI era
  • Coverage % and test case counts as primary quality signals
  • Certification model assumes codifiable skill over contextual judgment
  • Documentation overhead incompatible with AI development velocity
  • Assumes deterministic systems and human authors throughout
Where this leaves us

The striking thing about reviewing these frameworks together is how much the best of them already knew. RST, CDT, Exploratory Testing, and Risk-Based Testing each diagnosed the same disease: that what teams measure (coverage, pass rates, test counts) is not what they care about (whether the software works for real people in real conditions). They said this clearly, and largely, the industry ignored them.

The AI era doesn't change that diagnosis. It makes ignoring it more dangerous. When humans wrote all the code, the gap between activity metrics and actual confidence was bad but bounded. When AI generates code faster than teams can review, evaluates it with tools that may share its blind spots, and deploys into systems no individual understands end to end — that gap becomes a chasm.

Confidence Engineering inherits the insight of every framework on this page that rejected the illusion of diligence. It tries to name what the discipline looks like when you take that rejection seriously — not just as a philosophy of skilled individual testing, but as an organizational engineering practice built for a world where the authors of software are increasingly algorithms, and the people responsible for confidence must direct, amplify, and evaluate rather than simply execute.

The old frameworks were right. They just weren't finished.