Working draft

CE and the Testing Landscape

Confidence Engineering didn't emerge from nowhere. It inherits decades of testing thought — and departs from it in specific ways. Here is where it converges, where it diverges, and what the AI era changes about each.

The major schools of software testing — from Context-Driven Testing to Modern Testing to Rapid Software Testing — were developed by serious practitioners who were right about most things, most of the time. Many of their core insights hold up under the pressure of the AI era. Some do not.

The most honest framing is this: the old frameworks diagnosed the right disease — the over-reliance on metrics, the misalignment between what teams measure and what users experience, the failure of gate-based thinking. Confidence Engineering extends that diagnosis to a new set of conditions: AI-generated code that no one fully reads, probabilistic systems that cannot be deterministically verified, and software that evolves faster than any team can review.

This is not a rejection of what came before. It is an attempt to name what comes next.

Converges with CE

Diverges from CE

Still relevant in AI era

Less relevant in AI era

At a glance

Framework	Origin	CE Alignment	AI Era Relevance
Google Testing Approach	Whittaker, Arbon, Carollo	High	Strong in parts
Human Experience Testing (HXT)	Tariq King	High	Still strong
Modern Testing (MT)	Alan Page, Brent Jensen	High	Still strong
Context-Driven Testing (CDT)	Bach, Kaner, Pettichord	High	Still strong
Exploratory Testing	Cem Kaner	High	Still strong
Risk-Based Testing	Bach, Whittaker, various	High	Still strong
Agile Testing	Crispin, Gregory	Medium	Needs evolution
Behavior-Driven Development (BDD)	Dan North	Medium	Needs evolution
Test-Driven Development (TDD)	Kent Beck	Partial	Needs evolution
Rapid Software Testing (RST)	James Bach, Michael Bolton	Medium	Needs evolution
ISTQB / ISO 29119	Standards bodies	Low	Limited

Framework by framework

School of thought

Rapid Software Testing (RST)

James Bach · Michael Bolton

RST frames testing as a cognitive skill, not a procedure. It draws a sharp distinction between checking (automated verification of known expectations) and testing (a human intellectual activity involving exploration, judgment, and learning). It rejects scripted testing as sufficient and insists that good testing requires intelligent adaptation to context.

Converges with CE

RST's distinction between checking and testing maps almost directly onto CE. Checking = automated verification of known expectations. Testing = exploration of what you don't yet know. CE's value of exploration over repetition is RST's central argument restated.

RST's rejection of scripts and pass rates as quality proxies is foundational to CE's position that activity metrics are not quality metrics.

RST's insistence on judgment, context, and skepticism toward process over thinking resonates deeply with CE's belief in risk intelligence over execution volume.

Diverges from CE

RST is fundamentally a human discipline. Its models of the skilled tester assume a person is always the primary agent of exploration and judgment. CE extends this: AI is not the enemy of good testing — it is a tool to be directed, much like RST would direct a junior tester.

RST doesn't address what happens when no human authored the code being tested. Its heuristics and oracles are designed for systems humans built and understand; they need significant extension for systems that emerged from model outputs.

RST operates at the individual tester level. CE is also concerned with organizational confidence infrastructure — evals, observability, production telemetry — that sits above any individual tester's work.

Still deeply relevant

Checking vs testing distinction — now more important than ever
Skepticism toward metrics and coverage targets
Exploration as the primary vehicle for finding unknown unknowns
Context-sensitivity: no universal best practice

Needs extension in AI era

Human-as-primary-agent model doesn't scale to AI generation speeds
Heuristics and oracles assume human-authored systems
No framework for AI-generated code comprehension gaps
Individual skill focus misses systemic confidence infrastructure

⚑ Note on the Medium rating

RST's critical thinking and humanistic approach are not just still relevant — they are among the most important intellectual foundations CE builds on. The insistence that testing requires judgment, skepticism, and genuine curiosity about how systems fail is exactly right, and in an era of AI-generated code it matters more, not less.

What pulls the rating down is RST's deep skepticism of automation and instrumentation. RST tends to treat automated checks as a lesser form of testing — useful but fundamentally limited compared to skilled human exploration. CE disagrees: at AI-era scale, sophisticated automation, telemetry, evaluation pipelines, and data-driven measurement are not compromises — they are the only way to maintain confidence at all. A framework that treats data and measurement as secondary to human craft will struggle to speak to engineering leaders building systems that process billions of events a day.

The core RST insight — thinking matters more than process — travels perfectly into the CE era. The tooling philosophy does not.

School of thought

Modern Testing

Alan Page · Brent Jensen

Modern Testing's seven principles argue that quality is the team's responsibility, not a QA function's. Testers should accelerate the team, act as quality coaches, be data-driven, and continuously improve. It pushes testing from a gate-keeping activity into a capability distributed across the whole engineering team.

Converges with CE

MT's core argument — quality belongs to everyone, not a QA department — is a precondition for CE. You cannot build organizational confidence infrastructure if quality is siloed.

MT's "accelerate the team" principle maps to CE's belief that Confidence Engineering exists to help organizations move faster, not slower. Confidence is an enabler of speed, not a brake on it.

MT's emphasis on being data-driven and questioning the value of testing activities directly echoes CE's rejection of activity metrics.

Diverges from CE

MT focuses on team culture and organizational process. CE goes further: even a perfectly organized team with excellent quality culture faces new problems when AI writes the code and AI evaluates it. The cultural shift MT describes is necessary but not sufficient.

MT's coaching model assumes humans remain the primary authors and verifiers of software. AI authorship breaks the model: who does a quality coach coach when the developer is a language model?

MT doesn't address probabilistic systems or the two-unknowns problem. Its data-driven principle assumes data that is interpretable in traditional ways.

Still deeply relevant

Distributed quality ownership — even more critical when AI writes code
Testing as acceleration, not gate-keeping
Questioning the value of testing activities
Data-driven mindset as a foundation

Needs extension in AI era

Coaching model breaks when the "developer" is an LLM
Doesn't address AI-generated code or verification
No framework for probabilistic systems evaluation
Culture shift is necessary but not sufficient

School of thought

Context-Driven Testing (CDT)

James Bach · Cem Kaner · Brett Pettichord

CDT holds that there are no universal best practices in software testing — only practices that are good in a given context. Good testing requires skilled people exercising judgment, and the value of any practice depends entirely on the project, product, and people involved. It rejects process dogma and checklist compliance as substitutes for thinking.

Converges with CE

CDT's rejection of best practices maps directly onto CE's rejection of coverage targets and pass rates. There is no universal metric that constitutes confidence. What gives you confidence depends on what you're building, who's using it, and what could go wrong.

CDT's emphasis on skilled judgment over process compliance is the philosophical bedrock CE builds on. Directing AI verification toward meaningful risk is a judgment skill, not a process to follow.

CDT's notion that testing is fundamentally about learning about the product — not just running test cases — aligns with CE's comprehension coverage concern. Testing is partly how you maintain understanding of a system you didn't fully author.

Diverges from CE

CDT is almost entirely concerned with human testing skill applied to human-authored systems. Its rich vocabulary of heuristics, oracles, and test charters was developed for a world where a skilled tester explores a system a human team built. That world still exists — but it no longer describes the whole of software development.

CDT has less to say about organizational confidence infrastructure: the systems of evaluation, observability, and production telemetry that CE treats as first-class engineering concerns.

Still deeply relevant

No universal best practices — context always matters
Skilled judgment over process compliance
Skepticism toward metrics as quality proxies
Testing as learning, not just verification

Needs extension in AI era

Human-skill focus doesn't address AI-authored systems
Heuristics and oracles need extension for probabilistic outputs
Limited framework for organizational confidence systems

Practice

Exploratory Testing

Cem Kaner · James Bach · Elisabeth Hendrickson

Exploratory Testing treats test design and test execution as simultaneous rather than sequential activities. The tester learns about the system while testing it, adapting their approach based on what they discover. It is empirical, adaptive, and cognitive — the opposite of scripted test execution.

Converges with CE

CE's value of exploration over repetition is Exploratory Testing restated at a higher level of abstraction. Scripted test execution is repetition; exploratory testing is exploration. CE says repetition should be automated and attention directed toward the unknown.

Exploratory Testing's adaptive, learning-while-doing approach is precisely what's needed for probabilistic systems. You cannot pre-script tests for a system whose outputs vary by definition. You must explore.

The emphasis on unknown unknowns — testing not just what you expect to find, but what you didn't know to look for — is arguably more important when AI generates system behavior that no human anticipated.

Diverges from CE

Traditional Exploratory Testing is a human practice at human speed. A single skilled explorer testing a system for a day will cover a fraction of the behavioral surface area that an AI-assisted exploration can cover in the same time.

CE reframes exploration as something that can be directed and amplified by AI — not replaced by it. The human judgment about where to explore, and what a finding means, remains essential. But the scale of exploration is no longer limited to human bandwidth.

Still deeply relevant

Adaptive, learning-based approach to unknown systems
Critical for probabilistic AI systems that cannot be fully scripted
Focus on unknown unknowns — more important than ever
Cognitive flexibility as a core testing skill

Needs extension in AI era

Human-speed exploration doesn't cover AI-scale behavioral surfaces
Needs a framework for AI-assisted exploration at scale
Session-based charter model needs adaptation for continuous systems

Practice

Risk-Based Testing

James Bach · James Whittaker · various

Risk-Based Testing argues that testing effort should be allocated in proportion to risk — the likelihood and impact of failure. Rather than attempting to cover everything, teams identify what could go wrong, assess severity and probability, and invest testing effort accordingly. It is the argument against uniform coverage made into a practice.

Converges with CE

Risk-Based Testing is arguably the direct ancestor of CE's risk intelligence principle. "Direct verification toward meaningful risk" is risk-based testing elevated to a first principle of the discipline.

The underlying logic — not all parts of a system are equally worth testing, and resources should follow risk — is foundational to CE and becomes even more important when AI generates vast amounts of code faster than teams can evaluate all of it.

Diverges from CE

Traditional risk-based testing still tends to use test counts and coverage as the unit of investment once risk areas are identified. CE argues that these are the wrong measures even after you've correctly identified the risk.

Risk-based testing models typically assume you can enumerate risks upfront with reasonable completeness. In AI-generated systems that evolve continuously, risk maps become stale almost immediately. Risk identification itself must become continuous.

Still deeply relevant

Prioritizing by risk, not by coverage — more critical as code volume explodes
Explicit framing of likelihood × impact as the basis for investment
Rejection of uniform coverage as a goal

Needs extension in AI era

Static upfront risk enumeration doesn't work for continuously changing systems
Risk identification must itself become continuous and AI-assisted
Test counts as the unit of risk coverage must be replaced with behavioral evidence

Process

Agile Testing

Lisa Crispin · Janet Gregory

Agile Testing embeds quality throughout the development cycle rather than at the end. It advocates whole-team quality ownership, shift-left testing, fast feedback loops, and a testing quadrants model that balances business-facing and technology-facing tests, manual and automated, supportive and critical.

Converges with CE

Agile Testing's shift toward whole-team quality ownership and continuous feedback is a necessary precursor to CE. You cannot build confidence infrastructure if testing is a separate department that receives code at the end of a sprint.

The move away from big-bang release testing toward fast, continuous feedback is directionally correct — CE extends this logic all the way to continuous production evaluation.

Diverges from CE

Agile Testing is still fundamentally sprint-structured. Even "continuous testing" in Agile means testing within each iteration. CE argues that sprint boundaries are as artificial as release gates — software changes continuously, and confidence must be maintained continuously.

The testing quadrants model assumes a relatively stable classification of test types. It has no accommodation for AI-generated tests of AI-generated code, nor for the behavioral distributions that characterize probabilistic systems.

Agile Testing's shift-left assumes humans shift left. When AI generates code at any moment, "left" has no clear location.

Still relevant

Whole-team quality ownership — foundational
Fast feedback loops over slow release gates
Testing embedded in development, not appended after

Loses relevance in AI era

Sprint-based rhythm doesn't match continuous AI-driven change
Testing quadrants don't accommodate probabilistic systems
Shift-left assumes a linear development flow that AI disrupts
Human-centered team model needs significant rethinking

Practice

Test-Driven Development (TDD)

Kent Beck

TDD requires writing a failing test before writing any production code, then writing the minimum code to pass the test, then refactoring. The test suite becomes a specification of behavior, a safety net for change, and a design tool. Red-green-refactor is the rhythm.

Converges with CE

TDD's idea of specifying behavior before implementing it is perhaps more important in the AI era than it was when Kent Beck invented it. When AI generates the implementation, the human's job is to define the behavioral contract that the AI must satisfy. That is TDD's core intent.

TDD's fast feedback loop — know immediately if something breaks — is the seed of CE's continuous evaluation principle.

Diverges from CE

TDD's unit test focus creates exactly the isolation problem CE addresses. Unit tests verify that components work in isolation; they say nothing about emergent behavior, system interactions, or real-world conditions.

When AI writes both the tests and the code, TDD's verification loop is broken. The tests and implementation may share the same blind spots. A passing TDD suite from an AI pair is not evidence of confidence in the same way a human-authored TDD suite is.

TDD also assumes a deterministic system. The red-green-refactor cycle has no mechanism for probabilistic outputs or behavioral distributions.

Still relevant

Behavioral specification before implementation — more important than ever
Fast feedback loops as a design principle
Tests as a specification of intent, not just a safety net

Loses relevance in AI era

Unit test isolation misses emergent behavior by design
AI-authored TDD suites don't provide the same confidence as human-authored ones
No framework for probabilistic or non-deterministic systems
Mechanical red-green-refactor cycle doesn't transfer to AI-assisted development

Practice

Behavior-Driven Development (BDD)

Dan North

BDD extends TDD's intent outward — toward shared understanding between business and engineering. Using Given-When-Then scenario language, it creates living documentation that bridges what the business wants with what the system does. Tools like Cucumber and SpecFlow make these specifications executable.

Converges with CE

BDD's insistence on human-readable behavioral specifications that connect business intent to system behavior is directly aligned with CE's comprehension coverage concern. When no one fully reads the code, the behavioral spec becomes the primary artifact of understanding.

BDD's idea that shared understanding is a precondition for quality maps to CE's argument that confidence requires a legible model of system intent — especially when the author is an AI.

Diverges from CE

BDD's tools and ceremony assume a stable, enumerable set of scenarios that humans collaboratively define. In AI-generated systems that evolve continuously, the scenario set is always incomplete and the living documentation quickly becomes stale.

BDD doesn't accommodate probabilistic outputs. Given-When-Then assumes a deterministic "Then" — that the system always produces the same result for the same inputs. AI systems don't.

The overhead of BDD tooling and ceremony doesn't scale to AI-driven development velocity. By the time a Cucumber scenario is written, reviewed, and automated, the behavior it describes may have already changed.

Still relevant

Human-readable behavioral specs as the primary understanding artifact
Shared understanding between business and engineering
Specification of intent independent of implementation

Loses relevance in AI era

Tooling ceremony doesn't match AI development velocity
Deterministic Given-When-Then breaks for probabilistic systems
Living documentation goes stale faster than AI changes the code

Process

How Google Tests Software

James Whittaker · Jason Arbon · Jeff Carollo

Google's testing approach distributes quality responsibility across three roles — Software Engineers in Test (SET), Test Engineers (TE), and Test Engineering Managers (TEM) — and emphasizes testing as an engineering discipline rather than a QA function. It treats testing at scale as a systems problem and pioneered many automation and exploratory techniques at web-scale.

Converges with CE

Testing as a first-class engineering discipline, not a separate department, is a direct ancestor of CE. The argument that quality requires engineering investment — not just manual testers running test cases — is foundational.

Google's scale-first thinking — how do you maintain confidence across thousands of engineers and millions of lines of code? — is the right framing for the AI era, where the scale problem has simply become more extreme.

The emphasis on production signals and real user data as quality inputs, not just pre-release testing, anticipates CE's continuous evaluation and observability principles.

Diverges from CE

Google's approach was built for a world of large-scale deterministic software written by human engineers. The SET/TE/TEM roles were designed for that world. The AI era removes the assumption that humans wrote — and therefore can reason about — the code.

Google's testing approach still relies heavily on automated test suites as the primary confidence signal. CE argues this is insufficient when AI writes both code and tests, potentially sharing the same blind spots.

Still relevant

Testing as engineering, not QA overhead
Scale-first thinking about confidence systems
Production signals as quality inputs
Distributed quality ownership across engineering roles

Needs extension in AI era

Role model assumes human authors who can be coached
Automated suites as primary signal breaks for AI-co-authored code
Designed for deterministic software at human-generation speed

⚑ Insider note

The testing approach in How Google Tests Software describes what most of Google's software engineering looked like — and it was representative of the industry. But it was not the whole picture. The core Search engineering teams operated with a different model: Search Quality Engineers who sat inside the engineering organisation, not in testing or QA.

These engineers were responsible for evaluating whether search results were actually good — measuring relevance, freshness, ranking quality, and user satisfaction across billions of queries. They built evaluation frameworks, defined quality metrics, ran large-scale human rating programmes, and tracked confidence in the system's behaviour over time. They weren't testing for bugs. They were building and maintaining the organisation's confidence that the system was doing the right thing at scale.

That is Confidence Engineering — not by name, but in every meaningful sense. And critically, they were just engineers. They weren't part of the testing field, didn't use testing frameworks, and weren't measured by test coverage. They were measured by whether the product was actually getting better in ways real users experienced. The Search Quality model is arguably the closest real-world precursor to what CE describes — and it emerged not from testing practice, but from the demands of operating a complex, probabilistic, continuously-changing system at planetary scale.

Experience

Human Experience Testing (HXT)

Tariq King

HXT expands the scope of testing beyond the software itself to encompass the complete human experience with a product — including physical, emotional, social, and situational factors. Quality is measured not by whether the software functions correctly, but by whether it serves human needs across all the moments people interact with a brand or service.

Converges with CE

HXT is the framework CE most directly inherits from and credits. CE's belief that quality is a property of experience, not software is HXT's central claim. The scope expansion — from app to full brand experience — is both frameworks' most distinctive departure from traditional testing.

HXT's inclusion of physical, emotional, and situational context directly informs CE's value of operational reality over synthetic certainty. No test environment captures what real people experience in the real world.

HXT's human-first framing maps to CE's confidence framing: what matters is whether the person trusts and can rely on the experience, not whether the system passed its own tests.

Diverges from CE

HXT focuses primarily on the experience dimension of quality. CE extends the analysis into the engineering infrastructure required to maintain confidence — evals, observability, production telemetry, AI verification systems. HXT names the problem; CE also addresses the engineering apparatus required to solve it.

HXT predates the AI-as-author problem. How HXT principles apply when AI generates the experiences humans then have — and the experiences are themselves probabilistic — is an open question both frameworks would need to address together.

Still deeply relevant

Quality as human experience — the most durable framing
Physical, emotional, situational context in quality assessment
Full brand arc, not just software boundary
Human-first definition of what "works" means

Needs extension in AI era

Doesn't yet address AI-generated experiences and probabilistic outputs
Engineering infrastructure for confidence is outside HXT's current scope

Standard

ISTQB / ISO 29119

International Software Testing Qualifications Board · ISO

ISTQB provides a structured, internationally recognized certification framework for software testing. ISO 29119 is a formal testing standard covering test processes, documentation, and terminology. Both aim to create a common, codifiable body of knowledge for the testing profession — a lingua franca for practitioners across industries and organizations.

Converges with CE

ISTQB's foundational vocabulary — test levels, test types, defect lifecycle — provides a shared language that is genuinely useful. CE does not reject the idea of a common vocabulary for testing; it argues that vocabulary needs updating for the AI era.

In regulated and safety-critical environments — medical devices, aviation, nuclear — the rigorous documentation and process requirements that ISTQB and ISO 29119 demand remain important. CE's steelman acknowledges that deterministic verification still provides important guarantees in these contexts.

Diverges from CE

ISTQB treats testing as a codifiable, certifiable body of procedures. CE's foundational claim — informed by RST and CDT — is that testing requires judgment that cannot be reduced to procedure. Coverage percentages and test case counts, the primary metrics of ISTQB-aligned quality programs, are precisely what CE argues don't represent confidence.

ISTQB's certification model assumes testing skill is transferable via structured curriculum. CE treats good testing as contextual judgment that depends on deep domain knowledge and adaptive thinking — not something a syllabus fully conveys.

ISO 29119's heavyweight documentation requirements are incompatible with AI-era development velocity. By the time a formal test plan is approved, the system it describes may have been regenerated.

Still relevant

Common vocabulary across the profession
Rigorous process requirements in safety-critical, regulated domains
Foundational testing concepts that remain valid regardless of methodology

Largely inapplicable in AI era

Coverage % and test case counts as primary quality signals
Certification model assumes codifiable skill over contextual judgment
Documentation overhead incompatible with AI development velocity
Assumes deterministic systems and human authors throughout

Where this leaves us

The striking thing about reviewing these frameworks together is how much the best of them already knew. RST, CDT, Exploratory Testing, and Risk-Based Testing each diagnosed the same disease: that what teams measure (coverage, pass rates, test counts) is not what they care about (whether the software works for real people in real conditions). They said this clearly, and largely, the industry ignored them.

The AI era doesn't change that diagnosis. It makes ignoring it more dangerous. When humans wrote all the code, the gap between activity metrics and actual confidence was bad but bounded. When AI generates code faster than teams can review, evaluates it with tools that may share its blind spots, and deploys into systems no individual understands end to end — that gap becomes a chasm.

Confidence Engineering inherits the insight of every framework on this page that rejected the illusion of diligence. It tries to name what the discipline looks like when you take that rejection seriously — not just as a philosophy of skilled individual testing, but as an organizational engineering practice built for a world where the authors of software are increasingly algorithms, and the people responsible for confidence must direct, amplify, and evaluate rather than simply execute.

The old frameworks were right. They just weren't finished.