Open for comments — this manifesto is a living draft. Read, share, and tell us what you'd change. Leave a comment ↓
CE

Manifesto for Confidence Engineering

Perfect software isn't achievable, but confidence is. Ship with real evidence, not optimism and vanity metrics.

DELIVERY CADENCE Waterfall Agile AI Era continuous merges COMPLEXITY GAP COMPLEXITY TIME / SCALE confidence gap grows exponentially Total Complexity Code Generated Total Complexity (runtime + scale interactions) Code Generated
DELIVERY CADENCE Waterfall Agile AI Era continuous merges COMPLEXITY GAP COMPLEXITY TIME / SCALE confidence gap grows exponentially Total Complexity Code Generated Total Complexity (runtime + scale interactions) Code Generated

Through work on real-world projects, we have come to value:

Justified confidenceoverperformative coverage

94% code coverage and 10,000 passing unit tests are not confidence — they are the illusion of diligence. Running more tests is not the same as understanding more risk. Real confidence is grounded in evidence of actual behavior under real conditions. The question is never "how many?" — it's "do we know enough to ship?"

Understanding emergent behavioroververifying isolated functionality

Modern systems fail less from single defects and more from unintended interactions between independently functioning components, agents, models, and services.

Continuous evaluationoverstatic validation

AI systems, distributed systems, and continuously evolving architectures require ongoing evaluation, monitoring, and adaptation — not one-time verification. As systems become probabilistic and autonomous, evaluations themselves become core engineering infrastructure that must evolve continuously with the systems they cover.

AI amplificationoverAI resistance

AI is not the enemy of Confidence Engineering. AI dramatically expands the ability to explore, simulate, analyze, and evaluate systems at scales impossible for humans alone.

Explorationoverrepetition

Repetitive execution will increasingly be automated. Human attention — and AI-powered exploration — becomes most valuable when directed toward ambiguity, unintended behavior, adversarial thinking, and unknown unknowns. That requires visibility across code, telemetry, production behavior, and AI orchestration layers, not just a passing test suite.

Operational realityoversynthetic certainty

Your unit tests pass against a mock. Your staging environment has different data. Your users are doing things you didn't anticipate. And a person's confidence in a brand spans far beyond the app — the website, the in-store experience, the delivery, the support call, every touchpoint away from a screen. The measure of quality is not whether your system passes its own tests — it's whether it works for the people depending on it, across the full arc of how they experience it.

Probabilistic evaluationoverbinary assertion

AI systems don't produce the same output twice. Pass/fail test assertions assume deterministic behavior — a model that no longer holds. Confidence in AI-generated and AI-operated systems requires statistical evaluation, behavioral distributions, and confidence intervals, not green checkboxes.

We believe
The existing models of software quality are broken. Waterfall assumed software changed rarely, so teams tested it once before release. Agile acknowledged faster change and introduced sprint-based testing cycles. But neither model anticipated what we face now: software that changes continuously, generated faster than any team can review, deploy, or validate using practices designed for a slower era. Yet we have already seen what works. Teams building at scale in exactly these conditions — in search engines, web indexing, and SaaS platforms that iterate without pause, used at scales never originally envisioned, and operating in ways their creators never anticipated — have developed approaches to shipping with genuine confidence under this pressure. The practices exist. It is time for the whole industry to adopt them.
Quality is not a property of software — it is a property of the experience a person has with a brand or service. That experience includes physical spaces, human interactions, hardware, logistics, and moments that no unit test has ever considered. The scope of confidence must expand to match the scope of how people actually live with these systems.
Perfect software isn't achievable — but confidence is. Code coverage percentages, unit test counts, and pass rates are not quality metrics — they are activity metrics. Optimising for them produces better-looking dashboards, not better software. The goal is software that doesn't fail the people depending on it, and the evidence to know the difference.
The next era of engineering will not be defined by who can generate the most software — it will be defined by who can justify confidence in what they generate.
When AI generates both the code and the verification of that code, you have two unknowns interacting with each other. Neither was fully read or understood by a human. Multi-model pipelines — one LLM generating code, another evaluating it — make this worse: each handoff adds a layer of behavior no human directly authored. Because both models may share the same blind spots, the complexity doesn't just add — it compounds.
Confidence itself is becoming a scarce and strategic engineering resource.
The future role is not to compete with AI at speed or scale, but to direct AI-powered verification, and people exploring the domain, toward meaningful risk.
Evaluations, observability, production intelligence, adversarial analysis, and continuous feedback loops are becoming foundational engineering capabilities.
Organizations will increasingly generate more software than they can meaningfully understand. When AI writes the code, no one fully reads it. No one fully understands it. Confidence Engineering is the discipline that fills that comprehension gap — maintaining a legible model of system intent when the author is an algorithm.
And the discipline emerging to solve this problem is Confidence Engineering.
CE
A new role emerges

The Confidence Engineer

As software generation accelerates and systems grow more complex and autonomous, a new engineering discipline is taking shape. We call it Confidence Engineering — and the people who practice it are Confidence Engineers.

A Confidence Engineer is not a tester who writes scripts. They are not a QA analyst checking requirements. They are an engineer who owns the question that every team is now forced to answer: do we actually know enough about this system to ship it?

They understand risk, not just coverage. They think in systems, not scenarios. They use AI as an amplifier, not a replacement. They are as comfortable in production telemetry and exploring AI coding agent contexts as they are in a test plan. They own the evaluation infrastructure — the evals, the behavioral benchmarks, the observability pipelines — and treat that infrastructure as a product, not an afterthought. And they know that the most dangerous moment in software development is when a team believes their green dashboards.

This role already exists in practice — in the engineers who push back when the coverage number looks fine but something feels wrong, in the teams who run chaos experiments before launches, in the people who ask "what could actually go wrong?" when everyone else is asking "are we done?"

We believe it's time to name it, own it, and build a discipline around it.

⚖ Steelman

Confidence Engineering is not an argument that traditional testing was misguided or without value. Deterministic testing, regression suites, exploratory investigation, and human judgment evolved because they solved real problems in systems that changed more slowly, behaved more predictably, and required explicit evidence of control. Many of those techniques still provide important guarantees today, especially in regulated and safety-critical environments.

But testing itself has always been somewhat nondeterministic in nature — we simply built processes and tooling that treated software as more stable and fully understandable than modern systems increasingly are. AI exposes that reality rather than creating it.

The argument is that software systems are now evolving faster than the verification models built around them. AI-assisted evaluation, autonomous exploration, probabilistic analysis, and continuous observability can increasingly evaluate broader behavioral spaces faster, cheaper, and often more effectively than large portions of traditional scripted testing alone.

Confidence Engineering does not reject prior disciplines; it argues that they are no longer sufficient by themselves for increasingly autonomous, probabilistic, and highly interconnected systems.

Loading signatories…

Sign the Manifesto

Add your name alongside engineers, testers, and builders who share this vision.

You're signed.

Spread the word — the more engineers who stand behind this, the louder the signal.

Recent signatories
Why "Confidence"?

Confidence was chosen deliberately. It carries no legacy baggage — it isn't tied to "QA," "testing," or any particular methodology or toolchain. It isn't an AI-specific term, but it sits naturally alongside AI: confidence scores, model confidence, probabilistic outputs. That adjacency matters as AI becomes central to how software is built and verified.

More importantly, confidence is what engineering and business actually want. Not a coverage number. Not a pass rate. Not a green pipeline. Those are proxies — means to an end. The end is being able to say: we are confident this works, in production, for real users, under real conditions. Every metric in software quality is ultimately an attempt to approximate that feeling. Confidence Engineering names the goal directly.

This also applies recursively: every metric should itself be held to a confidence standard. How confident are we that this coverage number means what we think it means? How confident are we that these passing tests reflect real behavior? The discipline isn't just about measuring software — it's about knowing how much to trust the measurements.

A note on confidence intervals: in statistics, a confidence interval is a range of values that likely contains the true value of something you're measuring — paired with a probability, like "we are 95% confident the true defect rate is between 0.2% and 0.8%." It quantifies uncertainty honestly rather than collapsing it into a single point estimate. That's the spirit of Confidence Engineering: not a binary pass/fail, but a calibrated, evidence-based statement about how much you can rely on what you've built.