Perfect software isn't achievable, but confidence is. Ship with real evidence, not optimism and vanity metrics.
Through work on real-world projects, we have come to value:
94% code coverage and 10,000 passing unit tests are not confidence — they are the illusion of diligence. Running more tests is not the same as understanding more risk. Real confidence is grounded in evidence of actual behavior under real conditions. The question is never "how many?" — it's "do we know enough to ship?"
Modern systems fail less from single defects and more from unintended interactions between independently functioning components, agents, models, and services.
AI systems, distributed systems, and continuously evolving architectures require ongoing evaluation, monitoring, and adaptation — not one-time verification. As systems become probabilistic and autonomous, evaluations themselves become core engineering infrastructure that must evolve continuously with the systems they cover.
AI is not the enemy of Confidence Engineering. AI dramatically expands the ability to explore, simulate, analyze, and evaluate systems at scales impossible for humans alone.
Repetitive execution will increasingly be automated. Human attention — and AI-powered exploration — becomes most valuable when directed toward ambiguity, unintended behavior, adversarial thinking, and unknown unknowns. That requires visibility across code, telemetry, production behavior, and AI orchestration layers, not just a passing test suite.
Your unit tests pass against a mock. Your staging environment has different data. Your users are doing things you didn't anticipate. And a person's confidence in a brand spans far beyond the app — the website, the in-store experience, the delivery, the support call, every touchpoint away from a screen. The measure of quality is not whether your system passes its own tests — it's whether it works for the people depending on it, across the full arc of how they experience it.
AI systems don't produce the same output twice. Pass/fail test assertions assume deterministic behavior — a model that no longer holds. Confidence in AI-generated and AI-operated systems requires statistical evaluation, behavioral distributions, and confidence intervals, not green checkboxes.
As software generation accelerates and systems grow more complex and autonomous, a new engineering discipline is taking shape. We call it Confidence Engineering — and the people who practice it are Confidence Engineers.
A Confidence Engineer is not a tester who writes scripts. They are not a QA analyst checking requirements. They are an engineer who owns the question that every team is now forced to answer: do we actually know enough about this system to ship it?
They understand risk, not just coverage. They think in systems, not scenarios. They use AI as an amplifier, not a replacement. They are as comfortable in production telemetry and exploring AI coding agent contexts as they are in a test plan. They own the evaluation infrastructure — the evals, the behavioral benchmarks, the observability pipelines — and treat that infrastructure as a product, not an afterthought. And they know that the most dangerous moment in software development is when a team believes their green dashboards.
This role already exists in practice — in the engineers who push back when the coverage number looks fine but something feels wrong, in the teams who run chaos experiments before launches, in the people who ask "what could actually go wrong?" when everyone else is asking "are we done?"
We believe it's time to name it, own it, and build a discipline around it.
Confidence Engineering is not an argument that traditional testing was misguided or without value. Deterministic testing, regression suites, exploratory investigation, and human judgment evolved because they solved real problems in systems that changed more slowly, behaved more predictably, and required explicit evidence of control. Many of those techniques still provide important guarantees today, especially in regulated and safety-critical environments.
But testing itself has always been somewhat nondeterministic in nature — we simply built processes and tooling that treated software as more stable and fully understandable than modern systems increasingly are. AI exposes that reality rather than creating it.
The argument is that software systems are now evolving faster than the verification models built around them. AI-assisted evaluation, autonomous exploration, probabilistic analysis, and continuous observability can increasingly evaluate broader behavioral spaces faster, cheaper, and often more effectively than large portions of traditional scripted testing alone.
Confidence Engineering does not reject prior disciplines; it argues that they are no longer sufficient by themselves for increasingly autonomous, probabilistic, and highly interconnected systems.
Add your name alongside engineers, testers, and builders who share this vision.
Spread the word — the more engineers who stand behind this, the louder the signal.
By signing you agree to be listed publicly as a signatory.
Confidence was chosen deliberately. It carries no legacy baggage — it isn't tied to "QA," "testing," or any particular methodology or toolchain. It isn't an AI-specific term, but it sits naturally alongside AI: confidence scores, model confidence, probabilistic outputs. That adjacency matters as AI becomes central to how software is built and verified.
More importantly, confidence is what engineering and business actually want. Not a coverage number. Not a pass rate. Not a green pipeline. Those are proxies — means to an end. The end is being able to say: we are confident this works, in production, for real users, under real conditions. Every metric in software quality is ultimately an attempt to approximate that feeling. Confidence Engineering names the goal directly.
This also applies recursively: every metric should itself be held to a confidence standard. How confident are we that this coverage number means what we think it means? How confident are we that these passing tests reflect real behavior? The discipline isn't just about measuring software — it's about knowing how much to trust the measurements.
A note on confidence intervals: in statistics, a confidence interval is a range of values that likely contains the true value of something you're measuring — paired with a probability, like "we are 95% confident the true defect rate is between 0.2% and 0.8%." It quantifies uncertainty honestly rather than collapsing it into a single point estimate. That's the spirit of Confidence Engineering: not a binary pass/fail, but a calibrated, evidence-based statement about how much you can rely on what you've built.