Engineering Practitioner Brief / 18 May 2026

Test Debt Cost

A team without enough tests pays the cost in three places: more defects reaching production, longer mean-time-to-recover, and slower feature delivery as engineers move cautiously through code they cannot verify. This page sizes all three for a typical 10-engineer team and walks through how to recover from a low-coverage baseline without freezing feature work.

Annual cost of test debt

$625K to $1.1M

For a 10-engineer team under 40% coverage

Cost of fixing a defect late

50x to 200x

Production fix vs design-time fix (Boehm)

The Three Cost Vectors

Vector 1: Defect Escape Rate

The defect-escape rate is the percentage of defects introduced in a release that reach production rather than being caught earlier. The IEEE published comparisons across multiple organisations show low-coverage codebases (under 40 percent) escape 8 to 20 percent of injected defects to production. High-coverage codebases (above 70 percent) escape 1 to 4 percent. For a team merging 100 changes per sprint, the difference between 12 percent and 3 percent escape rate is 9 production defects per sprint that did not have to happen.

The cost of those 9 extra defects is the famous Boehm curve. A defect fixed at design costs roughly 1x. The same defect found by a unit test costs 5x. By QA, 10x. In production, somewhere between 50x and 200x depending on customer impact. The 200x case is rare (security breach, data corruption) but the 50x case is the daily average: investigation time, fix, regression test backfill, customer-support load, on-call interruption, post-mortem. At $85 per engineer-hour, a single production defect that consumes 8 engineer-hours across all those activities (a conservative estimate) costs $680. Nine of them per sprint is $6,120, or $159,000 per year just from the additional escape rate.

Vector 2: Mean Time to Recover

The DORA State of DevOps reports have established a consistent pattern: teams with strong test investment have MTTR measured in hours; teams without have MTTR measured in days. The cost is partly the customer-facing downtime, partly the engineer-hours spent investigating during the incident, and partly the secondary incidents that arise from rushed fixes. For a B2B SaaS at $5M ARR, a single 4-hour outage carries roughly $15,000 in revenue impact (4 of 8,760 hours). A team with 2x worse MTTR (8 hours instead of 4) pays double. Across a year of typical incident frequency (8 to 20 P1 incidents), the difference is $120,000 to $300,000.

Vector 3: Feature Delivery Tax

The largest of the three vectors. Engineers in low-coverage codebases move slower because every change carries unverifiable risk. Pull requests sit longer in review because reviewers cannot trust the change. Changes are smaller because larger changes feel unsafe. The cumulative effect, per multiple industry productivity studies and the Stripe Developer Coefficient finding that engineers spend 17.3 hours a week on maintenance and debt-related work, is 20 to 35 percent reduction in feature throughput. At $160K fully-loaded per engineer, that is $32,000 to $56,000 per engineer per year of lost capacity. Across 10 engineers, $320K to $560K. This is the bulk of the test-debt cost.


The Boehm Defect Cost Curve

Barry Boehm's research, first published in 1976 and updated repeatedly through the 2000s, established that the cost to fix a defect grows roughly exponentially with the development phase in which it is found. The exact multiples vary by study, but the shape is consistent: late discovery is expensive, very late discovery (after release) is dramatically expensive. Modern variants of the curve (NIST, IBM Systems Sciences Institute) confirm the same shape with somewhat different numbers.

Phase Defect DiscoveredRelative Cost (Boehm baseline)Typical Activities
Design / requirements1xConversation, doc update, sketch change
Code (unit test)5xFix the function, update the test, push the PR
Integration / QA10xReproduce, isolate, fix, regression-test, re-deploy to staging
User acceptance test25xAdd stakeholder coordination cost
Production50x to 200xIncident response, customer comms, post-mortem, regression backfill

The economic argument for tests is essentially the Boehm curve: every test that catches a defect at unit stage instead of production saves between 10x and 40x the cost of writing the test. Tests do not need a high catch rate to pay back; even a single saved production incident often pays for the whole test suite's authoring time.


The Flake Tax

A flaky test is a test that fails sometimes without any change to the code under test. Google's 2017 paper Flaky Tests at Google and How We Mitigate Them reported a 16 percent flake rate across Google's CI at the time. Each flaky run costs CI minutes (re-run), engineer attention (was it real?), and over time, trust in the test suite. The last cost is the worst because engineers stop treating failures as signal and instead retry until green, which then masks real regressions.

The flake tax for a 1,000-test suite with 5 percent of tests being flaky, run 20 times per day per engineer, and a 3-minute mean CI run-time, comes out to roughly 30 minutes per engineer per day of CI delay alone, plus an additional 15 minutes of engineer attention investigating which failures were real. Across 10 engineers and a 200-day work year, this is 1,500 engineer-hours per year, or roughly $128,000 at $85 per hour. The cost of fixing flakes is a fraction of this. A typical flake-detection-and-fix campaign of 100 flaky tests takes 200 to 400 engineer-hours to resolve and pays back in under 6 months.


Cost of Adding Tests to a 30%-Covered Legacy Module

Concrete sizing for the most common scenario: a module that has accumulated over years, has minimal tests, and now needs coverage as part of an active refactor or compliance requirement. Assume a 4,000-line module at 30 percent line coverage, target 70 percent.

Lines to cover: 4,000 x (70% - 30%) = 1,600 additional lines under test. A rough rule of thumb (drawn from time-tracked instrumentation studies of test-writing) is one engineer-hour per 8 to 15 LOC of new test coverage, depending on how testable the module is. For a moderately testable legacy module, call it 12 LOC per hour. That is 133 engineer-hours, or roughly $11,300 at $85 per hour.

The catch: that estimate assumes the module is already testable. If the module has hidden dependencies (a global singleton, file-system I/O at module load, time-based code without an injectable clock), the first cost is making it testable. That refactor-to-testability cost is typically 30 to 50 percent additional on top of the test-writing cost. So the full sizing for the 4,000-line module lands at $14,000 to $17,000.

For comparison, the carrying cost of that module at 30 percent coverage, applied across the team that touches it, is roughly $20,000 to $40,000 per year in the productivity-tax vector alone. Payback period: under 6 months. See legacy code refactoring cost for the related arithmetic of refactoring with versus without a safety net.


Why 100 Percent Coverage is the Wrong Target

Coverage above roughly 85 percent runs into diminishing returns. The remaining 15 percent is typically error-handling for impossible-in-practice conditions, defensive code that should probably be deleted, and deeply-nested branches that exercise rare combinations of inputs. Writing tests for these is expensive (often hours per percentage point of additional coverage) and the tests are brittle (they break on any refactor because they pin implementation details rather than behaviour).

The pragmatic target is differential: critical paths (payment, authentication, data corruption surface) near 100 percent, ordinary business logic in the 60 to 75 percent band, infrastructure glue code in the 40 to 60 percent band. Aggregate target around 70 percent for most codebases. The Google Software Engineering book recommends this kind of stratified target rather than a single global number.

Related Reading


Frequently Asked Questions

How much does test debt cost per sprint?

For a 10-engineer team, a low-coverage codebase (under 40 percent line coverage with no characterization tests on legacy modules) typically loses 20 to 35 percent of sprint capacity to defect-related work. At a $160K fully-loaded engineer cost and 2-week sprints, the cost is $24,000 to $42,000 per sprint, or $625K to $1.1M per year.

What is the optimal test coverage target?

There is no single optimum. Google's internal data, published in their Software Engineering at Google book, suggests 60 to 75 percent line coverage as a healthy band for most codebases, with critical paths near 100 percent and incidental code below 50 percent. Above 85 percent the marginal cost of additional coverage exceeds the marginal benefit in nearly every reported case.

Why is fixing bugs late so expensive?

Barry Boehm's defect-cost-curve research from the 1970s through the 2000s consistently shows defects cost roughly 6x more to fix in QA than at design, and 50x to 200x more in production. The cost includes the original fix, the incident response time, the customer support load, the regression-prevention test that should have existed in the first place, and the reputational cost.

Are flaky tests as expensive as missing tests?

Often more expensive over time. A flaky test trains engineers to ignore failures, which then masks real defects. Google reported a 16 percent flake rate across their CI in their 2017 'Flaky Tests at Google' paper. Each retried CI run costs minutes and engineer attention. A 1,000-test suite with 5 percent flake costs roughly 30 minutes of CI per developer per day.

Should I add tests to legacy code before refactoring?

Yes for any refactor that changes observable behaviour, no for purely-mechanical changes verified by codemod and type-check. The pattern is Michael Feathers's: write characterization tests for the current behaviour first, then refactor with the safety net of those tests catching regressions. The cost is real but the cost of refactoring without that safety net is higher.

What does the DORA research say about test investment?

The Google Cloud DORA State of DevOps reports consistently identify continuous testing as one of the practices most strongly correlated with elite-performer status (high deployment frequency, low lead time, low change failure rate, fast MTTR). The correlation does not prove causation, but the absence of test investment is the single most common feature of underperforming engineering organisations.

Updated 2026-04-27