Engineering Practitioner Brief / 18 May 2026

Parallel Run Refactor Cost

The parallel-run pattern is the highest-confidence refactor verification technique available. Both the old and new implementations execute on every real production request; a comparator records every divergence. The pattern catches behavioural regressions that test suites miss because production traffic contains the edge cases that synthetic tests do not. The cost is real and the benefit is concrete; this page works through both.

The Pattern

The pattern was articulated most clearly by GitHub in 2015 when they published the ScientistRuby library and the supporting blog post. The Scientist API is built around the experiment metaphor:

# Pseudocode for the Scientist pattern
result = Scientist.run("user-permissions") do |experiment|
  experiment.use { current_implementation(args) }   # control
  experiment.try { new_implementation(args) }       # candidate
end
# 'result' returns the 'use' value
# Scientist records divergences asynchronously

Both implementations run on every call. The control's result is what the caller sees. The candidate's result is compared to the control and any divergence is recorded for later analysis. The candidate runs asynchronously so its slowness does not affect user-facing latency.

The Scientist pattern has been ported to many languages: JavaScript (scientist.js), Python (laboratory), Java (multiple ports), .NET (Scientist.net), Go (scientist-go). The underlying concept transfers without changing.

Service-Level Parallel Run with Diffy

Twitter Diffy generalises the Scientist pattern to whole HTTP services. Diffy sits as a proxy in front of three running services: the current production (primary), the new candidate, and a control instance of the current production. Diffy forwards every request to all three, then compares the candidate's response to the primary, using the differences between primary and control to subtract out non-deterministic noise.

The control-subtraction step is the clever part. Many services have legitimate non-determinism (timestamps, random IDs, ordering of result lists). Without the control, every response would diverge on these fields and the real signal (actual logic differences) would be drowned in noise. By comparing two production instances against each other in addition to candidate-vs-production, Diffy can identify which divergences are noise and which are signal.

Diffy is open source and still works in 2026, though its commit cadence is low. Several similar tools exist (diffit, replayed, the various traffic-replay frameworks). For most teams the choice is between Diffy and building a service-specific comparator using the same pattern with custom rules for known noise sources.

The Cost Breakdown

Component	One-Time Cost	Recurring Cost
Setup of comparator infrastructure	100 to 500 engineer-hours	Small (logging, alerting)
Wrapping the code path	8 to 40 engineer-hours per refactor	None
Extra compute for candidate	None	~1x marginal compute of the path
Observability data ingest	None	$50 to $500 per service per month
Engineer analysis of divergences	None	20 to 120 engineer-hours over the run

For a typical service-level parallel run lasting four weeks, the total cost is 200 to 800 engineer-hours plus marginal infrastructure. At $85 per hour, $17,000 to $68,000 per parallel-run campaign. For high-stakes services (payments, pricing) this is straightforwardly worth doing. For low-stakes services, a feature-flag rollout is usually sufficient.

When Parallel Run Pays Back

Three categories of refactor justify the parallel-run cost:

Pricing and billing logic. Customer-visible dollar amounts that differ between old and new implementations are catastrophic. A single billing-side regression can cost more than the entire parallel-run infrastructure for the year. The conservative pattern is to run pricing changes in shadow mode for 4 to 8 weeks before cutover, comparing every line item.

Regulatory calculations. Tax computations, financial reporting, healthcare claims processing. A divergence between old and new might be invisible to users but visible to auditors. Parallel-run with full divergence logging provides the evidence trail that the new system matches the old before being granted regulatory cutover.

Machine-learning feature pipelines. Subtle differences in feature engineering between old and new pipelines silently degrade model performance. Parallel-run comparison of feature values for the same input is the only reliable way to verify pipeline parity. The cost is high (feature pipelines have wide outputs) but the cost of silent ML degradation is higher because it surfaces as gradual revenue erosion that is hard to root-cause.

The Non-Determinism Problem

The harder part of any parallel-run setup is handling non-determinism. Most production code has incidental non-determinism that does not reflect logic differences:

Timestamps in responses (always different between calls).
Generated IDs (UUIDs, sequence numbers).
Floating-point arithmetic that may round differently between implementations.
Ordering of items in result lists (set semantics vs list semantics).
Cache state (the candidate has an empty cache; the production instance has a warm one).
External-service dependencies that respond differently on retry.

Each source of non-determinism either needs to be subtracted out by the comparator (Diffy's control-vs-control approach) or accepted as background noise and the divergence threshold raised. Either path requires investment in the comparator setup; this is why the one-time setup cost is the largest single line in the parallel-run budget.

Exit Criteria

A parallel-run that exits too early misses meaningful divergences. A parallel-run that runs forever consumes engineer attention indefinitely. The standard exit criteria, applied in combination:

Divergence rate has stabilised at or below the agreed threshold for at least 2 consecutive weeks.
All categories of divergence are explained: each one is either accepted (known non-determinism) or fixed (a real bug in the candidate).
The run has covered the major seasonal and event-driven patterns of the service's traffic. A payment service that has not yet seen a Black Friday should not exit parallel-run before November.
The engineers most familiar with the candidate sign off on the comparison results.
The rollback plan if cutover-time issues emerge is documented and rehearsed.

A typical service meets these criteria within 4 to 8 weeks of starting parallel-run. Some services (regulatory, payment-critical, ML pipelines with long evaluation cycles) need 12 to 24 weeks. The length is dictated by the traffic pattern and the divergence rate, not by a fixed calendar.

Frequently Asked Questions

What is parallel-run refactoring?

A refactor verification pattern where the new code path runs alongside the original on real production traffic. Both paths execute, the original's result is used, and a comparator records any divergence between the new and original outputs. Once the divergence rate is below an agreed threshold, the new path takes over.

When is parallel run worth the cost?

When correctness matters more than speed of delivery. Payment paths, pricing engines, regulatory calculations, machine-learning feature pipelines. The parallel-run setup cost is non-trivial (typically 100 to 500 engineer-hours per service) but a single avoided correctness incident in production usually pays for the entire investment.

What is GitHub Scientist?

A Ruby library originally written at GitHub for parallel-run experiments. The pattern: wrap a code path in a 'science.run' block with a 'use' (current implementation) and a 'try' (new implementation). Scientist runs both, returns the 'use' result, and records the divergence. The pattern was published in 2015 and ported to many languages.

What is Twitter Diffy?

An open-source HTTP-service diff-comparison tool published by Twitter in 2015. Diffy sits in front of three running services: the current production, the new candidate, and a control instance of the current production. It compares responses between the candidate and the production while using the second production instance as a control to filter out noise from non-determinism. Useful when the divergence rate of interest is low.

How much does parallel-run infrastructure cost?

Two main cost lines. First, the runtime compute for the new path (which is doing real work but not serving users): roughly double the marginal compute of the service for the duration of the parallel-run. Second, the comparator infrastructure: 100 to 500 engineer-hours for setup, $50 to $500 per month in observability data ingest during the run.

How long should parallel run last?

Long enough for the production traffic to exercise the meaningful edge cases. Two to eight weeks is typical, with longer windows for services that have seasonal or rare-event traffic patterns. The exit condition is a stable, sufficiently-low divergence rate; not a fixed calendar date. Shipping the cutover before the divergence stabilises is a common failure pattern.