What Is a Counterfactual in Incrementality Testing? How to Audit the Number Behind Your iROAS

Every incrementality result depends on a number you never observed. Not the revenue, not the spend. It is what your sales would have been if the ads had never run. That missing world is the counterfactual. Build it well and incrementality becomes evidence. Build it badly and the iROAS is a confident guess.

Measurement vendors do not sell you lift. They sell you a counterfactual, and the lift is just the subtraction. So the question that decides a read is not "what was the iROAS?" It is "why should I believe the no-ads world you built?"

This is not a statistics lesson. It is the part of incrementality most marketers never get shown: the missing revenue line every iROAS leans on. Get that line wrong and the budget decision is wrong. So here is how to audit it, in the language of the meeting you will actually be in, not the notation.

What is a counterfactual, in one line?

A counterfactual is what your sales would have been without the ads. Incremental lift is just actual sales minus that counterfactual. You never observe it, so it has to be predicted, which means the counterfactual, not the lift, is the number under the hood that can be right or wrong.

This is the fundamental problem of causal inference: for any unit you see one outcome and the other is missing. So a counterfactual is not "what happened without marketing" in some loose sense. It is a specific, testable claim about what would have happened under one clearly defined alternative.

Tools like Google's CausalImpact predict it with a time-series model built from control data, then report the gap. That is related to synthetic-control thinking, since both estimate an unobserved counterfactual, though it is not identical to a classic weighted synthetic control.

Chart showing actual sales above a predicted counterfactual line with an uncertainty band. The shaded gap between them is the estimated incremental lift, controlled by a slider.

Actual sales (observed) Counterfactual (a prediction) Estimated lift

How much did the ads actually lift sales? 15%

Estimated incremental revenue

Implied iROAS

0.0x

Illustrative, simulated data. Not a client result. The dashed line is a model's prediction of a world that never happened, not a measured fact, so it carries a range, shown as the band around it. Before launch the two lines sit together, that overlap is the fit check. A gap that clears the band is stronger evidence; a gap inside it is not a clean budget decision.

Read it left to right. Before the campaign the two lines sit together, because the model was tuned to match that period. After launch the actual line pulls away, and the shaded gap is the only revenue the ads created. The gap is the lift. The dashed line is the argument.

Put numbers on it. A brand pauses Meta in 12 markets for four weeks. Actual revenue in those markets lands at $1.8M. The counterfactual, what would have happened with Meta still on, is predicted at $2.0M. That $200K gap is revenue Meta was carrying, and against $90K of paused spend it is a 2.2x incremental read. Then the interval decides what you do. A range of minus $50K to minus $360K says Meta is almost certainly incremental. A range of plus $40K to minus $440K includes zero, so the same 2.2x midpoint is a maybe, not a green light. Illustrative figures.

The words, in plain English

Counterfactual: The revenue line you never got to observe. What would have happened without the ads.
Treatment: The thing you changed. Paused Meta, cut spend, dropped prospecting, swapped creative.
Donor pool: The control markets used to build the counterfactual.
Pre-period fit: Whether the model could match reality before the test started.
Placebo test: A fake test that checks whether the model finds lift where nothing happened.
Interval: The range of plausible answers, not just the iROAS midpoint.

Why does attribution dodge the counterfactual?

Attribution assigns credit to the touchpoints a customer saw. It never asks whether the sale would have happened anyway, because it has no no-ads world to compare against. So it counts the customers who passed the ad, while incrementality asks which ones actually needed it to buy.

Gordon and colleagues compared observational measurement against randomized experiments at Facebook, across 15 studies. The observational methods often failed to recover the experimental effect, even after conditioning on rich demographic and behavioral data. For purchase outcomes, in about half the studies the estimated lift was off by a factor of three (Gordon et al., 2019, Marketing Science).

Off, not always up. Attribution overstates, sometimes understates, but rarely lands on the causal number, because it has no counterfactual. It counts, it does not compare.

What must a good counterfactual show?

Before you trust an iROAS, make the vendor show you five things about the counterfactual behind it: what changed, what built the missing world, whether it could predict before the campaign, what has to be true, and how wide the answer is. If any are fuzzy, the number is not a measurement yet. It is a midpoint with good posture.

1. What exactly changed?

"Without ads" is too broad to act on. Pin the treatment down: no Meta at all, no prospecting, no spend above a baseline, or a full pause with everything else left running. Each is a different missing world with a different number, and the read only means something once you know which one you bought. Ask your vendor: what exactly changed in the test, and what stayed the same?

2. What built the missing world?

For a geo test, the counterfactual is built from control markets, so the donor pool is the read. Which markets were eligible, and which got cut for promos, stockouts, or contamination from the campaign? This is where vendor content goes quiet, and our practitioner's guide to weighted synthetic controls walks through that selection. Ask your vendor: which markets built the counterfactual, and which did you exclude, and why?

3. Could it predict before the campaign started?

This is the one that separates a real counterfactual from a good-looking one. You do not need to read the code. You need to know whether the model could predict a held-back slice of the pre-period it was never shown. A model that cannot predict the past has not earned the right to explain the present. Ask your vendor: did the model predict data it had not seen yet, and what was the out-of-sample error?

4. What would have to be true?

Every counterfactual rests on assumptions, and the good ones are named out loud. For a geo test: the control markets tracked the test market before launch, nothing spilled between them, and no shock hit only the test market. Abadie's review of the method is blunt that donor-pool quality and pre-period fit decide credibility (Abadie, 2021). Ask your vendor: what has to hold for this read to be valid, and did you check it?

5. How wide is the answer?

An iROAS is a range with a midpoint, and it has two fragile parts: the incremental revenue on top, from the counterfactual, and the spend underneath it. A 2.1x with a tight interval around profitable values is a decision. A 2.1x from 0.4x to 5.8x is a hypothesis with a nice-looking midpoint. The point estimate gets the meeting. The interval makes the decision. Ask your vendor: what is the iROAS range, and does the low end still support the move?

Right question, or the wrong one?

A counterfactual can be right and still answer the wrong question. The model measures one specific change to one specific KPI, and the business quietly uses it to justify a different decision. So before you audit the model, audit the decision it is about to drive.

If the question is "should we scale Meta prospecting?", total revenue may be the wrong read. You may need new-customer revenue or contribution margin, because a channel that mostly reshuffles existing buyers can look fine on blended revenue and still lose money at the margin.

If the question is "can we cut Meta by 30%?", a full pause answers a more extreme question than the one you are deciding. A channel can survive a trim it would not survive a shutoff, and the reverse.

If the question is "what is next quarter's budget?", a four-week test is evidence, not a permanent truth. The counterfactual is a claim about one window, and the business shifts underneath it.

So audit the decision first. What action will this number change? What KPI decides whether that action worked? What range would still make the call obvious? Get those straight and the model has a clear job. Skip them and even a perfect counterfactual answers a question nobody asked.

How do counterfactuals lie?

Counterfactuals rarely fail loudly. They fail by looking reasonable. The pre-period fit looks clean, the model finds lift, the iROAS looks profitable. Every one can be an artifact: a contaminated control, a promo, or an overfit model, not real incremental revenue.

The donor markets saw the same campaign, so the control absorbed some treatment and the gap shrank. The treated region also ran a promo, so the lift is really the promo. The synthetic control tracks revenue beautifully, but only because it overfit pre-period noise. A stockout or seasonality break hit the treated market alone. The KPI is wrong: total revenue when the decision needs new-customer revenue or margin.

Each of these produces a clean-looking chart. That is the point. The question is never "did the model produce a counterfactual?" Any model can. The question is "what would have to be true for this one to be believable?"

How do you validate one before moving budget?

You validate a counterfactual the way you validate any forecast, out of sample, then you stress it for the specific ways it breaks. Fit it on part of the pre-period, predict the held-back slice, run placebo tests, and read the interval. Pre-period prediction is necessary evidence, not proof.

Fit on most of the pre-period and predict the slice you held back. Teams track the out-of-sample error (mean absolute percentage error) and how much movement the model captures (R squared). If it cannot predict a period it did not cause, stop there.

If it can, you have earned the right to keep testing, not to believe. The post-period can still break if the treated market hits a unique shock, the controls get contaminated, or the relationship shifts. So run placebo tests on untreated markets and fake treatment dates, and confirm they show no systematic or decision-changing effect. Then report the range. A gap that clears the interval is stronger evidence; a gap inside it is not a clean budget decision. The real question is whether the range changes what you would do once margin, spend, and risk are in view.

Across a self-selected set of 225 DTC brands that chose to test, not a market average or a universal success rate, the median came back at 2.31x, with the middle half between 1.36x and 3.24x. These are measurement-sophisticated advertisers, and 88.4% of the tests cleared significance at 90% confidence, so read it as what disciplined testing tends to find, not a promise. The spread is still the point: even validated counterfactuals produce a range, and our 2025 DTC incrementality benchmarks show the out-of-sample checks behind those reads.

What should you ask your vendor?

The questions above belong in the meeting, not in a report you read afterward. If you only ask one, ask what exactly was withheld or changed to build the counterfactual, because a vendor who cannot name the missing world cannot defend the iROAS. "Our model handles that" is a deflection, not an answer.

Want the full audit and to see how Stella answers it on real data? Start on our incrementality page.

FAQ

What is a counterfactual in simple terms?

It is what would have happened if you had not run your ads. You cannot see it, because your ads did run, so you estimate it. Your incremental lift is the difference between what actually happened and that estimated no-ads world.

What's the difference between a counterfactual and attribution?

Attribution assigns credit to touchpoints a customer saw. A counterfactual asks whether the sale would have happened anyway. Attribution has no no-ads world to compare against, which is why observational methods often missed the experimental effect, off by a factor of three in about half of one large study (Gordon et al., 2019).

How do you know a counterfactual is trustworthy?

Check four things: the treatment is defined precisely, the pre-period fit is close, the model predicts a held-back pre-period it never saw, and placebo tests on untreated markets show no systematic effect. Then read the interval, not just the midpoint.

Can a counterfactual be right and still mislead?

Yes. A counterfactual answers one specific question, like the effect of pausing Meta on total revenue. If your real decision needs new-customer revenue or contribution margin, a technically correct read can still point you the wrong way. Audit the decision, not just the model.

Can you build a counterfactual without a control group?

Yes, but it is weaker. A pre/post model predicts the counterfactual from a test region's own history. With no control regions, anything else that changed during the campaign, seasonality, a promo, a competitor, gets misread as lift.

Is a synthetic control the same as a counterfactual?

No. A synthetic control is one method for building a counterfactual, not the counterfactual itself. It blends untreated regions into a weighted stand-in for your test region. The counterfactual is the concept, and synthetic control is a way to predict it.

The counterfactual is the argument

A counterfactual is not a chart. It is an argument about a world that never happened, and every incrementality result stands or falls on it. So before you move budget, ask to see four things: the counterfactual, the validation, the interval, and the failure modes. If any of them are missing, the iROAS is not evidence yet. It is a number asking to be believed.

The next time a dashboard hands you an iROAS, the move is not to trust it or reject it. It is to say "show me the counterfactual." If you want that built and stress-tested on your own numbers, see it on your data.