Learn how DTC teams use geo experiments and holdout tests to measure true lift, calculate iROAS with confidence intervals.

Most marketing teams are allocating budget based on data they know is wrong. Attribution shows you correlation. Incrementality testing shows you causation.
The difference is six figures.
Across 225 experiments run through Stella, the median iROAS was 2.31x. The interquartile range was 1.36x to 3.24x. 88.4% of well-designed tests reached statistical significance. That is what a working testing program looks like in practice.
US digital ad spend hit $259 billion in 2024, up nearly 15% year over year. More budget at stake means more exposure to bad decisions. A single misallocated channel can quietly drain six figures before anyone notices.
This post explains how to build a testing program that actually fixes that.
Your tracking setup from two years ago is probably measuring something different than you think. This is not a future problem. It is a current one.
GDPR has applied across the EU since May 2018. The European Data Protection Board finalized updated guidance in October 2024 on how ePrivacy rules apply across all tracking methods, not just cookies. In California, CPRA amendments went into effect January 1, 2023, adding an explicit opt-out right for cross-context behavioral advertising.
Third-party cookies are not deprecated, but the roadmap keeps shifting. Google restricted third-party cookies for 1% of Chrome users starting January 4, 2024 as a testing phase. By July 2024, Google moved to a user-choice model instead of outright deprecation. By April 2025, Google confirmed it would not roll out a standalone prompt for third-party cookies. The UK's Competition and Markets Authority published a formal decision in October 2025 addressing the release of Privacy Sandbox commitments.
The common thread: user-level signal is eroding, platform-reported attribution is increasingly unreliable, and last year's efficiency numbers may not reflect current reality.
Incrementality testing does not depend on cookies or user-level tracking. It works on aggregated or location-based data. Google explicitly positions it as privacy-preserving for this reason. That is not a nice-to-have in 2026. It is a structural advantage.
Incrementality testing is a controlled experiment that measures the causal lift from advertising by comparing outcomes between a group exposed to ads and a group that was not.
Not what your ads touched. Not what they got credit for. What they actually caused.
You split your audience or geography into two groups. One sees your ads. One does not. At the end, you compare outcomes. The difference is your incremental lift.
Google's Conversion Lift documentation describes it clearly: separate customers into comparable geographic groups, measure conversions from the group that saw your ads versus the group that did not, and use the difference to estimate conversions caused by your campaigns.
Clean control group. Real comparison. Causal answer.
Standard attribution, whether last-click, data-driven, or platform-reported, assigns credit based on correlation. Someone saw your ad, then bought it. Attribution says the ad worked.
But maybe they were going to buy anyway.
A peer-reviewed study of large field experiments at Facebook found that ad exposure is non-random. Machine learning delivers ads to people most likely to convert. So the exposed group was already more likely to buy. Attribution cannot separate that from actual ad-driven lift.
This is not a flaw in the tools. It is a structural problem. The only solution is a controlled experiment.
Both test types answer the same core question. The setup is different.

Testing awareness-level spend or a new channel? Run a geo experiment. Testing a retargeting campaign with good pixel coverage? Run a holdout test. Most programs run both, in rotation, across different budget questions.
Published Google research on geo experiments defines them as experiments where non-overlapping geographic regions are randomly assigned to control or treatment, with the condition realized via geo-targeted advertising.
The key word is non-overlapping. If regions bleed into each other, your results are noise.
Contamination happens when a user is exposed in a treatment region but converts in a control region, or the reverse. Google Ads help explicitly describes this: contamination reduces the estimated incrementality because the control group is no longer clean.
It shows up with:
The fix is intentional region selection. Pick control regions with minimal population overlap with treatment regions. Factor in your delivery method. Plan for a post-test cooldown if your conversion cycle is long.
Google's geo-lift metrics documentation defines the cooldown date range as the period after the test ends when campaigns return to normal, but the system continues collecting data to calculate iROAS. Google explicitly recommends this for advertisers with conversion cycles longer than a few weeks.
If you skip the cooldown, you undercount conversions and your iROAS looks worse than it actually is.
iROAS (incremental return on ad spend) is the number that connects your test to a budget decision.
Google's geo-lift documentation describes it as a point estimate plus a confidence interval. Their example: iROAS of 2.2, confidence interval from 1.3 to 3.5.
Most teams report the 2.2 and move on. That is the mistake.
The interval is the real answer. It tells you the range of outcomes consistent with your data. Ignoring it is like reading a forecast that says "it might be 50 degrees or 90 degrees" and packing for one temperature.

Across Stella's 225 experiments, the median iROAS was 2.31x. The IQR of 1.36x to 3.24x shows real variance across channels and brands. Some channels are doing 3x. Some are doing 1.1x. They often look identical in platform dashboards.
Always. Not every channel at once, but always something.
A test result reflects a point in time. What worked in Q1 at one spend level may not work in Q3 with different creative, a different competitive environment, or a shifted audience mix. A test from six months ago is a data point, not a standing truth.
Google's geo-lift setup documentation is built around ongoing cadences for a reason. You should have at least one test running or completing every month. Not because testing is the point. Because you are making budget decisions every month, and those decisions should be based on current evidence, not old platform reports.
High-velocity testing compounds. More tests mean more learnings, faster reallocation, and smaller windows of wasted spend. But velocity without design quality is noise. A bad test running every week is worse than a well-designed test running every six weeks.
For brands that need continuous answers rather than periodic snapshots, always-on incrementality maintains a live holdout or geo split that updates your iROAS estimate on an ongoing basis. It costs more to run. It costs far less than discovering mid-year that a channel's efficiency collapsed three months ago and you kept spending anyway.
No framework names. Just the loop.
Step 1: Pick one specific budget question. Not "does paid social work." Something like: does Meta prospecting generate incremental lift at our current $50K/month spend level?
Step 2: Write your decision rule before the test starts. If the full confidence interval is above 1.0x, scale. If it crosses 1.0x, retest. If it is fully below 1.0x, cut. Write this down. Do not change it after seeing results.
Step 3: Choose your test type. Geo for channel-level or brand spend. Holdout for campaign or audience-level questions. Use the comparison table above.
Step 4: Design for clean measurement. Non-overlapping regions for geo tests. No overlapping tests on the same audience at the same time. Add a cooldown period for longer conversion cycles. Google's documentation explicitly recommends this.
Note: Google's geo-lift setup notes campaigns can only be active in one study at a time. Plan your calendar accordingly. Stacking tests on the same campaign without a factorial design breaks interpretation.
Step 5: Apply the decision rule. Document everything. Result, context, what changed. This is how a program builds institutional knowledge instead of a folder of one-off test decks.
Step 6: Move to the next question. The value is not any single result. It is knowing more about your channel mix every month than you knew the month before.
When you run a holdout test, you intentionally do not show ads to a control group. That group converts less. That shortfall is real money you did not capture during the test.
Finance teams call that waste. It is not.
If a holdout costs $50,000 in short-term revenue and reveals that a $500,000/year channel generates no meaningful incremental lift, you now have a $500,000 decision made with confidence for a $50,000 research cost.
That is a 10x return before the budget moves anywhere better.
The reason DTC teams under-test is usually framing. Holdout cost looks like a sacrifice. It is an insurance premium against much larger misallocation. The test cost is always smaller than the decision it is informing.
Single-channel testing is where most teams start. Portfolio-level testing is where the ROI compounds.
You run a holdout test on non-brand search and find iROAS of 3.1x with the full interval above 1.0x. Solid. You run a holdout on display retargeting and find iROAS of 0.8x, with the full interval below 1.0x. Those retargeting clicks were people who were going to buy anyway.
Now you have a budget decision that is not about opinion. Shift the display budget to search. Retest in 90 days to verify the new baseline.
That is portfolio optimization built on evidence. Not on platform reporting that is incentivized to make every channel look good.
The 2024 IAB/PwC report showed major spend concentrated across search ($102.9B), social ($88.8B), retail media ($53.7B), and digital video ($62.1B). Every one of those categories will tell you it is working. Incrementality tests will tell you what is actually true for your brand at your spend level.

The last row is the most important one. A/B testing tells you which ad performs better within a channel. Incrementality testing tells you whether the channel is worth the money at all. Use both. Do not confuse them.
If you want benchmarks across 225 experiments broken down by channel, spend tier, and industry, start with Stella's incrementality benchmarks.
If you want to understand the full measurement foundation before running your first test, Stella's measurement resources are the right starting point.
Ready to run a test? Schedule a demo or start a free trial.
.png)