A Comprehensive Analysis of 225 Geo-Based Incrementality Tests

A Comprehensive Analysis of 225 Geo-Based Incrementality Tests
Stella Incrementality Testing Platform | August 2024 - December 2025
This study analyzes 225 geo-based incrementality tests conducted between August 2024 and December 2025 on Stella's self-service incrementality testing platform, providing comprehensive public benchmarks for digital advertising incremental ROAS across major channels.
Top Findings:
How to Interpret These Benchmarks:
Key Limitations:
Digital advertising measurement suffers from a fundamental problem: correlation does not imply causation. Platform-reported metrics systematically overstate advertising effectiveness by attributing conversions that would have occurred anyway (organic searches, existing brand awareness, direct traffic) to paid campaigns.
Last-click attribution credits the final touchpoint while ignoring the full customer journey. Multi-touch attribution models, though more sophisticated, still rely on observational data and cannot distinguish users who converted because of ads from those who would have converted regardless. View-through windows capture users who saw ads but were already planning to purchase.
These attribution biases lead to systematic overinvestment in channels that appear effective but generate minimal incremental revenue. The gap between platform-reported ROAS and true incremental ROAS often reaches 2-3x, with some channels (particularly branded search and retargeting) showing 5-10x inflation.
Incrementality testing solves this through controlled experimentation. By creating treatment and control groups via geographic randomization, these tests isolate the causal effect of advertising spend. However, running high-quality incrementality tests requires sophisticated statistical methodology, adequate statistical power, and careful experimental design.
Despite growing adoption of incrementality testing, the industry lacks comprehensive public benchmarks. Marketers planning tests face critical questions:
This study addresses these gaps by analyzing 225 incrementality tests using consistent methodology across diverse channels and business verticals, providing transparent, empirically-grounded benchmarks.
This research analyzes tests conducted between August 2024 and December 2025 using Stella's self-service geo-based incrementality testing platform. Data was extracted via SQL query from Stella's BigQuery database, filtering for tests measuring DTC-only revenue (excluding Amazon, retail, and other sales channels).
All tests focus on direct-to-consumer ecommerce brands (90% Shopify, 6% WooCommerce, 3% Salesforce Commerce Cloud, 1% Other) operating primarily in US markets. The dataset represents a mix of fully self-service tests and tests where brands received support from Stella's team, though we lack visibility into the specific proportion.
iROAS = Incremental Revenue / Ad Spend
Incremental revenue is calculated as: (Revenue in Treatment Group - Counterfactual Revenue) where the counterfactual represents what revenue would have been without the advertising treatment.
In this paper, all iROAS values are reported as absolute values (e.g., 2.31x means $2.31 of incremental revenue per $1 spent).
iROAS (Incremental ROAS): Snapshot measure of how incremental your CURRENT spend level is. Answers: "Is my current $50K/month investment generating positive ROI?"
Marginal ROAS: Forward-looking measure of how effective the NEXT dollar will be. Answers: "If I increase spend to $75K/month, what return will the additional $25K generate?"
Both metrics are essential for optimization. iROAS validates current efficiency; marginal ROAS guides scaling decisions.
Geo Holdout Tests (N=62)
Ads are active in treatment regions and paused in control regions. These tests are designed to measure positive incremental lift by comparing observed revenue in treatment regions against what would have occurred without advertising.
Inverse Holdout Tests (N=163)
Ads are paused in treatment regions and remain active in control regions. These tests measure the revenue lost when advertising is turned off, producing a negative lift signal that is inverted to estimate incremental contribution.
Important clarification:
Geo holdouts look for positive incremental lift from turning ads on, while inverse holdouts look for negative lift from turning ads off. Both designs estimate the same underlying causal effect, just from opposite directions.
In all cases, results are normalized so that higher iROAS always indicates stronger incremental advertising performance, regardless of test direction.
Statistically Significant (90%+ confidence level): Test reached 90%, 95%, or 99% confidence level. Results are reliable for decision-making. Total: 199 tests (88.4%).
Directional (80% confidence level): Test showed consistent directional signal but did not reach 90% threshold. Results suggest a trend but are not conclusive. Total: 13 tests (5.8%).
Not Significant: Test failed to reach 80% confidence level. Results are inconclusive. Total: 13 tests (5.8%).
Critical distinction: Only 90%+ confidence qualifies as "statistically significant" in this paper. 80% confidence is categorized as "directional" and is NOT counted in success rates.
MAPE (Mean Absolute Percentage Error): Measures how accurately the synthetic control matches actual revenue in the pre-treatment period. Lower is better.
R-squared (Coefficient of Determination): Proportion of revenue variance explained by the model. Higher is generally better, but extremely high values (>0.94) may indicate overfitting. Values range 0-1.
Important: R² values exceeding 0.94 are flagged as potentially suspicious. Perfect or near-perfect fit may indicate the model is fitting noise rather than signal, reducing reliability in the post-treatment period.
CI Width = Upper Bound - Lower Bound
Calculated from the 95% confidence interval around the point estimate. Narrower intervals indicate more precise measurement. Median CI width across all tests: 0.89x.
IQR (Interquartile Range): The range between the 25th percentile and 75th percentile. This captures the middle 50% of results and excludes extreme outliers. Reported as [P25 - P75].
Valid (N=200, 88.9%): MAPE < 0.30, R² > 0.50, reasonable effect sizes, narrow confidence intervals.
Directional (N=13, 5.8%): Weaker fit or wider intervals, but consistent directional signal.
Invalid (N=10, 4.4%): MAPE > 0.30 or R² < 0.40, indicating poor counterfactual quality.
Possible Overfitted (N=2, 0.9%): Suspiciously perfect pre-test fit (R² > 0.98 with MAPE < 0.02) suggesting model overfitting.
Stella is a self-service incrementality testing platform, not a consultancy. Users design and execute tests independently through Stella's web interface, with the platform providing automated validation, model selection, and analysis.
The tests analyzed in this study represent a mix of:
Data Collection Method: Results were extracted via SQL query from Stella's BigQuery database, filtering for tests that measured DTC-only revenue (excluding Amazon, retail, wholesale, and other non-DTC channels).
Test Classification:
Test Durations:
Test Budgets:
Channels Tested:
Tests employed various geographic randomization levels based on user choice and business requirements:
Tradeoffs by Granularity:
Self-Selection Bias: Brands using Stella represent a self-selected sample of incrementality-aware advertisers who have already:
Compared to typical DTC advertisers, Stella users likely achieve 15-25% better performance through superior creative quality, more sophisticated targeting, cleaner tracking implementation, and willingness to optimize based on test results.
Survivorship Bias: Brands experiencing persistent measurement failures or discovering poor incrementality across channels may have churned before completing multiple tests, creating bias favoring brands with cleaner data and stronger underlying incrementality.
Interpretation: These benchmarks represent well-executed programs by measurement-sophisticated advertisers rather than universal industry averages. For planning purposes, we recommend discounting these benchmarks by 15-20% to establish conservative targets for new testing programs.
Before any test begins, Stella's platform runs automated validation to ensure regions are suitable for causal inference:
Validation Process:
Important Distinction: This is an automated system validation, not manual curation. The Stella team does not explicitly review each study unless the user requests support.
Stella's platform implements four causal inference methodologies and uses automated model selection based on pre-test validation metrics:
Available Models:
Automated Selection Criteria:
Each model uses 80/20 train-test split of pre-treatment data:
Platform's algorithm selects model based on:
Models achieving MAPE < 0.15 in the validation period are preferred regardless of other metrics, as this threshold strongly predicts test success.
Stella's platform allows users to upload confounding variables via Google Sheets or CSV files. The system incorporates these into the causal models as covariates:
Common Confounders:
Example: A Meta holdout test might include concurrent Google, TikTok, programmatic, and CTV spend as covariates to better isolate Meta's specific contribution.
User Responsibility: Confounding variable identification and upload is optional and user-controlled. Quality of confounder control varies by user sophistication.
Statistical significance assessed using two-tailed hypothesis tests:
Confidence intervals constructed using blocked bootstrap resampling with 1,000 iterations to account for time series autocorrelation (see Technical Appendix).
Stella recommends post-treatment windows (also called "cooldown periods") to capture delayed conversions, but this feature is optional and controlled by users.
Adoption in This Dataset:
Post-Treatment Window Length (when present):
Purpose of Post-Treatment Windows:
Test Duration (Active Treatment Period):
Total Observation Window (when including post-treatment):
Consistently Measured Across All 225 Tests:
Why This Scope:
Important Limitation: Brands with significant Amazon or retail presence may see different total incrementality when accounting for cross-channel effects. Our DTC-only estimates represent lower bounds on total business impact for omnichannel brands.
Upper-Funnel Channels (YouTube, CTV, Display):
Lower-Funnel Channels (Search, Shopping, Retargeting):
Cross-Channel Comparisons: All benchmark comparisons in this study account for measurement design differences. Channel benchmarks reflect typical measurement approaches (with or without PTW) for each channel type.

Channels with fewer than 10 tests should be interpreted with caution as individual test results can substantially skew medians. For robust channel benchmarks, we recommend minimum N=15.
Limited Sample Channels:
These represent our current testing portfolio but should not be treated as definitive industry standards.

Sample Size Limitation: Most business verticals represented by fewer than 20 tests. Only Accessories (N=38) and Workout Gear (N=34) provide sufficient statistical power for robust vertical-level benchmarks.
Top verticals by test volume:
Tests conducted over 16-month period (August 2024 - December 2025). Modest temporal variation observed, with cross-channel differences (>100%) exceeding time-based variation (±20%). No individual channel shows systematic degradation over time.
This section presents empirical observations from the dataset. Hypotheses about why these patterns exist appear in the Interpretation section.
Observed Outcomes Across All 225 Tests:
Central Tendency:
Dispersion:
Statistical Significance:
Test Quality:

By MAPE:

By R-Squared:

Combined Best Fit (MAPE <0.15 AND R² 0.85-0.94):
Variance in incrementality outcomes represents genuine differences in advertising effectiveness driven by creative quality, targeting precision, product-market fit, and execution sophistication. Understanding this variability is as important as understanding central tendency.
iROAS Percentile Distribution:

Distribution Characteristics:
Profitability Distribution:
Performance Tiers:
Key Insight: Over one-third of tests achieved high performance (≥3.0x) while nearly one-third showed low performance (<1.5x). This 2:1 spread emphasizes execution sensitivity.
Variance differs substantially by channel, reflecting different sensitivities to execution quality:
Meta (N=63): Tight Distribution
Google (N=98): Moderate Distribution
Tatari CTV (N=18): Wide Distribution
TikTok (N=10): Extreme Distribution
Variance-Informed Planning:
Stella recommends a structured approach to incrementality testing that builds insights progressively:
Phase 1: Channel-Level Baseline (First 1-2 Tests)
Start with channel-level holdouts to establish baseline iROAS for your highest-spend channels.
Objective: Understand how incremental your current spend is at the channel level.
Example: "Is our $100K/month Meta spend generating positive incremental ROI?"
Phase 2: Dive Deeper - Choose Your Path (Tests 3-6)
After establishing channel baseline, choose one of two paths:
Path A: Tactic-Level Decomposition
Objective: Find where channel iROAS comes from.
Example: "Our Google Search shows 1.85x iROAS at channel level. Is that driven by Brand (likely low) or Non-Brand (likely higher)?"
Path B: Scale Testing (Marginal ROAS Mapping)
Objective: Map the marginal growth curve to understand diminishing returns.
Example: "We know Meta generates 2.9x iROAS at $80K/month. What happens at $200K? At $400K?"
Critical Distinction:
Phase 3: Campaign/Creative-Level Optimization (Tests 7+)
After understanding channel and tactic performance, drill into specific campaigns or creative approaches.
Testing Hierarchy:
Monthly channel spend determines which testing levels are accessible:

Why Budget Matters:
Use these benchmarks as probabilistic inputs, not guarantees:
Example 1: Planning a Meta Test
Interpretation: With $25K budget and 30-day duration, expect 87% chance of reaching actionable conclusions. Result between 2.25x - 3.30x is normal even if below median.
Example 2: Evaluating Google Search Branded
Interpretation: If test shows 0.60x iROAS with statistical significance, this is NOT a failure. It confirms platform attribution overstates by 7-10x.
Example 3: TikTok Creative Testing
Interpretation: Budget for 2-3 creative iterations. Platform-native content drives extreme outperformance; generic social ads drive failure.
Platform ROAS exceeding 2-3x incremental ROAS warrants skepticism:

The gap represents attribution inflation - conversions that would have occurred without advertising.
Rational allocation framework prioritizes by iROAS, but requires understanding iROAS vs Marginal ROAS:
Tier 1: Scale Aggressively (3.0x+ iROAS)
Action: Increase budgets until marginal ROAS declines to Tier 2 levels. Use scale testing to map curve.
Tier 2: Core Performance (2.0-3.0x iROAS)
Action: Maintain current levels, optimize creative/targeting to push toward Tier 1.
Tier 3: Optimize Carefully (1.0-2.0x iROAS)
Action: Improve execution or maintain minimal investment. Test platform-native creative approaches.
Tier 4: Defensive Minimum (<1.0x iROAS)
Action: Reduce to defensive floor (prevent competitor conquest), redirect savings to higher-iROAS channels.
Critical Caveats:
Critical Framing for Correct Usage:
The median iROAS represents the middle value from Stella platform users - half of tests performed better, half performed worse. Your specific result depends on:
Do not treat medians as promises or targets. They are reference points from self-selected platform users for realistic planning.
All benchmarks reflect specific measurement choices made by individual users:
Attribution Window: 55.6% of tests included 5-14 day post-treatment windows (user-optional). Tests with PTW capture delayed conversions that tests without PTW miss. Upper-funnel channels show 10-15% higher iROAS with PTW.
Revenue Scope: 100% measured DTC-only revenue (excludes Amazon, retail, cross-channel effects). Omnichannel brands may see higher total incrementality.
Geographic Design: Mix of State-level (~65%), DMA-level (~30%), and ZIP/cluster (~5%) randomization. Different granularities have different spillover/power tradeoffs.
KPI Selection: iROAS measures revenue efficiency but not customer lifetime value, repeat rate, or brand equity.
Results are measurement-dependent. Different choices produce different estimates of the same underlying advertising effect.
Upper-funnel channels (YouTube, CTV, Display):
Lower-funnel channels (Search, Shopping):
When comparing channels: Ensure similar measurement designs or adjust expectations accordingly.
For high-variance channels (TikTok CV=1.26, CTV CV=0.85):
For low-variance channels (Meta CV=0.35, Shopping CV=0.36):
Risk management: High variance does not equal bad channel. It signals execution sensitivity and optimization leverage.
Brands using Stella represent self-selected measurement-sophisticated advertisers who likely achieve 15-25% better performance than typical DTC advertisers through:
When planning new programs: Discount these benchmarks by 15-20% to establish conservative targets.
Example: Meta median 2.92x becomes 2.34-2.48x for conservative planning (2.92x × 0.80 to 0.85).
Stella's platform employs automated validation before any test begins, but does not prevent users from proceeding when validation flags issues.
Automated Quality Checks:
Important: These are system-generated flags and recommendations, not enforced gates. Users retain full autonomy to proceed with tests that don't meet ideal thresholds.
Our 88.4% statistical significance rate likely reflects three factors:
Factor 1: Automated Pre-Test Optimization Stella's validation system flags tests unlikely to succeed, though users can override. Tests with poor validation metrics (correlation <0.60, insufficient data, inadequate power) are surfaced in UI before launch.
This is distinct from rigid quality gates - users maintain autonomy but benefit from automated guidance.
Factor 2: Multi-Model Causal Inference Platform's automated model selection adapts to each dataset's characteristics rather than applying single method universally. This likely improves success rates compared to inflexible approaches.
Academic literature reports:
Our 88.4% falls within this range, suggesting multi-model selection provides modest improvement.
Factor 3: Sample Selection and Survivorship Bias Brands experiencing consistent measurement failures may have:
This creates survivorship bias favoring brands with cleaner data and stronger underlying incrementality.
We cannot fully disentangle these effects. The combination of automated validation, multi-model selection, and sample selection all likely contribute to observed success rates.
Practitioners should expect: 75-85% significance rates with similar pre-test optimization and multi-model approaches. Those using simpler methods may see rates closer to 60-70%.
To assess robustness, we examined how results change under different analytical assumptions:
Significance Threshold: Using 95% confidence level rather than 90% as primary threshold reduces overall significance rate to 86.2% (vs 88.4% at 90%). Channel rankings and median iROAS values remain stable.
Test Duration: Excluding tests shorter than 25 days (N=47 removed) increases median iROAS by 3% (2.31x to 2.38x) and significance rate by 2% (88.4% to 90.3%). Shorter tests slightly less effective but don't drive overall patterns.
Fit Quality Filtering: Removing tests flagged as 'possible_overfitted' (N=2) or 'invalid' (N=10) has minimal impact on channel benchmarks (median iROAS changes <2% for all channels with N>20).
Automated Model Selection: When examining only Weighted Synthetic Controls (N=97), significance rate for those specific tests is 94.8%, suggesting model selection provides benefit but is not the sole driver of high success rates.
Geo-based incrementality testing assumes clean boundaries between treatment and control regions. In practice, several spillover mechanisms exist:
Brand Awareness Spillover: National campaigns create awareness that persists when users travel between regions or engage with content shared across geographic boundaries.
Search Spillover: Paid advertising in treatment regions can increase branded search volume in control regions through word-of-mouth and social sharing.
Cross-DMA Commerce: Users may see ads in one DMA but complete purchases in another due to shipping addresses or mobile location tracking limitations.
These spillover effects bias our iROAS estimates conservatively downward - true incrementality is likely 5-15% higher than measured values. Upper-funnel channels (YouTube, CTV) likely underestimated more than lower-funnel channels (search, shopping).
Platform Limitation: While Stella's platform supports retesting with different geographic boundaries when initial results are inconclusive, we lack visibility into which tests in this dataset employed this approach. Users make retesting decisions independently.
Every test in Stella's database includes:
This differs from "black box" platforms where methodology remains opaque.
DTC Ecommerce Focus: All tests involve direct-to-consumer ecommerce brands. Results may not generalize to:
Self-Service Platform Sample: Data extracted from Stella's database represents self-selected users of the platform. We lack visibility into:
Self-Selection Bias: Brands using incrementality testing platforms differ from typical advertisers:
The 15-25% performance premium represents our estimate of this selection effect.
US Market Focus: Tests concentrate in US markets. International markets may show different patterns due to varying digital penetration, ecommerce maturity, consumer behavior, and privacy regulations.
Revenue Scope: Analysis excludes Amazon and retail channel revenue, focusing solely on owned DTC channels. Omnichannel brands may see different incrementality patterns when accounting for cross-channel effects.
User Autonomy: Stella is self-service platform. Users control:
Quality varies by user sophistication. We cannot verify confounding variable completeness or quality for any individual test.
Automated Validation, Not Enforcement: Platform flags quality issues but doesn't prevent users from proceeding. Unknown proportion of tests launched despite validation warnings.
Limited Visibility: SQL query extraction means we lack context on:
16-Month Window: Tests conducted August 2024 - December 2025 capture specific moment in platform evolution:
Results may not reflect 2026+ performance as platforms adapt and competition evolves.
Claims NOT supported by this data:
This study analyzed 225 geo-based incrementality tests conducted on Stella's self-service platform between August 2024 and December 2025, establishing comprehensive public benchmarks for digital advertising incremental ROAS.
Primary Findings:
Overall Performance:
Channel Rankings by Median iROAS:
Success Predictors:
Critical Distinctions:
iROAS vs Marginal ROAS: iROAS measures current spend incrementality (snapshot). Marginal ROAS measures next-dollar effectiveness (forward-looking). Both required for optimization.
R² Sweet Spot: 0.85-0.94 is ideal. Values >0.94 are suspicious (possible overfitting), not excellent.
Variance as Signal: High variance channels (TikTok, CTV) offer execution leverage. Low variance channels (Meta) offer reliability.
The cost of not measuring incrementality increasingly exceeds the cost of testing. As platform attribution degrades due to privacy restrictions and competition intensifies, distinguishing truly incremental investments from attributional mirages becomes essential.
These public benchmarks aim to:
This dataset represents:
Discount benchmarks by 15-20% for conservative planning when starting incrementality testing programs.
The industry's transition from correlation-based attribution to causality-based incrementality is inevitable. The only question is timing.
Incrementality testing is a controlled experiment measuring the causal effect of advertising by comparing revenue in regions where ads ran (treatment) vs regions where ads were paused or changed (control). Unlike attribution which assigns credit based on correlation, incrementality isolates what revenue would NOT have occurred without advertising.
Stella is a self-service SaaS platform with automated validation and model selection. Users design and execute tests independently through the web interface. The platform provides:
Some users receive support from Stella's team, but most tests are fully self-service. This differs from consultancies that manually design and execute each test.
iROAS (Incremental ROAS): Snapshot of how incremental your CURRENT spend is.
Marginal ROAS: Forward-looking measure of how effective the NEXT dollar will be.
You need BOTH to optimize properly. iROAS validates current efficiency; Marginal ROAS guides scaling decisions.
Too-perfect pre-test fit often indicates the model is fitting noise rather than signal, reducing reliability in the post-treatment period. The sweet spot is R² 0.85-0.94, which indicates:
In this dataset, 38.7% of tests showed R² >0.94 (possibly overfitted) while 24.9% fell in the 0.85-0.94 sweet spot.
No, they're user-optional in Stella's platform. In this dataset:
When to use PTW:
When PTW is less critical:
Tests without PTW may underestimate upper-funnel iROAS by 10-15%.
Stella's platform calculates this automatically during pre-test validation. Key predictors:
Pre-test fit quality (most important):
Test duration: 30-35 days optimal for most scenarios
Budget: $15-30K sweet spot for significance
Platform flags tests unlikely to succeed during validation, though users can proceed.
Below-benchmark performance provides valuable information, not test failure:
If Meta shows 2.0x (below 2.92x median but within IQR 2.25-3.30):
If Google Search Branded shows 0.60x (below 0.70x median):
Remember: These benchmarks come from measurement-sophisticated advertisers. Apply 15-20% discount for conservative planning.
Yes, with appropriate caveats:
These benchmarks should apply if you:
Your results may differ if you:
Statistical significance rates may be lower (60-75%) with less sophisticated approaches, but iROAS medians should directionally apply.

Every $50K/month in total spend opens access to next testing level. However it is important to remember, the biggest factor in reaching statistical significance is conversion volume in each region. Granular testing with insufficient budget produces inconclusive results.
Scale test design (3 cells):
Measures: Marginal ROAS at each spend level to map diminishing returns curve.
Run scale test when:
Don't run scale test when:
State-level (~65% of tests in this dataset):
DMA-level (~30% of tests):
ZIP/Cluster-level (~5% of tests):
Most brands should start with State-level to reduce spillover, then move to DMA-level as they gain sophistication.
Confidence intervals constructed using blocked bootstrap resampling to account for time series autocorrelation:
Procedure:
This follows best practices for time series bootstrap (Politis & Romano, 1994).
Pre-test power calculations used standard formula:
n = (Z_α/2 + Z_β)² × (σ²_treatment + σ²_control) / (MDE × baseline_mean)²
Where:
Stella's platform calculates this automatically during pre-test validation.
Abadie, A., & Gardeazabal, J. (2003). The economic costs of conflict: A case study of the Basque Country. American Economic Review, 93(1), 113-132. https://doi.org/10.1257/000282803321455188
Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's tobacco control program. Journal of the American Statistical Association, 105(490), 493-505. https://doi.org/10.1198/jasa.2009.ap08746
Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., & Scott, S. L. (2015). Inferring causal impact using Bayesian structural time-series models. The Annals of Applied Statistics, 9(1), 247-274. https://doi.org/10.1214/14-AOAS788
Gordon, B. R., Zettelmeyer, F., Bhargava, N., & Chapsky, D. (2019). A comparison of approaches to advertising measurement: Evidence from big field experiments at Facebook. Marketing Science, 38(2), 193-225. https://doi.org/10.1287/mksc.2018.1135
Lewis, R. A., & Rao, J. M. (2015). The unfavorable economics of measuring the returns to advertising. The Quarterly Journal of Economics, 130(4), 1941-1973. https://doi.org/10.1093/qje/qjv023
Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), 1303-1313. https://doi.org/10.1080/01621459.1994.10476870
.png)