How to measure ad lift when you can't run an A/B test. The weights, the validation, when it breaks, and the one check most marketers skip.
.png)
Updated June 2026
A weighted synthetic control is a custom benchmark for a test market. It blends several untreated markets into one that looked like your test market before the campaign, then uses that blend to estimate what would have happened without the campaign. The gap between actual sales and that benchmark is your estimated incremental lift.
The business question underneath the math is simple. Did the campaign create revenue, or did the platform just take credit for revenue that was already going to happen?
One number for context, since you'll want it: across 225 geo tests in Stella's own dataset, 88.4% reached at least 90% confidence, with a median incremental ROAS of 2.31x and an interquartile range of 1.36x to 3.24x. Treat that as a benchmark for tests that were viable enough to run, from a self-selected set of mostly DTC advertisers. It's not a universal success rate or a guarantee.
Weighted synthetic control is useful when a clean user-level A/B test isn't available, isn't trusted, or isn't practical. It's especially useful for geo tests across Meta, Google, CTV, podcast, and retail media, where user-level exposure is messy. But it isn't magic. It works only when the control markets are clean, the test market can actually be matched, and the diagnostics pass.
Most marketers already know platform ROAS isn't the same thing as incrementality. A dashboard can tell you which conversions were attributed to an ad. It can't tell you which conversions would have happened anyway. That's the difference between reported performance and causal performance.
Three measurement categories get mixed together, and they're not the same:
Weighted synthetic control is one way to read that geo test. It doesn't replace good experimental design. It builds the best possible counterfactual once the test is designed.
Start with a test market. Say you run a six-week campaign in Chicago. To measure incrementality, you need to estimate what Chicago sales would have been if the campaign had never run.
The bad version compares Chicago to one similar city. Detroit, maybe. One market almost never gives you a good enough match.
Weighted synthetic control uses a blend instead. The model might decide Chicago's best benchmark is 38% Detroit, 31% Minneapolis, 22% Cleveland, and 9% other markets. Those weights are chosen because that blend tracked Chicago's pre-campaign sales closely.
If the synthetic version of Chicago would have produced $1.0M during the test and actual Chicago sales were $1.08M, the estimated lift is $80K. That's not platform-attributed revenue. It's estimated incremental revenue, assuming the synthetic control is a credible counterfactual.
That last phrase matters. The math gives you an estimate. The diagnostics tell you whether the estimate deserves to be believed.
The weights are one of the biggest advantages of synthetic control. You can see which markets built the benchmark, which makes the method far easier to explain than a black-box model.
But readable does not mean valid. A synthetic control can look clean and still be fragile. If one donor carries most of the weight, a random shock in that donor moves your whole estimate. If the weights are spread thin across weak markets, the benchmark can be noisy.
The right question isn't "do the weights look reasonable." The right questions are:
Weights create transparency. They don't replace validation.
A synthetic control result should never be judged by pre-period fit alone. A high R-squared can mislead. It may only prove the model fit the past, and a model can overfit history and still fail the moment the campaign starts. Three checks matter.
Before the campaign period is analyzed, test the model on pre-campaign data it didn't train on. Split the pre-period in two, train on the early weeks, then predict the later weeks. That later section is the holdout. If the model can't predict the holdout before the campaign, it has no business claiming lift after it. Good holdout performance doesn't prove causality. It earns the model the right to be taken seriously.
Placebo tests ask one question: would this method find fake lift in markets where no campaign ran? Treat each donor as if it were the campaign market, apply the same method, and collect the "effects." Since those markets weren't treated, those effects map the false positives your design and model can produce, from noise, shocks, misspecification, or weak donor comparability. Then compare your real test market against that spread.
This produces a permutation-style pseudo p-value. With 32 placebo markets, the smallest possible one-sided p-value using the standard plus-one correction is 1/33 = 0.03, and that only happens if your result beats all 32. If it beats 31 of 32, the corrected p-value is 2/33 = 0.06. Small arithmetic matters here, because this is the number a technical reader checks first.
A single lift number isn't enough. If the estimate says 6% but the range runs from -2% to 14%, the honest read is "inconclusive," not "the campaign worked." Show the estimate and the uncertainty around it. Finance doesn't need fake precision. It needs a range it can decide from.
Weighted synthetic control is credible only when a specific set of assumptions hold.
A clean-looking chart does not save a broken test.
It usually fails in predictable ways.
Test in your biggest or weirdest market and the donor pool may not be able to recreate it. This isn't only about size, it's about support: the treated market has to sit inside the range the donors can express. If it doesn't, standard synthetic control hits the boundary and fits poorly. Augmented and regression methods extrapolate, which trades a fit problem for an extrapolation problem. The rule: don't pick test markets from the edge of your business without a good reason.
This is the failure teams miss most. A control market only counts as untreated if it truly saw no campaign. CTV leaks across borders, paid social drifts, search demand spills, and retail media may not respect your test geography. Contaminated controls usually bias the estimate toward zero in a positive-lift test, so your campaign looks less incremental than it really was.
Short histories create unstable weights and hide your seasonality. For businesses with annual seasonality, 52 or more weeks is often preferable. Faster-moving accounts can use shorter windows, but only if the donor relationships are stable and the holdout passes. More data isn't automatically better, old data goes stale, but too little makes the counterfactual fragile.
A competitor sale, a warehouse issue, weather, or a holiday that behaves differently across markets can all bias the estimate. The model can't tell whether a market moved because of your campaign or because something else changed at the same time. External event checks aren't academic hygiene, they protect budget decisions from bad reads.
No causal method is always right. Pick the one that fits your data shape.
For high-stakes decisions, one method is rarely enough. Run the main read, then use a second as a sensitivity check. Agreement across methods is reassuring, not proof, since two bad models can agree if they share the same bad data. But it convinces a skeptical CFO far more than any single number.
The point of incrementality testing isn't a prettier report. It's to change budget decisions. iROAS here is incremental revenue over media spend, which is a fine starting point, but finance will care about contribution margin, not revenue. A 2.5x revenue iROAS can be great for one business and weak for another depending on gross margin and payback.
The worst outcome is running an incrementality test and changing nothing. Decide before launch what each result would make you do, then do it.
A checklist before you trust the result
Before you act on a synthetic control read, ask for this:
If those pieces are missing, the result may be interesting. It isn't decision-grade yet.
No. A geo holdout is the design: you create different exposure or spend levels across markets. Weighted synthetic control is one way to analyze that design, turning untreated markets into a weighted benchmark instead of a simple average. The holdout creates the comparison, synthetic control improves the counterfactual.
There's no universal number. You need enough clean donors to create a good match and enough placebos to make inference meaningful. A pool of 20 to 40 untreated markets is often a healthy start, but quality matters more than raw count. One or two great matches can produce a good-looking benchmark with weak inference and high sensitivity risk.
Long enough to detect the effect you care about. A larger effect can be detected faster; a smaller one needs more time, more markets, cleaner controls, or a lower-variance KPI. Don't choose duration on business urgency alone. Choose it on minimum detectable effect and statistical power.
Not cleanly. Synthetic control needs untreated markets. If every market got the campaign at once, there's no clean donor pool. For national, always-on spend, use media mix modeling, calibrated with experiments where possible.
No. A high pre-period R-squared only means the model fit history. It doesn't prove the counterfactual is valid. Confidence comes from the full design: clean controls, good holdout prediction, placebo tests, sensitivity checks, and no obvious confounding shock.
Weighted synthetic control identifies a causal effect only under a clear set of assumptions: untreated and uncontaminated donor markets, a treated market inside the donor pool's support, a treated-to-donor relationship that stays stable absent the campaign, no confounding shock coinciding with the test, consistent measurement across markets, and a design fixed before anyone sees post-period results. The estimate is a counterfactual under those assumptions, not a proof of causality on its own.
References and further reading:
Weighted synthetic control answers the question platform dashboards can't: what would have happened if you hadn't spent the money? It isn't a magic causal machine. It's a disciplined way to build and validate a counterfactual when a clean user-level experiment isn't available.
When the donors are clean, the test market is matchable, the holdout fit is strong, the placebo result stands out, and the uncertainty range supports action, the estimate is defensible. When those checks fail, the answer isn't "the model says no lift." It's simpler: you don't have a test you can trust yet.
That distinction is the difference between incrementality theater and measurement a finance team can actually use.
Run your next campaign through Stella and see the number you can defend.
.png)