How to vet a marketing measurement consultancy

Table of contents

Why did you pay for measurement and still not trust the number?

Because the bad ones do not look bad. A weak marketing measurement consultancy hands you a polished lift number, a budget recommendation, and just enough confidence to move money, without showing whether the model would survive data it never trained on. You traded one number you could not audit for another.

You hired a measurement partner because the platforms were over-crediting themselves. Then the partner handed you a number you still could not check. Different dashboard, same problem.

This guide is about telling a real measurement partner from a polished one. Not by the benefit they promise, since everyone promises that, but by the standard of proof they can put on the table.

One disclosure. Stella runs measurement for a living, so we have a stake in this. The standard we are asking you to hold a vendor to is the same one we would want applied to us.

‍

What does a marketing measurement consultancy do?

A marketing measurement consultancy estimates which channels actually cause sales, not which ones collect last-click credit. It runs incrementality tests, builds media mix models, audits your attribution, and turns the results into budget decisions. The good ones own the problem end to end and defend their model.

Start with what they are correcting. Your ad platforms report on their own performance. Meta claims the conversion, Google claims the same conversion, and last-click attribution assigns credit to whichever touchpoint happened to come last.

This is not a small effect. A Facebook field-experiment study comparing attribution methods against controlled experiments found attribution diverged sharply from the experimental truth, often overstating the platform's own contribution (Gordon et al., Marketing Science 2019).

Add it up and your platforms report more revenue than the business actually made. Every channel looks efficient in isolation while your blended numbers stay flat. Closing that gap is the problem a measurement consultancy exists to solve.

They rely on two main tools. Incrementality testing estimates the causal lift of a single channel through a controlled experiment. Media mix modeling estimates the contribution of every channel at once from historical data.

A well-designed experiment is usually the cleanest evidence available. You withhold a channel from one group, run it for another, and measure the difference in outcomes, with no platform self-reporting involved.

"Well-designed" carries the weight. A weak holdout, a contaminated geography, a promotion that lands in one market and not another, or an underpowered test will still return a number that looks authoritative and means very little.

The whole exercise answers one question. If this channel goes dark, do those sales disappear, or do they arrive anyway through another path? Last-click cannot answer that. A controlled experiment can estimate it.

The two main tools, side by side

	Incrementality testing	Media mix modeling
What it answers	Did this one channel cause incremental sales, right now?	How do all channels, plus price, promo, and seasonality, contribute over time?
Method	Controlled geo experiment. Ads run in some markets, paused in others.	Statistical model fit on historical spend and revenue.
Evidence type	Causal, from a real experiment.	Correlational. Strength depends on data variation and controls.
Strengths	Cleanest causal read for a single channel. Hard to argue with when designed well.	Covers the whole mix at once. No holdout revenue at risk.
Weaknesses	Narrow to one channel and period. Costs some holdout revenue. Needs enough conversion volume.	Leans on modeling assumptions. Collinearity can blur which channel did the work.
Best for	Settling a specific channel question. Checking whether platform ROAS is real.	Planning across many channels. Reading long-term trends and interactions.
Watch out for	Underpowered tests, contaminated geos, a promo that hits one market and not another.	Overfitting, multicollinearity, and a fit that only looks good in-sample.

Good programs use both. A test answers one question precisely. MMM gives the broad read. When they disagree, find out why rather than averaging them.

‍

How much does a measurement consultancy cost?

It depends on the structure. Some firms charge per project, a fixed fee for one model build or a set of tests. Others bill a monthly retainer for ongoing measurement. A few take a percentage of media spend, which is the model to watch, since it rewards them when you spend more.

Whatever the structure, the figure that matters is the return. If a consultancy shows that a meaningful share of your budget flows to channels with weak or no incremental lift, the fee can pay for itself on the first reallocation.

For reference, here is how Stella prices it, and the structure follows the same sequence most brands should. A full managed engagement, where our team designs, runs, and defends the measurement for you, starts at $10k. Self-serve plans run $6k and $3k for teams ready to run tests themselves. The price steps down as the work stops needing expert hands.

That math only works if the answer is sound. A cheap partner with weak identification is expensive. A strong one with defensible evidence is usually cheap next to the cost of moving budget on a guess.

‍

Should you hire a consultancy or use software?

Usually both, in sequence. Start with a managed engagement when you have no measurement skill in-house, because the first version of a program is full of small choices that change the answer. Move to self-serve software once the work becomes repeatable and your team can read the output without help.

Managed, self-serve, or in-house

Where you are	Best starting point	Why
No in-house measurement skill, first program	Managed engagement	The first version of a program is full of small choices that change the answer. You want someone to own and defend the model.
Program running, team can read the outputs	Self-serve software	The work is now repeatable. Stop paying expert rates for something your team can run.
Strong data team, just need tooling	Self-serve software or build	You already have the judgment. What you need is speed, compute, and repeatability.
Small spend, simple channel mix	Lighter checks first	A full engagement is overkill. Start with simpler tests, clean blended reporting, and basic incrementality checks.
High-stakes one-off decision	Managed or expert-validated	When a big reallocation or a board sign-off rides on the number, the cost of being wrong dwarfs the engagement fee.

Directional guidance, not rules. The right call depends on your data quality, spend level, and in-house capability.

Early on, you do not yet know what good measurement looks like, so handing the problem to someone who does is worth the cost. They set the methodology, run the first tests, pressure-test the model, and teach your team how to read the outputs.

This is where a managed engagement earns its fee. Not because your team lacks ability, but because those early choices look minor and change the answer. Which channels to test first. Which outcome to model. Which markets qualify for a holdout. Whether the model is identifying signal or fitting noise.

Those are not dashboard questions. They are judgment calls.

Once the program runs and your team can read the output confidently, software sustains it at a fraction of the cost. You stop paying expert rates for work that has become routine.

The brands that get this wrong buy a self-serve platform before they know what to measure, then cannot act on what it shows them. Others keep paying for a managed retainer long after the work became repeatable. Match the spend to where you actually are.

‍

What should you ask before hiring one?

Ask how they measure incrementality, not just attribution. Then ask to see the model receipt: out-of-sample accuracy, the uncertainty around the estimate, collinearity and residual checks, and a clear path from model to budget. If all you get is an in-sample R-squared and a tidy slide, keep asking.

Before you ask what the model concluded, ask how the model was graded.

Not every item on the receipt carries the same weight. There are three layers, in order of how much it should worry you when one is missing.

Prediction. Can the model predict data it did not train on? If it has only been graded on the period it learned from, that is a memory test, not evidence.

Identification. Can the model separate signal from noise, or are the channels too tangled to read cleanly? A model can predict well and still credit the wrong channel.

Decision quality. Is the result strong enough to move budget, or just interesting enough to put in a deck? A lift number with no uncertainty and no recommendation changes nothing.

In plain English, before the jargon arrives:

MAPE: how far off the model's predictions were.
Out-of-sample R-squared: how well the model explained data it had not seen.
VIF: whether channels moved too closely together to separate cleanly.
Residuals: the model's misses, prediction minus reality.
MDE: the smallest lift the test could reliably detect.
Confidence interval: the range of plausible answers, not just the single number in the deck.

Vendor check

Can you see the model receipt?

Mark what your measurement vendor actually shows you. This does not tell you whether the answer is right. It tells you whether you can check the work before you move budget on it.

Directional, not a verdict. A complete receipt does not prove the conclusion is right, and a thin one does not prove it is wrong. It tells you whether the work is auditable. The prediction and uncertainty checks carry the most weight here, because across Stella's 225-test benchmark, pre-test fit quality was more predictive of statistical significance than budget or duration. That does not prove causal correctness. It is a strong first screen. Estimates, not hard truth.

Out-of-sample accuracy, not in-sample fit

The real test is whether the model predicts a holdout period it was not trained on, measured by out-of-sample MAPE.

Ask for the out-of-sample MAPE and R-squared, the holdout window, and whether they used a rolling backtest or a single convenient slice. Ask what baseline the model beat, because a model is not impressive for beating nothing. It should outperform a naive seasonal forecast a spreadsheet could produce.

And prediction is not causation. A model that forecasts well does not automatically identify what caused growth, which is why the receipt needs identification checks too, not accuracy alone.

Uncertainty, power, and minimum detectable effect

A test can be well-designed and still too noisy to answer the question. You need the smallest effect it could have detected and the width of the interval around the estimate.

"It was significant" is not enough. Ask for the minimum detectable effect, the confidence interval, and whether the test was powered for the decision you intend to make. A point estimate without an interval is half an answer.

Collinearity diagnostics

Multicollinearity is one of the hardest problems in MMM. A VIF check does not solve it, but it flags when the model is trying to separate signals that are too entangled to read cleanly. Ask whether they ran it, what it showed, and what they changed as a result. When channels cannot be told apart from the data alone, the honest answer is to say so, not to publish a precise-looking split.

Residual diagnostics

A model can post strong fit metrics and still be misspecified. The residuals, the differences between predicted and actual, are where that shows. When the errors cluster, the model is missing structure: a holiday it does not capture, a promotion it overpredicts, a price change it never adjusts for.

Ask whether they examined the residuals and what they found. A defensible answer describes a problem they caught and corrected. "The R-squared was high" is not that answer.

The path from model to budget

This is where most engagements quietly fail. You receive a lift number and a slide, and nothing changes, because no one connected the analysis to a decision.

A finished recommendation states what to move, by how much, and with how much confidence. Not "Meta is 2.9x incremental ROAS," which is a number, but "Meta is estimated at 2.9x incremental ROAS, enough confidence to raise spend 15 percent next month but not enough to double it, while branded search is over-credited by platform reporting and should hold or fall first." That is a decision.

This is also where transparency separates measurement from theater. A serious partner can show evidence: a public benchmark, an anonymized validation report, or the diagnostics behind your own model. Stella's benchmark across 225 DTC incrementality tests run between August 2024 and December 2025 puts the median test at 2.31x iROAS, with the middle 50 percent between 1.36x and 3.24x, and 88.4 percent reaching significance at 90 percent confidence or higher. That is a self-selected set of our own tests, not an industry average, and we say so. The value is not that it predicts your result. It is that you can see the numbers, check the method, and decide whether to trust the evidence. Hold any partner to the same standard. Not the same benchmark, the same willingness to show the receipt.

Weak answer vs strong answer

What you ask	A weak answer	A strong answer
What was your out-of-sample MAPE?	"The model fit was strong."	"MAPE was 8.7 percent on a four-week holdout window, and we beat a naive seasonal baseline."
What was your out-of-sample R-squared?	"The R-squared was 0.98."	"Out-of-sample R-squared was 0.89. We treat unusually high values as possible overfit and check them against holdout behavior."
Did you check collinearity?	"The model accounts for that."	"Meta and YouTube were highly correlated, so we did not report separate ROI as if they were cleanly identified."
Was the test powered?	"It reached significance."	"The MDE was 12 percent, the observed lift was 16 percent, and the interval was narrow enough to support a 15 percent budget move."
What should we do next?	"Meta is 2.9x iROAS."	"Increase Meta 15 percent, hold branded search, and re-test YouTube before scaling."

Illustrative. These show the shape of a defensible answer, not specific test results.

‍

What are the red flags to watch for?

The dangerous vendors look polished, not sloppy. Watch for in-sample fit presented as proof, a suspiciously perfect R-squared, a single number with no interval, clean per-channel ROI for channels that obviously move together, and an engagement that ends in a deck instead of a budget decision.

Here is the full list to keep next to you on a sales call.

In-sample fit presented as proof. If the headline number is an R-squared on the data the model trained on, that is a memory test, not a prediction.
A suspiciously perfect fit. An R-squared of 0.98 usually means the model memorized noise and will break on new data, not that it is brilliant.
One number, no interval. A point estimate with no confidence interval or minimum detectable effect is half an answer dressed as a whole one.
Clean ROI for channels that move together. Separate, precise numbers for Meta and YouTube with no mention of collinearity is precision they have not earned.
A deck instead of a decision. If the work ends at "here is the lift" with no budget recommendation and no uncertainty, nothing will change.
Certainty about everything. A partner who never says "we cannot identify this from the data alone" has not stress-tested anything.

None of these mean the vendor is dishonest. They mean you cannot check the work, which for a decision this expensive is the same problem.

‍

When is a measurement consultancy actually worth it?

When the stakes are high and the answer is not obvious. Reallocating a seven-figure budget, settling a Meta-versus-Google dispute, proving incremental lift to a skeptical board. When being wrong is expensive, defensible measurement is cheap by comparison. If your spend is small and your mix is simple, you may not need it yet.

Run the math on being wrong. Move a million dollars of spend on a hunch that is off by 20 percent and you have misplaced 200,000 dollars. An engagement costs a fraction of that.

It settles arguments, too. Most growth teams have the Meta-versus-Google standoff, where both platforms claim the same conversions and no one can prove who drove the sale. A controlled test turns that from opinion into evidence, for the question tested, in the period tested, with the constraints stated plainly.

And it is worth it when you have to defend the number to someone who does not trust you. A board, a CFO, a new CMO. "The platform says so" does not survive that room. "We ran a controlled test, here are the results, here is how we validated them, and here is what we still do not know" does.

But if you are spending six figures or more across several channels, some of that money is likely over-credited, under-measured, or misallocated. Measurement is how you find it.

‍

Frequently asked questions

What is the difference between a measurement consultancy and an agency? An agency buys and manages your media, while a measurement consultancy tells you whether that media actually worked. When the team buying the media also grades it, the conflict is built in, so hold its validation to the same standard you would ask of anyone else.

How long does a measurement engagement take? A single incrementality test usually runs three to six weeks once live, and across Stella's 225-test benchmark, durations ranged from 20 to 59 days with a median of 33. A full MMM build takes longer, and anyone promising a credible causal answer in days is offering direction, not measurement.

Can a consultancy work with my existing attribution tools? Yes, most integrate with your ad platforms, analytics, and warehouse rather than replacing them. The work corrects what last-click and platform-reported ROAS over-credit, so your attribution tools stay useful for operations but stop being the final judge of incrementality.

Do I need a big budget to justify measurement? Not as large as people assume. If you are spending six figures a year across more than two channels, the cost of misallocating that spend is usually greater than the cost of measuring it.

What is the difference between incrementality testing and MMM? Incrementality testing estimates the causal lift of one channel through a controlled experiment, while MMM models every channel at once from historical data. Tests are precise but narrow and MMM is broad but assumption-dependent, so strong programs use both.

‍

See what defensible measurement looks like

If a partner cannot show you the receipt, you do not have measurement. You have a guess with good design.

Book a demo and we will walk you through the validation behind our numbers before you commit to anything.