Random Control, Random Results: Standards For Sales-Lift Measurement Needed

Data-Driven Thinking” is written by members of the media community and contains fresh ideas on the digital revolution in media.

Today’s column is written by Vijoy Gopalakrishnan, principal at the IRI Media Center of Excellence.

This scenario happens all too often in the digital advertising industry: An agency has just conceived and executed a brilliant campaign for a brand. The ROAS is off the charts; there is a measurable uplift in sales and market share.

The agency was confident that it would get kudos and land a new, juicy assignment, but it ended up getting fired instead because the campaign failed to meet the metrics the brand had in place.

What happened?

There is a saying in life: Measure twice, cut once.

But, there is a different saying in the information services industry: Never measure the same thing twice – you may get different results!

All of us do a good job individually to measure things once and in a consistent manner, but collectively we do not. Each partner in the ecosystem, be it the agency, marketer or researcher, brings a different lens to a problem, which can result in different approaches to analyze a given campaign.

While big data and analytics are typically a blessing, it is critical to apply the right data and analytics to gain a true picture of a product’s market position and the impact of a campaign. Equally as important, all partners in the ecosystem must agree on measurement decision points.

Measuring sales lift, in particular, has become a critical metric to gauge the effectiveness of advertising as we expand from measuring click-through rates and impressions to also measuring incremental purchase behavior. Central to the calculus that goes into measuring sales lift is an experimental design, which by definition requires a test and control.

Among marketers, that’s where agreement seems to end. The devil is always in the details as to what constitutes a “true” control, and these details make all the difference in measuring a campaign’s magnitude of success or lack thereof.

By intent, a control group and test group are “identical” in every way except that the test group is subjected to the treatment, or exposure to ads, in our case, while the control group is not. Do you remember when Bill Clinton famously said, “It depends how you define ‘is’”? Similarly, how we define “identical” is where practitioners diverge.

Different matching variable sets have differing sales lift outcomes. In Figure 1, the pre-period matching time frame has been changed to illustrate the point.

Figure 1: Lift results for five sub-sample iterations using 52 weeks, 13 weeks and one year ago match

If changing just one variable has such an impact, you can imagine the effect that differences in the selection of multiple matching variables can produce.

We do know from past research that context matters in matching decisions. When analyzing sales lift, which primarily combines an evaluation of media exposure and consumer spend, it stands to reason that any predictors of media exposure and consumer spend would naturally be the superset from which to select our variables.

Variables that are top of mind for matching include:

  • Selection of pre-campaign period
  • Purchase behavior characteristics
  • Demographics
  • Media behavior
  • Geography

Given the impact these variables have on measurement, standardization therefore is de rigueur to help bring clarity to the discussion as results are compared.

This could potentially lead to a different outcome for the agency mentioned at the beginning of this column. During strategy sessions, brands and their agencies should outline not only the metrics for their ad campaigns, but also the matching variables for the analyses of the control groups. All partners should understand that it is important to be consistent and relevant, so that they are selecting their methodology in a standardized manner that is tailored to the campaign objectives but independent of the campaign outcomes.

By going that extra mile, brands create a single point of truth against which to measure their ad campaigns. That extra step ensures that all parties stay on the same page before, during and after campaigns.

Follow IRI (@iriworldwide) and AdExchanger (@adexchanger) on Twitter.

Enjoying this content?

Sign up to be an AdExchanger Member today and get unlimited access to articles like this, plus proprietary data and research, conference discounts, on-demand access to event content, and more!

Join Today!


  1. In line with these recommendations, it’s important that marketers have transparent access to data from lift tests as an audit mechanism. It’s all too easy to skew results towards the test by “negatively optimizing” the control. Yet another reason that Bill Duggan’s recommendation that brands ask for log data makes complete sense!

    • Agreed – transparency is the best guard rail to increase credibility of measurement.


    Using synthetic controls will always lead to problems. Hoping that your modeling can accurately reproduce a low-single-digit lift is a wishful thinking – one will never be able to account for all counterfactuals. Moreover, often these studies use the wrong labels: matching exposed and unexposed users should be done by the probability to be exposed, not by some abstract definition of “identical”.

    There is a simple solution: conduct randomized controlled experiments (A/B testing) where possible. This is a gold standard for measuring incremental sales, and every marketer should strive to find partners that could conduct properly designed A/B testing where Test and Control users are randomly selected prior to treatment. Multiple platforms now offer Ghost Ads. When comparing exposed with “would have been exposed” is not possible, and PSA studies are prohibitively expensive, at least randomly split the audience into Test and Holdout group. The universe of unexposed users will then be limited to the holdout group which is statistically identical to the Test group. If measuring the lift on the entire Test and Holdout groups does not produce statistically significant result, one should model exposed users in Test and apply the model to the holdout group to find control consisting of “would have been exposed” users. This will result in a much more accurate semi-synthetic control.

    • Excellent points, Vadim. A/B testing is definitely the better option where we can implement. There is still a place for synthetic controls especially for past campaigns (only option), open web campaigns where building A/B while possible requires intense logistics etc.


        I agree. But even when logistics does not allow comprehensive A/B testing, there should be a concerted effort by everyone involved in sales lift measurement to make measurement as clean as possible. When doing any modeling, one has to remember what Dorn proposed more than 70 years ago. He stated that the designer of every observational study should ask “[h]ow would the study be conducted if it were possible to do it by controlled experimentation?”.

        It should not be difficult for the marketer or its agency to set up a holdout group for every campaign; users within this group will not be targeted, hence will not be exposed on this particular campaign. This constrains the universe of unexposed users to the group identical to the pre-campaign Test group. Then, one would need to find users in the holdout group who are most similar to the Test users actually treated by the ads. One can model the Test users actually exposed to the campaign ads against those in Test who were not exposed, and apply this model to the Holdout group to find “would have been exposed” in Control. With any modeling, the goal should be to find unexposed users who had the same probability to be exposed as the exposed users in the Test group. This approach will remove much of the selection bias present in most observational studies.