How to use Causal Inference when A/B testing is not available | by Harry Lu | Jan, 2024

#Causal #Inference #testing #Harry #Jan

Evaluating ad targeting product using causal inference: propensity score matching!

Harry Lu
Towards Data Science
Photo by Tech Daily on Unsplash

Ever caught those pumped-up Nike Ads while tuning in to a podcast recapping last night’s epic NBA showdown? Or how about stumbling upon New Balance ads mid-sneaker review extravaganza on YouTube? That’s the magic of contextual targeting — the matchmaking maestro connecting content and ads based on the vibe of the moment! Say goodbye to ad awkwardness and hello to tailored ad experiences that’ll make you do a happy dance. Picture this: “Would you rather groove to Nike ads in a basketball podcast or spice things up in a politics podcast?”

As tech giants ramp up their investment in protecting user privacy, the old-school behavior targeting (you know, the one that relies on IP addresses and user devices) might find itself in a sticky situation. With fewer cookies and mysterious IP addresses lurking around, it’s like the wild west out there for traditional targeting!

Let’s spice up the measurement game for contextual products — usually, it’s all about the advertisers. We’re talking about the typical success metrics: advertiser adoption, retention, referrals, and that sweet, sweet ad revenue. But here’s where the plot thickens — my hypothesis is that providing in more relevant ads turns the ad experience into a joyride. Picture this: fewer context switches during ads mean users can enjoy similar context content without missing a beat.

However, it’s not easy to run an A/B testing to see how users react to contextual targeting products. Why? When advertisers buy contextual targeting in their ads, it’s not just about contextual targeting — they will use all other targetings in the same campaign, causing that we cannot randomly assign contextual targeting as a treatment. Therefore, randomizing users into two groups is not possible.

Enter the superhero of alternatives: Causal Inference! When A/B testing is not possible because you can’t shuffle users like a deck of cards, we turn to historical data with causal inference!

In this blog post, I will go over how to evaluate ad targeting products using causal inference. So, buckle up if you:

  1. Navigate a domain where A/B testing is not ready yet — whether its unethical, costly, or downright impossible.
  2. Tread the thrilling waters of the Ad/Social domain, where the spotlight is on how an ad gets cozy with a specific user and their content.

Its important to design a causal inference research by setting up hypothesis and metrics!

Hypothesis: We believe users are more engaged when hearing an ad that was through contextual targeting, and plan to measure it via ad completion rate (the higher the better) and off focus skip (the lower the better)

Metrics: We started with Ad Completion Rate, a standard metric that is common in the ad space. However this metric is noisy, and we finally choose Off Focus Skip as our metrics.

Our Experiment Unit: 90 days of users that was either (Filtered-out users that received both treatment ad and control ad). Worth mentioning that we also tried on impressions level. We did both.

Population: We collected 90 windows of users/impressions.

Photo by Eddie Pipocas on Unsplash

We will use Propensity Score Match in this research as we have two groups of samples that we just need to synthesize some randomization. You can read more about PSM in here, and my summary on PSM is: let’s tell our samples to find pairs between control and treatments, and then we measure the average delta between each pair to attribute any difference we find to the treatment. So let’s start to prepare the ingredients for our PSM model!

There are many things that could impact users’ ad experience, and here are the three categories:

  1. User Attribute (ie., Age / Gender / LHR)
  2. Advertiser Attribute (ie., Company Past Ad Spending)
  3. Publisher Attribute (ie., Company Past Ad Revenue / Content Metadata)

We believe controlling these above isolates the treatment effect to contextual targeted ads vs non-contextual-targeted ads. Below is a sample data frame to help understand what the data could look like!

Image by the author: user attribute, treatment, and user engagement (y)

Using logistic regression for example, when the treatment (exposure) status is regressed on observed characteristics (covariates), we will get a predictive value for how possible if a user is in treatment. This number is how we then match each pair between treatment and control. Note that you could also use other classifiers of your choice! In the end, what you need to do is to use your classifier to label your users, so we can match them accordingly in the next steps.

Y = Treatment [0, 1]
X = User Attributes + Advertiser Attributes + Publisher Attributes

Image by the author: the dataframe now has a new field ps_score from our classifier model.

If we pull the distributions of PS Score for two groups, we will see two overlapping distributions as my drawing show below. The PS score distribution will likely look different in the two groups and that is expected! What we want to compare Apple-to-Apple is the “matched” area.

Image by the author: distributions of ps score between treatment and control groups.

As we assign the users their propensity score, we will then match the pairs between the treatment and control groups. In the example here, we start to see pairs being formed. Our sample size will also start to change as some samples may not find a match. (PS. use the psmpy package if you are in a python environment.)

Image by the author: the data fame has a new column suggesting the pairing between treatment and control groups.

When we matched the two groups, the two groups’ user-attributes will start to look similar than before! That is because the users that could not be matched are removed from my two groups.

Now we have matched them based on the PS, we can start our measurement work! The main calculation is essentially below:

MEAN(Treatment Group Y var) — MEAN(Control Group Y var) = Treatment Effect

We will have a treatment effect data that we could test on statistical significance and practical significance. By pairing up the ducks to calculating the average delta of each pair, we measure the treatment effect.

So if everything is set correctly so far, we have measured the treatment effects from the two groups. But it is critical to know that causal inference takes more risk on missing confounding variables or any other potential cause that we did not realize. So to further validate our research, let’s run an AA test!

An AA Test is a test where instead of using the true treatment, we randomly assign “fake” treatment to our data, and conduct the causal inference again. Because it is a fake treatment, we should not detect any treatment effect! Running an AA Test provide good code-review and also ensure our process minimize the bias (when true treatment effect is 0, we detect 0)

Once we complete our AA Test without detecting a treatment effect, we are ready to communicate the insight with engineering / product management! For my project, I ended up publishing my work and shared on a company-wide insight forum about the first causal inference work to measure Spotify podcast ad targeting.

This blog post explains every step of causal inference to evaluate an Ad Targeting product that is hard to experiment due to limitations in randomization. From how to determine the causal relationship, assign users propensity match score, match the users and calculate the treatment effect, to sanity check the result. I hope you find this article helpful and let me know if you have any questions!

PS. While due to confidentiality, I am not allowed to share the test result for specifically Spotify’s Contextual Targeting Product, you could still use this blog to build up your causal inference!