Understanding the data, and the problem

df = pd.read_csv('ab_data.csv')
df.head()
user_ID timestamp group landing_page converted
851104 2017-01-21 22:11:48.556739 control old_page 0
  • Analysing the results data from a previously implemented of an a/b test.
  • The extact experiment design is unknown, but one can make assumptions based on results.

  • Five columns: user_id, timestamp, group, landing_parge, converted

where:

  1. timestamp is the datetime user clicked on call-to-action
  2. user_id is the id assigned to user based on account id
  3. group states the two different treatment groups the user was assigned to {treatment,control}
  4. landing_page is the variable that is different between the different groups and ultimately the two user experiences
    • The A variant is the old_page
    • B variant is the new_page
  5. flag determining whether user clicked on call-to-action
    • 1: yes
    • 0: no

In a real-world scenario, the experiment would be designed, executed , and then results analysed. However we already have the results, we will be working backwards for learning purposes.

Example Scenerio :

The current conversion rate is 13% on average throughout the year for the old_page (variant A) this is our baseline conversion rate. The product manager says aim for 2% uplift, if converstion rate is 15% new product will be launched.

Designing experiment

As conversion rate is a binary metric 1 or 0, a two sample z-test on population proportion can be used. The statistics are straight forward.Dont need to bootstrap is construct confidence intervals

As the randomisation unit of the experiment was at page-level, I assume the analysis unit of the conversion rate is also at page-level due to the independence assumption between units.

conversion rate = sum(converted)/number of pageviews

  • Null hpothesis : the conversion rate of the old and new design are the same
  • Alternative hypothesis : the conversion rates are different

Power analysis is conducted to calculate to minimum effective size to measure our 2% uplift

statistical Power of the test - This represents the probability of finding a statistical difference between the groups in our test when a difference is actually present. commonly 0.8 is used alpha = 0.05 is commonly used , with (1-alpha) confidence level of 95% minimum Effect size - delta of 2%, we can calculate programatically using 13% (lower) and 15% (upper) bounds

import statsmodels.stats.api as sms

# Calculating effect size based on our expected rates
effect_size = sms.proportion_effectsize(0.13, 0.15)    


required_n = sms.NormalIndPower().solve_power(
    effect_size, 
    power=0.8, 
    alpha=0.05, 
    ratio=1
    )                                                  


required_n = ceil(required_n)                          
print(required_n)
#4720

Have to round up n to next whole number (cant have half a pageview)
ratio parameter is the SRM ratio which should be 1 for 50/50 univariate experiments random sample size of 4720 for each treatment need to be assigned

ratio is the SRM ratio which should be 1 for 50/50 univariate experiments random sample size of 4720 for each treatment need to assigned

pd.crosstab(df['group'], df['landing_page'])
frequency new_page old_page
control 1928 145274
treatment 145311 1965

Validity of Results

the control group must only see the old page, the treatment group only the new page, this needs to be investigated with the software engineers. As our independence assumption between units will invalidate our results. However for this excresie we continue. A cause can be from caching on client side … will need to debug

page_views = df['user_id'].value_counts(ascending=False)
users = page_views[page_views > 1].count()

users_to_drop = page_views[page_views > 1].index

df = df[~df['user_id'].isin(users_to_drop)]

There were 3894 users that experienced both user-experiences The updated dataset now has 286690 entries

alt

Statistics

control_sample = df[df['group'] == 'control'].sample(n=required_n, random_state=22)
treatment_sample = df[df['group'] == 'treatment'].sample(n=required_n, random_state=22)

ab_test = pd.concat([control_sample, treatment_sample], axis=0)
ab_test.reset_index(drop=True, inplace=True)

n_con = control_results.count()
n_treat = treatment_results.count()
successes = [control_results.sum(), treatment_results.sum()]
nobs = [n_con, n_treat]

z_stat, pval = proportions_ztest(successes, nobs=nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes, nobs=nobs, alpha=0.05)

  • z statistic: -0.34
  • p-value: 0.732
  • ci 95% for control group: [0.114, 0.133]
  • ci 95% for treatment group: [0.116, 0.135]

Fail to reject null, p-value > alpha. Probability of seeing a extreme result greater than type 1 error.

Also the confidence interval for the treatment group ([0.116, 0.135], i.e. 11.6-13.5%) we notice that:

  • It includes our baseline value of 13% conversion rate
  • It does not include our target value of 15% (the 2% uplift we were aiming for)

No strong evidence to reject null