星巴克广告宣传策略探索

2019-09-25  本文已影响0人  愤怒的果壳
New-Starbucks-Logo-1200x969.jpg

背景介绍

本数据原是星巴克的面试数据,包含 120,000 个数据点,按照 2:1 的比例划分为训练文件和测试文件。数据模拟的实验测试了一项广告宣传活动,看看该宣传活动能否吸引更多客户购买定价为 10 美元的特定产品。由于公司分发每份宣传资料的成本为 0.15 美元,所以宣传资料最好仅面向最相关的人群。每个数据点都有一列表示是否向某个人发送了产品宣传资料,另一列表示此人最终是否购买了该产品。每个人还有另外 7 个相关特征,表示为 V1-V7。

优化策略

通过训练数据了解 V1-V7 存在什么规律表明应该向用户分发宣传资料。具体而言,目标是最大化两项指标:

IRR 表示与没有收到宣传资料相比,因为推广活动而购买产品的客户增加了多少。从数学角度来说,IRR 等于推广小组的购买者人数与购买者小组客户总数的比例 (treatment) 减去非推广小组的购买者人数与非推广小组的客户总数的比例 (control)。

IRR = \frac{purch_{treat}}{cust_{treat}} - \frac{purch_{ctrl}}{cust_{ctrl}}

NIR 表示分发宣传资料后获得(丢失)了多少收入。从数学角度来讲,NIR 等于收到宣传资料的购买者总人数的 10 倍减去分发的宣传资料份数的 0.15 倍,再减去没有收到宣传资料的购买者人数的 10 倍。

NIR = (10\cdot purch_{treat} - 0.15 \cdot cust_{treat}) - 10 \cdot purch_{ctrl}

5.jpg

针对预测应该包含推广活动的个人比较指标,即第一象限和第二象限。由于收到宣传资料的第一组客户(在训练集中)是随机收到的,因此第一象限和第二象限的参与者人数应该大致相同。 比较第一象限与第二象限可以知道宣传策略未来效果如何即可。 也就是说,我们对预测参与宣传推广活动的客户应用两项指标计算,力争使其最大化。

设计构想

构想实施

导入数据并查看

# 导入工具包
import numpy as np
import pandas as pd
import scipy as sp
import sklearn as sk

from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
# 加载数据
train_data = pd.read_csv('./training.csv')
test_data = pd.read_csv('./Test.csv')
# 查看训练集
train_data.head()
6.png
train_data.info()
7.png
train_data['purchase'].value_counts()

0  83494
1   1040

train_data['Promotion'].value_counts()

Yes  42364
No  42170

# 查看特征分布
feature_list = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6','V7']
# 查看收到推广信息并购买用户的特征分布
train_data.query('Promotion == "Yes" and purchase == 1')[feature_list].hist(figsize=(12,12));
# 查看收到推广未购买的客户特征分布
train_data.query('Promotion=="Yes" and purchase==0')[feature_list].hist(figsize=(12,12));
# 查看未收到宣传资料而购买的用户特征
train_data.query('Promotion=="No" and purchase==1')[feature_list].hist(figsize=(12,12));
# 查看未收到宣传资料也未购买的用户特征
train_data.query('Promotion=="No" and purchase==0')[feature_list].hist(figsize=(12,12));

收到宣传且购买的用户特征

1.png

收到宣传未购买的用户特征

2.png

未收到宣传购买的用户特征

3.png

未收到宣传未购买的用户特征

4.png
发现:

策略一

数据预处理
# 备份数据
train = train_data.copy()
test = test_data.copy()
from sklearn import preprocessing
# 对V2,V3变量进行标准化
train['V2'] = preprocessing.scale(train['V2'])
train['V3'] = preprocessing.scale(train['V3'])
# 对V1、V4、V5、V6、V7进行one_hot编码
dummy_fields = ['V1', 'V4', 'V5', 'V6','V7']
for V in dummy_fields:
    dummies = pd.get_dummies(train[V],prefix =V,drop_first = False)
    train = pd.concat([train,dummies],axis =1)
train = train.drop(dummy_fields,axis=1)
# 标记收到推送后购买的用户为1,其他为0
train['response'] = 0
train.loc[(train['Promotion']=='Yes') & (train['purchase']==1),'response'] = 1
# 将train数据分为训练集和验证集
from sklearn.model_selection import train_test_split
Train, Valid = train_test_split(train, test_size=0.2, random_state=0)
features = ['V2', 'V3', 'V1_0', 'V1_1', 'V1_2','V1_3', 'V4_1', 'V4_2','V5_1', 'V5_2', 
            'V5_3', 'V5_4', 'V6_1', 'V6_2','V6_3', 'V6_4', 'V7_1', 'V7_2']
X_train,X_valid = Train[features],Valid[features] 
y_train,y_valid = Train['response'],Valid['response']
Train.head(2)
# 观察训练集标签
y_train.value_counts()
8.png
对连续数据v2,v3进行标准化,其他分类特征one_hot编码,使用策略1对收到推广之后购买的客户标签记为1,其余记为0
标签:
y_valid.value_counts()
直接使用xgboost分类
from xgboost import XGBClassifier
from sklearn import metrics
eval_set_1 = [(X_train, y_train), (X_valid, y_valid)]
model_1 = XGBClassifier(  learning_rate = 0.05,
                          max_depth = 8,
                          min_child_weight = 1,
                          scale_pos_weight = 114, # 通过权重调节数据标签的不平衡,114是0标签/1标签的比值
                          objective = 'binary:logistic',
                          seed = 42,
                          gamma = 0.1,
                          silent = True,
                          n_jobs = -1,
                          n_estimators = 200
                           )
model_1.fit(X_train, y_train, eval_set=eval_set_1,
          eval_metric="auc", verbose=True, early_stopping_rounds=30)
valid_pred_1 = model_1.predict(X_valid, ntree_limit=model_1.best_ntree_limit)
sk.metrics.confusion_matrix(y_valid, valid_pred_1)
9.png

简单评估下这个分类结果,在标签比列为16773:134的数据集中,如果随机进行选择,我们期望应该是就像抛硬币,各占一半,也就是8387:8386:67:67,现在9057:7716:34:100的结果显然超过这个最低要求,说明在xgb算法中设置scale_pos_weight = 114这个权重参数起到了一定的调节作用。

# 创建获得增量响应率和净增量收入的函数
def get_irr_nir(y_pred,df_valid=Valid):
    # 选取预测为1作为计算样本
    df_pro = df_valid.iloc[np.where(y_pred==1)]
    
    cust_tre = df_pro.loc[df_pro['Promotion']=='Yes',:].shape[0]
    cust_con = df_pro.shape[0] - cust_tre
    purch_tre = df_pro.loc[df_pro['Promotion']=='Yes', 'purchase'].sum()
    purch_con = df_pro.loc[df_pro['Promotion']=='No', 'purchase'].sum()
    
    irr = purch_tre/cust_tre - purch_con/cust_con
    nir = 10*purch_tre - 0.15*cust_tre - 10*purch_con
    return irr,nir
irr,nir = get_irr_nir(valid_pred_1,Valid)
print('IRR: %.4f' % irr)
print('NIR: %.4f' % nir )

计算验证集指标

将模型应用到测试数据
df = test_data[['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7']]
# 对df数据进行标准化
V_col = ['V2', 'V3']
for v in V_col:
    df[v] = preprocessing.scale(df[v])
# 对V1、V4、V5、V6、V7进行one_hot编码
dummy_fields = ['V1', 'V4', 'V5', 'V6','V7']
for V in dummy_fields:
    dummies = pd.get_dummies(df[V],prefix =V,drop_first = False)
    df = pd.concat([df,dummies],axis =1)
df = df.drop(dummy_fields,axis=1)

# 使用模型预测并输出结果
target_pred_1 = model_1.predict(df,ntree_limit=model_1.best_ntree_limit)
irr,nir = get_irr_nir(target_pred_1,test)
print('IRR: %.4f' % irr)
print('NIR: %.4f' % nir )

输出结果

smote过采样处理
# 使用smote方法过采样
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_train_over, y_train_over = sm.fit_sample(X_train, y_train)
X_train_over = pd.DataFrame(X_train_over, columns=features)
y_train_over = pd.Series(y_train_over)
y_train_over.value_counts()
过采样之后使用xgb预测
eval_set_2 = [(X_train_over, y_train_over), (X_valid, y_valid)]
model_2 = XGBClassifier(learning_rate = 0.05,
                          max_depth = 8,
                          min_child_weight = 1,
                          objective = 'binary:logistic',
                          seed = 42,
                          gamma = 0.1,
                          silent = True,
                          n_estimators=200)
model_2.fit(X_train_over, y_train_over, eval_set=eval_set_2,
          eval_metric="auc", verbose=True, early_stopping_rounds=30)
valid_pred_2 = model_2.predict(X_valid, ntree_limit=model_2.best_ntree_limit)
sk.metrics.confusion_matrix(y_valid, valid_pred_2)
12.png
irr,nir = get_irr_nir(valid_pred_2,Valid)
print('IRR: %.4f' % irr)
print('NIR: %.4f' % nir )
# 使用模型预测并输出结果
target_pred_2 = model_1.predict(df,ntree_limit=model_2.best_ntree_limit)
irr,nir = get_irr_nir(target_pred_2,test)
print('IRR: %.4f' % irr)
print('NIR: %.4f' % nir )

策略二

数据预处理
# 选取两个模型的训练和验证数据
train_treat = Train[Train['Promotion']=='Yes']
train_cont = Train[Train['Promotion']=='No']
valid_treat = Valid[Valid['Promotion']=='Yes']
valid_cont = Valid[Valid['Promotion']=='No']

X_tt = train_treat[features_2]
X_tc = train_cont[features_2]
y_tt = train_treat['purchase']
y_tc = train_cont['purchase']
X_val_tt = valid_treat[features_2]
X_val_tc = valid_cont[features_2]
y_val_tt = valid_treat['purchase']
y_val_tc = valid_cont['purchase']

# 对训练数据使用smote方法过采样
#from imblearn.over_sampling import SMOTE
#sm = SMOTE(random_state=42)
X_tt_over, y_tt_over = sm.fit_sample(X_tt, y_tt)
X_tt_over = pd.DataFrame(X_tt_over, columns=features)
y_tt_over = pd.Series(y_tt_over)
X_tc_over, y_tc_over = sm.fit_sample(X_tc, y_tc)
X_tc_over = pd.DataFrame(X_tc_over, columns=features)
y_tc_over = pd.Series(y_tc_over)
模型训练

收到宣传用户模型

eval_set_3 = [(X_tt_over, y_tt_over), (X_val_tt, y_val_tt)]
model_3 = XGBClassifier(learning_rate = 0.05,
                          max_depth = 8,
                          min_child_weight = 1,
                          objective = 'binary:logistic',
                          seed = 42,
                          gamma = 0.1,
                          silent = True,
                          n_estimators=200)
model_3.fit(X_tt_over, y_tt_over, eval_set=eval_set_3,
          eval_metric="auc", verbose=True, early_stopping_rounds=30)
valid_pred_3 = model_3.predict(X_val_tt, ntree_limit=model_3.best_ntree_limit)
sk.metrics.confusion_matrix(y_val_tt, valid_pred_3)
13.png

未收到宣传用户模型

eval_set_4 = [(X_tc_over, y_tc_over), (X_val_tc, y_val_tc)]
model_4 = XGBClassifier(learning_rate = 0.01,
                          max_depth = 7,
                          min_child_weight = 5,
                          objective = 'binary:logistic',
                          seed = 42,
                          gamma = 0.2,
                          silent = True,
                          n_estimators=200)
model_4.fit(X_tc_over, y_tc_over, eval_set=eval_set_4,
          eval_metric="auc", verbose=True, early_stopping_rounds=30)
14.png
# 使用模型预测概率
p_treat = model_3.predict_proba(df, ntree_limit=model_3.best_ntree_limit)[:,1]
p_cont = model_4.predict_proba(df, ntree_limit=model_4.best_ntree_limit)[:,1]
# 计算概率差值
delta_p = p_treat - p_cont
# 计算70%分位数
cut_num = np.percentile(delta_p,70)
# 选择用户
test_pred = np.where(delta_p > cut_num,1,0)
# 计算指标
irr,nir = get_irr_nir(test_pred,test)
print('IRR: %.4f' % irr)
print('NIR: %.4f' % nir )

总结

本次星巴克客户宣传推广策略的探索,其实是一个很有代表性的问题,商家谋求精准定位潜在营销对象,实施精准推广宣传,降低转化成本和行为成本,但是用户特征往往差异很小,极难做出精确分辨,尤其是类似数据很不平衡的分类问题,准确率是一个误导指标,xgboost的scale_pos_weight权重指标和SMOTE过采样方法提供了一个比较好的解决方案,可以帮助快速实现分类性能的提升,但最重要的还是策略的内在逻辑,好的方法与好的逻辑相结合,才能获得好的结果。

上一篇下一篇

猜你喜欢

热点阅读