电商交易数据分析

2019-12-22 本文已影响0人汤海怪

查看我们的数据源

image.png

导入数据

df = pd.read_csv('./order_info_2016.csv', index_col = 'id')
df.head()

image.png

查看数据有多少行

df.shape

image.png

总共104557条记录

数据类型

df.info()

image.png

很幸运，只有channelId有缺失8个数据，createtime和paytime我们后续转换成时间类型

描述性分析

df.describe()

image.png

查看一下数据的大概分布

price的数字太大，猜测是单位为分的原因

payMoney最小值居然是负数，后续我们要将这些删除

数据清洗

order_id

我们都知道order_id在一个系统里是唯一值，先看下有没有重复值

df['orderId'].unique().size

image.png

发现存在重复值，先不做处理，因为其他的列可以会影响到删除哪一条重复的记录，先处理其他的列

userId

df.userId.unique().size

image.png

对于订单数据，一个用户有可能存在多笔订单，重复值是合理的

productId
productId最小值是0，先来看下值为0的记录数量

df.productId[df.productId == 0].size

image.png

177条数据，数量不多，可能是因为商品的上架下架引起的，处理完其他值的时候我们把这些删掉

cityId

df.cityId.unique().size

image.png

cityId类似userId,值都在正常范围，不需要处理

price
由于price单位是分，所以我们把它变成元

df.price = df.price / 100
df.price

image.png

payMoney

payMoney有负值，我们下单不可能存是负值，所以这里对负值的记录要删除掉

df.drop(index = df[df.payMoney < 0].index, inplace = True)
df[df.payMoney < 0].size

image.png

单位转换成元

df.payMoney = df.payMoney / 100

channel_Id

df.drop(index = df[df.channelId.isnull()].index, inplace = True)
df[df.channelId.isnull()].size

image.png

channelId根据info的结果，有些null的数据，可能是端的bug等原因，在下单的时候没有传channelId字段，我们直接删除

createTime和payTime

将createTime和payTime转换成时间类型

df.createTime = pd.to_datetime(df.createTime)
df.payTime = pd.to_datetime(df.payTime)
df.dtypes

image.png

因为我们要统计2016年的数据，所以把非2016年的数据删除

import datetime
startime = datetime.datetime(2016, 1, 1)
endtime = datetime.datetime(2016, 12, 31, 23, 59, 59)

df.drop(index = df[df.createTime < startime].index, inplace = True)
df.drop(index = df[df.payTime > endtime].index, inplace = True)

image.png

df[df.createTime < startime].size

df[df.payTime > endtime].size

image.png

回过头来我们把orderId重复的记录删掉

df.orderId.unique().size
df.orderId.size
df.drop(index = df[df.orderId.duplicated()].index, inplace = True)
df.orderId.unique().size

image.png

把productId为0的也删除掉

df.drop(index = df[df.productId == 0].index, inplace = True)

就此数据清洗完毕，可以开始分析了

分析数据

第一步，我们先看看数据的总体情况

总订单数，总下单用户，总销售额，有流水的商品数

print(df.orderId.count())
print(df.userId.unique().size)
print(df.payMoney.sum())
print(df.productId.unique().size)

image.png

我们分析可以从两个方面开始考虑，一个是维度，一个是指标，维度可以看做X轴，指标可以看成y轴，同一个维度可以分析多个指标，同一维度也可以做降维升维

按照商品的productId

以商品销量查看

productId_orderCoun = df.groupby('productId').count()['orderId'].sort_values(ascending= False)

商品销量前10

print(productId_orderCoun.head(10))

image.png

商品销量倒数10

print(productId_orderCoun.tail(10))

image.png

以销售额查看

productId_turnover = df.groupby('productId').sum()['payMoney'].sort_values(ascending = False)

销售额前10

productId_turnover.head(10)

image.png

销售额倒数10

productId_turnover.tail(10)

image.png

看下销量和销售额最后100个的交集，如果销量和销售额都不行，这些商品需要看看是不是要优化或者下架

problem_productIds = productId_turnover.tail(100).index.intersection(productId_orderCoun.tail(100).index)
problem_productIds

image.png

按照商品的cityId
按城市分组

#按销量
cityId_orderCount = df.groupby('cityId').count()['orderId'].sort_values(ascending = False)
#按销售额
cityId_payMoney = df.groupby('cityId').sum()['payMoney'].sort_values(ascending = False)

商品销量前10的城市

cityId_orderCount.head(10)

image.png

商品销量后10的城市

cityId_orderCount.tail(10)

image.png

销售额前10个城市

cityId_payMoney.head(10)

image.png

销售额后10个城市

cityId_payMoney.tail(10)

image.png

看下销量和销售额最后100个的交集，以便以后复查，这些城市存在什么相通点。

problem_cityId = cityId_orderCount.tail(100).index.intersection(cityId_payMoney.tail(100).index)
problem_cityId

image.png

Price

对于价格，看下所有商品价格的分布情况，就知道什么价格的商品卖的最好

先按照100的区间取分桶，价格是元

bins = np.arange(0, 10000,100)
pd.cut(df.price, bins).value_counts()

绘制直方图
按100分桶

plt.figure(figsize=(8, 8))
plt.hist(df['price'], bins)

image.png

接下来我们看以下究竟是那些价格区间没有商品

price_cut_count = pd.cut(df.price, bins).value_counts()
zero_cut_result = (price_cut_count == 0)
zero_cut_result[zero_cut_result.values].index

image.png

我们可以发现，很多价格区间没有商品，如果有竞争对手的数据，可以看看是否需要补商品填充对应的价格区间

按1000分桶看看

bins = np.arange(0, 25000, 1000)
price_cut = pd.cut(df.price, bins).value_counts()
#看看1000分桶的时候5000以下的饼图

m = plt.pie(x=price_cut.values, labels=price_cut.index, autopct= '%d%%', shadow = True)

image.png

发现71%的商品的定价在0-1000元的区间内，17%的商品介于1000-2000的区间，6%的商品介于2000-3000的区间，6%的商品超过3000

下单时间分析

df['orderHour'] = df.createTime.dt.hour
df.groupby('orderHour').count()['orderId'].plot()

image.png

通过折线图我们可以发现，中午12， 13， 14点下单比较多，应该是午休的时候，然后是晚上20点左右，晚上20点左右几乎是所有互联网产品的一个高峰，下单高峰要注意网站的稳定性，可用性

以星期分组

df['orderWeek'] = df.createTime.dt.dayofweek
df.groupby('orderWeek').count()['orderId']

image.png

按照星期来看，周六下单最多，其次是周四周五

下单后多久支付

def get_seconds(x):
    return x.total_seconds()
df['payDelta'] = (df['payTime'] - df['createTime']).apply(get_seconds)

bins = [0, 25, 50, 100, 1000,  10000, 10000]
pd.cut(df.payDelta, bins).value_counts()

image.png

绘制饼图

pd.cut(df.payDelta,bins).value_counts().plot(kind = 'pie', autopct = '%d%%', shadow = True, figsize = (8, 8))

image.png

可以看到绝大部分用户都在十几分钟之内支付完成，说明用户基本很少犹豫，购买的目的性很强

月成交额

df.set_index('createTime', inplace = True)
turnover = df.resample('M').sum()['payMoney']
order_count = df.resample('M').count()['orderId']

turnover.plot()

image.png

可以看到商品销售额在前6个月前缓步上升，从7月开始断崖式下跌，11月开始回升，是不是因为商品是属于季节性销售的，还是供应链出现问题，亦或者在做商品调整所以会下跌一段时间。

到这里我们的电商交易数据分析就结束了，再会

电商交易数据分析

猜你喜欢

热点阅读