zhaoyqiu的数据分析进阶之路2.0

D7-利用Mysql进行数据分析实例

2020-06-30  本文已影响0人  喝奶茶不加奶茶

一、准备工作及背景
1、准备数据库:data
···
create database data charset=utf8;
···
2、准备数据表:orderinfo (539,414数据量)和userinfo(101,535数据量)

source+表1路径
source+表2路径

3、查看表结构:


表orderinfo相关信息 表userinfo相关信息

二、题目要求..
1、统计不同月份的下单人数
思考过程:

select month(paidTime) from orderinfo limit 10,10;

select count(1) ,month(paidTime)  as mon from orderinfo  group by month(paidTime) ;


select 
year(paidTime),
month(paidTime),
count(1)
from  orderinfo
group by  year(paidTime),month(paidTime);

注意:
不同月份,可能是同一年的不同月份,也有可能是不同年的不同月份,所以在分组的时候需要考虑年份。
结果:


可以利用where去除脏数据(0000-00-00 00:00:00)以及筛选出‘已支付’的用户

select 
year(paidTime),
month(paidTime),
count(1)
from  orderinfo
where isPaid='已支付' 
and
paidTime!='0000-00-00 00:00:00'
group by  year(paidTime),month(paidTime);

注意:group by 放在语句的最后


可能存在同一个用户支付多个订单的情况,所以需要用distinct做区分

select 
year(paidTime),
month(paidTime),
count(distinct userid)
from  orderinfo
where isPaid='已支付' 
and
paidTime!='0000-00-00 00:00:00'
group by  year(paidTime),month(paidTime);

补充:ctrl+c可以退出运行进程。
2、统计用户三月份的回购率和复购率
复购率:当月购买了多次的用户占当月用户的比例。
回购率:上月购买用户中有多少用户又再次购买。
思考过程:

select 
*
from  orderinfo
where isPaid='已支付'
and month(paidTime)='3'
limit 10;

(2)统计每个用户在3月份消费了多少次
思路:group by用户

select 
userid,count(1)
from  orderinfo
where isPaid='已支付'
and month(paidTime)='3'
group by userid
limit 10;

(3)统计出消费次数大于1的用户数、总用户数以及复购率

select 
     sum(if(con>1,1,0)) as fugou_cons,
     count(1) as userid_cons,
     sum(if(con>1,1,0))/count(1) as fugou_rate
from  (select 
       userid,count(1) as con
       from  orderinfo
       where isPaid='已支付'
       and month(paidTime)='3'
       group by userid
     ) t;

注意:嵌套的子表需要取别名(比如,t)且不需要加as;否则会报错如下:


小tip:在notepad++中'alt'+鼠标下拉,可以统一选中多行,进而可以按空格产生缩进,便于代码美观易观看。
或者 选择代码,再按‘tab’也可达到同样缩进效果。

先查看下数据中的月份类别

select distinct(month(paidTime)) from orderinfo;

现在是六月,那么这里的上月就是五月,确认了存在五月的数据,可以继续计算。
五月份的回购率=五月用户中6月又再次购买的人数/5月的用户总数

我们关系的不是用户购买的次数,而是用户有没有购买过。
(1)筛选出五月购买的用户

select * from orderinfo 
where month(paidTime)='5';

(2)按照userid进行分组并显示购买次数

select userid,count(1) as con from orderinfo 
where month(paidTime)='5' group by userid;

(3)统计购买次数大于1 的用户数、总用户数以及回购率

select 
    count(1) as userid_con,
    sum(if(con >1,1,0)) as huigou_con,
    sum(if(con >1,1,0))/count(1) as huigou_rate
from 
    (select userid,count(1) as con from orderinfo 
where month(paidTime)='5' group by userid) t;

===
(1)、统计每年每月的用户消费情况

select 
    userid,
    date_format(paidTime,'%Y-%m') as month_dt,
    count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m') 
limit 10;

注意:我们这里选用date_format()而没有用month(),目的是便于后续判断日期是否相邻。
(2)、相邻月份进行关联,能关联上的用户说明就是回购。
首先写个框架

select 
from () a
left join
() b;

其次,自己和自己进行关联,注意用left join,避免遗漏信息

select 
*
from (select 
    userid,
    date_format(paidTime,'%Y-%m-%d') as month_dt,
    count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m-%d')) a
left join
(select 
    userid,
    date_format(paidTime,'%Y-%m-%d') as month_dt,
    count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m-%d')) b on a.userid=b.userid 
and date_sub(b.month_dt  ,interval 1 month)=a.month_dt  limit 10; 

?但为什么右列都是null值?

(3)统计每个月份的消费人数情况以及得到回购率
框架:

select
a.month_dt,
    count(a.userid),
    count(b.userid),
    count(b.userid)/count(a.userid)
from
()
group by a.month_dt;

完整代码:

select
    a.month_dt,
    count(a.userid),
    count(b.userid),
    count(b.userid)/count(a.userid)
from
(select 
    userid,
    date_format(paidTime,'%Y-%m-%d') as month_dt,
    count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m-%d')) a
left join
(select 
    userid,
    date_format(paidTime,'%Y-%m-%d') as month_dt,
    count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m-%d')) b on a.userid=b.userid 
and date_sub(b.month_dt,interval 1 month)=a.month_dt
group by a.month_dt;

存在疑问:
1、date_sub()函数是对日期格式有特殊要求吗?
具体情形:
如果我将paidTime格式转换为'%Y-%m'再进行date_sub(b.month_dt,interval 1 month)与a表日期进行是否相等判断,则匹配的全为null值。
但是参考答案中将paidTime格式转换为'%Y-%m-01'再进行匹配则不全为null值,但是这个‘01’的设置在视频讲解中也提到说不是影响因素,只是强制设置所有的记录的日为01,所以就不清楚这个错误是为什么出现?
解决问题:
date_sub(b.month_dt,interval 1 month)中的日期最少要精确到日,到月的输出Null值



2、表Orderinfo中paidTime的约束是字符串类型(varchar),而date_format (日期,字符串格式)函数的作用是将日期转换为字符串格式,那用date_format(paidTime,'%Y-%m-01'),是把paidTime当成日期类型了吗?还是说我理解错了?
解决问题:
date_format (日期,字符串格式)函数并没有规定这个字段的类型,只要是合法的日期就行
总结:还是需要再练练

注意:count()函数在统计时并不计算null值

3、统计男女用户消费频次是否有差异
思考过程:
(1)统计不同性别下的用户的消费次数

思考过程:
(1)首先通过left join合并一下表userinfo和表orderinfo

select * from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;

查看一下sex的类别

select distinct sex from userinfo;

类别为:'男','女',字符串空''

(1)我们需要统计不同性别下的消费情况(次数)

select * from
(select *,o.userid as o_id from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' ;

出现ERROR 1060 (42S21): Duplicate column name 'userid',已经做好记录并解决了(见:https://www.jianshu.com/p/0c0df1939ee4)。

select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;

select * from
(select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' ;

select * from
(select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' limit 10 ;

#去除sex为空的字段以及根据sex和id进行分组
select o_id as all_id ,sex,count(1) from
(select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' and t.sex!='' group by sex,o_id limit 20;

#修改子查询的写法,并改变分组的顺序为id,sex
select orderid as all_id ,sex,count(1) from
(select o.*,u.sex,u.birth 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' and t.sex!='' group by orderid,sex limit 20;
#注意:group by 后面变量的前后顺序不同,得到的结果也会发生改变。

#去掉'已支付'的限制条件
select orderid as all_id ,sex,count(1) from
(select o.*,u.sex,u.birth 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where  t.sex!='' group by orderid,sex limit 20;

#把left join 换成inner join
select orderid as all_id ,sex,count(1) from
(select o.*,u.sex,u.birth 
from orderinfo as o inner join userinfo as u on u.userid=o.userid) t
where  t.sex!='' group by orderid,sex limit 20;

/*发现分组的依据id以及查询的id使用错误,应该使用userid 而不是orderid,所以还是用回之前的left join,看下结果:*/
select userid ,sex,count(1) from
(select o.*,u.sex,u.birth 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where  t.sex!='' group by userid,sex limit 20;

#接着再次加上'已支付'的限制条件
select userid ,sex,count(1) from
(select o.*,u.sex,u.birth 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where  t.isPaid='已支付' and t.sex!='' group by userid,sex limit 20;

这样我们就统计了不同性别下的用户消费次数
接着,对性别做一个消费次数平均计算
同样构建框架:

select 

from ()
group by sex;

详细代码:

select
    sex,
    avg(count(1)) as avg_con
from (select userid ,sex,count(1) from
(select o.*,u.sex,u.birth
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where  t.isPaid='已支付' and t.sex!='' group by userid,sex) a
group by sex;

但出现错误:



ERROR 1111 (HY000): Invalid use of group function
这个问题还有待进一步了解。
但是修改avg()后运行成功,
完整代码如下:

select 
sex,
avg(cons) as avg_con
from (select userid ,sex,count(1) as cons from
(select o.*,u.sex,u.birth 
 from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where  t.isPaid='已支付' and t.sex!='' group by userid,sex) a
group by sex;

错误记录:
1、

select *,o.userid as o_id from
 (select * from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' ;

ERROR 1060 (42S21): Duplicate column name 'userid'
解决办法:
其实这个问题的本质是子查询:

select * from orderinfo as o left join userinfo as u on u.userid=o.userid;

中出现重复字段名,如下图:


所以,我们在查询的时候需要对字段进行重命名,如下:

select * ,o.userid as o_id, u.userid as u_id from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;

但是这样是在查询原有全部信息的同时又增加了重新命名的两列,如何在查询全部信息的同时更改其中某个字段的名称呢?

目前的解决办法是:
自己手动写出需要查询的字段的名称,如下:

 select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
 from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;

这样似乎有些蠢笨,等发现有简单方法后再来更新。

方法更新:
如下:

select a.*,b.sex,b.birth 
from orderidfo as a left join userinfo as b on a.userid=b.userid limit 10;

这个的思路是:
这样写就不用修改名称了,a.*查询字段多的表,字段少的表直接手写

错误2:

select
    sex,
    avg(count(1)) as avg_con
from (select userid ,sex,count(1) from
(select o.*,u.sex,u.birth
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where  t.isPaid='已支付' and t.sex!='' group by userid,sex) a
group by sex;

ERROR 1111 (HY000): Invalid use of group function


4、统计多次消费的用户,第一次和最后一次消费间隔是多少天
即产品的生命周期。
(1)统计出多次消费的用户

select userid from orderinfo
where  isPaid='已支付'
group by userid
having count(1)>1;

(2)取出第一次和最后一次的时间

select 
    userid ,
    min(paidTime),
    max(paidTime),
    max(paidTime)-min(paidTime) 
from orderinfo
where  isPaid='已支付'
group by userid
having count(1)>1 limit 10;

注意:日期类型不能用简单的算术运算,应该用日期特有的运算(datediff(日期1,日期2))

select 
    userid ,
    min(paidTime),
    max(paidTime),
    datediff(max(paidTime),min(paidTime) )
from orderinfo
where  isPaid='已支付'
group by userid
having count(1)>1 limit 10;

5、统计不同年龄段,用户的消费金额是否有差异
首先计算年龄

select userid,2020-year(birth) as age from userinfo limit 10;

注意:计算年龄可以考虑timestampdiff(year,birth,now())函数,而且该函数未过生日会减一
其中,第一个参数year是返回的类型,也可以是month,hour

select userid,birth,now(),timestampdiff(year,birth,now()) as age
from userinfo limit 10;

注意:系统设置的默认时间为'1900-01-01'(因为发现年龄居然存在一百多岁的),以及存在脏数据‘ 0000-00-00’


所以需要进行筛选:

select userid,birth,now(),timestampdiff(year,birth,now()) as age
from userinfo where year(birth)>1900 limit 30;

年龄如何分段?0-10:1,11-20:2,21-30:3 利用case when
补充:向上取整函数ceil()

select 
    userid,
    birth,
    now(),
    ceil(timestampdiff(year,birth,now())/10) as age
from userinfo 
where year(birth)>1900 
limit 10;

疑问:
year(birth)>1900 和birth>'1901-00-00'的区别??

统计消费金额

select  price from orderinfo limit 10;

消费金额如何分段?
分段之后通过两个表的userid 连接起来,关联订单信息,获取不同年龄段的一个消费频次和消费金额
框架:

select
 
from orderinfo a 
inner join() b on a.userid=b.userid;

详细代码:

select
    userid,
    age,
    count(1) as con,
    sum(price) as prices
from orderinfo a 
inner join(select 
    userid,
    birth,
    now(),
    ceil(timestampdiff(year,birth,now())/10) as age
from userinfo 
where year(birth)>1900) b on a.userid=b.userid
group by userid,age;

修改如下:

select
    a.userid,
    age,
    count(1) as con,
    sum(price) as prices
from orderinfo a 
inner join(select 
    userid,
    birth,
    now(),
    ceil(timestampdiff(year,birth,now())/10) as age
from userinfo 
where year(birth)>1900) b on a.userid=b.userid
group by a.userid,age  limit 30;

(3)再对年龄分层进行聚合,得到不同年龄层的消费情况
框架:

select 
    age,
    avg(con),
    avg(prices)
from 
() a
group by age ;

完整代码:

select 
    age,
    avg(con),
    avg(prices)
from 
(select
    a.userid,
    age,
    count(1) as con,
    sum(price) as prices
from orderinfo a 
inner join(select 
    userid,
    birth,
    now(),
    ceil(timestampdiff(year,birth,now())/10) as age
from userinfo 
where year(birth)>1900) b on a.userid=b.userid
group by a.userid,age) a
group by age ;

错误记录:
错误3、ERROR 1052 (23000): Column 'userid' in field list is ambiguous


6、统计消费的二八法则,消费的top20%用户,贡献了多少消费额
(1)统计每个用户的消费金额,并进行降序排序

select 
    userid,
    price,
    sum(price)  as total_price
from orderinfo
where isPaid='已支付'
group by userid limit 20;

(2)统计用户总数以及总消费金额
框架:

select 
count(1) as  con,
sum(total_price)  as all_price
from () a;

完整代码:

select 
count(1) as  con,
sum(total_price)  as all_price
from (select 
    userid,
    price,
    sum(price)  as total_price
from orderinfo
where isPaid='已支付'
group by userid) a;

(3)取出前20%的用户进行金额统计。

select
    count(1) as  con,
    sum(total_price)  as all_price
from(
select 
    userid,
    price,
    sum(price)  as total_price
from orderinfo
where isPaid='已支付'
group by userid 
order by total_price desc 
limit 17000) b;

我们发现20%的用户贡献了80%的业务量。


上一篇下一篇

猜你喜欢

热点阅读