打点数据的坑:没有参数的苦
今天吐吐数据分析师依赖打点数据做分析时的一些苦水「参数缺失的苦」。
业务背景:
一种学习APP 中的口语系列课程,用户在每次学习时的主要行为是录音
需要分析课程学习的平均完成率,和没有任何录音就退出的比例,希望后者比例越低越好
数据结构:
| page_name | user_id | action | Data_date | time | lesson_id |
| ----------- | ------- | ------------------------- | --------- | --------- | --------- |
| lesson_page | user_id | page_view (一次学习开始) | Date | timestamp | lesson_id |
| lesson_page | user_id | click_record (录音) | Date | timestamp | lesson_id |
| report_page | user_id | page_view (一次学习结束) | Date | timestamp | lesson_id |
参数缺失问题:
1. 用户在一天中可以重复学习同个 lesson 多次,所以 user_id + lesson_id 并不能关联起「一次学习开始-录音-一次学习结束」
2. 本可以打点时在 params 中带上每次学习唯一的 id,但是缺失了
3. 用户学习过的课程,再次进入时是上一次学习的报告页 report_page ,此时只有该页面的 page_view 点
解决方案:
1. 找到 User_id + lesson_id + report time + data_date 在当日的学习开始 time
2. 计算 user_id + lesson_id + start_time 对应的学习结束 time
3. 计算 user_id + lesson_id + start_time + report_time 中的录音次数
SQL代码:
```sql
SELECT lesson_id,
count(user_id) AS num_user,
1.0*count(report_time)/count(user_id) AS finish_rate,
1.0*count(if(record_num=0,user_id))/count(user_id) AS bounce_rate
FROM
(SELECT t3.user_id,
t3.lesson_id,
t3.start_time,
t3.st_rn,
t3.report_time,
t3.report_time_1,
count(r.user_id) AS record_num
FROM
(SELECT t2.*,
coalesce(t2.report_time,s1.time, '2999-12-31 23:59:59') AS report_time_1 -- 处理 strat,start,report 情况
FROM
(SELECT s.user_id,
s.lesson_id,
s.start_time,
s.st_rn,
min(t1.report_time) AS report_time -- 为了处理 start,report,report情况
FROM
(SELECT *,
row_number()over(partition BY user_id
ORDER BY start_time) AS st_rn
FROM TABLE
WHERE page_name = 'lesson_page'
AND action = 'page_view') s
LEFT JOIN
(SELECT f.user_id,
f.lesson_id,
f.time AS report_time,
min(s.time) AS start_time
FROM
(SELECT *
FROM TABLE
WHERE page_name = 'report_page'
AND action = 'page_view') f
JOIN
(SELECT *
FROM TABLE
WHERE page_name = 'lesson_page'
AND action = 'page_view') s ON f.user_id = s.user_id
AND f.lesson_id = s.lesson_id
AND f.data_date = s.data_date
AND date_diff('second', s.time, f.time) BETWEEN 1 AND 1800
GROUP BY 1,
2,
3) t1 ON s.user_id = t1.user_id
AND s.lesson_id = t1.lesson_id
AND s.start_time = t1.start_time
GROUP BY 1,
2,
3,
4) t2
LEFT JOIN
(SELECT *,
row_number()over(partition BY user_id
ORDER BY time) AS st_rn
FROM TABLE
WHERE page_name = 'lesson_page'
AND action = 'page_view') s1 ON t2.user_id = s1.user_id
AND t2.lesson_id = s1.lesson_id
AND t2.st_rn = s1.st_rn - 1) t3
LEFT JOIN
(SELECT *
FROM TABLE
WHERE page_name = 'lesson_page'
AND action = 'click_record') r ON t3.user_id = r.user_id
AND t3.lesson_id = r.lesson_id
AND r.time BETWEEN t3.start_time AND t3.report_time_1
GROUP BY 1,
2,
3,
4,
5) tt
GROUP BY 1;
```