ClickHouse应用随笔2——活跃用户留存统计(前方高能~~)

前言

众所周知,产品数据监控体系中,用户留存是一个非常重要的指标,细化到业务场景中,留存其实可以拆解出来多个类别:包括新增用户留存、活跃用户留存、付费用户留存、回流用户留存、新增付费用户留存、回流付费用户留存等等,以下主要针对活跃用户留存设计ClickHouse查询语句。

表结构及查询需求

表结构

time user_id
January 1, 2021, 08:33 AM 112233abc
January 1, 2021, 09:00 AM 221133bac
January 1, 2021, 09:20 AM 113322acb
January 2, 2021, 10:12 AM 113322acb
January 2, 2021, 11:01 AM 221133bac
January 3, 2021, 11:45 AM 112233abc
…… ……

查询需求:在表 User_login 中,按指定时间区间给出该区间内每日的活跃次留、三留、七留、十四留、三十留。

ClickHouse语句如何实现

PS:涉及多重嵌套查询

  1. 查询每个 user_id 的登录时间间隔(按天),按日期分组,得到每个 user_id 两次登录的时间差;
select
	uid,
	base_date,
	retention_date
from
	-- 表a,首日登录的用户
	(select
		distinct 
			date(time) base_date, 
			user_id uid
	from
		User_login
	where
		-- 需要计算留存的日期区间
		base_date between '2021-03-01' and '2021-03-31'
	group by
		uid, base_date) a
left join
	-- 表b,表连接匹配后续登录的情况,计算留存
	(select
		distinct 
			date(time) retention_date, 
			user_id uid
	from
		User_login
	where
		-- 此处不应限制最大日期,否则可能导致后续留存率为0
		retention_date >= '2021-03-01'
	group by
		uid, retention_date) b
on 
	a.uid = b.uid
where 
	-- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
	base_date <= retention_date
order by
	uid, base_date

结果如下(demo数据):
在这里插入图片描述

  1. 利用 retention_date 减去 base_date 得到两日登录的日期差值;
select
	uid,
	base_date,
	retention_date,
	retention_date - base_date datediff
from
	/* 上一步的查询
	(select
		uid,
		base_date,
		retention_date
	from
		-- 表a,首日登录的用户
		(select
			distinct 
				date(time) base_date, 
				user_id uid
		from
			User_login
		where
			-- 需要计算留存的日期区间
			base_date between '2021-03-01' and '2021-03-31'
		group by
			uid, base_date) a
	left join
		-- 表b,表连接匹配后续登录的情况,计算留存
		(select
			distinct 
				date(time) retention_date, 
				user_id uid
		from
			User_login
		where
			-- 此处不应限制最大日期,否则可能导致后续留存率为0
			retention_date >= '2021-03-01'
		group by
			uid, retention_date) b
	on 
		a.uid = b.uid
	where 
		-- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
		base_date <= retention_date
	order by
		uid, base_date) c
	*/
order by
	uid, base_date

结果如下(demo数据):
在这里插入图片描述

  1. 对应提取相应的两次登录的日期差,得到当日、次日、三日、七日等多个登录用户数的值;
select
	base_date,
	sum(case when datediff = 0 then 1 else 0 end) day_0,
	sum(case when datediff = 1 then 1 else 0 end) day_1,
	sum(case when datediff = 2 then 1 else 0 end) day_2,
	sum(case when datediff = 6 then 1 else 0 end) day_6,
	sum(case when datediff = 13 then 1 else 0 end) day_13,
	sum(case when datediff = 29 then 1 else 0 end) day_29
from
	/* 上一步的查询
	(select
		uid,
		base_date,
		retention_date,
		retention_date - base_date datediff
	from
		(select
			uid,
			base_date,
			retention_date
		from
			-- 表a,首日登录的用户
			(select
				distinct 
					date(time) base_date, 
					user_id uid
			from
				User_login
			where
				-- 需要计算留存的日期区间
				base_date between '2021-03-01' and '2021-03-31'
			group by
				uid, base_date) a
		left join
			-- 表b,表连接匹配后续登录的情况,计算留存
			(select
				distinct 
					date(time) retention_date, 
					user_id uid
			from
				User_login
			where
				-- 此处不应限制最大日期,否则可能导致后续留存率为0
				retention_date >= '2021-03-01'
			group by
				uid, retention_date) b
		on 
			a.uid = b.uid
		where 
			-- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
			base_date <= retention_date
		order by
			uid, base_date) c
	order by
		uid, base_date) d
	*/
group by base_date
order by base_date

结果如下(demo数据,一共查询出31天每天的各个留存情况,其中day_0是当日的登录):
在这里插入图片描述

  1. 利用次日、三、七日等登录数除以当日登录数,得到对应留存率;
select
    base_date,
    -- 保留4位小数,用于百分比展示
    round(day_1/day_0, 4) retention_2,
    round(day_2/day_0, 4) retention_3,
    round(day_6/day_0, 4) retention_7,
    round(day_13/day_0, 4) retention_14,
    round(day_29/day_0, 4) retention_30
from
	/* 上一步的查询
	(select
		base_date,
		sum(case when datediff = 0 then 1 else 0 end) day_0,
		sum(case when datediff = 1 then 1 else 0 end) day_1,
		sum(case when datediff = 2 then 1 else 0 end) day_2,
		sum(case when datediff = 6 then 1 else 0 end) day_6,
		sum(case when datediff = 13 then 1 else 0 end) day_13,
		sum(case when datediff = 29 then 1 else 0 end) day_29
	from
		(select
			uid,
			base_date,
			retention_date,
			retention_date - base_date datediff
		from
			(select
				uid,
				base_date,
				retention_date
			from
				-- 表a,首日登录的用户
				(select
					distinct 
						date(time) base_date, 
						user_id uid
				from
					User_login
				where
					-- 需要计算留存的日期区间
					base_date between '2021-03-01' and '2021-03-31'
				group by
					uid, base_date) a
			left join
				-- 表b,表连接匹配后续登录的情况,计算留存
				(select
					distinct 
						date(time) retention_date, 
						user_id uid
				from
					User_login
				where
					-- 此处不应限制最大日期,否则可能导致后续留存率为0
					retention_date >= '2021-03-01'
				group by
					uid, retention_date) b
			on 
				a.uid = b.uid
			where 
				-- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
				base_date <= retention_date
			order by
				uid, base_date) c
		order by
			uid, base_date) d
	group by base_date
	order by base_date)
	*/

结果如下(demo数据):
在这里插入图片描述
到这里所有活跃留存就查询完毕了,百分比的显示只需在metabase设置中调整一下即可,效果如下:
在这里插入图片描述
emmm…总感觉还是差点啥,是的没错!还有一环!!

  1. 增加日期选择交互框,按指定日期进行查询;

这也是最后一步了,如果在编写查询语句时就把查询时间写死了,当业务方想在平台上查看留存时,还得专门找数据小伙伴来调整时间,这不仅给数据侧带来更多工作量,还降低了业务方工作效率;

如果不限制这个时间,则每次查询都要把整个数据表都过一遍,这就不仅影响到查询效率了,当数据量日渐增多,这查询语句也会给数据库带来越来越大的压力;

因此可以使用时间筛选交互框的方式,让业务方自己选择想要查看的日期区间对应的留存情况,操作如下:(在日期筛选时把条件设为 { {date}} 形式,再在设置里将交互框调整为日期类型即可)

select
    base_date,
    -- 保留4位小数,用于百分比展示
    round(day_1/day_0, 4) retention_2,
    round(day_2/day_0, 4) retention_3,
    round(day_6/day_0, 4) retention_7,
    round(day_13/day_0, 4) retention_14,
    round(day_29/day_0, 4) retention_30
from
	(select
		base_date,
		sum(case when datediff = 0 then 1 else 0 end) day_0,
		sum(case when datediff = 1 then 1 else 0 end) day_1,
		sum(case when datediff = 2 then 1 else 0 end) day_2,
		sum(case when datediff = 6 then 1 else 0 end) day_6,
		sum(case when datediff = 13 then 1 else 0 end) day_13,
		sum(case when datediff = 29 then 1 else 0 end) day_29
	from
		(select
			uid,
			base_date,
			retention_date,
			retention_date - base_date datediff
		from
			(select
				uid,
				base_date,
				retention_date
			from
				-- 表a,首日登录的用户
				(select
					distinct 
						date(time) base_date, 
						user_id uid
				from
					User_login
				where
					-- 需要计算留存的日期区间
					-- 利用时间选择交互,手动选择时间区间后再运行
					base_date between {
   
   {start_date}} and {
   
   {end_date}}
				group by
					uid, base_date) a
			left join
				-- 表b,表连接匹配后续登录的情况,计算留存
				(select
					distinct 
						date(time) retention_date, 
						user_id uid
				from
					User_login
				where
					-- 此处不应限制最大日期,否则可能导致后续留存率为0
					retention_date >= {
   
   {start_date}}
				group by
					uid, retention_date) b
			on 
				a.uid = b.uid
			where 
				-- 保证每条记录不会连接到在当前记录时间之前的数据,如昨天的登录
				base_date <= retention_date
			order by
				uid, base_date) c
		order by
			uid, base_date) d
	group by base_date
	order by base_date)

效果如下:
在这里插入图片描述
在这里插入图片描述



猜你喜欢

转载自blog.csdn.net/weixin_44181744/article/details/115430787