Hive典型案例详解-----------------初级

1.用sql来做wordcount
有以下文本文件
hello tom hello jim
hello rose hello tom
tom love rose rose love jim
jim love tom love is what
what is love
需要用hive做wordcount
– 建表映射
create table t_wc(sentence string);

– 导入数据
load data local inpath ‘/root/hivetest/xx.txt’ into table t_wc;
实现:
先将文本用空格分隔,利用行转列函数,转成列,按照word分组,并且统计

SELECT word
    ,count(1) as cnts
FROM (
    SELECT explode(split(sentence, ' ')) AS word
    FROM t_wc
    ) tmp
GROUP BY word
order by cnts desc
;

2.累计报表查询
(每次将上一次的运算结果当做这一次的条件)
有如下数据:

A,2015-01-08,5
A,2015-01-11,15
B,2015-01-12,5
A,2015-01-12,8
B,2015-01-13,25
A,2015-01-13,5
C,2015-01-09,10
C,2015-01-11,20
A,2015-02-10,4
A,2015-02-11,6
C,2015-01-12,30
C,2015-02-13,10
B,2015-02-10,10
B,2015-02-11,5
A,2015-03-20,14
A,2015-03-21,6
B,2015-03-11,20
B,2015-03-12,25
C,2015-03-10,10
C,2015-03-11,20

需要要开发hql脚本,来统计出如下累计报表:
用户 月份 月总额 累计到当月的总额
A 2015-01 33 33
A 2015-02 10 43
A 2015-03 30 73
B 2015-01 30 30
B 2015-02 15 45
(1)创建表并且加载数据(在linux创建一个文件将上面的数据插入即可)

 create table t_sale(username string,day string,msale int)
   row format delimited fields terminated by ",";
   load data local inpath '/root/sale.txt' into table t_sale;

(2) 月份数据聚合

 select  username,substr(day,1,7) as month_sale,sum(msale)
   from t_sale
   group by username,substr(day,1,7);

结果:
+———–+————-+——+–+
| username | month_sale | _c2 |
+———–+————-+——+–+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| A | 2015-03 | 20 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
| B | 2015-03 | 45 |
| C | 2015-01 | 60 |
| C | 2015-02 | 10 |
| C | 2015-03 | 30 |
+———–+————-+——+–+
(3)因为需要计算累积到当月的总和,所以需要自关联

select t1.*,t2.* from
(select  username,substr(day,1,7) as month_sale,sum(msale) as cnt   //进行自关联
from t_sale
group by username,substr(day,1,7)) t1
join
(select  username,substr(day,1,7) as month_sale,sum(msale) as cnt
from t_sale
group by username,substr(day,1,7)) t2
on t1.username=t2.username;


结果:
+

————–+—————-+———+————–+—————-+———+–+
| t1.username | t1.month_sale | t1.cnt | t2.username | t2.month_sale | t2.cnt |
+————–+—————-+———+————–+—————-+———+–+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-01 | 33 | A | 2015-03 | 20 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-03 | 20 |
| A | 2015-03 | 20 | A | 2015-01 | 33 |
| A | 2015-03 | 20 | A | 2015-02 | 10 |
| A | 2015-03 | 20 | A | 2015-03 | 20 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-01 | 30 | B | 2015-03 | 45 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-03 | 45 |
| B | 2015-03 | 45 | B | 2015-01 | 30 |
| B | 2015-03 | 45 | B | 2015-02 | 15 |
| B | 2015-03 | 45 | B | 2015-03 | 45 |
| C | 2015-01 | 60 | C | 2015-01 | 60 |
| C | 2015-01 | 60 | C | 2015-02 | 10 |
| C | 2015-01 | 60 | C | 2015-03 | 30 |
| C | 2015-02 | 10 | C | 2015-01 | 60 |
| C | 2015-02 | 10 | C | 2015-02 | 10 |
| C | 2015-02 | 10 | C | 2015-03 | 30 |
| C | 2015-03 | 30 | C | 2015-01 | 60 |
| C | 2015-03 | 30 | C | 2015-02 | 10 |
| C | 2015-03 | 30 | C | 2015-03 | 30 |
+————–+—————-+———+————–+—————-+———+–+
(4)将表一月销售额大于表二月销售额筛选统计出来(为了累加,别的数据没用)

select t1.*,t2.* from
(select  username,substr(day,1,7) as month_sale,sum(msale) as cnt
from t_sale
group by username,substr(day,1,7)) t1
join
(select  username,substr(day,1,7) as month_sale,sum(msale) as cnt
from t_sale
group by username,substr(day,1,7)) t2
on t1.username=t2.username
where t1.month_sale>=t2.month_sale;    //统计1的大于2的
+-------

——-+—————-+———+————–+—————-+———+–+
| t1.username | t1.month_sale | t1.cnt | t2.username | t2.month_sale | t2.cnt |
+————–+—————-+———+————–+—————-+———+–+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| A | 2015-03 | 20 | A | 2015-01 | 33 |
| A | 2015-03 | 20 | A | 2015-02 | 10 |
| A | 2015-03 | 20 | A | 2015-03 | 20 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
| B | 2015-03 | 45 | B | 2015-01 | 30 |
| B | 2015-03 | 45 | B | 2015-02 | 15 |
| B | 2015-03 | 45 | B | 2015-03 | 45 |
| C | 2015-01 | 60 | C | 2015-01 | 60 |
| C | 2015-02 | 10 | C | 2015-01 | 60 |
| C | 2015-02 | 10 | C | 2015-02 | 10 |
| C | 2015-03 | 30 | C | 2015-01 | 60 |
| C | 2015-03 | 30 | C | 2015-02 | 10 |
| C | 2015-03 | 30 | C | 2015-03 | 30 |
+————–+—————-+———+————–+—————-+———+–+
(5)聚合统计

select t1.username,min(t1.cnt),t1.month_sale,sum(t2.cnt) from
(select  username,substr(day,1,7) as month_sale,sum(msale) as cnt
from t_sale
group by username,substr(day,1,7)) t1
join
(select  username,substr(day,1,7) as month_sale,sum(msale) as cnt
from t_sale
group by username,substr(day,1,7)) t2
on t1.username=t2.username
where t1.month_sale>=t2.month_sale
group by t1.username,t1.month_sale
;

+--------

+————–+——+—————-+——+–+
| t1.username | _c1 | t1.month_sale | _c3 |
+————–+——+—————-+——+–+
| A | 33 | 2015-01 | 33 |
| A | 10 | 2015-02 | 43 |
| A | 20 | 2015-03 | 63 |
| B | 30 | 2015-01 | 30 |
| B | 15 | 2015-02 | 45 |
| B | 45 | 2015-03 | 90 |
| C | 60 | 2015-01 | 60 |
| C | 10 | 2015-02 | 70 |
| C | 30 | 2015-03 | 100 |

-

3.需求:求出连续三天有销售记录的店铺
A,2017-10-11,300,1,2017-10-10
A,2017-10-12,200,2,2017-10-10
A,2017-10-13,100,3,2017-10-10
A,2017-10-15,100,4,2017-10-11
A,2017-10-16,300,5,2017-10-11
A,2017-10-17,150,6,2017-10-11
A,2017-10-18,340,7,2017-10-11
A,2017-10-19,360,8,2017-10-11

B,2017-10-11,400
B,2017-10-12,200
B,2017-10-15,600

C,2017-10-11,350
C,2017-10-13,250
C,2017-10-14,300
C,2017-10-15,400
C,2017-10-16,200
D,2017-10-13,500
E,2017-10-14,600
E,2017-10-15,500
D,2017-10-14,600

(1)建表,加载数据

create table t_jd(shopid string,dt string,sale int)
row format delimited fields terminated by ',';
load data local inpath '/root/sale.dat' into table t_jd;

2):打编号

select shopid,dt,sale,
row_number() over(partition by shopid order by dt) as rn //利用row_number()函数
from t_jd;

+———+————-+——-+—–+–+
| shopid | dt | sale | rn |
+———+————-+——-+—–+–+
| A | 2017-10-11 | 300 | 1 |
| A | 2017-10-12 | 200 | 2 |
| A | 2017-10-13 | 100 | 3 |
| A | 2017-10-15 | 100 | 4 |
| A | 2017-10-16 | 300 | 5 |
| A | 2017-10-17 | 150 | 6 |
| A | 2017-10-18 | 340 | 7 |
| A | 2017-10-19 | 360 | 8 |
| B | 2017-10-11 | 400 | 1 |
| B | 2017-10-12 | 200 | 2 |
| B | 2017-10-15 | 600 | 3 |
| C | 2017-10-11 | 350 | 1 |
| C | 2017-10-13 | 250 | 2 |
| C | 2017-10-14 | 300 | 3 |
| C | 2017-10-15 | 400 | 4 |
| C | 2017-10-16 | 200 | 5 |
| D | 2017-10-13 | 500 | 1 |
| D | 2017-10-14 | 600 | 2 |
| E | 2017-10-14 | 600 | 1 |
| E | 2017-10-15 | 500 | 2 |
+———+————-+——-+—–+–+

(3):根据编号,生成连续日期

select shopid,dt,sale,rn,
date_sub(to_date(dt),rn) 
from
(select shopid,dt,sale,
row_number() over(partition by shopid order by dt) as rn 
from t_jd) tmp;

+--

——-+————-+——-+—–+————-+–+
| shopid | dt | sale | rn | _c4 |
+———+————-+——-+—–+————-+–+
| A | 2017-10-11 | 300 | 1 | 2017-10-10 |
| A | 2017-10-12 | 200 | 2 | 2017-10-10 |
| A | 2017-10-13 | 100 | 3 | 2017-10-10 |
| A | 2017-10-15 | 100 | 4 | 2017-10-11 |
| A | 2017-10-16 | 300 | 5 | 2017-10-11 |
| A | 2017-10-17 | 150 | 6 | 2017-10-11 |
| A | 2017-10-18 | 340 | 7 | 2017-10-11 |
| A | 2017-10-19 | 360 | 8 | 2017-10-11 |
| B | 2017-10-11 | 400 | 1 | 2017-10-10 |
| B | 2017-10-12 | 200 | 2 | 2017-10-10 |
| B | 2017-10-15 | 600 | 3 | 2017-10-12 |
| C | 2017-10-11 | 350 | 1 | 2017-10-10 |
| C | 2017-10-13 | 250 | 2 | 2017-10-11 |
| C | 2017-10-14 | 300 | 3 | 2017-10-11 |
| C | 2017-10-15 | 400 | 4 | 2017-10-11 |
| C | 2017-10-16 | 200 | 5 | 2017-10-11 |
| D | 2017-10-13 | 500 | 1 | 2017-10-12 |
| D | 2017-10-14 | 600 | 2 | 2017-10-12 |
| E | 2017-10-14 | 600 | 1 | 2017-10-13 |
| E | 2017-10-15 | 500 | 2 | 2017-10-13 |
+———+————-+——-+—–+————-+–+

(4)分组,求count

select shopid,count(1) as cnt 
from
(select shopid,dt,sale,rn,
date_sub(to_date(dt),rn) as flag
from
(select shopid,dt,sale,
row_number() over(partition by shopid order by dt) as rn 
from t_jd) tmp) tmp2
group by shopid,flag
;
+-------

–+——+–+
| shopid | cnt |
+———+——+–+
| A | 3 |
| A | 5 |
| B | 2 |
| B | 1 |
| C | 1 |
| C | 4 |
| D | 2 |
| E | 2 |
+———+——+–+
(5):筛选出连续天数大于等于3的

select shopid from
(select shopid,count(1) as cnt 
from
(select shopid,dt,sale,rn,
date_sub(to_date(dt),rn) as flag
from
(select shopid,dt,sale,
row_number() over(partition by shopid order by dt) as rn 
from t_jd) tmp) tmp2
group by shopid,flag) tmp3
where tmp3.cnt>=3
;
+---------

+–+
| shopid |
+———+–+
| A |
| A |
| C |
+———+–+
(6):去重

select distinct shopid from
(select shopid,count(1) as cnt 
from
(select shopid,dt,sale,rn,
date_sub(to_date(dt),rn) as flag
from
(select shopid,dt,sale,
row_number() over(partition by shopid order by dt) as rn 
from t_jd) tmp) tmp2
group by shopid,flag) tmp3
where tmp3.cnt>=3
;
+---------

+–+
| shopid |
+———+–+
| A |
| C |
+———+–+

猜你喜欢

转载自blog.csdn.net/qq_41166135/article/details/82227650