HIVE-SQL经典面试题

HIVE-SQL经典面试题

  1. 需求:
    我们有如下的用户访问数据
    userId visitDate visitCount
    u01 2017/1/21 5
    u02 2017/1/23 6
    u03 2017/1/22 8
    u04 2017/1/20 3
    u01 2017/1/23 6
    u01 2017/2/21 8
    U02 2017/1/23 6
    U01 2017/2/22 4
    要求使用SQL统计出每个用户每个月的累积访问次数以及总的累计次数,如下表所示:
    用户id 月份 小计 累积
    u01 2017-01 11 11
    u01 2017-02 12 23
    u02 2017-01 12 12
    u03 2017-01 8 8
    u04 2017-01 3 3
    2.数据准备
CREATE TABLE test_sql.test1 ( 
        userId string, 
        visitDate string,
        visitCount INT )
    ROW format delimited FIELDS TERMINATED BY "\t";
    INSERT INTO TABLE test_sql.test1
    VALUES
        ( 'u01', '2017/1/21', 5 ),
        ( 'u02', '2017/1/23', 6 ),
        ( 'u03', '2017/1/22', 8 ),
        ( 'u04', '2017/1/20', 3 ),
        ( 'u01', '2017/1/23', 6 ),
        ( 'u01', '2017/2/21', 8 ),
        ( 'u02', '2017/1/23', 6 ),
        ( 'u01', '2017/2/22', 4 );

思路分析:
1.利用日期函数与政策表达式替换,将原始数据中的日转换成需要的月份 date_formate(regexp_replace(visitdate,'\','-'),'yyyy-MM')
2.利用userid和visitdate(月份)分组,将每个月访问总数求和
3. 在上述子查询基础上,利用窗口函数,对uersid进行分组求和
over(partition by userid order by visitmonth)
最终实现

select userid,visitmonth,subTotal,
sum(subTotal) over(partition by userid order by visitmonth) as total
from
(select userid,
visitmonth,
sum(visitcount) as subTotal
from
(select userid, 
date_format(regexp_replace(visitdate,'/','-'),'yyyy-MM') as visitmonth,
visitcount
from test1) as t1
group by userid,visitmonth) as t2

结果:
在这里插入图片描述
2. 需求:
有50W个京东店铺,每个顾客访客访问任何一个店铺的任何一个商品时都会产生一条访问日志,
访问日志存储的表名为Visit,访客的用户id为user_id,被访问的店铺名称为shop,数据如下:

            u1  a
            u2  b
            u1  b
            u1  a
            u3  c
            u4  b
            u1  a
            u2  c
            u5  b
            u4  b
            u6  c
            u2  c
            u1  b
            u2  a
            u2  a
            u3  a
            u5  a
            u5  a
            u5  a

请统计:
(1)每个店铺的UV(访客数)
(2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
数据准备:

CREATE TABLE test_sql.test2 ( 
                         user_id string, 
                         shop string )
            ROW format delimited FIELDS TERMINATED BY '\t'; 
            INSERT INTO TABLE test_sql.test2 VALUES
            ( 'u1', 'a' ),
            ( 'u2', 'b' ),
            ( 'u1', 'b' ),
            ( 'u1', 'a' ),
            ( 'u3', 'c' ),
            ( 'u4', 'b' ),
            ( 'u1', 'a' ),
            ( 'u2', 'c' ),
            ( 'u5', 'b' ),
            ( 'u4', 'b' ),
            ( 'u6', 'c' ),
            ( 'u2', 'c' ),
            ( 'u1', 'b' ),
            ( 'u2', 'a' ),
            ( 'u2', 'a' ),
            ( 'u3', 'a' ),
            ( 'u5', 'a' ),
            ( 'u5', 'a' ),
            ( 'u5', 'a' );       

思路分析:
(1)注意访客数目要去重
(2)第二题先按子查询,把每个店铺每个人有次访问该店铺求出来,然后利用排名函数,根据店铺shop进行分组,对次数进行排序

1SELECT shop,
               count(DISTINCT user_id)
        FROM test_sql.test2
        GROUP BY shop
(2select shop,userid,visitNum,rankOrder from 
(select shop,userid,visitNum,
row_number() over(partition by shop order by visitNum desc) as rankOrder
from 
(select shop,userid,count(userid) as visitNum from test2 
group by shop,userid) as t1) 
as t2
where rankOrder<=3

3.需求:
已知一个表STG.ORDER,有如下字段:Date,Order_id,User_id,amount。
数据样例:2017-01-01,10029028,1000003251,33.57。
请给出sql进行统计:
(1)给出 2017年每个月的订单数、用户数、总成交金额。
(2)给出2017年11月的新客数(指在11月才有第一笔订单)
数据准备:

CREATE TABLE test_sql.test3 ( 
            dt string,
            order_id string, 
            user_id string, 
            amount DECIMAL ( 10, 2 ) )
ROW format delimited FIELDS TERMINATED BY '\t';
INSERT INTO TABLE test_sql.test3 VALUES ('2017-01-01','10029028','1000003251',33.57);
INSERT INTO TABLE test_sql.test3 VALUES ('2017-01-01','10029029','1000003251',33.57);
INSERT INTO TABLE test_sql.test3 VALUES ('2017-01-01','100290288','1000003252',33.57);
INSERT INTO TABLE test_sql.test3 VALUES ('2017-02-02','10029088','1000003251',33.57);
INSERT INTO TABLE test_sql.test3 VALUES ('2017-02-02','100290281','1000003251',33.57);
INSERT INTO TABLE test_sql.test3 VALUES ('2017-02-02','100290282','1000003253',33.57);
INSERT INTO TABLE test_sql.test3 VALUES ('2017-11-02','10290282','100003253',234);
INSERT INTO TABLE test_sql.test3 VALUES ('2018-11-02','10290284','100003243',234);

思路分析
(1) 首先将日期数据转换成需要的格式,然后根据月份分组,分别求总人数、金额、订单数
(2)巧妙利用group by 会进行默认排序,在having里面用dt进行筛选

实现:

CREATE TABLE test_sql.test3 ( 
            dt string,
            order_id string, 
            user_id string, 
            amount DECIMAL ( 10, 2 ) )
ROW format delimited FIELDS TERMINATED BY '\t';
INSERT INTO TABLE test_sql.test3 VALUES ('2017-01-01','10029028','1000003251',33.57);
INSERT INTO TABLE test_sql.test3 VALUES ('2017-01-01','10029029','1000003251',33.57);
INSERT INTO TABLE test_sql.test3 VALUES ('2017-01-01','100290288','1000003252',33.57);
INSERT INTO TABLE test_sql.test3 VALUES ('2017-02-02','10029088','1000003251',33.57);
INSERT INTO TABLE test_sql.test3 VALUES ('2017-02-02','100290281','1000003251',33.57);
INSERT INTO TABLE test_sql.test3 VALUES ('2017-02-02','100290282','1000003253',33.57);
INSERT INTO TABLE test_sql.test3 VALUES ('2017-11-02','10290282','100003253',234);
INSERT INTO TABLE test_sql.test3 VALUES ('2018-11-02','10290284','100003243',234);
select count(distinct user_id) from test3 group by user_id
having date_format(min(dt),'yyyy-MM')='2017-11';

错误写法:(原因:hive 不支持子查询,如果是in的话可以转化为left join 来写)

select * from test3 where dt>='2017-11-01' and user_id not in 
(select distinct user_id from test3 where dt<'2017-11-01')

4 需求:
有一个5000万的用户文件(user_id,name,age),一个2亿记录的用户看电影的记录文件(user_id,url),根据年龄段观看电影的次数进行排序?

数据准备

CREATE TABLE test_sql.test4user
           (user_id string,
            name string,
            age int);

CREATE TABLE test_sql.test4log
                        (user_id string,
                        url string);
insert into test4user values 
('001','u1',10),
('002','u2',15),   
('003','u3',15),   
('004','u4',20),   
('005','u5',25),   
('006','u6',35),   
('007','u7',40),
('008','u8',45),  
('009','u9',50),  
('0010','u10',65);
insert into test4log values 
('001','url1'),
('002','url1'),   
('003','url2'),   
('004','url3'),   
('005','url3'),   
('006','url1'),   
('007','url5'),
('008','url7'),  
('009','url5'),  
('0010','url1');

思路分析:

  1. 最后一层一定是个排序函数
  2. 要排序首先要求,每个年龄段的观看次数
  3. 先求每个人观看次数,关联年龄
select t2.age_phase,count(user_id) as num
from 
(select distinct test4user.user_id,test4user.name,test4user.age,t1.num,
case when age<=10 and age>0 then '0-10'
  WHEN age <= 20 AND age > 10 THEN '10-20'
  WHEN age >20 AND age <=30 THEN '20-30'
  WHEN age >30 AND age <=40 THEN '30-40'
  WHEN age >40 AND age <=50 THEN '40-50'
  WHEN age >50 AND age <=60 THEN '50-60'
  WHEN age >60 AND age <=70 THEN '60-70'
  ELSE 'more than 70' END as age_phase
from
test4user  left join 
(select user_id,count(1) as num from test4log group by user_id) t1
on t1.user_id = test4user.user_id) t2
group by age_phase;

标准答案(这里中文会报错,需要看下怎么写中文)

(SELECT user_id,
  count(*) cnt
FROM test_sql.test4log
GROUP BY user_id) t1
JOIN
(SELECT user_id,
  CASE WHEN age <= 10 AND age > 0 THEN '0-10' 
  WHEN age <= 20 AND age > 10 THEN '10-20'
  WHEN age >20 AND age <=30 THEN '20-30'
  WHEN age >30 AND age <=40 THEN '30-40'
  WHEN age >40 AND age <=50 THEN '40-50'
  WHEN age >50 AND age <=60 THEN '50-60'
  WHEN age >60 AND age <=70 THEN '60-70'
  ELSE '70以上' END as age_phase
FROM test_sql.test4user) t2 ON t1.user_id = t2.user_id 
GROUP BY t2.age_phase```
5. 需求:
请用sql写出所有用户中在今年10月份第一次购买商品的金额,
表ordertable字段:
(购买用户:userid,金额:money,购买时间:paymenttime(格式:2017-10-01),订单id:orderid          
数据准备

```sql
create table test6 (
   userid string,
money decimal(10,2),
paymenttime string,
orderid string
)
row format delimited fields terminated by "\t";
insert into test6
values
('001',100,'2017-10-01','123'),
('001',200,'2017-10-02','124'),
('002',500,'2017-10-01','125'),
('001',100,'2017-11-01','126');

思路分析
1.找出10月份的所有订单,日期函数

date_format(paymenttime,'yyyy-MM')

2.对范围内订单进行排名row_number 并且利用窗口函数over(),以userid为分区,以paymenttime为排序
3.选区row_number为1的

select userid,paymenttime,money,orderid from (
select userid,paymenttime,money,orderid,
row_number() over(partition by userid order by paymenttime) as rowNum
from test6
where date_format(paymenttime,'yyyy-MM')='2017-10') t1 where rowNum =1;

6.需求:
有日志如下,请写出代码求得所有用户和活跃用户的总数及平均年龄。(活跃用户指连续两天都有访问记录的用户)
日期 用户 年龄
2019-02-11,test_1,23
2019-02-11,test_2,19
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-12,test_2,19
2019-02-13,test_1,23
2019-02-15,test_2,19
2019-02-16,test_2,19
数据准备:

create table test5(
  dt string,
  user_id string,
  age int)
  row format delimited fields terminated by "\t";
insert into test5 
values
('2019-02-11','test_1',23),
('2019-02-11','test_2',19),
('2019-02-11','test_3',39),
('2019-02-11','test_1',23),
('2019-02-11','test_3',39),
('2019-02-11','test_1',23),
('2019-02-12','test_2',19),
('2019-02-13','test_1',23),
('2019-02-15','test_2',19),                                        
('2019-02-16','test_2',19);

思路分析
方法一:

  1. 制造基础数据,利用序列函数lag,将连续登陆的用户进行标记,多次利用case when
  2. 得到临时表侯,进行求均值操作

实现:

with tmp as 
(
select distinct user_id,age,
case sum(loginDiff) over(partition by user_id order by loginDiff desc) when 0 then 0
else 1 end loginDiffGroup
from 
(
select dt,
lag(dt) over(partition by user_id order by dt),
case datediff(dt,lag(dt) over(partition by user_id order by dt)) when 1 then 1
else 0 end loginDiff,
user_id,
age from test5) t1)


select avg(age),sum(age*loginDiffGroup)/sum(loginDiffGroup) from tmp;

思路分析方法2:
1.兵分两路,一方面求所有用户均值年龄,一方面找到连续用户,两者union之后,再求所得值
2. 求所有用户均值比较容易

SELECT count(*) total_user_cnt,
                    cast(sum(age) /count(*) AS decimal(5,2)) total_user_avg_age,
                    0 two_days_cnt,
                    0 avg_age
   FROM
     (SELECT user_id,
             max(age) age
      FROM test5
      GROUP BY user_id) t5
  1. 求连续登录用户的均值
    先求连续值再分组
    (1.日期-row_number 得到flag,然后再按flag 分组,找到一组count大于2的即为连续)
SELECT 0 total_user_cnt,
          0 total_user_avg_age,
          count(*) AS two_days_cnt,
          cast(sum(age) / count(*) AS decimal(5,2)) AS avg_age
   FROM
     (SELECT user_id,
             max(age) age
      FROM
        (SELECT user_id,
                max(age) age
         FROM
           (SELECT user_id,
                   age,
                   date_sub(dt,rank) flag
            FROM
              (SELECT dt,
                      user_id,
                      max(age) age,
                      row_number() over(PARTITION BY user_id
                                        ORDER BY dt) rank
               FROM test5
               GROUP BY dt,
                        user_id) t1) t2
         GROUP BY user_id,
                  flag
         HAVING count(*) >=2) t3
      GROUP BY user_id) t4

5.最后进行联合查询

SELECT sum(total_user_cnt) total_user_cnt,
       sum(total_user_avg_age) total_user_avg_age,
       sum(two_days_cnt) two_days_cnt,
       sum(avg_age) avg_age
FROM
  (SELECT 0 total_user_cnt,
          0 total_user_avg_age,
          count(*) AS two_days_cnt,
          cast(sum(age) / count(*) AS decimal(5,2)) AS avg_age
   FROM
     (SELECT user_id,
             max(age) age
      FROM
        (SELECT user_id,
                max(age) age
         FROM
           (SELECT user_id,
                   age,
                   date_sub(dt,rank) flag
            FROM
              (SELECT dt,
                      user_id,
                      max(age) age,
                      row_number() over(PARTITION BY user_id
                                        ORDER BY dt) rank
               FROM test5
               GROUP BY dt,
                        user_id) t1) t2
         GROUP BY user_id,
                  flag
         HAVING count(*) >=2) t3
      GROUP BY user_id) t4
   UNION ALL SELECT count(*) total_user_cnt,
                    cast(sum(age) /count(*) AS decimal(5,2)) total_user_avg_age,
                    0 two_days_cnt,
                    0 avg_age
   FROM
     (SELECT user_id,
             max(age) age
      FROM test5
      GROUP BY user_id) t5) t6

7.需求:
现有图书管理数据库的三个数据模型如下:
图书(数据表名:BOOK)
序号 字段名称 字段描述 字段类型
1 BOOK_ID 总编号 文本
2 SORT 分类号 文本
3 BOOK_NAME 书名 文本
4 WRITER 作者 文本
5 OUTPUT 出版单位 文本
6 PRICE 单价 数值(保留小数点后2位)
读者(数据表名:READER)
序号 字段名称 字段描述 字段类型
1 READER_ID 借书证号 文本
2 COMPANY 单位 文本
3 NAME 姓名 文本
4 SEX 性别 文本
5 GRADE 职称 文本
6 ADDR 地址 文本
借阅记录(数据表名:BORROW LOG)
序号 字段名称 字段描述 字段类型
1 READER_ID 借书证号 文本
2 BOOK_ID 总编号 文本
3 BORROW_DATE 借书日期 日期
(1)创建图书管理库的图书、读者和借阅三个基本表的表结构。请写出建表语句。
(2)找出姓李的读者姓名(NAME)和所在单位(COMPANY)。

select name,company from test7reader where name like '%jack%';

(3)查找“高等教育出版社”的所有图书名称(BOOK_NAME)及单价(PRICE),结果按单价降序排序。

select book_name,price from test7book where output = '机械工业出版社' order by price desc;

(4)查找价格介于10元和20元之间的图书种类(SORT)出版单位(OUTPUT)和单价(PRICE),结果按出版单位(OUTPUT)和单价(PRICE)升序排序。

select `sort`,output,price from test7book order by output,price;

(5)查找所有借了书的读者的姓名(NAME)及所在单位(COMPANY)。

select test7reader.name,test7reader.company from test7borrowlog
left join test7reader on test7borrowlog.reader_id = test7reader.reader_id ;

(6)求”科学出版社”图书的最高单价、最低单价、平均单价。

select max(price),min(price),avg(price) from test7book where output = '机械工业出版社';

(7)找出当前至少借阅了2本图书(大于等于2本)的读者姓名及其所在单位

select count(distinct book_id) from test7borrowlog group by reader_id having count(distinct book_id)>=2;

(8)考虑到数据安全的需要,需定时将“借阅记录”中数据进行备份,请使用一条SQL语句,在备份用户bak下创建与“借阅记录”表结构完全一致的数据表BORROW_LOG_BAK.井且将“借阅记录”中现有数据全部复制到BORROW_L0G_ BAK中。

(8)
    CREATE TABLE test_sql.borrow_log_bak AS
    SELECT *
    FROM test_sql.borrow_log;
(9)

(9)现在需要将原Oracle数据库中数据迁移至Hive仓库,请写出“图书”在Hive中的建表语句(Hive实现,提示:列分隔符|;数据表数据需要外部导入:分区分别以month_part、day_part 命名)

CREATE TABLE book_hive ( 
    book_id string,
    SORT string, 
    book_name string,
    writer string, 
    OUTPUT string, 
    price DECIMAL ( 10, 2 ) )
    partitioned BY ( month_part string, day_part string )
    ROW format delimited FIELDS TERMINATED BY '\\|' stored AS textfile;

(10)Hive中有表A,现在需要将表A的月分区 201505 中 user_id为20000的user_dinner字段更新为bonc8920,其他用户user_dinner字段数据不变,请列出更新的方法步骤。(Hive实现,提示:Hlive中无update语法,请通过其他办法进行数据更新)

方式1:配置hive支持事务操作,分桶表,orc存储格式
方式2:第一步找到要更新的数据,将要更改的字段替换为新的值,第二步找到不需要更新的数据,第三步将上两步的数据插入一张新表中。

8.需求:
有一个线上服务器访问日志格式如下(用sql答题)
时间 接口 ip地址
2016-11-09 14:22:05 /api/user/login 110.23.5.33
2016-11-09 14:23:10 /api/user/detail 57.3.2.16
2016-11-09 15:59:40 /api/user/login 200.6.5.166
… …
求11月9号下午14点(14-15点),访问/api/user/login接口的top10的ip地址
数据准备:


CREATE TABLE test_sql.test8(`date` string,
                interface string,
                ip string);
INSERT INTO TABLE test8 VALUES ('2016-11-09 11:22:05','/api/user/login','110.23.5.23');
INSERT INTO TABLE test8 VALUES ('2016-11-09 11:23:10','/api/user/detail','57.3.2.16');
INSERT INTO TABLE test8 VALUES ('2016-11-09 23:59:40','/api/user/login','200.6.5.166');
INSERT INTO TABLE test8 VALUES('2016-11-09 11:14:23','/api/user/login','136.79.47.70');
INSERT INTO TABLE test8 VALUES('2016-11-09 11:15:23','/api/user/detail','94.144.143.141');
INSERT INTO TABLE test8 VALUES('2016-11-09 11:16:23','/api/user/login','197.161.8.206');
INSERT INTO TABLE test8 VALUES('2016-11-09 12:14:23','/api/user/detail','240.227.107.145');
INSERT INTO TABLE test8 VALUES('2016-11-09 13:14:23','/api/user/login','79.130.122.205');
INSERT INTO TABLE test8 VALUES('2016-11-09 14:14:23','/api/user/detail','65.228.251.189');
INSERT INTO TABLE test8 VALUES('2016-11-09 14:15:23','/api/user/detail','245.23.122.44');
INSERT INTO TABLE test8 VALUES('2016-11-09 14:17:23','/api/user/detail','22.74.142.137');
INSERT INTO TABLE test8 VALUES('2016-11-09 14:19:23','/api/user/detail','54.93.212.87');
INSERT INTO TABLE test8 VALUES('2016-11-09 14:20:23','/api/user/detail','218.15.167.248');
INSERT INTO TABLE test8 VALUES('2016-11-09 14:24:23','/api/user/detail','20.117.19.75');
INSERT INTO TABLE test8 VALUES('2016-11-09 15:14:23','/api/user/login','183.162.66.97');
INSERT INTO TABLE test8 VALUES('2016-11-09 16:14:23','/api/user/login','108.181.245.147');
INSERT INTO TABLE test8 VALUES('2016-11-09 14:17:23','/api/user/login','22.74.142.137');
INSERT INTO TABLE test8 VALUES('2016-11-09 14:19:23','/api/user/login','22.74.142.137');

思路分析:
重点在于利用date_format转换,然后进行时间比较
解决方案:

select ip,count(*) as cnt from test8 
where date_format(`date`,'yyyy-MM-dd HH')>='2016-11-09 14'
and date_format(`date`,'yyyy-MM-dd HH') <'2016-11-09 15'
and interface = '/api/user/login'
group by ip
order by cnt desc 
limit 10;

9.需求:
有一个充值日志表credit_log,字段如下:

dist_id int ‘区组id’,
account string ‘账号’,
money int ‘充值金额’,
create_time string ‘订单时间’

请写出SQL语句,查询充值日志表2019年01月02号每个区组下充值额最大的账号,要求结果:
区组id,账号,金额,充值时间
思路分析:
1.利用 date_fromat 格式化日期进行比较
2. 窗口函数
结果:


with tmp as 
(select distinct dist_id,account,sum(money) over(partition by account) as summoney from credit_log
where date_format(create_time,'yyyy-MM-dd') = '2019-01-02')


select dist_id,account, summoney from (
select dist_id,account, summoney,
row_number() over(partition by dist_id order by summoney ) as rowNum
from tmp) t1 
where rowNum = 1;

10.需求:
有一个账号表如下,请写出SQL语句,查询各自区组的money排名前十的账号(分组取前10)
dist_id string ‘区组id’,
account string ‘账号’,
gold int ‘金币’
数据准备:


create table test10(
  dest_id string,
  account string,
  gold int
)
row format delimited fields terminated by '\t';
INSERT INTO TABLE test10 VALUES ('1','77',18);
INSERT INTO TABLE test10 VALUES ('1','88',106);
INSERT INTO TABLE test10 VALUES ('1','99',10);
INSERT INTO TABLE test10 VALUES ('1','12',13);
INSERT INTO TABLE test10 VALUES ('1','13',14);
INSERT INTO TABLE test10 VALUES ('1','14',25);
INSERT INTO TABLE test10 VALUES ('1','15',36);
INSERT INTO TABLE test10 VALUES ('1','16',12);
INSERT INTO TABLE test10 VALUES ('1','17',158);
INSERT INTO TABLE test10 VALUES ('2','18',12);
INSERT INTO TABLE test10 VALUES ('2','19',44);
INSERT INTO TABLE test10 VALUES ('2','10',66);
INSERT INTO TABLE test10 VALUES ('2','45',80);
INSERT INTO TABLE test10 VALUES ('2','78',98); 

思路分析:
1.构建临时表
2.使用排名函数进行筛选
结果:

with tmp as (
select dest_id,sum(gold) over(partition by account) as summoney,account from test10
)

select dest_id,account,summoney,rownum from 
(select dest_id,account,summoney,row_number() over(partition by dest_id)  as rownum
from tmp) t1
where t1.rownum<=10;

猜你喜欢

转载自blog.csdn.net/weixin_38813363/article/details/109492352