深入理解 group by【思考点滴】

网上搜索 group by，结果一大堆，千篇一律 ……

现在从浅入深的讲一下 group by的应用

1、group by的作用

group by 简单来讲，就是分组，配合计算使用，计算方法如SUM、count、avg、max、min等…

应用举例 : 从交易表中查询出今天有交易的商户账户。

2、使用 group by 完成一个简单的应用

      查询order_amount表中，是否有指定的账户：使用group by 就能做到 : select account_id from order_amount group by account_id；

        同样distinct 也能做到：使用 distinct 也能做到 : select distinct account_id from order_amount ;

      如上使用 group by 和 distinct 的效率是一样的，如果是查找是否存在，建议使用 order by + limit 1，扫描到第一条数据就完成，效率最高

        select account_id from order_amount where account_id = xxx limit 1，

group by 和 distinct 使用上的差异

举例1：select count(distinct account_id) from order_amount ;

举例2：select count(account_id) from order_amount group by account_id;

感兴趣的话，可以自己运行一下。

差异：

group by : 先分组，再计算

distinct ：先去重，再计算，distinct 只是去重，不太适合条件和计算类查找

复杂应用，通过两个条件锁定一条记录

举例，获取一批商户，符合指定条件的最后一个订单详情(如：12点到15点之间，参与秒杀商户的最后一个订单)

方法1：

1、select * from (select * from order_amount where xxx yyy zzz) as sel group by xy

方法2：

2、select * from order_amount group by xy having id = (select max(id) from order_amount where xxx yyy zzz)

注 distinct 无法完成本类需求，只能静静的看着 group by 表演了。

方法1分析：涉及子查询，子查询空间和效率的问题都暴露出来了。

select * from (select * from order_amount where xxx yyy zzz) as sel group by xy

优点：

可以实现功能

缺点：

1、group by 中的，除了算法使用的字段和group by 以外的字段，其它字段的值是随机的，默认获取的是选择查询索引(where或者group by)的第一条符合分组的记录填充的。

2、当子查询的结果非常大的时候，数据库服务器的临时表空间会用完，因此多余的查询数据会丢失

3、子查询生成的临时表，没有索引可用，如果临时表数据很大，则主select语句的效率也很低

4、子查询结果很大的时候，生成临时表的时间也很长

如果子查询的数据超过1G【1G一般是mysql中默认的，子查询配置的表大小，数量差不多是500万条以上数据】，则后面的查询结构就丢失了。造成随机性数据丢失的问题。
所以一般数据量都不会踩到这个坑，踩到这个坑的都不是一般的数据量！

方法2分析：

select * from order_amount group by xy having id = (select max(id) from order_amount where xxx yyy zzz)

优点：

可以实现功能，合理的利用了having 语句，查询结果集很小，无临时表空间占满的问题

缺点：

1、效率偏低。

方法3：

select * from order_amount where id in (select max(id) from order_amount where xxx yyy zzz group by xy)

优点：

可以实现功能，查询结果集很小，无临时表空间占满的问题，效率应该比网友指路要好很多

缺点：

不能说没有缺点，暂时是最好的选择。

更复杂的需求

      step 1：先定位出唯一记录的ID或者索引信息
            需求1:12点到15点之间，最后完成，且最后创建的订单select max(concat(complete_time,create_time)) from order_amount where xxx yyy zzz group by xy
       需求2:12点到15点之间，最后完成，且最先创建的订单：select max(concat(complete_time,2000000000-create_time)) from order_amount where xxx yyy zzz group by xy
        需求3:12点到15点之间，最先完成，且最后创建的订单：select min(concat(complete_time,2000000000-create_time)) from order_amount where xxx yyy zzz group by xy
        需求4:12点到15点之间，最先完成，且最先创建的订单：select min(concat(complete_time,create_time)) from order_amount where xxx yyy zzz group by xy
      step 2：通过如上唯一信息，查询唯一数据

        如上只是个举例，总的来讲，还是通过 max/min(concat(xxx,yyy,bbb,...)) 等方式完成按需查找，找到符合条件的唯一记录，其中 xxx,yyy,bbb 可以是字段，也可以是一种运算，如2000000000-create_time，总的原则来讲，就是想通过max或者min搜索出想要的唯一信息。

        如果觉得数据量不是很大，则可以使用 select * from (select * from order_amount order by complete_time desc, create_time asc where xxx yyy zzz) as sel group by xy，zz，dd，通过内查询按要求排序，通过group by筛选出第一条记录。

group by 总结

1、group by 非计算列，非group by列，如何自行控制？

解决方法：子查询，子查询按需进行排序

select * from (select * from order_amount order by complete_time desc, create_time asc where xxx yyy zzz) as sel group by xy

2、group by 也是优先使用索引。

3、group by 一次可以完成多个函数，可以通过多个字段进行分组

select count(amount) as cnt, SUM(amount) as total_amount, avg(amount) as avg_amount, max(id) as max_id, min(id) as min_id, xy, za, hs from order_amount group by xy, za,hs

4、同时可以使用 with rollup再获取上级汇总

select count(amount) as cnt, SUM(amount) as total_amount, avg(amount) as avg_amount, max(id) as max_id, min(id) as min_id, xy, za, hs from order_amount group by xy, za,hs with rollup

5、group by 之后的结果也可以排序,并非select的条件，且不影响select的结果。

select count(amount) as cnt, SUM(amount) as total_amount, avg(amount) as avg_amount, max(id) as max_id, min(id) as min_id, xy, za, hs from order_amount group by xy asc, za desc,hs asc

6、使用 group by 的时候，难免会用到子查询，一定要严格审视子查询结果的大小和性能

深入理解 group by【思考点滴】

猜你喜欢