SQL常见问题整理(四)

7.统计技术

(1)计算描述性统计
整体特征:
观察的数量、总和及其范围(最小值和最大值)
集中趋势的度量,例如均值,中位数和众数
变化的度量,例如标准偏差和方差
统计信息:
中位数:
通过查询统计观察值的数量,根据这个数量,可以确定中位数的计算需要一个值还是两个值,以及在有序的观察值集合内它们排在第几个
通过包含order by子句的查询,对观察值进行排序,并利用limit子句提取中间的一个或两个值
.如果只有一个中间值,那么它就是中位数,如果中间值有两个,那么需要求它们的平均值
奇数个数:select score from t order by score limit 18,1
偶数个数:select score from t order by score limit 18,2
众数:select score,count(score) as frequency from testscore group by score order by frequency desc
合计函数:
均值:mean()
方差:
总体方差:VAR_POP()
抽样方差:VAR_SAMP()
标准差:
总体标准差:STDDEV_POP()
抽样标准差:STDDEV_SAMP ()
(2)分组描述性统计
group by
(3)生成频率分布
根据数据集的汇总信息,导出频率分布
频数分布:select score,count(score) as counts from testscore group by score
百分比代替计数生成相对频率分布:set @n=(select count(score) from testscore);select score,(count(score)100)/@n as percent from testscore group by score;
(4)统计缺失的值
统计集合中null值的数量
count(
)统计总行数,count(score)统计没有缺失的分数,两个值之差即为缺失分数的数量,并且通过与总数的差可以计算缺失分数的百分比:
select count() as ‘n(total)’, count(score) as ‘n(nonmissing)’, count()-count(score) as ‘n(missing)’,((count()-count(score))100/count() as ‘%missing’) from t;
使用sum(isnull(score))直接计算,如果参数为null,则isnull()函数返回1,否则返回0:select count(
) as ‘n(total)’,count(score) as ‘n(nonmissing)’,sum(isnull(score)) as ‘n(missing)’,(sum(isnull(score))100)/count() as ‘%missing’ from t;
(5)计算线性回归或相关系数
使用汇总函数来计算所需的统计结果
计算回归方程所需的各项,包括观察数量,每个变量的均值、总和、平方和,以及每个变量的乘积之和
select
@n:=count(score) as N,
@meanX:=avg(age) as ‘X mean’,
@sumX:=SUM(age) as ‘X sum’,
@sumXX:=SUM(ageage) as ‘X sum of squares’,
@meanY:=AVG(score) as ‘Y mean’,
@sumY:=SUM(score) as ‘Y sum’,
@sumYY:=SUM(score
score) as ‘Y sum of squares’,
@sumXY:=SUM(agescore) as 'XY sum’
from testscore\G
根据上述各项,计算回归的斜率和截距
set @b:=(@n*@sumXY-@sumX*@sumY)/(@n*@sumXY-@sumX);
set @a:=(@meanY-@b*@meanX);
select @b as slope,@a as intercept;
那么回归方程是:
select concat(‘Y=’,@b,‘X+’,@a) as ‘least-squares regression’;
计算相关系数:
select (@nsumXY-@sumX@sumY)/SQRT((@n*@sumXY-@sumX*@sumX)(@n@sumXY-@sumY*@sumY)) as correlation
(6)生成随机数
使用rand()函数,rand()函数可以产生0到1之间的随机数
select rand()
(7)随机排序
order by rand()
select name from t order by rand();
(8)随机选择数据
随机化值,然后选择第一个(如果需要多个值,则选择前几个):order by rand(),加入limit 1,就可以选择第一个值
(9)计算相邻两行之差
利用表的自连接,获取相邻两个行并计算它们之间的差值:
select t1.id as id1,t2.id as id2,
t2.date,
t1.ab as ab1,t2.ab as ab2,
t1.h as h1,t2.h as h2,
t2.ab-t1.ab as abdiff,
t2.h-t1.h as hdiff,
truncate(ifnull((t2.h-t1.h)/(t2.ab-t1.ab),0),3) as ba
from player_states as t1 inner join player_states as t2
on t1.id+1=t2.id
order by t1.id;
(10)查找累计总和与移动平均值
利用表的自连接生成每个测量点连续的观察值集合,然后使用合计函数处理每组值,以计算总和或平均值
select t1.stage,t1.km,sum(t2.km) as ‘cum.km’
from marathon as t1 inner join marathon as t2
on t1.stage>=t2.stage
group by t1.stage;
(11)指定排名
决定排名的方法,然后将值按所需顺序排列,并按照排名方法指定排名
简单地用行号给有序集合中的值指定排名:
set @rownum:=0;
select @rownum:=@rownum+1 as rank,score from t order by score desc;
这种排名没有考虑到相同名次(即多个相同的值)的可能性
仅在值改变时提升排名(相同分数排名相同):
set @rank=0,@prev_val=NULL;
select @rank:=if(@prev_val=score,@rank,@rank+1) as rank,@prev_val:=score as score from t order by score desc;
按行号对值进行排名,并在遇到相同值的情况下,将所有相同值的排名都设置为第一个值的行号:
set @rownum=0,@rank=0,@prev_val=NULL
select @rownum:=@rownum+1 as row,
@rank:=if(@prev_val<>score,@rownum,@rank) as rank,
@prev_val:=score as score
from t order by score desc;

猜你喜欢

转载自blog.csdn.net/weixin_47970003/article/details/123744977
今日推荐