Hive window function, function analysis
1 analysis functions: for level, points, and other n-fragmentation
Ntile Hive is a very powerful analysis functions.
It can be seen: it is the ordered set of data allocated to the average number of specified (num) buckets, the bucket number allocated to each row. If it is not evenly distributed, the priority assigned barrel smaller number, and number of rows in each bucket can put up to 1 difference.
The syntax is:
NTILE (NUM) over ([partition_clause] order_by_clause) AS your_bucket_num
The number of barrels may then, or after n data before selecting the parts per.
example:
To the user corresponding to each user and consumer information table, calculates the average consumption of 50% of the spent before the user;
- the user table and consumption, consumption decreased by sequentially divided into two parts by the average
drop Table IF EXISTS test_by_payment_ntile;
Create Table test_by_payment_ntile AS
SELECT
Nick,
Payment,
the NTILE (2) the OVER (the ORDER BY Payment desc) the AS RN
from test_nick_payment;
- calculating an average value every respectively, can obtain a 50% and 50% post-consumer forward average consumption
SELECT
'avg_payment' AS INF,
t1.avg_payment_up_50 AS avg_payment_up_50,
t2.avg_payment_down_50 AS avg_payment_down_50
from
(SELECT
AVG (Payment ) AS avg_payment_up_50
from test_by_payment_ntile
WHERE RN =. 1
) T1
the Join
(SELECT
AVG (Payment) AS avg_payment_down_50
from test_by_payment_ntile
WHERE RN = 2
) T2
ON (= t1.dp_id t2.dp_id);
Rank,Dense_Rank, Row_Number
SQL is very familiar with the three groups within the sort function. Syntax Like:
R() over (partion by col1... order by col2... desc/asc)
select
class1,
score,
rank() over(partition by class1 order by score desc) rk1,
dense_rank() over(partition by class1 order by score desc) rk2,
row_number() over(partition by class1 order by score desc) rk3
from zyy_test1;
Difference:
Rank have the same value, the same output sequence number, and the next sequence number uninterrupted;
dense_rank have the same value, the output of the same number, but in a number, intermittent
Different values on all outputs row_number number, serial number uniquely continuous;
2. The window function Lag, Lead, First_value, Last_value
Lag, Lead
LAG (col, n, DEFAULT) up to a value in the n-th row statistics window
LEAD (col, n, DEFAULT) down to the n-th row statistics window value, and the opposite LAG
- After the sorted group, shifted forward or backwards
- if the third parameter is omitted, the default is NULL, otherwise fill.
select
dp_id,
mt,
payment,
LAG(mt,2) over(partition by dp_id order by mt) mt_new
from test2;
- After the sorted group, shifted forward or backwards
- if the third parameter is omitted, the default is NULL, otherwise fill.
select
dp_id,
mt,
payment,
LEAD(mt,2,'1111-11') over(partition by dp_id order by mt) mt_new
from test2;
FIRST_VALUE, LAST_VALUE
first_value: after taking the sorted packets to the current cut-off line, a first value
last_value: after taking the sorted packets to the current cut-off line, the last value
- FIRST_VALUE get the current row in the group ahead of the first value
- LAST_VALUE get the last value in the group ahead of the current row
- FIRST_VALUE (DESC) to obtain the final value within a global group of
the SELECT
DP_ID,
MT,
Payment,
FIRST_VALUE (Payment) over (Partition by DP_ID Order by MT) payment_g_first,
The LAST_VALUE (Payment) over (Partition by DP_ID Order by MT) payment_g_last,
FIRST_VALUE (Payment) over (Partition by DP_ID Order by MT desc) payment_g_last_global
from test2
the ORDER bY DP_ID, mt;