Commonly used hive sql


Details: There are regular matching functions involved in sql , so pay attention to the escape symbols

Because whether the regular matching rules need to add escape symbols is different in different languages. For example, in the regexp_replace function, the \d+ in the regular matching rules of hive sql needs to be preceded by the escape symbol \, but it may not be used in Java. , there is no need to add the escape symbol \ in Presto sql.

☺ Idea: There is no need to remember which language requires escape symbols and which language does not. You only need to remember the regular matching rules. One reason for the error may be a problem with escape symbols .

 regexp_replace(`date`, '\\d+ 小时前', '${DateUtil.addDays(dt, 1)}')

1. Disassemble the json field/json parsing function get_json_object

(1) 语法:get_json_object(string json_string, string path)

  • json_string : required. STRING type. Standard JSON format object, format is {Key:Value, Key:Value,...}. If you encounter English double quotes ("), you need to use two backslashes (\) to escape. If you encounter English single quotes ('), you need to use one backslash (\) to escape.
  • path: required. STRING type. Begins with $.
  • $: Indicates the root node.
  • .or ['']: represents a child node. .MaxCompute supports using these two characters to parse JSON objects, which can ['']be used instead when the JSON Key itself contains it .
  • []: Indicates the array subscript, starting from 0.
  • *: Return the entire array. *Escape is not supported.

(2) Example:

-- json字符串数据如下:
json_string:
{
	"store": {
		"fruit":[{
   
   "weight":8,"type":"apple"},{
   
   "weight":9,"type":"pear"}],
         "bicycle":{
   
   "price":19.95,"color":"red"} }, 
     "email":"amy@only_for_json_udf_test.net",
     "owner":"amy" 
} 

-- 获取owner字段信息,返回amy。
  select get_json_object(json_string, '$.owner') from json_string;

-- 提取store.fruit字段第一个数组信息,返回{"weight":8,"type":"apple"}。
  select get_json_object(json_string, '$.store.fruit[0]') from json_string;

2. Intercept string

(1) In line with the regular expression method, intercept the string regexp_extract

  • 语法:regexp_extract(string subject, string pattern, int index)
  • Extract the substring that matches the index-th part of the regular expression pattern in the string subject
index is which part of the expression the return result is taken from
  • 0 means returning all results corresponding to the entire regular expression
  • 1 means returning the result corresponding to the first () in the regular expression, and so on.
select regexp_extract('histry','(i)(.*?)(e)',0);

(2) Intercept the string substr according to character position

  • 语法:substr(string|binary A, int start) substr(string|binary A, int start, int len)
substr(title,1,10)

3. Replace characters in the string regexp_replace

(1) Grammar:

regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)

(2) Function:

Replace the qualified part of the string INTIAL_STRING with the string specified by REPLACEMENT according to the regular expression pattern

(3) Example:

-- 举例:替换字符
regexp_replace(get_json_object(map_col,'$.title'), '\n|\t|\r', '') title,-- 去掉空格等特殊符号,以防存储是出现数据错行
regexp_replace(get_json_object(map_col,'$.date'),'\/ ', '') `date`, -- 去掉时间前面的/

4. Concatenate string concat/concat_ws

(1) concat concatenates strings

  • Syntax: concat_ws(string1,string2)
-- 举例:拼接字段
concat('https://developer.unity.cn/projects/',get_json_object(map_col,'$.id')) url

(2) concat_ws concatenated string with delimiter

  • Syntax: concat_ws('separator', string 1, string 2)
-- 举例:带分割符的拼接字段
concat_ws('/','https://t.bilibili.com',dynamic_id) note_url

5. Time format related

(1) Specify the time output format date_format

-- 举例:规定时间输出格式,默认格式 yyyy-MM-dd HH:mm:ss
date_format(get_json_object(map_col,'$.create_time'),'yyyy-MM-dd HH:00:00')

Details: date_format does not recognize the time format of /, for example: select date_format('2023/01/17', 'y'); the result is null

Solution: First use the string replacement function to replace / with -, and then use the date_format function to get the time in the specified format

select date_format(regexp_replace('2023/01/17', '/', '-'), 'yyyy-MM-dd');

(2) Related to timestamp

unix_timestamp(string date) gets the current timestamp

  • unix_timestamp(string timestame) The input timestamp format must be 'yyyy-MM-dd HH:mm:ss'. If it does not match, null will be returned.

unix_timestamp(string date, string pattern) specifies the format to convert the time string into a timestamp

  • select unix timestamp(‘2023-1-6’‘yyyy-MM-dd’);

from_unixtime(bigint unixtime[, string format]) Convert timestamp to coordinated international time


6. Sorting/ranking/window function ROW_NUMBER

  • Syntax: ROW_NUMBER() OVER(PARTITION BY partition field ORDER BY ascending/descending field [DESC])
-- 举例:根据标题分区后根据创建时间降序展示自然数排名
ROW_NUMBER() OVER(PARTITION BY get_json_object(map_col,'$.title') ORDER BY get_json_object(map_col,'$.create_time') DESC)AS rn

7. Explosion function explode + side view function LATERAL VIEW

(1) Function:

The burst function converts one row into multiple columns, and then the side view function performs aggregation

(2) Example:

  • Original data results:

  • Try the explode function explosion effect:
SELECT
    explode(split(regexp_replace(get_json_object(map_col,'$.genre'), '\\[|\\]', ''), ","))  genre 
FROM
	ods_crawler_table                         
WHERE
	dt = '2023-02-26'       
AND get_json_object(map_col,'$.code') = 'xxx'

▷ explode(genre) query together with other fields

  • For actual business, the fields game_name and genre must be queried.
SELECT
    get_json_object(map_col,'$.game_name') game_name,
    explode(split(regexp_replace(get_json_object(map_col,'$.genre'), '\\[|\\]', ''), ","))  genre 
FROM
	ods_crawler_table                           
WHERE
	dt = '2023-02-26'       
AND get_json_object(map_col,'$.code') = 'xxx'	
报错:UDTF’s are not supported outside the SELECT clause, nor nested in expressions
Analysis: The reason is that this field genre, after exploding, is converted into multiple columns (3 columns), while the game_name field is still 1 column, and the number of columns does not match.
Solution: Aggregation of side views (tables)
ods_crawler_table -- 原先的表
LATERAL VIEW -- 聚合(本质上就是笛卡尔乘积)
explode(split(regexp_replace(get_json_object(map_col,'$.genre'), '\\[|\\]', ''), ",")) v -- 炸裂后作为一个表,两个表聚合之后成v表
as genre -- 是炸裂函数explode(split(regexp_replace(get_json_object(map_col,'$.genre'), '\\[|\\]', ''), ","))的别名


------------------------------------------------------------------------------------------------------------------------
SELECT
    get_json_object(map_col,'$.game_name') game_name,
    genre 
FROM
	ods_crawler_table      
LATERAL VIEW explode(split(regexp_replace(get_json_object(map_col,'$.genre'), '\\[|\\]', ''), ",")) v as genre		
WHERE
	dt = '2023-02-26'       
    AND get_json_object(map_col,'$.code') = 'xxx'
Aggregation effect:


8. Remove json redundant fields json2map + map_remove + map_values

  • First convert json into map, then use method map_remove to delete, and finally use map_values ​​to take it out
-- 举例:
map_values(map_remove(json2map(map_col),'code','create_time')) AS datas

9. Conditional judgment, whether it is empty nvl, IF

  • nvl(valueExp1, valueExp2): Based on whether the value of the first expression is empty, if it is not empty, it returns the value of the first expression. If it is empty, it returns the value of the second expression.
nvl(IF(gap>120, null, gap), 0) gap

10. Improve query performance, equivalent to temporary tables and views with...as

(1) Function:

The with as phrase, also called the subquery part, is used to define a SQL fragment that will be used by the entire SQL statement. Among them, the result set generated by the SQL fragment is stored in the memory , and
subsequent SQL can access this result set, which is similar to a view or a temporary table.

(2) Grammar:

with temp as (
    select xx字段 from xx表
)
select xx字段 from temp;

(3) Essence:

The with...as subquery part is no different in efficiency from using the subquery directly, but this way of writing increases the readability of SQL.

(4) Small details:

  • with...as Features: It is a one-time use. For example, in the following example, the "temporary table" temp1 is defined. After the first query of name, the id cannot be viewed again.


11. Type cast

(1) Grammar:

cast(expr as <type>) Converts the result of expression expr to <type>

Cast (field name as converted type)

(2) Example:

  • Example 1: cast('1' as BIGINT) converts the string '1' to its integer representation

  • Example 2: Table tableA has a time field release_time: 2018-11-03 15:31:26

select cast(release_time as date) as release_time from tableA;
  • Result: release_time: 2018-11-03




If this article is helpful to you, please remember to give Yile a like, thank you!

Guess you like

Origin blog.csdn.net/weixin_45630258/article/details/129305532