Shell programming | text processing tools: regular expressions, grep, sed, awk


If you want to use these streaming tools, you must understand regular expressions . Since regular expressions have a lot of content, a blog will definitely not finish, so here is just a brief introduction to the common syntax of regular expressions.

Regular expression

Metacharacter

Options Description
\ Escape character
. Match any character
* Match the preceding character 0 or n times
^ Match beginning
$ End of match
[a-z] Match any character in square brackets

Extended metacharacters

Options Description
+ Match the preceding regular expression at least once
Match the previous regular expression, 0 or once

Combining these meta-characters becomes a regular expression, which will be demonstrated in the use cases of the following tools


The following introduces the three swordsmen of streaming text processing in Linux- text filtering tool grep, text editing tool sed, text report generator awk

grep

grep is a powerful text search tool, it can search for text using regular expressions and print out the matching lines. It is also one of the most commonly used tools in Linux

grammar

grep [选项](查询内容)(参数)

Common options

-i:忽略大小写
-r:递归读取一个目录下所有文件
-E:支持拓展正则表达式

and

Sed is a powerful text editing tool. It processes one line of content at a time, stores the current processing line in a temporary buffer (mode space) during processing, then uses the sed command to process the content in the buffer, sends the processed data to the terminal, and then outputs the next line of data. Until the end of the file

grammar

sed [选项] '命令' (参数)

Common options

-e:直接在指令列模式上进行sed的动作编辑,即可以使用多次命令
-n:只输出匹配行(也可以指定修改行)
-r:支持使用扩展正则表达式

Common commands

a:追加,在下一行出现
i:插入,在上一行出现
d:删除
s:查找并替换 
p:打印当前模式空间内容

Use demonstration, add, delete and modify the following texts respectively

hello
world
this
is
test
file
aaaaa
bbbbbbb
cccccccccc
123123124
world
hello

Append xxxxxxxxx to the next line of the line starting with t

cat test3 | sed '/^t/axxxxxxxxx'

Insert picture description here
Delete all lines ending in d

cat test3 | sed '/d$/d'

Insert picture description here
Replace all b in the text with d

cat test3 | sed 's/b/d/g'

Insert picture description here


awk

Awk is a powerful text analysis tool that reads the file line by line, slices each line with a space as the default separator , and analyzes the cut part.

In fact, awk is a programming language used to process text files, and gawk is a specific implementation of this programming language, so it supports arithmetic operations, conditional judgments, and flow control syntax similar to shells.

The following is the description of awk in the man manual, when we query awk, it will automatically jump to gawk
Insert picture description here

grammar

awk [选项] ‘匹配模式1{
    
    执行操作1} 匹配模式2{
    
    执行操作2}......(参数)

Note: Only the rows that match the pattern will execute the action

At the same time, you can also specify the content to be executed at the beginning and end

awk [选项] ‘BEGIN{
    
    开始时执行的内容} 匹配模式1{
    
    执行操作1} END{
    
    结束时执行的内容}(参数)

Note: BEGIN is executed before all data read lines; END is executed after all data is executed.

Common options

-F<分隔字符>:指定输入文件折分隔符
-v:赋值一个用户定义变量

Common system variables

FILENAME:文件名
NR:已读的记录数(即当前的行数)
NF:浏览记录的域的个数(切割后,列的个数)

Since awk arrays, operations, conditional judgments, flow control, functions, etc. are highly similar to shell and c language, I won’t spend much time on it here.

Usage demonstration: cutting out the ip address of ens38 in ifconfig

ens38: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.220.128  netmask 255.255.255.0  broadcast 192.168.220.255
        inet6 fe80::8e00:5dc0:711b:a9cf  prefixlen 64  scopeid 0x20<link>
        ether 00:0c:29:4e:dd:e4  txqueuelen 1000  (Ethernet)
        RX packets 1513042  bytes 952654297 (908.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1444185  bytes 1229233776 (1.1 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 185821127  bytes 25600956325 (23.8 GiB)
        RX errors 0  dropped 5322  overruns 0  frame 0
        TX packets 185821127  bytes 25600956325 (23.8 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
ifconfig ens38 | grep inet\  | awk -F " " '{print$2}'

Insert picture description here
Usage example: find out the blank lines in the text

the day is sunny the the

the sunny is is
awk '/^$/{print NR}' words

Insert picture description here


Other common tools

The following are other commonly used tools, due to the simple usage, so here is a brief introduction

cut

The main function of cut is to cut data

grammar

cut [选项](参数)

Common options

-f:列号,提取第几列
-d<分隔字符>:分隔符,按照指定分隔符分割列

Usage example: Get the name in the text

name age
alice 21
ryan 30
cat test | cut -d ' ' -f 1

Insert picture description here


sort

sort text content for sorting

grammar

sort [选项](参数)

Common options

-t<分隔字符>:指定排序时所用的栏位分隔字符
-n:依照数值的大小排序
-r:以相反的顺序来排序
-k:指定需要排序的列

Use demonstration: sort data in text

123123
42
647
453
6789
23
1
457
97312
cat nums | sort -n

Insert picture description here


uniq

The uniq command is used to deduplicate the data stream , usually used in conjunction with sort

grammar

uniq [选项](参数)

Common parameters

-c:在每列旁边显示该行重复出现的次数;
-d:仅显示重复出现的行列;

Use demonstration: count the number of occurrences of each word in the text

the
day
is
sunny
the
the
the
sunny
is
is

First remove the duplicates and sort according to the number of repetitions

cat test2 | uniq -c | sort -n

Insert picture description here


tr

tr is used to modify, replace, and compress text content

grammar

tr [选项](参数)

Common parameters

-d:删除所有属于第一字符集的字符;
-s:把连续重复的字符以单独一个字符表示

Usage example: Convert lowercase to uppercase

the day is sunny the the
the sunny is is
cat words | tr 'a-z' 'A-Z'

Insert picture description here


Common interview questions

The following questions are derived from leetcode

Tenth line

leetcode-195. The tenth line

Given a text file file.txt, please print only the tenth line in this file.

Example:
Suppose file.txt has the following content:

Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10

Your script should display the tenth line:

Line 10

Description:

  1. If the file is less than ten lines, what should you output?
  2. There are at least three different solutions, please try as many methods as possible to solve the problem.

answer:

解法1:sed -n '10p' file.txt
解法2:awk 'NR==10{print $0}' file.txt
解法3:tail -n +10 file.txt | head -n 1

Problem solving ideas:

  1. The idea of ​​solution 1 is to use the -n option of sed to display only matching lines, and then use the p command to output the tenth line
  2. The idea of ​​solution 2 is to use the system variable NR in awk to specify the line number
  3. The idea of ​​solution 3 is to first use the tail -n +10 (+ is the specified starting line number) option to get the content starting from the tenth line, and then use head -n 1 to filter, that is, only the first piece of data is left. Which is the content of the tenth line

Valid phone number

leetcode-193. Valid phone number

Given a text file file.txt containing a list of phone numbers (one phone number per line), write a bash script to output all valid phone numbers.

You can assume that a valid phone number must meet the following two formats: (xxx) xxx-xxxx or xxx-xxx-xxxx. (X represents a number)

You can also assume that there are no extra space characters before and after each line.

Example:
Suppose the content of file.txt is as follows:

987-123-4567
123 456 7890
(123) 456-7890

Your script should output the following valid phone numbers:

987-123-4567
(123) 456-7890

answer:

awk '/^([0-9]{3}-|\([0-9]{3}\) )[0-9]{3}-[0-9]{4}$/' file.txt

Problem solving ideas:

  1. Use awk directly to match with regular expressions

Statistical word frequency

leetcode-192. Statistical word frequency

Write a bash script to count the frequency of each word in a text file words.txt.

For simplicity, you can assume:

words.txt only includes lowercase letters and ''.
Each word consists of lowercase letters only.
Words are separated by one or more space characters

Example:

the day is sunny the the
the sunny is is

Your script should output (in descending order of word frequency):

the 4
is 3
sunny 2
day 1

answer:

cat words.txt | tr -s ' ' '\n' | sort | uniq -c | sort -nr | awk '{print $2,$1}'

Problem solving ideas:

  1. Since each word is separated by a space, and the result is counted by line, use the tr command to replace the space with a newline character, making each word one line
  2. We need to count the number of occurrences, so first use sort to sort the data so that adjacent ones are close together, then use uniq to sort, and use the -c option to count the number of repetitions
  3. Use sort -nr to sort in descending order using the number of repetitions
  4. Use awk to process the text, put the word in the front and the number of times in the back

Transpose file

leetcode-194. Transpose file

Given a file file.txt, transpose its contents.
You can assume that the number of columns in each row is the same, and each field is separated by ''.

Example:
Assume the content of file.txt is as follows:

name age
alice 21
ryan 30

Should output:

name alice ryan
age 21 30

answer:

awk '{
    for (i=1; i<=NF; i++)
    {
        if (NR==1)  
        {
            res[i]=$i
        }
        else
        {
            res[i]=res[i]" "$i
        }
    }
}
END{	
    for(j=1; j<=NF; j++)
    {
        print res[j]
    }
}'  file.txt

Problem solving ideas:

  1. From the meaning of the question, we need to reverse the rows and columns, so we need an array to store the string of each row
  2. When the line number is 1, it means that the word i in the column number is the beginning of the i-th line
  3. When the row number is not 1, append the word with column number i to the i-th item of the array, which is the i-th row after inversion
  4. It should be noted that since awk reads by row, we only need a layer of loop to process each column.

Guess you like

Origin blog.csdn.net/qq_35423154/article/details/109285753