Article Directory
If you want to use these streaming tools, you must understand regular expressions . Since regular expressions have a lot of content, a blog will definitely not finish, so here is just a brief introduction to the common syntax of regular expressions.
Regular expression
Metacharacter
Options | Description |
---|---|
\ | Escape character |
. | Match any character |
* | Match the preceding character 0 or n times |
^ | Match beginning |
$ | End of match |
[a-z] | Match any character in square brackets |
Extended metacharacters
Options | Description |
---|---|
+ | Match the preceding regular expression at least once |
? | Match the previous regular expression, 0 or once |
Combining these meta-characters becomes a regular expression, which will be demonstrated in the use cases of the following tools
The following introduces the three swordsmen of streaming text processing in Linux- text filtering tool grep, text editing tool sed, text report generator awk
grep
grep is a powerful text search tool, it can search for text using regular expressions and print out the matching lines. It is also one of the most commonly used tools in Linux
grammar
grep [选项](查询内容)(参数)
Common options
-i:忽略大小写
-r:递归读取一个目录下所有文件
-E:支持拓展正则表达式
and
Sed is a powerful text editing tool. It processes one line of content at a time, stores the current processing line in a temporary buffer (mode space) during processing, then uses the sed command to process the content in the buffer, sends the processed data to the terminal, and then outputs the next line of data. Until the end of the file
grammar
sed [选项] '命令' (参数)
Common options
-e:直接在指令列模式上进行sed的动作编辑,即可以使用多次命令
-n:只输出匹配行(也可以指定修改行)
-r:支持使用扩展正则表达式
Common commands
a:追加,在下一行出现
i:插入,在上一行出现
d:删除
s:查找并替换
p:打印当前模式空间内容
Use demonstration, add, delete and modify the following texts respectively
hello
world
this
is
test
file
aaaaa
bbbbbbb
cccccccccc
123123124
world
hello
Append xxxxxxxxx to the next line of the line starting with t
cat test3 | sed '/^t/axxxxxxxxx'
Delete all lines ending in d
cat test3 | sed '/d$/d'
Replace all b in the text with d
cat test3 | sed 's/b/d/g'
awk
Awk is a powerful text analysis tool that reads the file line by line, slices each line with a space as the default separator , and analyzes the cut part.
In fact, awk is a programming language used to process text files, and gawk is a specific implementation of this programming language, so it supports arithmetic operations, conditional judgments, and flow control syntax similar to shells.
The following is the description of awk in the man manual, when we query awk, it will automatically jump to gawk
grammar
awk [选项] ‘匹配模式1{
执行操作1} 匹配模式2{
执行操作2}......’ (参数)
Note: Only the rows that match the pattern will execute the action
At the same time, you can also specify the content to be executed at the beginning and end
awk [选项] ‘BEGIN{
开始时执行的内容} 匹配模式1{
执行操作1} END{
结束时执行的内容}’ (参数)
Note: BEGIN is executed before all data read lines; END is executed after all data is executed.
Common options
-F<分隔字符>:指定输入文件折分隔符
-v:赋值一个用户定义变量
Common system variables
FILENAME:文件名
NR:已读的记录数(即当前的行数)
NF:浏览记录的域的个数(切割后,列的个数)
Since awk arrays, operations, conditional judgments, flow control, functions, etc. are highly similar to shell and c language, I won’t spend much time on it here.
Usage demonstration: cutting out the ip address of ens38 in ifconfig
ens38: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.220.128 netmask 255.255.255.0 broadcast 192.168.220.255
inet6 fe80::8e00:5dc0:711b:a9cf prefixlen 64 scopeid 0x20<link>
ether 00:0c:29:4e:dd:e4 txqueuelen 1000 (Ethernet)
RX packets 1513042 bytes 952654297 (908.5 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1444185 bytes 1229233776 (1.1 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 185821127 bytes 25600956325 (23.8 GiB)
RX errors 0 dropped 5322 overruns 0 frame 0
TX packets 185821127 bytes 25600956325 (23.8 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ifconfig ens38 | grep inet\ | awk -F " " '{print$2}'
Usage example: find out the blank lines in the text
the day is sunny the the
the sunny is is
awk '/^$/{print NR}' words
Other common tools
The following are other commonly used tools, due to the simple usage, so here is a brief introduction
cut
The main function of cut is to cut data
grammar
cut [选项](参数)
Common options
-f:列号,提取第几列
-d<分隔字符>:分隔符,按照指定分隔符分割列
Usage example: Get the name in the text
name age
alice 21
ryan 30
cat test | cut -d ' ' -f 1
sort
sort text content for sorting
grammar
sort [选项](参数)
Common options
-t<分隔字符>:指定排序时所用的栏位分隔字符
-n:依照数值的大小排序
-r:以相反的顺序来排序
-k:指定需要排序的列
Use demonstration: sort data in text
123123
42
647
453
6789
23
1
457
97312
cat nums | sort -n
uniq
The uniq command is used to deduplicate the data stream , usually used in conjunction with sort
grammar
uniq [选项](参数)
Common parameters
-c:在每列旁边显示该行重复出现的次数;
-d:仅显示重复出现的行列;
Use demonstration: count the number of occurrences of each word in the text
the
day
is
sunny
the
the
the
sunny
is
is
First remove the duplicates and sort according to the number of repetitions
cat test2 | uniq -c | sort -n
tr
tr is used to modify, replace, and compress text content
grammar
tr [选项](参数)
Common parameters
-d:删除所有属于第一字符集的字符;
-s:把连续重复的字符以单独一个字符表示
Usage example: Convert lowercase to uppercase
the day is sunny the the
the sunny is is
cat words | tr 'a-z' 'A-Z'
Common interview questions
The following questions are derived from leetcode
Tenth line
Given a text file file.txt, please print only the tenth line in this file.
Example:
Suppose file.txt has the following content:
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10
Your script should display the tenth line:
Line 10
Description:
- If the file is less than ten lines, what should you output?
- There are at least three different solutions, please try as many methods as possible to solve the problem.
answer:
解法1:sed -n '10p' file.txt
解法2:awk 'NR==10{print $0}' file.txt
解法3:tail -n +10 file.txt | head -n 1
Problem solving ideas:
- The idea of solution 1 is to use the -n option of sed to display only matching lines, and then use the p command to output the tenth line
- The idea of solution 2 is to use the system variable NR in awk to specify the line number
- The idea of solution 3 is to first use the tail -n +10 (+ is the specified starting line number) option to get the content starting from the tenth line, and then use head -n 1 to filter, that is, only the first piece of data is left. Which is the content of the tenth line
Valid phone number
leetcode-193. Valid phone number
Given a text file file.txt containing a list of phone numbers (one phone number per line), write a bash script to output all valid phone numbers.
You can assume that a valid phone number must meet the following two formats: (xxx) xxx-xxxx or xxx-xxx-xxxx. (X represents a number)
You can also assume that there are no extra space characters before and after each line.
Example:
Suppose the content of file.txt is as follows:
987-123-4567
123 456 7890
(123) 456-7890
Your script should output the following valid phone numbers:
987-123-4567
(123) 456-7890
answer:
awk '/^([0-9]{3}-|\([0-9]{3}\) )[0-9]{3}-[0-9]{4}$/' file.txt
Problem solving ideas:
- Use awk directly to match with regular expressions
Statistical word frequency
leetcode-192. Statistical word frequency
Write a bash script to count the frequency of each word in a text file words.txt.
For simplicity, you can assume:
words.txt only includes lowercase letters and ''.
Each word consists of lowercase letters only.
Words are separated by one or more space characters
Example:
the day is sunny the the
the sunny is is
Your script should output (in descending order of word frequency):
the 4
is 3
sunny 2
day 1
answer:
cat words.txt | tr -s ' ' '\n' | sort | uniq -c | sort -nr | awk '{print $2,$1}'
Problem solving ideas:
- Since each word is separated by a space, and the result is counted by line, use the tr command to replace the space with a newline character, making each word one line
- We need to count the number of occurrences, so first use sort to sort the data so that adjacent ones are close together, then use uniq to sort, and use the -c option to count the number of repetitions
- Use sort -nr to sort in descending order using the number of repetitions
- Use awk to process the text, put the word in the front and the number of times in the back
Transpose file
Given a file file.txt, transpose its contents.
You can assume that the number of columns in each row is the same, and each field is separated by ''.
Example:
Assume the content of file.txt is as follows:
name age
alice 21
ryan 30
Should output:
name alice ryan
age 21 30
answer:
awk '{
for (i=1; i<=NF; i++)
{
if (NR==1)
{
res[i]=$i
}
else
{
res[i]=res[i]" "$i
}
}
}
END{
for(j=1; j<=NF; j++)
{
print res[j]
}
}' file.txt
Problem solving ideas:
- From the meaning of the question, we need to reverse the rows and columns, so we need an array to store the string of each row
- When the line number is 1, it means that the word i in the column number is the beginning of the i-th line
- When the row number is not 1, append the word with column number i to the i-th item of the array, which is the i-th row after inversion
- It should be noted that since awk reads by row, we only need a layer of loop to process each column.