01-shell文本处理三剑客之grep

开篇：哈喽，今天我想写写shell编程，打算平均一天一篇吧，这样一个月后就可以进步比较多。
先从shell文本处理三剑客grep、sed、awk开始。听说啊，要是我不会这个命令，就不好意思说自己会shell编程。

1 grep是什么意思？

grep: Global search REgular expression and Print out the line.
文本搜索工具，根据用户指定的“模式（pattern）”对目标文本进行过滤，显示被模式匹配到的行。
嘿嘿，我觉得学习grep，倒不如说是在学习模式匹配，也就是说正则表达式。
我们先来简单实验一下grep的用法：

[root@hadoop1 hadoop]# cat /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
gopher:x:13:30:gopher:/var/gopher:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin
usbmuxd:x:113:113:usbmuxd user:/:/sbin/nologin
vcsa:x:69:69:virtual console memory owner:/dev:/sbin/nologin
rtkit:x:499:497:RealtimeKit:/proc:/sbin/nologin
avahi-autoipd:x:170:170:Avahi IPv4LL Stack:/var/lib/avahi-autoipd:/sbin/nologin
abrt:x:173:173::/etc/abrt:/sbin/nologin
haldaemon:x:68:68:HAL daemon:/:/sbin/nologin
gdm:x:42:42::/var/lib/gdm:/sbin/nologin
ntp:x:38:38::/etc/ntp:/sbin/nologin
apache:x:48:48:Apache:/var/www:/sbin/nologin
saslauth:x:498:76:"Saslauthd user":/var/empty/saslauth:/sbin/nologin
postfix:x:89:89::/var/spool/postfix:/sbin/nologin
pulse:x:497:496:PulseAudio System Daemon:/var/run/pulse:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
tcpdump:x:72:72::/:/sbin/nologin
itcast01:x:500:500:itcast01:/home/itcast01:/bin/bash
hadoop:x:501:501::/home/hadoop:/bin/bash
mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/bash
[root@hadoop1 hadoop]# cat /etc/passwd |grep root
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
[root@hadoop1 hadoop]# cat /etc/passwd |grep --color root
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
[root@hadoop1 hadoop]#

2 正则表达式
grep虽简单，但是模式匹配不简单呀，接下来学习的都是正则表达式。学正则表达式有什么好处？其实我们做大数据的话，要进行数据清洗，或者爬虫等等，都要用正则表达式，用处还是大大的。
正则表达式：由一类字符书写的模式，其中有些字符不表示字符的字面意义，而是表示控制或通配的功能；
同一个元字符所表达的含义可以不一样，依此分为两类：基础正则表达式和扩展正则表达式。一定要明确你所写的是属于基础正则表达式还是扩展正则表达式。
写模式匹配一定要加单引号’ ‘,比如grep –color ‘r..t’ /etc/passwd

[root@hadoop1 hadoop]# grep --color 'r..t' /etc/passwd
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
[root@hadoop1 hadoop]#

2.1 字符匹配
.：匹配任意单个字符
[]：匹配指定集合中的任意单个字符
[[:digit:]], [0-9]
[[:lower:]], [a-z]
[[:upper:]], [A-Z]
[[:alpha:]], [a-zA-Z]
[[:alnum:]], [0-9a-zA-Z]
[[:space:]]
[[:punct:]]

[root@hadoop1 shelltest]# grep --color 'abcdef[[:digit:]][[:digit:]][0-9]' test
abcdef123
[root@hadoop1 shelltest]# grep --color 'abcdef[[:digit:]]' test
abcdef123
[root@hadoop1 shelltest]# grep --color 'abc' test
abcdef123
[root@hadoop1 shelltest]# grep -o --color 'abc' test
abc
[root@hadoop1 shelltest]#

[^]：匹配指定集合外的任意单个字符

[root@hadoop1 shelltest]# grep --color 'abcdef[^[:digit:]]' test
[root@hadoop1 shelltest]#

[root@hadoop1 shelltest]# grep --color 'xielaoshi[^a-z]' test
xielaoshi121314
xiexiexielaoshi133
[root@hadoop1 shelltest]#

2.2匹配次数：
用于对其前面紧邻的字符所能够出现的次数作出限定。语法如下：
*: 匹配其前面的字符任意次，0,1或多次；
例如：grep ‘x*y’
xy, xxy, xxxy, y
\?：匹配其前面的字符0次或1次；
例如：grep ‘x\?y’
xy, xxy, y, xxxxxy, aby
+: 匹配其前面的字符出现至少1次；
{m}: 匹配其前面的字符m次；
例如：grep ‘x{2}y’
xy, xxy, y, xxxxxy, aby
{m,n}：匹配其前面的字符至少m次，至多n次；
例如: grep ‘x{2,4}y’
xy, xxy, y, xxxxxxy, aby
grep ‘x{0,4}y’
xy, xxy, y, xxxxxxxxxy, aby
grep ‘x{2,}y’
xy, xxy, y, xxxxxy
.*: 匹配任意长度的任意字符

实验一下：

[root@hadoop1 shelltest]# vi test 
[root@hadoop1 shelltest]# cat test 
xielaoshi121314
xiexiexielaoshi133
xie123laoshi
xielaoshi
abcdef123

xy
xxy
xxxxy
y
aby
[root@hadoop1 shelltest]# grep --color 'x*y' test
xy
xxy
xxxxy
y
aby
[root@hadoop1 shelltest]#

[root@hadoop1 shelltest]# grep --color 'x\+y' test
xy
xxy
xxxxy
[root@hadoop1 shelltest]# grep --color 'x\{2\}y' test
xxy
xxxxy
[root@hadoop1 shelltest]# grep --color 'x\{0,4\}y' test
xy
xxy
xxxxy
y
aby
[root@hadoop1 shelltest]# grep --color 'x\{2,\}y' test
xxy
xxxxy
[root@hadoop1 shelltest]# grep --color 'x.*i' test
xielaoshi121314
xiexiexielaoshi133
xie123laoshi
xielaoshi
[root@hadoop1 shelltest]#

如果你跟着敲命令，会不会感觉“\”不知道啥意思？其实它就是转义字符。

2.3 位置锚定：

^: 行首锚定
        写在模式的最左侧
$: 行尾锚定
        写在模式的最右侧
^$: 空白行
\<: 词首锚定, \b
        出现在要查找的单词模式的左侧；\<char
\>：词尾锚定, \b
        出现在要查找的单词模式的右侧；char\>
\<pattern\>: 匹配单词

[root@hadoop1 shelltest]# grep --color '\<r' /etc/passwd
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
rtkit:x:499:497:RealtimeKit:/proc:/sbin/nologin
pulse:x:497:496:PulseAudio System Daemon:/var/run/pulse:/sbin/nologin
[root@hadoop1 shelltest]# grep --color '\<ha' /etc/passwd
halt:x:7:0:halt:/sbin:/sbin/halt
haldaemon:x:68:68:HAL daemon:/:/sbin/nologin
hadoop:x:501:501::/home/hadoop:/bin/bash
[root@hadoop1 shelltest]# grep --color 'tor\>' /etc/passwd
operator:x:11:0:operator:/root:/sbin/nologin
[root@hadoop1 shelltest]# grep --color '\<root\>' /etc/passwd
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
[root@hadoop1 shelltest]#

2.4 分组
()

后向引用：模式中，如果使用\(\)实现了分组，在某行文本的检查中，如果\(\)的模式匹配到了某内容，此内容后面的模式中可以被引用；
    \1, \2, \3
模式自左而右，引用第#个左括号以及与其匹配右括号之间的模式匹配到的内容；

[root@hadoop1 shelltest]# cat test
xielaoshi121314
xiexiexielaoshi133
xie123laoshi
xielaoshi
abcdef123

xy
xxy
xxxxy
y
aby
abababy
by
bby
[root@hadoop1 shelltest

[root@hadoop1 shelltest]# grep --color 'ab*y' test
aby
abababy
[root@hadoop1 shelltest]# grep --color 'ab\{1\}y' test
aby
abababy
[root@hadoop1 shelltest]# vi test 
[root@hadoop1 shelltest]# grep --color 'ab\{1,\}' test
abcdef123
aby
abbbby
abababy
[root@hadoop1 shelltest]# grep --color '\(ab\)\{1,\}y' test
aby
abababy
[root@hadoop1 shelltest]#

后向引用：

[root@hadoop1 shelltest]# grep --color '\(ab\)\{1,\}y\1' test
abababyab

3 grep选项
-v: 反向选取
-o: 仅显示匹配到内容
-i: 忽略字符大小写
-E: 使用扩展正则表达式
-A #: 显示匹配字符的下面的行数内容
-B #：显示匹配字符的下面的行数内容
-C #：显示匹配字符的上下面的行数内容

[root@hadoop1 shelltest]# grep -A 2 'abababyab' test
abababyab
by
bby
[root@hadoop1 shelltest]# grep -B 2 'abababyab' test
aby
abbbby
abababyab
[root@hadoop1 shelltest]# grep -C 2 'abababyab' test
aby
abbbby
abababyab
by
bby
[root@hadoop1 shelltest]#

以上三个命令在查找日志的时候很有用。

4 egrep及扩展的正则表达式
扩展正则表达式的元字符：
字符匹配：
.
[]
[^]
匹配次数限定：
*
?: 匹配其前面字符0次或1次；
+：匹配其前面的字符至少1次；
{m}：匹配其前面的字符m次；
{m,n}：{m,}, {0,n}
锚定：
^
$
\<, >: \b
分组：
()

            支持后向引用：\1, \2, ...
        或者：
            a|b: a或者b
            ab|cd：

    # grep -E 'pattern' file...
    # egrep 'pattern' file...

其实扩展正则表达式相比较于基础正则表达式，扩展正则表达式少了‘\’斜杠。

[root@hadoop1 shelltest]# grep --color 'xie[1x]' test
xiexiexielaoshi133
xie123laoshi
[root@hadoop1 shelltest]# egrep --color 'xie1|x' test
xielaoshi121314
xiexiexielaoshi133
xie123laoshi
xielaoshi
xy
xxy
xxxxy
[root@hadoop1 shelltest]# grep --color 'xie1\|x' test
xielaoshi121314
xiexiexielaoshi133
xie123laoshi
xielaoshi
xy
xxy
xxxxy
[root@hadoop1 shelltest]# grep -E --color 'xie1|x' test
xielaoshi121314
xiexiexielaoshi133
xie123laoshi
xielaoshi
xy
xxy
xxxxy
[root@hadoop1 shelltest]# grep --color 'xie\(1\|x\)' test
xiexiexielaoshi133
xie123laoshi
[root@hadoop1 shelltest]#

好了，有点累了，今天就先玩到这里吧。如果你看到此文，想进一步学习或者和我沟通，加我微信公众号：名字：五十年后
see you again! !
这里写图片描述

01-shell文本处理三剑客之grep

猜你喜欢