字符串之子字符串查找

子字符串查找

字符串的一种基本操作就是子字符串查找：给定一段长度为N的文本text和一个长度为M的模式字符串pattern，在文本中找到一个和该模式相符的子字符串。解决该问题的大部分算法都可以很容易地扩展为找出文本中和该模式相符的子字符串，统计该模式在文本中的出现次数，或者找出上下文（和该模式相符的子字符串周围的文字）的算法。现如今主要算法有暴力子字符串查找，KMP，BM字符串查找和RK指纹字符串查找。

今天主要整理和利用python实现暴力子字符串查找算法和KMP字符串查找算法。

设长度为N的文本为txt，长度为M的模式字符串为pat。

暴力子字符串查找算法

顾名思义，暴力简单，就是在文本中模式可能出现匹配的任何地方检查匹配是否存在。其使用i跟踪文本，使用j跟踪模式。首先遍历i，对每个i，将j重置为0并遍历模式字符串，直到出现不匹配字符或者模式结束。然后在模式字符串结束之前文本字符串就已经结束了，那么就没有找到匹配。其中，找到匹配返回匹配起始位置，否则返回N（len(txt)）。暴力子字符串查找算法的时间复杂度为O(M*N)。

代码实现

'''
Implementation of a violent substring lookup algorithm
Args:
    pat: pattern string
    txt: text
Return:
    match successfully, return i which is the location of pat in txt
    failed, return the length of txt
'''

def search1(pat, txt):
    M = len(pat)  # the length of pattern
    N = len(
    txt)  # the length of text
    for i in range(N-M+1):  # i is following txt
        m = 0  # the value of j
        for j in range(M):  # j is following pat
            if txt[i+j] != pat[j]:
                break
            m += 1
        if m == M:
            return i  # we find the match
    return N  # not found
    
def search2(pat, txt):
    M = len(pat)  # the length of pattern
    N = len(txt)  # the length of text
    i = 0  # i is following txt
    j = 0  # j is following pat
    while(i<N):  
        if txt[i] == pat[j]:
            j += 1
            i += 1
        else:
            i -= j-1
            j = 0
        if j == M:
            return i-M  # we find the match
    return N  # not found

前者使用了双层for循环，后者在此基础上对其做了一些改进，对指向模式的j的值和指向文本的i的值进行了回溯，如果字符不匹配，则将i指向本次匹配开始位置的下一个字符位置，将j指向模式的开头。

KMP子字符串查找算法之前沿——确定有限状态自动机

KMP算法命名是根据发明人Knuth,Morris和Pratt的名字命名，它的基本思想就是当出现不匹配时，我们就能知晓一部分文本的内容，因为在匹配失败之前这部分文本内容已经和模式字符串的部分内容相匹配。这样，我们就可以不用每次都将指针i退回到所有我们知晓的文本内容之前。KMP算法是暴力子字符串查找算法的优化，它的时间复杂度是O(M+N)。

在KMP算法中，我们不会回退文本指针i，而使用一个dfa[][]数组来记录匹配失败情况和匹配成功情况下模式指针j回退的具体位置。这个数组定义的是一个确定有限状态自动机。——转自工程师WWW的博客

图中显示的是和模式字符串 A B A B A C 对应的确定有限自动机，它是由状态和转换的方向组成。每个字符对应着一个状态，每个状态能够转换为字母表中的任意字符(我在这里采用ascii码值表示所有的字符，一共有256个)。字符对应的每个状态，只有一条转换是匹配成功的转换，其他都是非匹配转换。状态与字符的比较相对应，每个状态都表示一个模式字符的索引值。

我们举个例子，假设txt为“ABABCABABACA”，pat为“ABABAC”。如图是pat对应的确定有限状态自动机。下面是对有限自动机的转换过程的详细解释。

i = 0, j = 0, txt[i] = pat[j] = 'A', dfa[ord('A')][j] = 1，则 next[j] = 1, i++;
i = 1, j = 1, txt[i] = pat[j] = 'B', dfa[ord('B')][j] = 2, 则 next[j] = 2, i++;
i = 2, j = 2, txt[i] = pat[j] = 'A', dfa[ord('A')][j] = 3, 则 next[j] = 3, i++;
i = 3, j = 3, txt[i] = pat[j] = 'B', dfa[ord('B')][j] = 4, 则 next[j] = 4, i++;
i = 4, j = 4, txt[i] = 'C', pat[j] = 'A', dfa[ord('C')][j] = 0, 则 next[j] = 0, i++;
i = 5, j = 0, txt[i] = pat[j] = 'A', dfa[ord('A')][j] = 1, 则 next[j] = 1, i++;
i = 6, j = 1, txt[i] = pat[j] = 'B', dfa[ord('B')][j] = 2, 则 next[j] = 2, i++;
i = 7, j = 2, txt[i] = pat[j] = 'A', dfa[ord('A')][j] = 3, 则 next[j] = 3, i++;
i = 8, j = 3, txt[i] = pat[j] = 'B', dfa[ord('B')][j] = 4, 则 next[j] = 4, i++;
i = 9, j = 4, txt[i] = pat[j] = 'A', dfa[ord('A')][j] = 5, 则 next[j] = 5, i++；
i = 10, j = 5, txt[i] = pat[j] = 'C', dfa[ord('C')][j] = 6, 则 next[j] = 6, i++;
i = 11, j = 6, j == len(pat), 匹配转换，到达停止状态，则返回查找匹配成功的位置 i - j = 5(从0开始)。

如果自动机到达了状态M，我们就称之为确定有限状态自动机识别了该模式，也就是在文本中找到了和该模式字符串相匹配的子字符串。

在构造DFA上，我们一般遵循三个准则：

对于匹配失败的情况，我们将 dfa[][X] 复制 dfa[][j]。
对于匹配成功的情况，我们将 dfa[ord(pat[j])][j] 设为 j+1，也就是模式字符的下一个位置。
更新X的状态，X为匹配后的重启状态。

代码实现

'''
Implementation of a Knuth-Morris-Pratt substring lookup algorithm

ord() : Converts ASCII characters to corresponding values
'''
class KMP:
    def __init__(self, pat, txt, R = 256):
        self.__pat = pat
        self.__txt = txt
        self.R = R  # ASCII num
        self.dfa = [[0 for i in range(len(self.__pat))] for j in range(self.R)]  # DFA

    '''
    Args:
        pat: pattern
    Return:
        dfa
    '''
    def __initdfa(self):
        self.dfa[ord(self.__pat[0])][0] = 1
        X = 0  # restarting index
        for j in range(1,len(self.__pat)):  # count dfa[][j]
            for c in range(self.R):
                self.dfa[c][j] = self.dfa[c][X]
            self.dfa[ord(self.__pat[j])][j] = j+1
            X = self.dfa[ord(self.__pat[j])][X]  # update restarting index
            
    '''
    Args:
        txt: text
    Return:
        match successfully, return i-M which is the location of pat in txt
        failed, return the length of txt N
    '''
        
    def search(self):
        self.__initdfa()
        M = len(self.__pat)  # the length of pattern
        N = len(self.__txt)  # the length of text
        i = 0  # i is following txt
        j = 0  # j is following pat
        while(i<N):  
            j = self.dfa[ord(self.__txt[i])][j]
            i += 1
            if j == M:
                return i-M  # we find the match
        return N  # not found

KMP子字符串查找算法

在接下来的学习之后突然发现上面的基于确定有限状态自动机的KMP算法还不算是KMP算法，只是利用有限自动机进行字符串匹配。真正的KMP算法不用提前预先计算变迁函数（用以表示自动机），只需要用到辅助函数s[1,m], 这个前缀函数包含有模式与其自身的位移进行匹配的信息。这些信息可用于避免在朴素的字符串匹配算法中，对无用位移进行测试，也可以避免在字符串匹配自动机中对变迁函数的预先计算过程（也就是dfa[][]数组）。

数组s有m个元素，而数组dfa有m*256个元素，通过预先计算s而不是dfa，会使得子字符串查找中减少了一定的时间（时间会减少一个字母表的个数因子，这里是256）。

而这个数组s我们就叫做关于模式的前缀函数，它包含有模式与其自身的位移进行匹配的信息。我们还是按照之前的例子来进行解释。 txt 为“ABABCABABACA”， pat 为“ABABA”。N = 12 , M = 5。i 是 txt 的索引， j 是 pat 的索引，d 是当前模式字符串处于文本中的位置， dd 是下次匹配不成功时模式字符串应当在文本中的有效位置。

i = 0, j = 0, d = 0, txt[i] = pat[j] = 'A', 则 next[j] = 1, i++, dd = 0 + (1-s[0]) = 1；
i = 1, j = 1, d = 0, txt[i] = pat[j] = 'B', 则 next[j] = 2, i++, dd = 0 + (2-s[1]) = 2;
i = 2, j = 2, d = 0, txt[i] = pat[j] = 'A', 则 next[j] = 3, i++, dd = 0 + (3-s[2]) = 2;
i = 3, j = 3, d = 0, txt[i] = pat[j] = 'B', 则 next[j] = 4, i++, dd = 0 + (4-s[3]) = 2；
i = 4, j = 4, d = 0, txt[i] = 'C', pat[j] = 'A', txt[i] != pat[j] ,则 next->d = 2, next[j] = 2;
i = 4, j = 2, d = 2, txt[i] = 'C', pat[j] = 'A, txt[i] != pat[j], 则 next[j] = 0, i++, next->d = 5-0 = 5;
i = 5, j = 0, d = 5, txt[i] = pat[j] = 'A', 则 next[j] = 1, i++, dd = 6;
i = 6, j = 1, d = 5, txt[i] = pat[j] = 'B', 则 next[j] = 2, i++, dd = 7;
i = 7, j = 2, d = 5, txt[i] = pat[j] = 'A', 则 next[j] = 3, i++, dd = 7;
i = 8, j = 3, d = 5, txt[i] = pat[j] = 'B', 则 next[j] = 4, i++, dd = 7;
i = 9, j = 4, d = 5, txt[i] = pat[j] = 'A', 则 next[j] = M = 5, 达到终止状态, 匹配成功。返回d=5(从0开始)。

我们来解释一下 i = 3 的时候的dd值大小，其他的类似（解释第三个无他，幸运数字是3）。此时我们设已匹配到的模式字符串为“ABAB”, 长度为4，我们对其求真后缀的最长前缀为“AB”，长度为2。那么 s[4] = 2（这里的s[]数组索引从1开始容易对其定义）。于是我们可以得出一个结论：在模式字符串开始的位移d(=1)处有q(=4)个字符成功匹配，则下一个模式字符如果匹配不成功时模式字符串的有效位置为 dd = d + (q - s[q]) = 1+(4-2) = 3 。

前缀是除了最后一个字符之外此字符串的全部头部组合，比如“ABCD”，其前缀有A, AB, ABC。

后缀是除了第一个字符之外此字符串的全部尾部结合，比如“ABCD”，其后缀有BCD, CD, D。

真后缀的最长前缀就是字符串的前后缀的最长的公共的元素组合。

这样我们就得到了关于模式的前缀函数 s 的值，我在下面表格里有写出来。

j	0	1	2	3	4
pat[j]	A	B	A	B	A
s[j]	0	0	1	2	3

代码实现

'''
Implementation of a Knuth-Morris-Pratt substring lookup algorithm

'''
class KMP:
    def __init__(self, pat, txt):
        self.__pat = pat
        self.__txt = txt
        self.s = [0 for j in range(len(self.__pat))]  # s

    '''
    Find the maximum same prefix length for a string

    Args:
        pat: pattern
    Return:
        s
    '''
    def __inits(self):
        self.s[0] = 0  # Maximum prefix length of the first character of a template string is 0
        X = 0  # restarting index
        for j in range(1,len(self.__pat)):  # count s[j]
            while (X > 0 and self.__pat[j] != self.__pat[X]):
                X = self.s[X-1]  # update restarting index
            if self.__pat[j] == self.__pat[X]:
                X += 1
            self.s[j] = X

    '''
    Args:
        txt: text
        pat: pattern
    Return:
        match successfully, return d which is the location of pat in txt
        failed, return the length of txt N
    
    j      0 1 2 3 4
    pat[j] A B A B A
    s[j]   0 0 1 2 3
    '''        
    def search(self):
        #s = [0,0,1,2,3]
        self.__inits()
        M = len(self.__pat)  # the length of pattern
        N = len(self.__txt)  # the length of text
        i = 0  # i is following txt
        j = 0  # j is following pat
        d = 0  # current distance
        dd = 0  # willing distance
        while(i < N):  
            if self.__txt[i] == self.__pat[j]:
                dd = d + (j + 1 - self.s[j])
                print i,j,d,dd
                i += 1
                j += 1
            else:
                d = dd
                j = i - d
                if j < 0:
                    i += 1
                    j = 0
                dd = d + (j + 1 - self.s[j])
                print i,j,d,dd
            if j == M:
                return d  # we find the match
        return N  # not found