leetcode classification question: string matching KMP algorithm

1. Different from leetcode classification questions: sliding window (three, two sequences + window fixed length type) and leetcode classification questions: string coverage, letters in sliding window (four, two sequences + window variable length type) Anagrams, permutations, etc., here is to judge the matching of strings ( the type, number and order of elements at the corresponding positions are completely consistent) 2. The KMP algorithm
in the string matching process is also a special case of double pointers , one pointer points to the original string, and another pointer points to the matching string. The original string pointer is responsible for traversing the elements of the original string . The matching string pointer marks the first unmatched position in the matching string (the position to be compared for the next match) 3. The first difficulty It is to understand how the original string pointer is updated . In fact, the original string pointer is updated in the KMP algorithm by traversing the elements of the original string in sequence. However, this is not easy to understand because when a mismatch occurs with the matching string, it is difficult to directly understand this. The original string pointer does not need to be rolled back . After reading the solution of a big guy on leetcode's official website , it seems quite reasonable: when our original string pointer moves from the i position to the j position, it not only means that the "original string" is down The characters whose subscript range is [i,j) match or do not match the "matching string", and it also rejects the subset whose subscript range is [i,j) as the "matching starting point" of the "original string" .


4. The second difficulty is to understand how the matching string pointer is updated . There are two situations at this time. When an element mismatch occurs , the matching string pointer must continue to roll back . At this time, it must be noted that the rollback position is only related to the matching string. The relationship depends on the length of the longest equal suffix of the position element in the previous match , that is, the next array needs to be calculated in advance; when an element match occurs , the matching string pointer + 1 points to the next element
. 5. Third The first difficulty is to solve the next array , which is the array with the longest and equal suffix lengths (the prefix does not include the last character, and the suffix does not include the first character). The solution of this array is only related to the matching string, and its solution process is also It is a KMP algorithm . It is only slightly different in details such as starting from index 1 when traversing the original string.
6. Detailed understanding of the fallback position : the fallback position is in the previous position element of the next array, which is exactly equal to the element at the previous position in the matching string. The longest equal suffix length (thanks to the fact that the array index starts from 0 ), there is a detailed consideration - the relationship with the index position , which happens to be the matching string pointer marking the current first unmatched position in the matching string (The position to be compared for the next match)

28. Find the subscript of the first match in a string

This question is a basic question type for string matching and provides an algorithm template for subsequent questions.

from typing import List
'''
28. 找出字符串中第一个匹配项的下标
给你两个字符串haystack 和 needle ,请你在 haystack 字符串中找出 needle 字符串的第一个匹配项的下标(下标从 0 开始)。
如果needle 不是 haystack 的一部分,则返回 -1 。
示例 1:
    输入:haystack = "sadbutsad", needle = "sad"
    输出:0
    解释:"sad" 在下标 0 和 6 处匹配。 第一个匹配项的下标是 0 ,所以返回 0 。
题眼:字符串匹配
理解1:KMP 利用已匹配部分中相同的「前缀」和「后缀」来加速下一次的匹配;KMP 的原串指针不会进行回溯(当我们的原串指针从 i 位置后移到 j 位置,
不仅仅代表着「原串」下标范围为 [i,j) 的字符与「匹配串」匹配或者不匹配,更是在否决那些以「原串」下标范围为 [i,j) 为「匹配发起点」的子集。)。
理解2:next数组 即为 最长相等的前后缀长度的数组(前缀不含后最后一个字符,后缀不包含第一个字符);因此,该数组求解只与匹配串有关,同时,
回退位置在next数组的前一个位置元素里,恰好等于 匹配串中前一个位置的元素的 最长相等的前后缀长度(得益于数组索引从0开始)
思路:第一步,求next数组(记住)
     第二步,KMP算法
'''


class Solution:
    def strStr(self, haystack:str, needle:str) -> int:
        # 情况1、匹配串长度大于原串
        if len(haystack) < len(needle):
            return -1
        # 情况2、KMP算法
        nextArr = [0] * len(needle)
        self.getNext(nextArr, needle)
        # 双指针分别指向原串和匹配串,原串指针遍历,匹配串指针标记 当前第一个没匹配上的 位置
        j = 0
        for i in range(len(haystack)):
            # 对应位置元素不匹配:需要持续回退&判断是否匹配上,如果一直匹配不上,j会回到起始位置0
            while j > 0 and haystack[i] != needle[j]:
                j = nextArr[j - 1]
            # 对应位置元素匹配
            if haystack[i] == needle[j]:
                j += 1
            # 判断是否匹配完毕
            if j == len(needle):
                return i - len(needle) + 1
        return -1

    def getNext(self, nextArr: List[int], needle: str):  # 最长相等的前后缀长度的数组
        # i:字符串、next数组的遍历位置,也是当前字符的后缀末尾
        # j:最长相等前缀末尾位置+1,即 最长相等的前后缀长度,也是 匹配串中当前第一个没匹配上的 位置
        j = 0
        for i in range(1, len(needle)):  # next数组起始位置从1开始
            # 对应位置元素不匹配
            while j > 0 and needle[i] != needle[j]:
                j = nextArr[j - 1]
            # 对应位置元素匹配
            if needle[i] == needle[j]:
                j += 1
            nextArr[i] = j


if __name__ == "__main__":
    obj = Solution()
    while True:
        try:
            in_line = input().strip().split(',')
            haystack = in_line[0].split('=')[1].strip()[1: -1]
            needle = in_line[1].split('=')[1].strip()[1: -1]
            print(obj.strStr(haystack, needle))
        except EOFError:
            break

Interview question 17.17. Multiple searches

An extension of "28. Find the subscript of the first matching item in a string", because the starting position of all matching substrings is to be returned, so when there are multiple matching positions, after the matching string is completely matched , The next comparison position of the matching string is to fall back to the position of the longest equal prefix and suffix length of the last element.

from typing import List
'''
面试题 17.17. 多次搜索
给定一个较长字符串big和一个包含较短字符串的数组smalls,设计一个方法,根据smalls中的每一个较短字符串,对big进行搜索。
输出smalls中的字符串在big里出现的所有位置positions,其中positions[i]为smalls[i]出现的所有位置。
示例 1:
    输入:haystack = "sadbutsad", needle = "sad"
    输出:0
    解释:"sad" 在下标 0 和 6 处匹配。 第一个匹配项的下标是 0 ,所以返回 0 。
    输入:
        big = "mississippi"
        smalls = ["is","ppi","hi","sis","i","ssippi"]
    输出: [[1,4],[8],[],[3],[1,4,7,10],[5]]
题眼:字符串匹配
思路:“28. 找出字符串中第一个匹配项的下标”的扩展,需要注意 第一次匹配上时,匹配串的下个比较位置更新
'''


class Solution:
    def multiSearch(self, big: str, smalls: List[str]) -> List[List[int]]:
        result = []
        for s in smalls:
            if s == "":  # s为空时,没必要调用函数
                result.append([])
            else:
                result.append(self.strStr(big, s))
        return result

    def strStr(self, big: str, s: str) -> List[int]:
        result = []
        nextArr = [0] * len(s)
        self.getNext(nextArr, s)
        j = 0  # 标记匹配串 第一个没被匹配上的位置或者下次匹配要比较的位置
        for i in range(len(big)):
            # 不匹配时
            while j > 0 and big[i] != s[j]:
                j = nextArr[j - 1]
            # 匹配时
            if big[i] == s[j]:
                j += 1
            if j == len(s):
                result.append(i - len(s) + 1)
                j = nextArr[j - 1]  # 第一次匹配上时,匹配串的下个比较位置更新
        return result

    def getNext(self, nextArr: List[int], s: str):
        j = 0  # 最长相当前缀位置+1,标记第一个没被匹配上的位置或者下次匹配要比较的位置
        for i in range(1, len(s)):  # 遍历从1开始
            # 不匹配时
            while j > 0 and s[i] != s[j]:
                j = nextArr[j - 1]
            # 匹配时
            if s[i] == s[j]:
                j += 1
            nextArr[i] = j

796. Rotate string

For repeated structural string matching problems, you need to first construct a suitable original string , and then solve the problem according to the string matching idea: this problem is to connect two identical s strings to form the original string, and then determine whether the matching string goal can match Just go up, interview question 01.09. String rotation is exactly the same as this question.

from typing import List
'''
796. 旋转字符串
给定两个字符串, s 和 goal。如果在若干次旋转操作之后,s 能变成 goal ,那么返回 true 。
s 的 旋转操作 就是将 s 最左边的字符移动到最右边。 
例如, 若 s = 'abcde',在旋转一次之后结果就是'bcdea' 。
示例 1:
    输入: s = "abcde", goal = "cdeab"
    输出: true
题眼:字符串匹配
思路、字符串匹配:构造s+s的新字符串,判断goal是否存在
'''


class Solution:
    def rotateString(self, s: str, goal: str) -> bool:
        # 情况1、两个字符串长度不相等
        if len(s) != len(goal):
            return False
        # 情况2、字符串匹配:重复连接两个s,判断goal是否能匹配上
        newS = s * 2
        nextArr = [0] * len(goal)
        self.getNext(nextArr, goal)
        j = 0  # 标记 匹配串中第一个没匹配上的位置
        for i in range(len(newS)):
            # 不匹配时
            while j > 0 and newS[i] != goal[j]:
                j = nextArr[j - 1]
            # 匹配时
            if newS[i] == goal[j]:
                j += 1
            if j == len(goal):
                return True
        return False

    def getNext(self, nextArr: List[int], goal: str):  # 最长相等前后缀长度的数组
        j = 0  # 最长相等前缀位置+1,即匹配串中第一个没匹配上的位置
        for i in range(1, len(goal)):  # 遍历从1开始
            # 不匹配时
            while j > 0 and goal[i] != goal[j]:
                j = nextArr[j - 1]
            # 匹配时
            if goal[i] == goal[j]:
                j += 1
            nextArr[i] = j


if __name__ == "__main__":
    obj = Solution()
    while True:
        try:
            in_line = input().strip().split('=')
            s = in_line[1].strip().split(',')[0][1: -1]
            goal = in_line[2].strip()[1: -1]
            print(obj.rotateString(s, goal))
        except EOFError:
            break

459. Repeated substring

1. Idea 1: The extension of "796. Rotating Strings" is a repeated structural string matching problem. You need to construct a suitable original string first , and then solve the problem according to the string matching idea: This problem is solved by combining two identical strings. s strings are connected, and the first and last elements are removed to form the original string, and then it is judged whether the matching string s can be matched. 2. Idea 2: Use the particularity of this question to obtain the longest phase of the last element of
the s string through the next array. Wait for the length of the prefix and suffix , and then determine whether the length of the substring composed of the remaining characters is divisible by the length of the original string (it can be known from the mathematical relationship: the substring composed of the remaining characters is a repeated substring ). Need to pay attention to the details: nextArr[-1]>0.

'''
459. 重复的子字符串
给定一个非空的字符串 s ,检查是否可以通过由它的一个子串重复多次构成。
示例 1:
    输入: s = "abab"
    输出: true
    解释: 可由子串 "ab" 重复两次构成。
题眼:字符串匹配
思路1、字符串匹配:构造s+s并去掉头尾字符的新字符串,判断s是否存在
思路2、通过nextArr直接判断,需要注意nextArr[-1]>0即最长相等前后缀首先要存在的细节
'''
from typing import List


class Solution:
    def repeatedSubstringPattern(self, s: str) -> bool:
        # # 思路1、字符串匹配:构造s+s并去掉头尾字符的新字符串,判断s是否存在
        # newS = s[1: len(s)] + s[: len(s) - 1]
        # nextArr = [0] * len(s)
        # self.getNext(nextArr, s)
        # # 双指针分别指向原串和匹配串,原串指针遍历,匹配串指针标记 当前第一个没匹配上的 位置
        # j = 0  # 标记匹配串的 最长相等前缀位置+1
        # for i in range(len(newS)):
        #     # 不匹配时:需要持续回退&判断是否匹配上,如果一直匹配不上,j会回到起始位置0
        #     while j > 0 and newS[i] != s[j]:
        #         j = nextArr[j - 1]
        #     # 匹配时
        #     if newS[i] == s[j]:
        #         j += 1
        #     if j == len(s):
        #         return True
        # return False

        # 思路2、通过nextArr直接判断
        nextArr = [0] * len(s)
        self.getNext(nextArr, s)
        # 一定要注意nextArr[len(s) - 1] > 0的细节
        if nextArr[len(s) - 1] > 0 and len(s) % (len(s) - nextArr[len(s) - 1]) == 0:
            return True
        return False

    def getNext(self, nextArr: List[int], s: str):  # 最长相等的前后缀长度的数组
        # i:字符串、next数组的遍历位置,也是当前字符的后缀末尾
        # j:最长相等前缀末尾位置+1,即 最长相等的前后缀长度,也是 匹配串中 当前第一个没匹配上的 位置
        j = 0
        for i in range(1, len(s)):  # 遍历索引从1开始
            # 不匹配时
            while j > 0 and s[i] != s[j]:
                j = nextArr[j - 1]
            # 匹配时
            if s[i] == s[j]:
                j += 1
            nextArr[i] = j


if __name__ == "__main__":
    obj = Solution()
    while True:
        try:
            s = input().strip().split('=')[1].strip()[1: -1]
            print(obj.repeatedSubstringPattern(s))
        except EOFError:
            break

686. Repeated stacked string matching

1. Idea 1: It is obviously a repetitive structural string matching problem. You need to construct a suitable original string first , and then solve the problem according to the string matching idea: analyze the minimum and maximum number of repetitions based on the relationship between string lengths. value, and form the original string according to the maximum value , and then perform a string matching process on the repeated original string and the matching string; if it can be matched, determine the number of repetitions based on the end of the matching position in the original string (I personally feel that the first idea is more intuitive and easier Think of it, it is the same type of question as the above two questions)
2. Idea 2: According to the meaning of the question, when the matching is established, the starting position of the matching must be in the first repeated a; otherwise, no matter how many repeated a's, it will not work. The match is successful; therefore, the loop condition of string matching is changed to whether the starting position of the match is in the first repeated original string , that is, i - j < len(haystack), once the starting position of the match occurs in the second repeated string when, return immediately

from typing import List
'''
686. 重复叠加字符串匹配
给定两个字符串a 和 b,寻找重复叠加字符串 a 的最小次数,使得字符串 b 成为叠加后的字符串 a 的子串,如果不存在则返回 -1。
注意:字符串 "abc"重复叠加 0 次是 "",重复叠加 1 次是"abc",重复叠加 2 次是"abcabc"。
示例 1:
    输入:a = "abcd", b = "cdabcdab"
    输出:3
    解释:a 重复叠加三遍后为 "abcdabcdabcd", 此时 b 是其子串。
题眼:字符串匹配
思路1、根据字符串长度关系分析出 重复次数的最小取值和最大取值,进而对重复原串与匹配串进行 字符串匹配过程
思路2、根据题意满足时,匹配上的开始位置是否在第一个重复原串中判断:如果不在,说明再多重复也不会匹配上;如果在,讨论重复次数
即:匹配成立时,匹配开始的位置一定是在第一个重复的a中;否则,再多重复的a,也无法匹配成功
'''


class Solution:
    def repeatedStringMatch(self, a: str, b: str) -> int:
        # # 思路1、根据字符串长度关系分析出 重复次数的最小取值和最大取值
        # result = 1
        # # 重复次数的最小取值:保证长度上能够覆盖b字符串
        # if len(b) % len(a) == 0:
        #     result = len(b) // len(a)
        # else:
        #     result = len(b) // len(a) + 1
        # # 重复次数的最大取值:最小取值+1;再多的重复次数没有必要了,匹配情况会跟此时的最大取值一样
        # # 能匹配上时,更多的重复串是冗余的;无法匹配上时,更多的重复串也匹配不上
        # result += 1
        # newA = a * result
        # nextArr = [0] * len(b)
        # self.gexNext(nextArr, b)
        # j = 0  # 标记匹配串中 最长匹配前缀位置+1, 也是标记 匹配串中 当前第一个没匹配上的 位置
        # for i in range(len(newA)):  # i负责遍历原串
        #     # 没匹配上:需要持续回退&判断是否匹配上,如果一直匹配不上,j会回到起始位置0
        #     while j > 0 and newA[i] != b[j]:
        #         j = nextArr[j - 1]
        #     if newA[i] == b[j]:
        #         j += 1
        #     if j == len(b):  # 匹配上了:讨论匹配位置是否超出result-1次重复的最大值索引
        #         if i <= (result - 1) * len(a) - 1:
        #             return result - 1
        #         else:
        #             return result
        # return -1

        # 思路2、根据题意满足时,匹配上的开始位置是否在第一个重复原串中判断:如果不在,说明再多重复也不会匹配上;如果在,讨论重复次数
        index = self.strStr(a, b)
        if index == -1:
            return -1
        elif index + len(b) - 1 <= len(a) - 1:
            return 1
        else:
            if (len(b) + index) % len(a) == 0:  # 想不通官方答案怎么能把这两种情况统一起来的
                return (len(b) + index) // len(a)
            else:
                return (len(b) + index) // len(a) + 1

    def strStr(self, haystack: str, needle: str) -> int:
        nextArr = [0] * len(needle)
        self.getNext(nextArr, needle)  # 获取next数组
        # 执行字符串匹配的过程
        i, j = 0, 0
        while i - j < len(haystack):  # i-j就是当前匹配发生的起始位置,<n表示发生在第一个a中;
            # 一旦匹配过程的起点不在第一个a中时,循环结束
            while j > 0 and haystack[i % len(haystack)] != needle[j]:  # 当前字符不匹配:找j的下个比较位置
                j = nextArr[j - 1]
            if haystack[i % len(haystack)] == needle[j]:  # 当前字符匹配:看下一个字符
                j += 1
            if j == len(needle):  # 匹配成功
                return i - len(needle) + 1
            i += 1
        return -1

    def getNext(self, nextArr: List[int], b: str):  # 最长相等的前后缀长度的数组
        # i:字符串、next数组的遍历位置,也是当前字符的后缀末尾
        # j:最长相等前缀末尾位置+1,即 最长相等的前后缀长度,也是 匹配串中 当前第一个没匹配上的 位置
        j = 0
        for i in range(1, len(b)):  # 遍历索引从1开始
            # 没匹配上:持续回退,直到匹配上或回到起始位置
            while j > 0 and b[i] != b[j]:
                j = nextArr[j - 1]
            # 匹配上
            if b[i] == b[j]:
                j += 1
            nextArr[i] = j


if __name__ == "__main__":
    obj = Solution()
    while True:
        try:
            in_line = input().strip().split(',')
            a = in_line[0].split('=')[1].strip()[1: -1]
            b = in_line[1].split('=')[1].strip()[1: -1]
            print(obj.repeatedStringMatch(a, b))
        except EOFError:
            break

214. Shortest palindrome string

1. It is also a structural string matching, but the algorithm idea is more difficult to think about.
2. The big idea is to solve the longest prefix palindrome string , and then add the remaining characters of the entire string to the front in reverse order.
3. Specific implementation : Reverse sequence s is treated as the original string, s is the matching string, and a complete string matching process is performed. At the end of the process, that is, when the original string reverse sequence s reaches the end, the corresponding suffix is ​​exactly the palindrome sequence of the prefix of string s. , at this time, the mark in the matching string is the longest prefix position + 1 (and the value is at least 1). Therefore, the subsequence before the matching string mark happens to be the longest prefix palindrome string - I have to say that this idea is too clever Got it

from typing import List
'''
214. 最短回文串
给定一个字符串 s,你可以通过在字符串前面添加字符将其转换为回文串。找到并返回可以用这种方式转换的最短回文串。
示例 1:
    输入:s = "aacecaaa"
    输出:"aaacecaaa"
    解释:a 重复叠加三遍后为 "abcdabcdabcd", 此时 b 是其子串。
题眼:
思路:寻找最长前缀回文串,再将整个字符串的剩余字符逆序插入到最前面
'''


class Solution:
    def shortestPalindrome(self, s: str) -> str:
        # 情况1、字符串为空
        if len(s) == 0:
            return s
        # 情况2、寻找最长前缀回文串:将逆序s串当作原串,s串当作匹配串,进行一次完整的字符串匹配过程
        # 该过程结束时,即原串达到末尾时对应的后缀(刚好是s串前缀的回文序列),此时匹配串的标记为 最长前缀回文串位置+1(该值至少为1)
        nextArr = [0] * len(s)
        self.getNext(nextArr, s)
        j = 0  # 标记匹配串中第一个没匹配上的位置
        for i in range(len(s) - 1, -1, -1):
            # 不匹配时
            while j > 0 and s[i] != s[j]:
                j = nextArr[j - 1]
            # 匹配时
            if s[i] == s[j]:
                j += 1
        # 匹配结束时,此时j指向了 最长前缀回文串位置+1
        if j == len(s):  # s串本身为回文串
            add = ""
        else:
            add = s[j: len(s)]
        return add[::-1] + s

    def getNext(self, nextArr: List[int], s: str):  # 最长相等的前后缀长度的数组
        # i:字符串、next数组的遍历位置,也是当前字符的后缀末尾
        # j:最长相等前缀末尾位置+1,即 最长相等的前后缀长度,也是 匹配串中 当前第一个没匹配上的 位置
        j = 0
        for i in range(1, len(s)):  # 遍历索引从1开始
            # 不匹配时
            while j > 0 and s[i] != s[j]:
                j = nextArr[j - 1]
            # 匹配时
            if s[i] == s[j]:
                j += 1
            nextArr[i] = j


if __name__ == "__main__":
    obj = Solution()
    while True:
        try:
            in_line = input().strip().split('=')
            s = in_line[1].strip()[1: -1]
            print(obj.shortestPalindrome(s))
        except EOFError:
            break

Guess you like

Origin blog.csdn.net/qq_39975984/article/details/132643558