String learning (KMP algorithm)

For string questions, I only know how to use templates, and I don’t know how to change the board without the template questions.
I must have not understood the meaning of the algorithm itself, so I have studied each algorithm itself and some of its more common extensions in detail in the past few days. I hope to broaden my thinking for solving problems in the future.

Detailed explanation of KMP algorithm

Human eye optimized string matching

We use the position pointers i and j in the string to illustrate, the first position index starts with 0, we call it the 0th position. If it is artificially searched, I will definitely not be moved back to the first place, because the main string matching failed position (i=3) there is no more A besides the first A, why can we know the main There is only an A in front of the string? Because we already know that the first three characters are matched! (This is very important). Moving in the past is definitely not a match! There is an idea, i can not move, we only need to move j, as shown below:

Insert picture description here
The big cows couldn't stand the inefficient method of "brute force cracking", so the three of them developed the KMP algorithm. The idea is the same as we saw above: "Using the valid information that has been partially matched, keeping the i pointer from backtracking, and by modifying the j pointer, the pattern string is moved to a valid position as much as possible."

Therefore, the key point of the entire KMP is that when a certain character does not match the main string, we should know where to move the j pointer?

Next, let's find out the movement law of j by ourselves:

Insert picture description here

As shown in the figure: C and D do not match, where do we want to move j? Obviously the first place. why? Because there is an A in front of the same.

Insert picture description here

The same situation is shown in the following figure:

Insert picture description here

You can move the j pointer to the second position, because the two letters in front are the same:

Insert picture description here

At this point, we can probably see a clue, when the match fails, the next position k to be moved by j. There is such a property: the first k characters are the same as the last k characters before j.

If you use a mathematical formula to express it like this

P[0 ~ k-1] == P[j-k ~ j-1]

This is very important. If you feel hard to remember, you can understand it through the following figure:

Insert picture description here
After understanding this, it should be possible to understand why j can be moved directly to position k.

because:

When T[i] != P[j]

Have T[ij ~ i-1] == P[0 ~ j-1]

From P[0 ~ k-1] == P[jk ~ j-1]

Inevitably: T[ik ~ i-1] == P[0 ~ k-1]

The nature of the KMP algorithm

The KMP algorithm uses this property of the substring to be matched to improve the matching speed. The nature of many other version explanation may also be described as: if substrings prefixes and suffixes repeated longest substring of length k, the next matching substrings j may be moved to the k-th bit ( The subscript is 0 for the 0th position). We define this interpretation as the maximum repeated substring interpretation.

In "aba", the prefix set is the set of substrings {a,ab} after removing the last character'a', and the suffix set is the set of substrings {a,ba} after removing the first character a. Then the longest repeated substring of the two is a, k=1;

The steps decomposed into a computer are as follows:

1) Find the prefix pre and set it to pre[0~m];

2) Find the suffix post and set it as post[0~n];

3) From the prefix pre, first use the maximum length of s[0~m] as a substring, that is, set the initial value of k to m, and compare it with post[n-m+1~n]:

If they are the same, then pre[0~m] is the largest repeated substring, and the length is m, then k=m;

If they are not the same, then k=k-1; one character of the substring of the reduced prefix is ​​aligned with the substring of the suffix according to the tail, and the comparison is performed to see if they are the same.

This continues until a duplicate substring is found, or k is not found.

Find next array

Okay, the next step is the point. How do we find this (these) k? Because mismatches may occur at each position of P, that is to say, we have to calculate the k corresponding to each position j, so we use an array next to save, next[j] = k, which means when T[i] != When P[j], the next position of the j pointer.
Another very useful and identical definition, because the subscript starts from 0, the value of k is actually the length of the maximum repeated substring of the substring before the j bit. Please keep in mind the definition of the next array at all times. The following explanation is based on this definition.

Code example 1

void Getnext(int next[],String t) {
    
    
   int j=0,k=-1;
   next[0]=-1;
   while(j<t.length-1) {
    
    
      if(k == -1 || t[j] == t[k]) next[++j] = ++k;
      else k = next[k];
   }
}

Let's look at the first one first: when j is 0, what if there is no match at this time?

Insert picture description here

In the case of the above picture, j is already on the far left, and it is impossible to move. At this time, the i pointer should move backward. So there will be next[0] = -1; this initialization in the code.

What if it is when j is 1?
Obviously, the j pointer must be moved back to the 0 position. Because there is only this place in front of it~~~

Insert picture description here

The following is the most important, please see the picture below:

Insert picture description here
Insert picture description here

Please compare these two figures carefully.

We found a rule:

When P[k] == P[j],

有next[j+1] == next[j] + 1

What if P[k] != P[j]? For example, as shown in the figure below:

Insert picture description here

In this case, if you look at the code, it should be this sentence: k = next[k]; why is it like this? You should understand it by looking at the following.

Insert picture description here

Now you should know why k = next[k]! Like the example above, we can no longer find the longest suffix string [A, B, A, B ], but we can still find prefix strings such as [A, B] and [B ]. So this process seems to be positioning the string [A, B, A, C], when C is different from the main string (that is, the position of k is different), of course the pointer is moved to next[k]. .

Memory point

1) The value of k is the length of the maximum repeated substring of the substring before j bits.

2) The array next is saved, and each position j corresponds to k

Optimization of next array solving algorithm

Finally, take a look at the flaws in the above algorithm. Look at the first example:

Insert picture description here

Obviously, when we get the next array from the above algorithm, it should be [-1, 0, 0, 1]

So the next step is to move j to the first element:

Insert picture description here

It is not difficult to find that this step is completely meaningless. Because the latter B does not match anymore, the former B must also not match. The same situation actually happens on the second element A.

Obviously, the reason for the problem is that P[j] == P[next[j]].

Modified code example

void Getnext(int next[],String t)
{
    
    
   int j=0,k=-1;
   next[0]=-1;
   while(j<t.length-1) {
    
    
      if(k == -1 || t[j] == t[k]) {
    
    
         if(t[++j]==t[++k])//当两个字符相同时,就跳过
            next[j] = next[k];
         else
            next[j] = k;
      }
      else k = next[k];
   }
}

KMP algorithm

int KMP(String s,String t)
{
    
    
   int next[MaxSize],i=0;j=0;
   Getnext(t,next);
   while(i<s.length&&j<t.length) {
    
    
      if(j==-1 || s[i]==t[j]) {
    
    
         i++;
         j++;
      }
      else j=next[j];               //j回退
   }
   if(j>=t.length)
       return (i-t.length);         //匹配成功,返回子串的位置
   else
      return (-1);                  //没找到
}

Reference blog post: https://www.cnblogs.com/dusf/p/kmp.html
https://blog.csdn.net/dark_cy/article/details/88698736

Guess you like

Origin blog.csdn.net/Bluesky_lt/article/details/113386079