Data Structure and Algorithm Longest Common Substring

The longest common substring problem is to find the longest common substring of two strings S1 and S2. In fact, it is to find the longest repeating substring of two strings.

The simplest algorithm is to enumerate each pair of substrings of S1 and S2, and then determine whether they are equal. The time complexity is O(n^3). However, this algorithm is too inefficient and cannot meet actual needs.

Generally, the idea of ​​dynamic programming is used to solve it. Let dp[i][j] represent the length of the common substring ending with the first i characters of S1 and the first j characters of S2. When S1[i]=S2[j ], dp[i][j]=dp[i-1][j-1]+1; otherwise dp[i][j]=0. Finally, find the maximum value of dp[i][j].

The time complexity of is O(n2). However, it should be noted that since the dp array is a two-dimensional array, the space complexity is also O(n2). If the string is relatively long, it may cause a memory overflow problem. You can use the rolling array method to optimize space complexity and convert a two-dimensional array into a one-dimensional array.

Insert image description here

1. C implementation of the longest common substring and detailed code explanation

The longest common substring refers to the longest of the same consecutive substrings in two strings. For example, the longest common substring of "abcmnopq" and "xyzmnopqa" is "mnopq".

The following is an example code to implement the longest common substring in C language:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define N 100

char *lcs(char *str1, char *str2)
{
    int mat[N][N] = {0};
    int len1 = strlen(str1), len2 = strlen(str2);
    int max_len = 0, max_end = 0;
    for (int i = 0; i < len1; i++) {
        for (int j = 0; j < len2; j++) {
            if (str1[i] == str2[j]) {
                if (i == 0 || j == 0) {
                    mat[i][j] = 1;
                } else {
                    mat[i][j] = mat[i-1][j-1] + 1;
                }
                if (mat[i][j] > max_len) {
                    max_len = mat[i][j];
                    max_end = i;
                }
            }
        }
    }
    char *lcs_str = (char*)malloc((max_len+1)*sizeof(char));
    strncpy(lcs_str, str1+max_end-max_len+1, max_len);
    lcs_str[max_len] = '\0';
    return lcs_str;
}

int main()
{
    char str1[N], str2[N];
    printf("Input two strings:\n");
    scanf("%s %s", str1, str2);
    char *lcs_str = lcs(str1, str2);
    printf("The longest common substring is: %s\n", lcs_str);
    free(lcs_str);
    return 0;
}

Here is a detailed explanation of the code:

  1. Line 4 defines a macro that specifies the maximum length of a string.
  2. Line 6 begins by defining a function named lcs, which accepts two strings as parameters and returns the longest common substring.
  3. Line 7 defines a two-dimensional array mat to store the length of the common substring of two strings.
  4. Lines 10 to 12 initialize variables and record the length and end position of the longest common substring.
  5. Lines 13 to 22 of use a double loop to traverse the two strings and compare whether their characters are the same. If they are the same, assign the corresponding element in mat to the previous value. One diagonal element plus one, that is m a t i , j = m a t i − 1 , j − 1 + 1 mat_{i,j} = mat_{i-1,j-1} + 1 < /span>atti,j=atti1,j1+1; In addition, if the current common substring length is greater than the maximum length of the previous record, the maximum length and end position are updated.
  6. Lines 24 to 27 allocate a memory that is used to store the longest common substring. Since the strncpy function does not automatically add the string terminator to the end of the copied string, you need to add the terminator manually.
  7. Finally use the free function to release the allocated memory.

The time complexity of this algorithm is O ( n 2 ) O(n^2) O(n2), inside n n n is the longest of the two strings.

Insert image description here

2. C++ implementation of the longest common substring and detailed code explanation

The longest common substring problem is to find the longest substring that is the same in two strings. This problem can be solved using dynamic programming algorithms. The specific implementation is as follows:

Suppose there are two strings s1 and s2, whose lengths are n and m respectively. Let dp[i][j] represent the longest length of the common substring ending with s1[i] and s2[j]. Then there is the following state transition equation:

When s1[i] == s2[j], dp[i][j] = dp[i-1][j-1] + 1; // The current characters are equal and the longest common substring is Add 1 to the string length
When s1[i] != s2[j], dp[i][j] = 0; // The current characters are not equal, and the longest common substring length is 0

Ultimately, to find the length of the longest common substring, you only need to traverse all the elements in the dp array and find the maximum value.

The following is the complete code implementation of the algorithm:

#include <iostream>
#include <cstring>

using namespace std;

const int N = 1000;

char s1[N], s2[N];
int dp[N][N];

int main()
{
    
    
    cin >> s1 + 1 >> s2 + 1;  // 从1开始存储字符串,方便后续的状态转移
    int n = strlen(s1 + 1), m = strlen(s2 + 1);

    int ans = 0;  // 最长公共子串的长度
    for (int i = 1; i <= n; i++)
        for (int j = 1; j <= m; j++)
            if (s1[i] == s2[j])
            {
    
    
                dp[i][j] = dp[i-1][j-1] + 1;
                ans = max(ans, dp[i][j]);  // 更新最长公共子串的长度
            }
            else
                dp[i][j] = 0;  // 不相等,最长公共子串的长度为0

    cout << ans << endl;

    return 0;
}

The time complexity of this algorithm is O ( n m ) O(nm) O(nm), in which n summ The length of the separate two characters.

Insert image description here

3. Java implementation of the longest common substring and detailed code explanation

The longest common substring is the longest substring common to two or more strings.

Java can use dynamic programming to implement the longest common substring. Specific steps are as follows:

  1. Construct a two-dimensional array dp[i][j], indicating the longest common substring length of string s1 ending with the i-th character and string s2 ending with the j-th character.

  2. Initialize the dp array, that is, when i=0 or j=0, dp[i][j] are all 0.

  3. Calculate the value of dp[i][j] recursively. When s1[i]=s2[j], dp[i][j]=dp[i-1][j-1]+1; otherwise, dp [i][j]=0.

  4. During the recursion process, record the index when the maximum value appears, and you can find the longest common substring.

The following is the Java code implementation:

public static String longestCommonSubstring(String s1, String s2) {
    
    
    int m = s1.length(), n = s2.length();
    int[][] dp = new int[m][n];
    int maxLength = 0, endIndex = 0;

    // 初始化dp数组
    for (int i = 0; i < m; i++) {
    
    
        Arrays.fill(dp[i], 0);
    }

    // 递推计算dp数组
    for (int i = 0; i < m; i++) {
    
    
        for (int j = 0; j < n; j++) {
    
    
            if (s1.charAt(i) == s2.charAt(j)) {
    
    
                if (i == 0 || j == 0) {
    
    
                    dp[i][j] = 1;
                } else {
    
    
                    dp[i][j] = dp[i - 1][j - 1] + 1;
                }
            }
            // 更新最大值及索引
            if (dp[i][j] > maxLength) {
    
    
                maxLength = dp[i][j];
                endIndex = i;
            }
        }
    }
    return s1.substring(endIndex - maxLength + 1, endIndex + 1);
}

In this code, the substr method is used to intercept substrings, and the Arrays.fill method is used to fill the array.

The above code is a way to implement the longest common substring using a dynamic programming algorithm with a time complexity of O(mn).

Insert image description here

Guess you like

Origin blog.csdn.net/weixin_47225948/article/details/133124442