Detailed edit distance

A few days ago saw a face questions goose field, algorithm is the most dynamic part of the plan, the last question is to write a function of distance calculation editor, specially today to write an article to discuss this issue.

I personally like to edit from this problem, because it looks very difficult, but surprisingly solution was pretty simple, but it is more practical algorithm rare (yes, I admit that a lot of algorithmic problems are not very practical). First look at the following topics:

Why is this difficult problem, because the obvious, it is difficult to let people know what to do, terrified.

Why is it practical too, because a few days ago I used this method in their daily lives. There are a number of articles before the public due to negligence, misplaced write a piece of content, I decided to modify this part of the logic of fluent. But the public can only amend article No. 20 words, and only supports add, delete, replace operations (edit distance with exactly the same problem), so I used an algorithm to obtain the optimal solution, it only took 16 steps to complete modify.

Another example is tall application, DNA sequencing by the point A, G, C sequence, T composition, can be likened to a string. Edit distance can measure the similarity between two sequences of DNA, the smaller edit distance, indicating that the more similar these two DNA, maybe and maybe the owner of the ancient DNA is a close relative of her.

Below Closer to home, explain in detail how to calculate the edit distance, I believe this article will let you harvest.

First, the idea

Edit distance problem is to give us two strings s1and s2, with only three operations, let us s1become s2, seeking a minimum number of operations. To be clear, whether it is to s1become s2or vice versa, the result is the same, so later on to s1become s2an example.

The foregoing "longest common subsequence" said that to solve the problem of dynamic programming two strings, usually with two pointers i,jare pointing to the last two strings, then a step forward, reduce the size of the problem .

Provided two strings are "rad" and "apple", to put s1into s2the algorithm will be carried out by:


Remember this GIF process, so that we can calculate the edit distance. The key is how to make the right operation, she will speak later.

According to the above GIF, it can be found operating not only three, in fact, there is a fourth operation, is to do nothing (skip). For example, this situation:

Because these two characters already the same, in order to edit the minimum distance, obviously you should not have any operations on them, move forward directly i,jto.

There is a case very easy to handle, is jcompleted s2, if inot completed s1, then the deletion can only be used to s1shorten s2. For example, this situation:

Similarly, if the ifinish s1time jhas not taken the s2, it can only be used to insert s2the rest of the characters is fully inserted s1. So we will see, in both cases is the algorithm of Base Case .

The following Detailed look at how to convert ideas into code, sit tight, to start up.

Second, Code Detailed

First sort out what previous ideas:

base case is icompleted s1or jcompleted s2, can be returned directly to the rest of the length of another string.

For each pair of child characters s1[i]and s2[j]can have four actions:

if s1[i] == s2[j]:
    啥都别做(skip)
    i, j 同时向前移动
else:
    三选一:
        插入(insert)
        删除(delete)
        替换(replace)

Have this framework, the problem has been resolved. Readers may ask, this "three elections" in the end how to choose? Very simple, the whole test again, the minimum edit distance which end up operation, it is finding out. It should be a recursive technique, a bit tricky to understand, look at the code:

def minDistance(s1, s2) -> int:

    def dp(i, j):
        # base case
        if i == -1: return j + 1
        if j == -1: return i + 1
        
        if s1[i] == s2[j]:
            return dp(i - 1, j - 1)  # 啥都不做
        else:
            return min(
                dp(i, j - 1) + 1,    # 插入
                dp(i - 1, j) + 1,    # 删除
                dp(i - 1, j - 1) + 1 # 替换
            )
    
    # i,j 初始化指向最后一个索引
    return dp(len(s1) - 1, len(s2) - 1)

下面来详细解释一下这段递归代码,base case 应该不用解释了,主要解释一下递归部分。

都说递归代码的可解释性很好,这是有道理的,只要理解函数的定义,就能很清楚地理解算法的逻辑。我们这里 dp(i, j) 函数的定义是这样的:

def dp(i, j) -> int
# 返回 s1[0..i] 和 s2[0..j] 的最小编辑距离

记住这个定义之后,先来看这段代码:

if s1[i] == s2[j]:
    return dp(i - 1, j - 1)  # 啥都不做
# 解释:
# 本来就相等,不需要任何操作
# s1[0..i] 和 s2[0..j] 的最小编辑距离等于
# s1[0..i-1] 和 s2[0..j-1] 的最小编辑距离
# 也就是说 dp(i, j) 等于 dp(i-1, j-1)

如果 s1[i]!=s2[j],就要对三个操作递归了,稍微需要点思考:

dp(i, j - 1) + 1,    # 插入
# 解释:
# 我直接在 s1[i] 插入一个和 s2[j] 一样的字符
# 那么 s2[j] 就被匹配了,前移 j,继续跟 i 对比
# 别忘了操作数加一

dp(i - 1, j) + 1,    # 删除
# 解释:
# 我直接把 s[i] 这个字符删掉
# 前移 i,继续跟 j 对比
# 操作数加一

dp(i - 1, j - 1) + 1 # 替换
# 解释:
# 我直接把 s1[i] 替换成 s2[j],这样它俩就匹配了
# 同时前移 i,j 继续对比
# 操作数加一

现在,你应该完全理解这段短小精悍的代码了。还有点小问题就是,这个解法是暴力解法,存在重叠子问题,需要用动态规划技巧来优化。

怎么能一眼看出存在重叠子问题呢?前文「动态规划之正则表达式」有提过,这里再简单提一下,需要抽象出本文算法的递归框架:

def dp(i, j):
    dp(i - 1, j - 1) #1
    dp(i, j - 1)     #2
    dp(i - 1, j)     #3

对于子问题 dp(i-1, j-1),如何通过原问题 dp(i, j) 得到呢?有不止一条路径,比如 dp(i, j) -> #1dp(i, j) -> #2 -> #3。一旦发现一条重复路径,就说明存在巨量重复路径,也就是重叠子问题。

三、动态规划优化

对于重叠子问题呢,前文「动态规划详解」详细介绍过,优化方法无非是备忘录或者 DP table。

备忘录很好加,原来的代码稍加修改即可:

def minDistance(s1, s2) -> int:

    memo = dict() # 备忘录
    def dp(i, j):
        if (i, j) in memo: 
            return memo[(i, j)]
        ...
        
        if s1[i] == s2[j]:
            memo[(i, j)] = ...  
        else:
            memo[(i, j)] = ...
        return memo[(i, j)]
    
    return dp(len(s1) - 1, len(s2) - 1)

主要说下 DP table 的解法

首先明确 dp 数组的含义,dp 数组是一个二维数组,长这样:

有了之前递归解法的铺垫,应该很容易理解。dp[..][0]dp[0][..] 对应 base case,dp[i][j] 的含义和之前的 dp 函数类似:

def dp(i, j) -> int
# 返回 s1[0..i] 和 s2[0..j] 的最小编辑距离

dp[i-1][j-1]
# 存储 s1[0..i] 和 s2[0..j] 的最小编辑距离

dp 函数的 base case 是 i,j 等于 -1,而数组索引至少是 0,所以 dp 数组会偏移一位。

既然 dp 数组和递归 dp 函数含义一样,也就可以直接套用之前的思路写代码,唯一不同的是,DP table 是自底向上求解,递归解法是自顶向下求解

int minDistance(String s1, String s2) {
    int m = s1.length(), n = s2.length();
    int[][] dp = new int[m + 1][n + 1];
    // base case 
    for (int i = 1; i <= m; i++)
        dp[i][0] = i;
    for (int j = 1; j <= n; j++)
        dp[0][j] = j;
    // 自底向上求解
    for (int i = 1; i <= m; i++)
        for (int j = 1; j <= n; j++)
            if (s1.charAt(i-1) == s2.charAt(j-1))
                dp[i][j] = dp[i - 1][j - 1];
            else               
                dp[i][j] = min(
                    dp[i - 1][j] + 1,
                    dp[i][j - 1] + 1,
                    dp[i-1][j-1] + 1
                );
    // 储存着整个 s1 和 s2 的最小编辑距离
    return dp[m][n];
}

int min(int a, int b, int c) {
    return Math.min(a, Math.min(b, c));
}

三、扩展延伸

一般来说,处理两个字符串的动态规划问题,都是按本文的思路处理,建立 DP table。为什么呢,因为易于找出状态转移的关系,比如编辑距离的 DP table:

还有一个细节,既然每个 dp[i][j] 只和它附近的三个状态有关,空间复杂度是可以压缩成 \(O(min(M, N))\) 的(M,N 是两个字符串的长度)。不难,但是可解释性大大降低,读者可以自己尝试优化一下。

你可能还会问,这里只求出了最小的编辑距离,那具体的操作是什么?你之前举的修改公众号文章的例子,只有一个最小编辑距离肯定不够,还得知道具体怎么修改才行。

这个其实很简单,代码稍加修改,给 dp 数组增加额外的信息即可:

// int[][] dp;
Node[][] dp;

class Node {
    int val;
    int choice;
    // 0 代表啥都不做
    // 1 代表插入
    // 2 代表删除
    // 3 代表替换
}

val 属性就是之前的 dp 数组的数值,choice 属性代表操作。在做最优选择时,顺便把操作记录下来,然后就从结果反推具体操作。

我们的最终结果不是 dp[m][n] 吗,这里的 val 存着最小编辑距离,choice 存着最后一个操作,比如说是插入操作,那么就可以左移一格:

重复此过程,可以一步步回到起点 dp[0][0],形成一条路径,按这条路径上的操作进行编辑,就是最佳方案。

以上就是编辑距离算法的全部内容,如果本文对你有帮助,欢迎关注我的公众号 labuladong,致力于把算法问题讲清楚

labuladong

我最近精心制作了一份电子书《labuladong的算法小抄》,分为【动态规划】【数据结构】【算法思维】【高频面试】四个章节,共 60 多篇原创文章,绝对精品!限时开放下载,在我的公众号 labuladong 后台回复关键词【pdf】即可免费下载!

目录

欢迎关注我的公众号 labuladong,技术公众号的清流,坚持原创,致力于把问题讲清楚!

labuladong

Guess you like

Origin www.cnblogs.com/labuladong/p/12320390.html