数据结构 - 串（Sequence）

串是开发中非常熟悉的字符串，是由若干个字符组成的有限序列
字符串thank的前缀（prefix）、真前缀（proper prefix）、后缀（suffix）、真后缀（proper suffix）

串匹配算法

查找一个模式串（Pattern）在文本串（Text）中的位置

String text = "Hello world";
String pattern = "or";
text.indexOf(pattern); // 7
text.indexOf("other"); // -1

几个经典的串匹配算法

蛮力（Brute Force）
KMP
Boyer-Moore
Rabin-Karp
Sunday

tlen代表文本串text的长度，plen代表模式串Pattern的长度

蛮力（Brute Force）

以字符为单位，从左到右移动模式串，知道匹配成功
蛮力算法有2种常见实现思路

蛮力1 - 执行过程

在这里插入图片描述

蛮力1 - 实现

public static int indexOf(String text, String pattern) {
    if (text == null || pattern == null) return -1;
    int tlen = text.length();
    int plen = pattern.length();
    if (tlen == 0 || plen == 0 || tlen < plen) return -1;
    int pi = 0, ti = 0;
    while (pi < plen && ti < tlen) {
        if (text.chatAt(ti) == pattern.charAt(pi)) {
            ti++;
            pi++;
        } else {
            ti -= pi - 1;
            pi = 0;
        }
    }
    return pi == plen ? ti - pi : -1;
}

蛮力1 - 优化

此前实现的蛮力算法，在恰当的时候可以提前退出，减少比较次数
因此，ti的退出条件可以从ti < tlen改为
ti - pi <= tlen - plen
ti - pi 是指每一轮比较中Text首个比较字符的位置

蛮力1 - 优化实现

public static int indexOf(String text, String pattern) {
    if (text == null || pattern == null) return -1;
    int tlen = text.length();
    int plen = pattern.length();
    if (tlen == 0 || plen == 0 || tlen < plen) return -1;
    int pi = 0, ti = 0;
    int tmax = tlen - plen;
    while (pi < plen && ti - pi <= tmax) {
        if (text.chatAt(ti) == pattern.charAt(pi)) {
            ti++;
            pi++;
        } else {
            ti -= pi - 1;
            pi = 0;
        }
    }
    return pi == plen ? ti - pi : -1;
}

蛮力2 - 执行过程

在这里插入图片描述

匹配失败
pi = 0
ti++
匹配成功
pi == plen

蛮力2 - 实现

public static int indexOf(String text, String pattern) {
    if (text == null || pattern == null) return -1;
    int tlen = text.length();
    int plen = pattern.length();
    if (tlen == 0 || plen == 0 || tlen < plen) return -1;
    int tmax = tlen - plen;
    for (int ti = 0; ti <= tmax; ti++) {
        int pi = 0;
        for (; pi < plen; pi++) {
            if (text.charAt(ti + pi) != pattern.chatAt(pi)) break;
        }
        if (pi == plen) return ti;
    }
    return -1;
}

蛮力 - 性能分析

n是文本串长度，m是模式串长度
最多n - m + 1轮
最好情况
只需一轮比较完全匹配成功，比较m次（m是模式串的长度）
时间复杂度为O(m)
最坏情况（字符集比较大，出现概率越低）
执行了n - m + 1轮比较（n是文本串的长度）
每轮都比较至模式串的末字符后失败（m - 1次成功，1次失败）
时间复杂度为O(m * (n - m + 1))，由于一般m远小于n，所以为O(nm)

在这里插入图片描述 —

KMP

KMP是Knuth-Morris-Pratt的简称（取名自3位发明人的名字），于1977年发布

蛮力 vs KMP

在这里插入图片描述

对比蛮力算法，KMP的精妙之处：充分利用此前比较过的内容，可以很聪明地跳过一些不必要的比较位置

KMP - next表的使用

KMP会预先根据模式串的内容生成一张next表（一般是个数组）

KMP - 核心原理

在这里插入图片描述

当d、e失配时，如果希望Pattern能够一次性向右移一大段距离，然后直接比较d、c字符
前提条件是A必须等于B
所以KMP必须在失配字符e左边的子串中找出符合条件的A、B，从而得知向右移的距离
向右移动的距离：e左边子串的长度 - A的长度，等价于：e的索引 - c的索引
且c的索引 == next[e的索引]，所以向右移动的距离：e的索引 - next[e的索引]
总结
如果在pi位置失配，向右移动的距离是pi - next[pi]，所以next[pi]越小，移动距离越大
next[pi]是pi左边子串的真前缀后缀的最大公共子串长度

KMP - 真前缀后缀的最大公共子串长度

在这里插入图片描述

KMP - 得到next表

在这里插入图片描述

将最大公共子串长度都向后移动1位，首字符设置为-1，就得到了next表

KMP - 主算法实现

public static int indexOf(String text, String pattern) {
    if (text == null || pattern == null) return -1;
    int plen = pattern.length();
    int tlen = text.length();
    if (tlen == 0 || plen == 0 || tlen < plen) return -1;
    int[] next = next(pattern);
    int pi = 0, ti = 0;
    int tmax = tlen - plen;
    while (pi < plen && ti - pi <= tmax) {
        if (pi < 0 || text.charAt(ti) == pattern.charAt(pi)) {
            ti++;
            pi++;
        }  else {
            pi = next[pi];
        }
    }
    return pi == plen ? ti - pi : -1;
}

KMP - 为什么是“最大”公共子串长度？

假设文本串是AAAAABCDEF，模式串是AAAAB
应该将1、2、3中的哪个值赋值给pi是正确的？
将3赋值给pi
向右移动了1个字符单位，最后匹配成功
将1赋值给pi
向右移动了3个字符单位，过错了成功匹配的机会
公共子串长度越小，向右移动的距离越大，越不安全
公共子串长度越大，向右移动的距离越小，越安全

KMP - next表的构造思路

在这里插入图片描述

已知next[i] == n

如果Pattern[i] == Pattern[n]，那么next[i + 1] == n + 1
如果Pattern[i] != Pattern[n]，已知next[n] == k，那么next[i + 1] == k + 1
如果Pattern[i] != Pattern[k]，将k代入n，重复执行2

KMP - next表的代码实现

public static int[] next(String pattern) {
    int len = pattern.length();
    int[] next = new int[len];
    int i = 0;
    int n = next[i] = -1;
    int imax = len - 1;
    while (i < imax) {
        if (n < 0 || pattern.charAt(i) == pattern.charAt(n)) {
            next[++i] = ++n;
        } else {
            n = next[n];
        }
    }
    return next;
}

KMP - next表的不足之处

假设文本串是AAABAAAAB，模式串是AAAAB
在这种情况下，KMP显得比较笨拙

KMP - next表的优化思路

在这里插入图片描述

如果Pattern[i] != d，就让模式串滑动到next[i]（也就是n）位置跟d进行比较
如果Pattern[n] != d，就让模式串滑动到next[n]（也就是k）位置跟d进行比较
如果Pattern[i] == Pattern[n]，那么当i位置失配时，模式串最终必然会滑倒k位置跟d进行比较
所以next[i]直接存储next[n]（也就是k）即可

KMP - next表的优化思路

public static int[] next(String pattern) {
    int len = pattern.length();
    int[] next = new int[len];
    int i = 0;
    int n = next[i] = -1;
    int imax = len - 1;
    while (i < imax) {
        if (n < 0 || pattern.charAt(i) == pattern.charAt(n)) {
            i++;
            n++;
            if (pattern.charAt(i) == pattern.charAt(n)) {
                next[i] = next[n];
            } else {
                next[i] = n;
            }
        } else {
            n = next[n];
        }
    }
    return next;
}

KMP - next表的优化效果

在这里插入图片描述

KMP - 性能分析

KMP主逻辑
最好时间复杂度：O(m)
最坏时间复杂度：O(n)，不超过O(2n)
next表的构造过程跟KMP主体逻辑类似
时间复杂度：O(m)
KMP整体
最好时间复杂度：O(m)
最坏时间复杂度：O(m + n)
空间复杂度：O(m)

蛮力 - KMP

蛮力算法为何低效？
当字符失配时
蛮力算法：ti回溯到左边位置，pi回溯到0
KMP算法：ti不比回溯，pi不一定要回溯到0

Boyer-Moore

Boyyer-Moore算法，简称BM算法，由Robert S.Boyer 和 J Strother Moore 于1977年发明
最好时间复杂度：O(n / m)，最坏时间复杂度：O(n + m)
改算法从模式串的尾部开始匹配（自后向前）
BM算法的移动字符数是通过2条规则计算出最大值
坏字符规则（Bad Character，简称BC）
好后缀规则（Good Suffix，简称GS）

坏字符（Bad Character）

在这里插入图片描述

当Pattern中的字符E和Text中的S失配时，称S为“坏字符”
如果Pattern的未匹配子串中不存在坏字符，直接将Pattern移动到坏字符的下一位
否则，让Pattern的未匹配子串中最靠右的坏字符与Text中的坏字符对齐

好后缀（Good Suffix）

在这里插入图片描述

“MPLE”是一个成功匹配的后缀，“E”、“LE”、“PLE”、“MPLE”都是“好后缀”
如果Pattern中找不到与好后缀对齐的子串，直接将Pattern移动到好后缀的下一位
否则，从Pattern中找出子串与Text中的好后缀对齐

BM的最好情况

在这里插入图片描述

时间复杂度：O(n / m)

BM的最好情况

在这里插入图片描述

时间复杂度：O(m + n)
其中的O(m)是构造BC、GS表

Rabin-Karp

Rabin-Karp算法（或Karp-Rabin算法），简称RK算法，是一种基于hash的字符串匹配算法
由Richard M.Karp 和 Michael O.Rabin 于1987年发明
大致原理
将Pattern的hash值与Text中每个子串的hash值进行比较
某一子串的hash值可以根据上一子串的hash值在O(1)时间内计算出来

Sunday

Sunday算法由Daniel M.Sunday在1990年提出，它的思想跟BM算法很相似
从前向后匹配
当匹配失败时，关注的是Text中参与匹配的子串的下一位字符A
如果A没有在Pattern中出现，则直接跳过，即移动位数 = Pattern长度 + 1
否则，让Pattern中最靠右的A与Text中的A对齐

在这里插入图片描述

玉树临风你卓哥

发布了188 篇原创文章 · 获赞 19 · 访问量 8万+

私信关注

数据结构 - 串（Sequence）

串匹配算法

蛮力（Brute Force）

蛮力1 - 执行过程

蛮力1 - 实现

蛮力1 - 优化

蛮力1 - 优化实现

蛮力2 - 执行过程

蛮力2 - 实现

蛮力 - 性能分析

KMP

蛮力 vs KMP

KMP - next表的使用

KMP - 核心原理

KMP - 真前缀后缀的最大公共子串长度

KMP - 得到next表

KMP - 主算法实现

KMP - 为什么是“最大”公共子串长度？

KMP - next表的构造思路

KMP - next表的代码实现

KMP - next表的不足之处

KMP - next表的优化思路

KMP - next表的优化思路

KMP - next表的优化效果

KMP - 性能分析

蛮力 - KMP

Boyer-Moore

坏字符（Bad Character）

好后缀（Good Suffix）

BM的最好情况

BM的最好情况

Rabin-Karp

Sunday

猜你喜欢