KMP算法

2020-01-26 本文已影响0人 spraysss

KMP算法是一种字符串匹配算法，对于指定字符串str1和子串str2，返回字串在str1出现的位置索引，str1中不存在子串str2返回-1

前缀表

KMP 在匹配字符串根据前缀表减少不必要的回溯

KMP首先需要根据str2构造最长公共前后缀表
以 AABAABAAA为例
前缀

A 最长公共前后缀为0
AA 最长公共前后缀为1
AAB 最长公共前后缀为0
AABA 最长公共前后缀为1
AABAA 最长公共前后缀为2
AABAAB 最长公共前后缀为3
AABAABA 最长公共前后缀为4
AABAABAA 最长公共前后缀为5
AABAABAAA 最长公共前后缀为2

A	A	B	A	A	B	A	A	A
0	1	0	1	2	3	4	5	2

给出计算前缀表的代码：

    public static int[] computePrefixFunction(String str) {
        int[] next = new int[str.length()];
        next[0] = 0;
        for (int i = 1, j = 0; i < str.length(); i++) {
            //对不上
            while (j > 0 && str.charAt(i) != str.charAt(j)) {
                //hard to understand
                j = next[j - 1];

            }


            if (str.charAt(i) == str.charAt(j)) {
                j++;
            }
            next[i] = j;
        }
        return next;

    }

在这个代码while里的 j = next[j - 1]非常不好理解, j代表的是下一个需要比较的最长公共前后缀字符串的位置，我以构造AABAABAAA这个串的前缀表的最后两步做一些说明，在倒数第二步时算出了AABAABAA最长公共前后缀为5 ，此时的j=5，接下来i+1=8，指向了最后字符串中的最后一个字符A，此时j对应的字符为B和i对应的字符不相等，需要重新定位字符j。代码中的定位j的代码为j = next[j - 1]，这个实际上是倒数第二步匹配到的AABAA这个串的公共前后缀，其值等于2，关键是为什么要这么定位呢，这也是回溯不过是基于已知串的特性，也就是开头的两个绿色AA和结尾出的两个绿色AA相等，不需要比较，只需要比较接下来的黄色是否相等给出一张图，帮助理解：

图中的串有如下特性

$S1=S2=AABAA$ ，这个结论在倒数第二步由AABAABAA 最长公共前后缀为5 可以得出
$S1$ ， $S2$ 的子串 $S11=S12=S21=S22$ ,所以需要将j定位到黄色的B出，然后比较i，j对应的字符是否相等，不等的话继续按照此方法回溯。

kmpSearch

具体的搜索就是利用了前面提到的前缀表，思想基本一致，通过前缀表对齐str1 ,str2,直接上代码

import java.util.Arrays;

/****
 i
 0   1   2   3   4   5   6   7   8
 A   A   B   A   A   B   A   A   A
 0   1   0   1   2   3   4   5

 i
 0   1   2   3   4   5   6   7   8
 A   A   B   A   A   B   A   A   A
 j

 0   1   oo
 A   A   B   A   A   B
 A   A   B   A   A   A
 */

public class KMP {


    public static void main(String[] args) {
        String str1 = "BBC ABCDAB ABCDABCDABDE";
//        String str2 = "AABAABAAA";
        String str2 = "ABCDABD";
        int[] prefixTable = computePrefixFunction(str2);
        System.out.println(Arrays.toString(prefixTable));
        System.out.println(kmpSearch(str1, str2, prefixTable));

    }

    /**
     * @param str 字符串
     * @return 字符串的最大公共前后缀
     */
    public static int[] computePrefixFunction(String str) {
        int[] next = new int[str.length()];
        next[0] = 0;
        for (int i = 1, j = 0; i < str.length(); i++) {
            //对不上
            while (j > 0 && str.charAt(i) != str.charAt(j)) {
                //hard to understand
                j = next[j - 1];

            }


            if (str.charAt(i) == str.charAt(j)) {
                j++;
            }
            next[i] = j;
        }
        return next;

    }

    public static int kmpSearch(String str1, String str2, int[] prefixTable) {
        for (int i = 0, j = 0; i < str1.length(); i++) {
            //匹配不上时，查询前缀表
            while (j > 0 && str1.charAt(i) != str2.charAt(j)) {
                //根据前缀表对齐str2的 j下标，减少比较
                j = prefixTable[j - 1];
            }
            if (str1.charAt(i) == str2.charAt(j)) {
                j++;
            }
            if (j == str2.length()) {
                return i - j + 1;
            }

        }
        return -1;
    }


}

时间复杂度

如果str1的长度为n，str2的长度为m

computePrefixFunction时间复杂度为O(m)
kmpSearch时间复杂度为O(n)

KMP算法

前缀表

kmpSearch

时间复杂度

猜你喜欢

热点阅读