正则表达式中的“量词”入门介绍

2017-04-03 本文已影响91人 c6a092a61698

正则表达式中的量词可以用来指明某个字符串匹配的次数。将在以下描述“贪心量词”（Greedy）、“厌恶量词”（reluctant）、“占有量词”（possessive）这三种量词。（真的不知道怎么翻译）。乍一看量词X？(贪心量词)、X??(厌恶量词) 和X？+(占有量词)好像作用也差不多，因为它们的匹配规则都是匹配“X” 一次或者零次，即X出现一次或者一次都不出现。其实它们有着细微的差别，在本文中最后一部分会说明。

1.png

让我们用贪心量词来创建三种不同的正则表达式：a?、a*、a+、。看看如果用空字符串来测匹配会得到什么结果。

先给出以下测试代码（直接使用终端编译运行即可）：

public class RegexTestHarness {
    public static void main(String[] args){
        Console console = System.console();
        if (console == null) {
            System.err.println("No console.");
            System.exit(1);
        }
        while (true) {

            Pattern pattern =
                    Pattern.compile(console.readLine("%nEnter your regex: "));

            Matcher matcher =
                    pattern.matcher(console.readLine("Enter input string to search: "));

            boolean found = false;
            while (matcher.find()) {
                console.format("I found the text" +
                                " \"%s\" starting at " +
                                "index %d and ending at index %d.%n",
                        matcher.group(),
                        matcher.start(),
                        matcher.end());
                found = true;
            }
            if(!found){
                console.format("No match found.%n");
            }
        }
    }
}

Enter your regex: a?
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a*
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a+
Enter input string to search:
No match found.

零长度匹配

在上面的例子中，前两个例子可以匹配成功是因为表达式a?和a*允许字符串中不出现‘a’字符。你会看到开始和结束的下标都是0。空字符串""没有长度，因此这个正则在开始位置（即下标为0）即匹配成功。像这一类的匹配称之为“零长度匹配”。零长度匹配会在以下三种情况出现：
1.一个空字符串匹配。
2.和字符串的开端匹配，即下标为0的地方匹配。（开端即是空字符串）
3.和字符串结束的位置匹配。（结束即是空字符串）
4.任意两个字符之间,如"bc"，b和c之间即存在一个空字符串""。

用“foo”这个字符串作为例子，下标的位置对应关系为

Paste_Image.png

即index=0和index=3的地方会匹配。

零长度匹配是非常容易辨别出来，因为他们开始的位置和结束的位置是同一下标。

然我们再看几个列子，输入一个“a”字符。
Enter your regex: a?
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a*
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a+
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

以上三个量词都能找到字符“a”，但是前两个例子在下标为1处匹配，也就是字符的结尾处。记住，匹配器查找到下标0和1之间的“a”，该程序会一直匹配到没有匹配为止。

接下来输入"ababaaaab",看下会得到什么输出。输出如下：
Enter your regex: a?
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a*
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a+
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.

读者可以自己推敲为什么会得出以上结果。

如果要限制某个字符出现的次数，可以使用大括号"{}"。如：
匹配“aaa”
Enter your regex: a{3}
Enter input string to search: aa
No match found.

Enter your regex: a{3}
Enter input string to search: aaa
I found the text "aaa" starting at index 0 and ending at index 3.

Enter your regex: a{3}
Enter input string to search: aaaa
I found the text "aaa" starting at index 0 and ending at index 3.

对于第三个实例，要注意的是，当匹配了前三个a，后面的匹配和前面3个a没有任何关系，正则会继续和“aaa”后面的内容继续尝试匹配。

被量词修饰的子表达式 如：
Enter your regex: (dog){3}
Enter input string to search: dogdogdogdogdogdog
I found the text "dogdogdog" starting at index 0 and ending at index 9.
I found the text "dogdogdog" starting at index 9 and ending at index 18.

Enter your regex: dog{3}
Enter input string to search: dogdogdogdogdogdog
No match found.

对于第二个例子，正则表达式匹配的内容应该是"do",后面紧跟3个"g",因此第二个例子无法匹配。

再看多一个例子：
Enter your regex: [abc]{3}
Enter input string to search: abccabaaaccbbbc
I found the text "abc" starting at index 0 and ending at index 3.
I found the text "cab" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.
I found the text "ccb" starting at index 9 and ending at index 12.
I found the text "bbc" starting at index 12 and ending at index 15.

Enter your regex: abc{3}
Enter input string to search: abccabaaaccbbbc
No match found.

贪婪模式和厌恶模式和占有模式的区别
贪婪模式之所以被称为贪婪模式，是因为贪婪模式会尽可能的去匹配更多的内容，如果匹配不成功，将会进行回溯，直至匹配成功或者不成功。
看看下面例子：
Enter your regex: .*foo // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo // reluctant quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .+foo // possessive quantifier
Enter input string to search: xfooxxxxxxfoo
No match found.
第一个例子采用贪婪模式，.部分和整个字符串"xfooxxxxxxfoo"匹配，接着正则中foo部分和字符串"xfooxxxxxxfoo"的剩余部分匹配，即空字串"",发现匹配不成功。开始回溯, .*与"xfooxxxxxxfo"匹配，正则中的foo部分和"xfooxxxxxxfo"剩余部分进行匹配，即"o"，发现不匹配，继续回溯。重复上诉过程，直到匹配成功。由于是贪婪模式，一旦成功，将不会继续匹配，匹配终止。

第二个例子采用的是厌恶模式（非贪婪模式），刚好和贪婪模式相反，一开始只会和字符串开始位置进行匹配，此例中，即和空字符串""匹配，匹配成功后，正则中的foo部分和字符串中的开头三个字符"xfo"匹配，发现匹配不成功。.*?开始和第一个字符匹配，即"x",匹配成功，接着正则中的foo和字符串中的"foo"匹配。至此整个正则第一次匹配成功。接着继续匹配，接下来的匹配内容为"xxxxxxfoo",采用相同的规则继续匹配,第二次匹配成功的字符串为"xxxxxxfoo"。直至整个字符串被消耗完毕才终止匹配。

第三个例子是占有模式。该模式只进行一次匹配。不进行回溯尝试，在次例中，.*+与"xfooxxxxxxfoo"匹配，正则中的foo和空字符串""匹配，匹配失败。将不进行回溯尝试。匹配结束。

以上内容大部分是翻译The Java™ Tutorials中关于正则的教程

正则表达式中的“量词”入门介绍

猜你喜欢

热点阅读