正则表达式中的“量词”入门介绍
正则表达式中的量词可以用来指明某个字符串匹配的次数。将在以下描述“贪心量词”(Greedy)、“厌恶量词”(reluctant)、“占有量词”(possessive)这三种量词。(真的不知道怎么翻译)。乍一看量词X?(贪心量词)、X??(厌恶量词) 和X?+(占有量词)好像作用也差不多,因为它们的匹配规则都是匹配“X” 一次或者零次,即X出现一次或者一次都不出现。其实它们有着细微的差别,在本文中最后一部分会说明。
1.png让我们用贪心量词来创建三种不同的正则表达式:a?、a*、a+、。看看如果用空字符串来测匹配会得到什么结果。
先给出以下测试代码(直接使用终端编译运行即可):
public class RegexTestHarness {
public static void main(String[] args){
Console console = System.console();
if (console == null) {
System.err.println("No console.");
System.exit(1);
}
while (true) {
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: "));
Matcher matcher =
pattern.matcher(console.readLine("Enter input string to search: "));
boolean found = false;
while (matcher.find()) {
console.format("I found the text" +
" \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(),
matcher.start(),
matcher.end());
found = true;
}
if(!found){
console.format("No match found.%n");
}
}
}
}
Enter your regex: a?
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.
Enter your regex: a*
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.
Enter your regex: a+
Enter input string to search:
No match found.
零长度匹配
在上面的例子中,前两个例子可以匹配成功是因为表达式a?和a*允许字符串中不出现‘a’字符。你会看到开始和结束的下标都是0。空字符串""没有长度,因此这个正则在开始位置(即下标为0)即匹配成功。像这一类的匹配称之为“零长度匹配”。零长度匹配会在以下三种情况出现:
1.一个空字符串匹配。
2.和字符串的开端匹配,即下标为0的地方匹配。(开端即是空字符串)
3.和字符串结束的位置匹配。(结束即是空字符串)
4.任意两个字符之间,如"bc",b和c之间即存在一个空字符串""。
用“foo”这个字符串作为例子,下标的位置对应关系为
Paste_Image.png即index=0和index=3的地方会匹配。
零长度匹配是非常容易辨别出来,因为他们开始的位置和结束的位置是同一下标。
然我们再看几个列子,输入一个“a”字符。
Enter your regex: a?
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
Enter your regex: a*
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
Enter your regex: a+
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
以上三个量词都能找到字符“a”,但是前两个例子在下标为1处匹配,也就是字符的结尾处。记住,匹配器查找到下标0和1之间的“a”,该程序会一直匹配到没有匹配为止。
接下来输入"ababaaaab",看下会得到什么输出。输出如下:
Enter your regex: a?
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.
Enter your regex: a*
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.
Enter your regex: a+
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.
读者可以自己推敲为什么会得出以上结果。
如果要限制某个字符出现的次数,可以使用大括号"{}"。如:
匹配“aaa”
Enter your regex: a{3}
Enter input string to search: aa
No match found.
Enter your regex: a{3}
Enter input string to search: aaa
I found the text "aaa" starting at index 0 and ending at index 3.
Enter your regex: a{3}
Enter input string to search: aaaa
I found the text "aaa" starting at index 0 and ending at index 3.
对于第三个实例,要注意的是,当匹配了前三个a,后面的匹配和前面3个a没有任何关系,正则会继续和“aaa”后面的内容继续尝试匹配。
被量词修饰的子表达式 如:
Enter your regex: (dog){3}
Enter input string to search: dogdogdogdogdogdog
I found the text "dogdogdog" starting at index 0 and ending at index 9.
I found the text "dogdogdog" starting at index 9 and ending at index 18.
Enter your regex: dog{3}
Enter input string to search: dogdogdogdogdogdog
No match found.
对于第二个例子,正则表达式匹配的内容应该是"do",后面紧跟3个"g",因此第二个例子无法匹配。
再看多一个例子:
Enter your regex: [abc]{3}
Enter input string to search: abccabaaaccbbbc
I found the text "abc" starting at index 0 and ending at index 3.
I found the text "cab" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.
I found the text "ccb" starting at index 9 and ending at index 12.
I found the text "bbc" starting at index 12 and ending at index 15.
Enter your regex: abc{3}
Enter input string to search: abccabaaaccbbbc
No match found.
贪婪模式和厌恶模式和占有模式的区别
贪婪模式之所以被称为贪婪模式,是因为贪婪模式会尽可能的去匹配更多的内容,如果匹配不成功,将会进行回溯,直至匹配成功或者不成功。
看看下面例子:
Enter your regex: .*foo // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.
Enter your regex: .*?foo // reluctant quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.
Enter your regex: .+foo // possessive quantifier
Enter input string to search: xfooxxxxxxfoo
No match found.
第一个例子采用贪婪模式,.部分和整个字符串"xfooxxxxxxfoo"匹配,接着正则中foo部分和字符串"xfooxxxxxxfoo"的剩余部分匹配,即空字串"",发现匹配不成功。开始回溯, .*与"xfooxxxxxxfo"匹配,正则中的foo部分和"xfooxxxxxxfo"剩余部分进行匹配,即"o",发现不匹配,继续回溯。重复上诉过程,直到匹配成功。由于是贪婪模式,一旦成功,将不会继续匹配,匹配终止。
第二个例子采用的是厌恶模式(非贪婪模式),刚好和贪婪模式相反,一开始只会和字符串开始位置进行匹配,此例中,即和空字符串""匹配,匹配成功后,正则中的foo部分和字符串中的开头三个字符"xfo"匹配,发现匹配不成功。.*?开始和第一个字符匹配,即"x",匹配成功,接着正则中的foo和字符串中的"foo"匹配。至此整个正则第一次匹配成功。接着继续匹配,接下来的匹配内容为"xxxxxxfoo",采用相同的规则继续匹配,第二次匹配成功的字符串为"xxxxxxfoo"。直至整个字符串被消耗完毕才终止匹配。
第三个例子是占有模式。该模式只进行一次匹配。不进行回溯尝试,在次例中,.*+与"xfooxxxxxxfoo"匹配,正则中的foo和空字符串""匹配,匹配失败。将不进行回溯尝试。匹配结束。
以上内容大部分是翻译The Java™ Tutorials中关于正则的教程