Python 正则表达式

2018-12-28 本文已影响0人 obsession_me

正则表达式语法

点`.`(dot)

在默认模式下，匹配除了换行符以外的任意字符。如果DOTALLflag被确定了，那么点可以匹配任何字符。

# match the anything
s = "boy"
s2 = "booy"
s3 = "by"
pattern = re.compile("b.y")
flag = re.findall(pattern, s)  # 这里返回结果 boy
flag2 = re.findall(pattern, s2)  # 这里返回[]，因为.是匹配任意一个字符
flag3 = re.findall(pattern, s3)  # 这里返回[]，因为.是匹配任意一个字符，且不包括空字符

补字号`^`(Caret)

在默认状况下，从字符串开始位置开始匹配，但是在MULTILINE模式下，在新的一行也开始匹配。

# match the start of the string
s = "boybay"
pattern = re.compile("^(b.y)")
flag = re.findall(pattern, s) # 返回boy

美元符号`$`

在默认情况下，匹配字符串末端，但是在MULTILINE模式下，在多行也开始匹配。

# match the end of the string
s = "foolyou\nfoolme\n"
s1 = "foolyo\nfoolme\n"  # be aware of the difference from s.
s2 = "fool\n"

pattern = re.compile("fool..$", flags=re.RegexFlag.MULTILINE) 
pattern1 = re.compile("fool..$")
pattern2 = re.compile("$", flags=re.RegexFlag.MULTILINE)

result = re.findall(pattern, s)  # ["foolme"]
result1 = re.findall(pattern1, s1)  # ["foolme"]
result2 = re.findall(pattern2, s2)  # ["", ""] one before the new line, another after the newline

星号`*`

匹配前面RE0次或者任意次数。

# mathch any thing
s = "a"
s1 = "ab"
s2 = "abb"
s3 = "abbb"

pattern = re.compile("ab*")

flag = re.findall(pattern, s)  # ["a"]
flag1 = re.findall(pattern, s1)  # ["ab"]
flag2 = re.findall(pattern, s2)  # ["abb"]
flag3 = re.findall(pattern, s3)  # ["abbb"]

加号`+`

匹配前面的正则表达式一次或多次。

s = "a"
s1 = "ab"
s2 = "abb"
s3 = "abbb"

pattern = re.compile("ab+")

flag = re.findall(pattern, s)  # [""]
flag1 = re.findall(pattern, s1)  # ["ab"]
flag2 = re.findall(pattern, s2)  # ["abb"]
flag3 = re.findall(pattern, s3)  # ["abbb"]

问号`?`

匹配前面的正则表达式0次或1次。

s = "a"
s1 = "ab"
s2 = "abb"
s3 = "abbb"

pattern = re.compile("ab?")

flag = re.findall(pattern, s)  # [""]
flag1 = re.findall(pattern, s1)  # ["ab"]
flag2 = re.findall(pattern, s2)  # ["ab"]
flag3 = re.findall(pattern, s3)  # ["ab"]

`*?`、`+?`、`??`贪心算法和非贪心算法

例如我们目标字符串为<a>mid<b>，然后我们的模式串为<.*>，那么我们最后得到的结果会是["<a>mid<b>"]

，而这种行为我们称之为贪心 (greedy)，假设我需要<a>这种，我们需要使用非贪心算法 (non-greedy)，当我们的模式串为<.*?>时，则我们的结果是["<a>", "<b>"]

s = "<a>mid<b>"

pattern = re.compile("<.*>")
pattern2 = re.compile("<.*?>")

flag = re.findall(pattern, s)  # ['<a>mid<b>']
flag1 = re.findall(pattern2, s)  # ['<a>', '<b>']

这里.*?是非贪婪的主要原因是?是非贪婪的，所以尽可能匹配更少的元素。

`{m}`

匹配前面的正则表达式m次，少了不行。

s = "abbbbc"
s1 = "abbbbbc"

pattern = re.compile("ab{5}")

flag = re.findall(pattern, s)  # []
flag1 = re.findall(pattern, s1)  # ['abbbbb']

`{m,n}`

匹配前面的字符串m到n次。

s = "abbbbc"
s1 = "abbbbbc"

pattern = re.compile("ab{4,5}")

flag = re.findall(pattern, s)  # ['abbbb']
flag1 = re.findall(pattern, s1)  # ['abbbbb']

`{m,n}?`

匹配前面的字符串m到n次，但是尽量匹配少的那种次数，这并不算是一种非贪心算法，注意和之前贪心算法的区别。

s = "abbbbc"
s1 = "abbbbbc"

pattern = re.compile("ab{4,5}?")

flag = re.findall(pattern, s)  # ['abbbb']
flag1 = re.findall(pattern, s1)  # ['abbbb'] 注意和{4,5}的区别

`\`

转义符，如果想让Python不解释转义符，在在字符串前面加r，如：r"a\b"。

`[]`

用来表示字符集。

[abc]表示匹配"a"、"b"或者"c"

[a-z]表示匹配小写字母a到z所有的字母，[0-5][0-9]表示匹配0-59之间的数字，[0-9a-fA-F]匹配十六进制数字。

s = "a30b"
s1 = "a99b"
s2 = "aFb"

pattern = re.compile("a[0-5][0-9]b")
pattern2 = re.compile("a[0-9a-fA-F]b")  # 匹配十六进制数字

flag = re.findall(pattern, s)  # ['a30b']
flag1 = re.findall(pattern, s1)  # []
flag1 = re.findall(pattern2, s2)  # []"aFb"]

正则表达式中的一些通配符在[]中失去原来的意义，如[(.*?)]表示匹配.、*和?。

`|`

A|B意味着，匹配A或者B，并且这种扫描是从左到右的，意味着如果字符串符合A，那么就不会检测B。

s = "ab"
s1 = "aa"

pattern2 = re.compile("a[a-z]|[0-9]")

flag2 = re.findall(pattern2, s)  # ["ab"]
flag3 = re.findall(pattern2, s1)  # ["ab"]

上面当我输入"aa|b"这种类型的表达式的时候，语言会解释为我需要的是aa和b之间的匹配，当我输入"(a|b)"这种类型的模式串时，系统提示我这种情况下使用[ab]更加合适，所以注意|的使用地方。

括号()

表示分组。

扩展表述`(?...)`

?后面的第一个字符通常决定这个扩展表述是什么意思。

`(?aiLmsux)`

这个组不匹配任何字符串，而是单纯用来设置匹配flag的。a代表re.A表示只匹配ACSII码，而i表示re.I(ignore case)表示不区分大小写，L代表re.L(locale dependent)，m表示多行模式，s表示.可以匹配任何字符，u表示匹配Unicode字符，x表示verbose,可以让你写模式串更加易于阅读。

# to explian the function of re.VERBOSE
# the next two Regex are the same.
a = re.compile(r"\d+\.\d*")
b = re.compile(r"""\d +  # the integral part
                   \.    # the decimal part
                   \d *  # some fractional digits""", re.X)
"""
在re.X模式下，除非对空格转义，否则不处理空格，此外，#后面这行的文字都不处理，表示注释，除非转义。
"""

(?:...)

匹配之后，不保留。区别与()。

(?aiLmsux-aiLmsux)

python3.7

-后接上表示要去除的flag。

# source https://blog.csdn.net/rubikchen/article/details/80471781
s = """hello,
world.
hello,
Python.
Happy!
"""

print(len(s))
rs = re.match("(?sm)(.*?\.)\n(?-s:(.*))(.*)", s)
print(rs.groups(), rs.span())
print("group0", rs.group(0))
print("group1", rs.group(1))
print("group2", rs.group(2))
print("group3", rs.group(3))

我们可以认真来看下这个例子，首先看下这个例子的输出：

36
('hello,\nworld.', 'hello,', '\nPython.\nHappy!\n') (0, 36)
group0 hello,
world.
hello,
Python.
Happy!

group1 hello,
world.
group2 hello,
group3 
Python.
Happy!

接下来我们分析下，我们首先设置了一下这个Regex的全局flag状态为sm即DOTALL和MULTILINE所以我们这里Regex中的(.*?\.)匹配到hello,\nworld.，而后面的(?-s:(.*))由于这里去除了DOTALL这个flag，所以.*这个Regex匹配到行末，接着，这个局部取消DOTALL状态消失，回到全局的状态，即.可以重新匹配换行符，故最后的(.*)匹配接下来的所有字符。

`(?P<name>...)`

和常规的(...)相似，但不同的是给了这个组一个名字。这个name可以在如下三种上下文中使用：

# examples of how to use (?P<name>...)
p = re.compile(r'(?P<quote>[\'"]).*?(?P=quote)')
c = p.search('Paris in "the the" spring').group()
print(c)  # "the the"

`(?P=name)`

同上一个是对应的。

`(?#...)`

表注释。

`(?=...)`

表示条件语句，相当于if...

target = "uozoyoPeng"
target2 = "uozoyoNotPeng"
pattern = re.compile(r"uozoyo(?=Peng)")


print(re.findall(pattern, target))  # ['uozoyo']
print(re.findall(pattern, target2))  # []

`(?!...)`

相当于if not...

target = "uozoyoPeng"
target2 = "uozoyoNotPeng"
pattern = re.compile(r"uozoyo(?!Peng)")


print(re.findall(pattern, target))  # []
print(re.findall(pattern, target2))  # ['uozoyo']

`(?<=...)`

非获取匹配，反向肯定检查。

# example
pattern = re.compile(r"(?<=/logo/).*?(?=\.png)")
result = re.findall(pattern, content)
# 匹配类似"/logo/东吴大学.png"这种字符串

`(?<!...)`

非获取匹配，反向否定检查。

`\num`

表示前面的Regexnum次。

p = re.compile(r'(.+) \1')
c = p.search('Paris in "the the" spring').group()
print(c)  # "the the"

`\A`

从字符串开始匹配。

s = "peng jidong"
s1 = "jidong peng"

pattern = re.compile(r"\Apeng")

print(re.findall(pattern, s))  # ['peng']
print(re.findall(pattern, s1))  # []

`\b`

匹配空字符串，但是只在一个Word的前后匹配。 A word is defined as a sequence of word characters.

s = " foo "
s1 = "youfoo"

pattern = re.compile(r"\bfoo\b")

print(re.findall(pattern, s))  # ['foo']
print(re.findall(pattern, s1))  # []

`\B`

是\b的反向。

s = "python"
s1 = "py"
s2 = "py."


pattern = re.compile(r"py\B")

print(re.findall(pattern, s))  # ['py']
print(re.findall(pattern, s1))  # []
print(re.findall(pattern, s2))  # []

`\d`

匹配数字。

s = "有十个人"
s1 = "10 people"

pattern = re.compile(r"\d")

print(re.findall(pattern, s))  # []
print(re.findall(pattern, s1))  # ['1', '0']

`\D`

匹配任意非数字的字符，相当于[^0-9]

`\s`

匹配空格符，也就是说匹配[\t\n\r\f\v]。

s = "\t有\n十\r个人\n"

pattern = re.compile(r"\s.*?\s")

print(re.findall(pattern, s))  # ['\t有\n', '\r个人\n']

`\S`

匹配非空格符，是\s的反面。

`\w`

匹配包括下划线的任何单词字符。类似但不等价于“[A-Za-z0-9_]”，这里的"单词"字符使用Unicode字符集。

s = "\t有\n十\r个人\n"

pattern = re.compile(r"\w.*?\w")

print(re.findall(pattern, s))  # ['十\r个']

`\W`

\w的反面。

`\Z`

匹配字符串的末尾，与\A对应。

Python使用正则表达式的语法

`re.compile(pattern, flag=0)`

`re.search(pattern, string, flag=0)`

返回第一个匹配到的match object，否则返回None。

s = "we can find the one you love"
pattern = re.compile(r"on.")
print(re.search(pattern, s))  # <re.Match object; span=(16, 19), match='one'>

`re.match(pattern, string, flag=0)`

从最开始匹配，就算是re.M的flag下也是从头开始匹配，而并非是从每行开始匹配，若没有结果，则返回None。

s = """shall we
love together.
"""
s1 = "10 people"

pattern = re.compile(r"(?m).*")

print(re.match(pattern, s))  # <re.Match object; span=(0, 8), match='shall we'>
print(re.findall(pattern, s))  # ['shall we', '', 'love together.', '', '']

`re.fullmatch(pattern, string, flag=0)`

全部匹配了，才返回match object，否则返回None。

s = "we you"
s1 = "we"

pattern = re.compile(r"we")

print(re.fullmatch(pattern, s))  # None
print(re.fullmatch(pattern, s1))  # <re.Match object; span=(0, 2), match='we'>

`re.split(pattern, string, maxsplit=0, flag=0)`

可以用正则表达式的字符串分割版本，当pattern不带()时，pattern本身不会出现在结果中，maxsplit表示最大分割次数。

s = "we, I, you, they, he, she"

print(re.split(r"\W+", s))  # ['we', 'I', 'you', 'they', 'he', 'she']
print(re.split(r"(\W+)", s))  # ['we', ', ', 'I', ', ', 'you', ', ', 'they', ', ', 'he', ', ', 'she']
print(re.split(r"\W+", s, 1))  # ['we', 'I, you, they, he, she']

`re.findall(pattern, string, flag=0)`

从左到右寻找，返回所有找到的结果，以list的形式返回。

`re,finditer(pattern, string, flag=0)`

返回一个迭代器。

s = "we, I, you, they, he, she"

result = re.finditer(r"\w+", s)
for i in result:
    print(i)
"""
<re.Match object; span=(0, 2), match='we'>
<re.Match object; span=(4, 5), match='I'>
<re.Match object; span=(7, 10), match='you'>
<re.Match object; span=(12, 16), match='they'>
<re.Match object; span=(18, 20), match='he'>
<re.Match object; span=(22, 25), match='she'>
"""

`re.sub(pattern, repl, string, count=0, flags=0)`

这里的repl意为replacement,可以为一个字符串，也可以为含有一个matchObj的函数。这个函数有一个参数，并且返回一个字符串。

def dashrepl(matchobj):
     if matchobj.group(0) == '-': return ' '
     else: return '-'


print(re.sub('-{1,2}', dashrepl, 'pro----gram-files'))  # pro--gram files

`re.`subn`(pattern, repl, string, count=0, flags=0)`

和re.sub()类似，但是返回的是一个tuple，里面含有newstring和number_of_subs_made。

def dashrepl(matchobj):
     if matchobj.group(0) == '-': return ' '
     else: return '-'


print(re.subn('-{1,2}', dashrepl, 'pro----gram-files'))  # ('pro--gram files', 3)

`re.escape(pattern)`

可以将pattern转义，这个特别是对你想匹配的字符串中含有通配符时有效，可以省去人工进行转义的麻烦。

print(re.escape("python.exe"))  # python\.exe

`re.purge()`

清楚正则表达式的缓存。

Match Object

Python 正则表达式

正则表达式语法

点.(dot)

补字号^(Caret)

美元符号$

星号*

加号+

问号?

*?、+?、??贪心算法和非贪心算法

{m}

{m,n}

{m,n}?

\

[]

|

括号()

扩展表述(?...)

(?aiLmsux)

(?:...)

(?aiLmsux-aiLmsux)

(?P<name>...)

(?P=name)

(?#...)

(?=...)

(?!...)

(?<=...)

(?<!...)

\num

\A

\b

\B

\d

\D

\s

\S

\w

\W

\Z