python_cookbook学习

2018-06-10 本文已影响5人陆_志东

字符串和文本处理

针对任意多的分隔符拆分字符串

问题一：我们需要将字符串拆分为不同的字段，但是分隔符（以及分隔符之间的空格）

解决方法：
字符串的split（）只能处理简单情况，这里我们使用re模块

import re
line = "sadb,dsada  , dsaa,hehe;   haha  "
result = re.split(r"[;,\s]\s*",line)  
#  \s 代表匹配任何空白字符，包括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]
print(result)
>> ['sadb', 'dsada', '', 'dsaa', 'hehe', 'haha', '']

要想处理最后面的空字符使用下面方式：

line = line.strip()  # 去掉字符串前后两边的空，换行字符 还有 lstrip方法和 rstrip方法
result = re.split(r"[;,\s]\s*",line)
print(result)
>> ['sadb', 'dsada', '', 'dsaa', 'hehe', 'haha']

注意：
这里正则表达式使用的是 [] 而非 () ，如果使用()也就是正则表达式的捕获组
那么匹配的文本也会出现在结果里面。[] 是代表[]里面的满足其一就可以匹配。

line = "sadb,dsada  , dsaa,hehe;   haha  "
line = line.strip()
result = re.split(r"[;,\s]\s*",line)
print(result)
>> ['sadb', ',', 'dsada', ' ', '', ',', 'dsaa', ',', 'hehe', ';', 'haha']

如果想要使用()并且不带有捕获组的功能，请使用(?:pattern)

line = "sadb,dsada  , dsaa,hehe;   haha  "
line = line.strip()
result = re.split(r"(;|,|\s)\s*",line)
print(result)
result1 = re.split(r"(?:,|;|\s)\s*",line)
print(result1)
>> ['sadb', 'dsada', '', 'dsaa', 'hehe', 'haha']

如果你想要只是去掉多余的空格你也可以这样做

line = "asdf  fjdk; afed, fjek,asdf,   foo"
res = re.split(r"(;|,|\s)\s*",line)
values = res[::2]
print(values,1)
>> ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo'] 1
delimiters = res[1::2] + [""]  # 下面的zip函数转字典的时候以小的为基准
# 大的列表，多余的会被丢弃，所以添加一个多余的元素
print(delimiters,2)
>> [' ', ';', ',', ',', ',', ''] 2
print([v+d for v,d in zip(values,delimiters)],3)
>> ['asdf ', 'fjdk;', 'afed,', 'fjek,', 'asdf,', 'foo'] 3
print(dict(zip(values,delimiters)),4)
>> {'asdf': ' ', 'fjdk': ';', 'afed': ',', 'fjek': ',', 'foo': ''} 4
res2 = "".join(v+d for v,d in zip(values,delimiters))
print(res2)
>> "asdf fjdk;afed,fjek,asdf,foo"

如果是中文的字符串，去掉空格可以使用re.sub()方法，但引文不行，因为英文以空格分隔

str = "哈喽 你是？ 哈 哈"
res = re.sub(r"\s*","",str)
print(res)
>> "哈喽你是？哈哈"

在字符串的开头或结尾处做文本匹配

一种简单的方法是使用str.startswith()或者str.endswith()方法就可以了

filename = "test.txt"
res = filename.endswith(".txt")
print(res)
>> True
res = filename.startswith("haha")
print(res)
>> False
url = "http://www.baidu.com"
res = url.endswith(".cn")
print(res)
>> False

如果需要同时对多个选项做检查，只需要把参数由字符串改为元组即可。
注：不能使用列表或者集合，必须是元组

import os
# 假设当前文件夹有 目录 test ，文件 test.py ， test.js， test.txt
filenames = os.listdir(".")  # .代表当前文件夹  返回结果是列表  
print(filenames)
>> ["test", "test.py", "test.js", "test.txt"]
name = [name for name in filenames if name.endswith((".txt",".py"))]
print(name)
>>["test.py", "test.txt"]
# python 内建方法 any 和 all
res = any(name.endswith(".py") for name in filenames)
print(res)
>> True
res = all(name.endswith(".py") for name in filenames)
print(res)
>> False

又比如，从传入的参数获取内容，如果参数是网址，则从网址读取内容，
如果是文件名，使用文件处理。

from urllib.request import urlopen
def read_body(arg1):
    if arg1.startswith(("http:","https:","ftp:")):
        return urlopen(arg1).read()
    else:
        with open(arg1,"r") as file:
            res = file.read()
        return res

当然使用字符串的切片也可以解决

file_name = "test.txt"
print(file_name[-1:-4:-1] == "txt")
>> True

也可以使用正则表达式进行判断

import re
url = "http://www.baidu.com"
res = re.match(r"^(http|https|ftp).*$",url)
print(res.group(0))
>> http://www.baidu.com
print(res.group(1))
>> http

注：正则re的通配符和shell下的通配符是略有不同的

正则的？，代表前一个字符可有可无
shell的？ 代表当前位置的字符是什么我不关系
正则的*  代表前一个字符出现多少次都可以 ， 一般和  .* 搭配
shell 的 *  代表当前位置到后面位置是什么字符我不关心
比如： shell 的  da*  以da开头就行，后面的不管
  da*ad  以da 开头 ad结尾就行，中间是什么不管

普通文本的匹配使用str.find() 和 str.startswith() 和 str.endswith()
更复杂的匹配就使用正则表达式re
如果一个正则表达式规则可能会多次使用，你可以使用re的compile方法

# 不使用compile方法
import re
text = "2018/06/12"
if re.match(r"\d+/\d+/\d+",text):
    print("yes")
else:
    print("no")
# 使用compile方法
rule = re.compile(r"\d+/\d+/\d+")
if rule.match(text):
    print("yes")
else:
    print("no")

使用正则表达式对文本查找和替换的时候不区分大小写

可以对正则加上re.IGNORECASE标记

import re
text = "HAHA, Hello world"
res = re.findall("HAHA", text, re.IGNORECASE)
print(res)
>> ["HAHA"]

使用正则替换的特殊情况

根据原字符的大小写决定替换后的字符的大小写

import re
text = "HAHA, Haha haha  haHa"

def matchcase(word):
    def replace(m):
        text = m.group() # 这个 m是 re.search().group() 取匹配到的结果
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

res = re.sub("haha", matchcase("hello"), text, flags=re.IGNORECASE)
print(res)
>> HELLO, Hello hello  hello
text = "hha"
print(re.search("h",text).group())
>> h

正则表达式在*号后面加？修饰符可以让贪婪模式变为非贪婪模式。

编写可以匹配换行符的正则表达式

使用管道符为 . 添加换行符匹配

import re
str = """/* hah
       */"""
res = re.findall(r"/\*(.|\n)*\*/",str)
print(res)
>>[' ']
res = re.findall(r"/\*(?:.|\n)*\*/",str)
print(res)
>>['/* hah\n       */']

处理字符串中的特殊字符

如果想要处理字符串的开始和结尾处的空格,可以直接使用strip方法.
如果想要处理开始和结尾处的非空格字符,在strip()方法里面添加要去除的字符,比如strip("-")
如果想要处理字符串中间的内容,就要使用字符串的replace方法或者正则表达式的sub方法.

s = "hello      word"
s.replace(" ","")
>>>helloword
import re
re.sub("\s+"," ",s)
>>>hello world

不过这些去除特殊字符操作,通常同迭代器相结合起来

with open(filename,"r") as f:
    lines = (line.strip() for line in f)   # 注意这里是生成器,不是列表生成式
    for line in lines:
        pass

当我们想清楚整个范围内的字符时,或者去掉音符标志.可以使用str.translate()方法

s = "python\fis\tawesome\r\n"
remap = {
    ord("\t"):" ",
    ord("\t"):" ",
    ord("\t"):None    #delete
}
a = s.translate(remap)
a
>>> python is awesome\n

字符串的常用操作:
str.upper()
str.lower()
str.find()
str.replace()
str.translate()
re.search()
re.match()
re.sub()
re.findall()

字符串的对齐操作:
使用字符串的just方法

text = "hello world"
text.ljust(20)
>>>"hello world     "
text.rjust(20)
>>>"    hello world"
text.center(20)
>>>"   hello world  "