[将句子切分]split

2019-04-28  本文已影响0人  VanJordan

If you use .split() instead, the program will take into account multiple spaces, newlines, tabs and all other forms of whitespace. That should get you what you're looking for.

fairseq中的切分方法

import re

SPACE_NORMALIZER = re.compile(r"\s+")


def tokenize_line(line):
    line = SPACE_NORMALIZER.sub(" ", line)
    line = line.strip()
    return line.split()

其实就等价一个split()

>>> teststr = "a   v w   ef sdv   \n   wef"
>>> print teststr
a   v w   ef sdv   
   wef
>>> teststr.split()
['a', 'v', 'w', 'ef', 'sdv', 'wef']
>>> teststr.split(" ")
['a', '', '', 'v', 'w', '', '', 'ef', 'sdv', '', '', '\n', '', '', 'wef']
上一篇 下一篇

猜你喜欢

热点阅读