TORCH04-01TorchText之文本数据集处理

2020-04-07 本文已影响0人杨强AT南京

原来一致凑合着使用Torch中torch.util.data下的数据集工具做数据处理，但是其中的DataLoader要求样本的长度是对齐的，而且对不同的数据源需要做细节处理。
Torch提供了torchtext.data模块用来实现文本的处理，并并结合中文分词工具，基本上可以满足日常的文本处理了。
这个主题就是介绍torchtext并入门，主要介绍Field，Example，Dataset，Vectors的使用，并使用LSTM网络做了一个文本分类的例子。实际torchtext还是很彪悍的工具模块。

tortext 模块结构

torchtext模块包含文本数据处理与文本数据集
1. 文本数据处理
  1. torchtext.data
  2. torchtext.data.utils
  3. torchtext.data.functional
  4. torchtext.data.metrics
  5. torchtext.vocab
  6. torchtext.utils
2. 文本数据集
  1. torchtext.datasets
  2. torchtext.experimental.datasets
  3. examples
注意：
- 这里先从torchtext.data开始使用TorchText。

torchtext.data结构

torchtext.data
1. Dataset, Batch, and Example
2. Fields
3. Iterators
4. Pipeline
5. Functions
文本处理的核心模式：
1. Dataset指定文本数据源；
2. Field指定处理字段；
3. Iterator遍历数据集；
下面是TorchText的使用模式示意图
- 参考链接：http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/

TorchText使用模式示意图

TorchText使用例子

下面我们从一个例子来说明TorchText的使用模式。
1. 环境安装
2. 数据源
3. 定义字段Field
4. 构建数据集
5. 构建批次数据
6. 词向量与构建词表
7. 使用数据集

环境安装

安装torchtext
- pip install torchtext

注意：
- 因为bug的缘故，建议采用直接在github安装修正版：
  - pip install https://github.com/pytorch/text/archive/master.zip

安装torchtext

可选安装1 - 分词工具
- pip install spacy
- python -m spacy download en

spacy官网：
- https://spacy.io/models/

Spacy分词工具

注意：
- 安装训练库会因为网络缘故，无法下载，可以使用如下安装：
  - pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz
  - 也可以直接下载，并安装（这儿采用的。）。

安装库

可选安装2 - 分词工具
- pip install sacremoses

安装sacremoses

安装 -分词工具
- 结巴分词
- pip install jieba

数据源

下载地址
- https://github.com/bigboNed3/chinese_text_cnn
下载的文件：
- 训练集：train.tsv
- 测试集：test.tsv
- 验证集：dev.tsv

数据源文件

注意：
- 文件也可以使用其他方式存储，比如text文件，json文件等。
数据格式：
- 序号（多余的字段）
- label
- text

数据格式

定义字段Field

Field类帮助文档

构造Field对象的参数设置有两种方式：
1. 在构造器中设置
2. 使用属性设置（我们后面采用属性设置）

from torchtext.data import Field

Field?

�[1;31mInit signature:�[0m
�[0mField�[0m�[1;33m(�[0m�[1;33m
�[0m    �[0msequential�[0m�[1;33m=�[0m�[1;32mTrue�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0muse_vocab�[0m�[1;33m=�[0m�[1;32mTrue�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0minit_token�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0meos_token�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mfix_length�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mdtype�[0m�[1;33m=�[0m�[0mtorch�[0m�[1;33m.�[0m�[0mint64�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mpreprocessing�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mpostprocessing�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mlower�[0m�[1;33m=�[0m�[1;32mFalse�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mtokenize�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mtokenizer_language�[0m�[1;33m=�[0m�[1;34m'en'�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0minclude_lengths�[0m�[1;33m=�[0m�[1;32mFalse�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mbatch_first�[0m�[1;33m=�[0m�[1;32mFalse�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mpad_token�[0m�[1;33m=�[0m�[1;34m'<pad>'�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0munk_token�[0m�[1;33m=�[0m�[1;34m'<unk>'�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mpad_first�[0m�[1;33m=�[0m�[1;32mFalse�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mtruncate_first�[0m�[1;33m=�[0m�[1;32mFalse�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mstop_words�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mis_target�[0m�[1;33m=�[0m�[1;32mFalse�[0m�[1;33m,�[0m�[1;33m
�[0m�[1;33m)�[0m�[1;33m�[0m�[0m
�[1;31mDocstring:�[0m     
Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented
by tensors.  It holds a Vocab object that defines the set of possible values
for elements of the field and their corresponding numerical representations.
The Field object also holds other parameters relating to how a datatype
should be numericalized, such as a tokenization method and the kind of
Tensor that should be produced.

If a Field is shared between two columns in a dataset (e.g., question and
answer in a QA dataset), then they will have a shared vocabulary.

Attributes:
    sequential: Whether the datatype represents sequential data. If False,
        no tokenization is applied. Default: True.
    use_vocab: Whether to use a Vocab object. If False, the data in this
        field should already be numerical. Default: True.
    init_token: A token that will be prepended to every example using this
        field, or None for no initial token. Default: None.
    eos_token: A token that will be appended to every example using this
        field, or None for no end-of-sentence token. Default: None.
    fix_length: A fixed length that all examples using this field will be
        padded to, or None for flexible sequence lengths. Default: None.
    dtype: The torch.dtype class that represents a batch of examples
        of this kind of data. Default: torch.long.
    preprocessing: The Pipeline that will be applied to examples
        using this field after tokenizing but before numericalizing. Many
        Datasets replace this attribute with a custom preprocessor.
        Default: None.
    postprocessing: A Pipeline that will be applied to examples using
        this field after numericalizing but before the numbers are turned
        into a Tensor. The pipeline function takes the batch as a list, and
        the field's Vocab.
        Default: None.
    lower: Whether to lowercase the text in this field. Default: False.
    tokenize: The function used to tokenize strings using this field into
        sequential examples. If "spacy", the SpaCy tokenizer is
        used. If a non-serializable function is passed as an argument,
        the field will not be able to be serialized. Default: string.split.
    tokenizer_language: The language of the tokenizer to be constructed.
        Various languages currently supported only in SpaCy.
    include_lengths: Whether to return a tuple of a padded minibatch and
        a list containing the lengths of each examples, or just a padded
        minibatch. Default: False.
    batch_first: Whether to produce tensors with the batch dimension first.
        Default: False.
    pad_token: The string token used as padding. Default: "<pad>".
    unk_token: The string token used to represent OOV words. Default: "<unk>".
    pad_first: Do the padding of the sequence at the beginning. Default: False.
    truncate_first: Do the truncating of the sequence at the beginning. Default: False
    stop_words: Tokens to discard during the preprocessing step. Default: None
    is_target: Whether this field is a target variable.
        Affects iteration over batches. Default: False
�[1;31mFile:�[0m           c:\program files\python36\lib\site-packages\torchtext\data\field.py
�[1;31mType:�[0m           type
�[1;31mSubclasses:�[0m     ReversibleField, NestedField, LabelField, ShiftReduceField, ParsedTextField, BABI20Field

Field类说明

CLASS torchtext.data.Field(
    sequential=True,         # 是否序列数据， 默认值True，如果为False，则不需要init_token参数。
    use_vocab=True,          # 是否使用词袋对象，默认True，如果指定False，则不需要处理这个字典，表示这个字典默认是Numerical。 
    init_token=None,         # 加载每个字段前的处理函数。
    eos_token=None,          # 加载完每个字段后的处理函数。
    fix_length=None,         # 指定字段的文本长度。
    dtype=torch.int64,       # 数据类型。
    preprocessing=None,      # 在token后，转换为Numerical之前的处理管道。
    postprocessing=None,     # 转换为Numerical之后的处理管道。
    lower=False,             # 是否小写转换。
    tokenize=None,           # 用来把文本转成序列文本单词的函数，缺省的使用string.split函数，如果指定Spacy，就使用Spacy分词工具。
    tokenizer_language='en', # 指定文本语言，在指定除en意外语言，tokenize必须使用Spacy。
    include_lengths=False,   # 是否只返回补丁长度，还是返回不定长度与数据长度。返回两个长度使用元组类型。
    batch_first=False,       # 是否把批次大小放第一个维度（这是因为LSTM等网络模块对格式的要求）。
    pad_token='<pad>',       # 用来做补丁对齐处理的token函数。
    unk_token='<unk>',       # 出现OOV的处理函数。OOV（Out-of-vocabulary）就是出现不在词袋内的单词的处理函数。
    pad_first=False,         # 补丁对齐的两种情况，在前补丁（True），在后补丁（False）
    truncate_first=False,    # 文本超过长度的截断方式：丢弃前面（True）与后面（False）
    stop_words=None,         # 预处理步骤中需要丢弃的单词（停用词）。
    is_target=False)         # 是否是标签字段。

构建Feild字段的例子

根据上面的数据源来构建三个字段：
1. 索引（无字段名）
2. 标签（label）
3. 特征（text）

构建默认对象
- 因为索引不是我们需要的数据列，所以该字段不用处理。

from torchtext.data import Field
fld_label = Field()
fld_text = Field()

设置基本属性

# 标签字段比较简答
fld_label.sequential = False     # 这个属性默认True
fld_label.use_vocab = False      # 这个属性默认True

# 特征字段
fld_text .sequential = True     # 这个属性默认True
fld_text .use_vocab = True      # 这个属性默认True
# 因为sequential为True，则必须指定分词属性token

设置token属性，指定分词函数
- 该函数的要求：
  1. 参数：传入一个样本的特征（就是text字段）
  2. 返回：返回一个列表，就是分词以后的结果，这样字段的数据就不是字符串，而是单词列表。

import re
import jieba
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
def word_cut(text):
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]

fld_text .tokenize = word_cut

构建数据集

Dataset的帮助文档

from torchtext.data import Dataset
Dataset?

�[1;31mInit signature:�[0m �[0mDataset�[0m�[1;33m(�[0m�[0mexamples�[0m�[1;33m,�[0m �[0mfields�[0m�[1;33m,�[0m �[0mfilter_pred�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m)�[0m�[1;33m�[0m�[0m
�[1;31mDocstring:�[0m     
Defines a dataset composed of Examples along with its Fields.

Attributes:
    sort_key (callable): A key to use for sorting dataset examples for batching
        together examples with similar lengths to minimize padding.
    examples (list(Example)): The examples in this dataset.
    fields (dict[str, Field]): Contains the name of each column or field, together
        with the corresponding Field object. Two fields with the same Field object
        will have a shared vocabulary.
�[1;31mInit docstring:�[0m
Create a dataset from a list of Examples and Fields.

Arguments:
    examples: List of Examples.
    fields (List(tuple(str, Field))): The Fields to use in this tuple. The
        string is a field name, and the Field is the associated field.
    filter_pred (callable or None): Use only examples for which
        filter_pred(example) is True, or use all examples if None.
        Default is None.
�[1;31mFile:�[0m           c:\program files\python36\lib\site-packages\torchtext\data\dataset.py
�[1;31mType:�[0m           type
�[1;31mSubclasses:�[0m     TabularDataset, LanguageModelingDataset, SST, TranslationDataset, SequenceTaggingDataset, TREC, IMDB, BABI20

Dataset属性说明

构造器说明：

    Dataset(examples, fields, filter_pred=None)
          # examples：数据列表，类型是Examples列表。
          # fields：字段列表，类型是tuple(str, Field))列表。
          # filter_pred：过滤数据集的条件，类型是可调用对象或者函数，样本是否使用，根据函数的返回值确定。True就使用。若为None，样本全部使用。

属性说明：
1. sort_key ：类型是callable：
2. examples：类型list(Example)
3. fields：类型dict[str, Field]

构建数据集的字段

Dataset需要的字段是列表类型：fields (List(tuple(str, Field)))
下面例子是完整的字段的构建例子

from torchtext.data import Field
import re
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
# ----------------------------------------------------------------------
# 1. 数据集需要的Fields定义:fields (List(tuple(str, Field)))

fld_label = Field()
fld_text = Field()
# 标签字段比较简答
fld_label.sequential = False     # 这个属性默认True
fld_label.use_vocab = False      # 这个属性默认True

# 特征字段
fld_text .sequential = True     # 这个属性默认True
fld_text .use_vocab = True      # 这个属性默认True

# 因为sequential为True，则必须指定分词属性token
def word_cut(text):
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut

# 构建Dataset需要的fields
fields = [("text", fld_text),("label", fld_label)]   # 两个字段

fields

[('text', <torchtext.data.field.Field at 0x2a9d2336a20>),
 ('label', <torchtext.data.field.Field at 0x2a9d2336780>)]

Example帮助文档

from torchtext.data import Example
help(Example)

Help on class Example in module torchtext.data.example:

class Example(builtins.object)
 |  Defines a single training or test example.
 |  
 |  Stores each column of the example as an attribute.
 |  
 |  Class methods defined here:
 |  
 |  fromCSV(data, fields, field_to_index=None) from builtins.type
 |  
 |  fromJSON(data, fields) from builtins.type
 |  
 |  fromdict(data, fields) from builtins.type
 |  
 |  fromlist(data, fields) from builtins.type
 |  
 |  fromtree(data, fields, subtrees=False) from builtins.type
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

构建Example对象

Example提供一组类函数实现Example对象构建，所谓的工厂模式就是这个了。
- 参数需要数据与字段描述。
- 数据与字段的长度应该是对应的。

from torchtext.data import Field
from torchtext.data import Example

# 这里使用上面构建的fields，上面的fields是否正确，这个就可以检测
one_example = Example.fromlist(["我是数据，很长的数据", 1], fields)     # 1是标签
one_example

<torchtext.data.example.Example at 0x2a9d22e2ba8>

构建Example列表

Example列表的构建需要数据源的数据。
- 可以使用[..., ..., ...]构建，下面数据多，我们使用循环构建，
- 数据是csv格式，csv得分隔符可以体现在扩展名上。
  - csv: Comma-Separated Values，
  - tsv: Tab-Separated Values

import pandas as pd
from torchtext.data import Field
from torchtext.data import Example
# ----------------------------------------------------------------------
# 2. 数据集需要的exampls列表(list(Example)):
# 使用pandas读取csv文件，其他方式也可以。比如csv库。
data = pd.read_csv("datasets/train.tsv", sep='\t')   # csv: Comma-Separated Values，tsv: Tab-Separated Values

examples = []
for txt, lab in zip(data["text"], data["label"]):
    one_example = Example.fromlist([txt, lab], fields)
    examples.append(one_example)
examples[0:5]    # 显示5个

[<torchtext.data.example.Example at 0x2a9d233c4a8>,
 <torchtext.data.example.Example at 0x2a9b5fee9e8>,
 <torchtext.data.example.Example at 0x2a9d8fae9e8>,
 <torchtext.data.example.Example at 0x2a9d8faea90>,
 <torchtext.data.example.Example at 0x2a9d8fae9b0>]

构建数据集

使用Dataset构造器构建数据集
- Dataset(examples, fields, filter_pred=None)

from torchtext.data import Dataset

# 这个数据集与torch.utils.data的Dataset是有差异的。 torch.utils.data的DataLoader要求数据是整齐的，就是每个记录长度一样。
dataset = Dataset(examples, fields)
dataset

<torchtext.data.dataset.Dataset at 0x2a9b7f1a780>

深入理解数据集

Dataset应该提供函数操作数据。下面通过帮助文档了解。
- 尤其提供数据的变量访问：
  1. __getitem__(self, i)
  2. __len__(self)
  3. __iter__(self)
  4. split(self, split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
    - 数据集切分：训练集 + 测试机
  5. filter_examples(self, field_names)
    - 根据字段过滤数据字段。

help(dataset)

Help on Dataset in module torchtext.data.dataset object:

class Dataset(torch.utils.data.dataset.Dataset)
 |  Defines a dataset composed of Examples along with its Fields.
 |  
 |  Attributes:
 |      sort_key (callable): A key to use for sorting dataset examples for batching
 |          together examples with similar lengths to minimize padding.
 |      examples (list(Example)): The examples in this dataset.
 |      fields (dict[str, Field]): Contains the name of each column or field, together
 |          with the corresponding Field object. Two fields with the same Field object
 |          will have a shared vocabulary.
 |  
 |  Method resolution order:
 |      Dataset
 |      torch.utils.data.dataset.Dataset
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getattr__(self, attr)
 |  
 |  __getitem__(self, i)
 |  
 |  __init__(self, examples, fields, filter_pred=None)
 |      Create a dataset from a list of Examples and Fields.
 |      
 |      Arguments:
 |          examples: List of Examples.
 |          fields (List(tuple(str, Field))): The Fields to use in this tuple. The
 |              string is a field name, and the Field is the associated field.
 |          filter_pred (callable or None): Use only examples for which
 |              filter_pred(example) is True, or use all examples if None.
 |              Default is None.
 |  
 |  __iter__(self)
 |  
 |  __len__(self)
 |  
 |  filter_examples(self, field_names)
 |      Remove unknown words from dataset examples with respect to given field.
 |      
 |      Arguments:
 |          field_names (list(str)): Within example only the parts with field names in
 |              field_names will have their unknown words deleted.
 |  
 |  split(self, split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
 |      Create train-test(-valid?) splits from the instance's examples.
 |      
 |      Arguments:
 |          split_ratio (float or List of floats): a number [0, 1] denoting the amount
 |              of data to be used for the training split (rest is used for test),
 |              or a list of numbers denoting the relative sizes of train, test and valid
 |              splits respectively. If the relative size for valid is missing, only the
 |              train-test split is returned. Default is 0.7 (for the train set).
 |          stratified (bool): whether the sampling should be stratified.
 |              Default is False.
 |          strata_field (str): name of the examples Field stratified over.
 |              Default is 'label' for the conventional label field.
 |          random_state (tuple): the random seed used for shuffling.
 |              A return value of `random.getstate()`.
 |      
 |      Returns:
 |          Tuple[Dataset]: Datasets for train, validation, and
 |          test splits in that order, if the splits are provided.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  download(root, check=None) from builtins.type
 |      Download and unzip an online archive (.zip, .gz, or .tgz).
 |      
 |      Arguments:
 |          root (str): Folder to download data to.
 |          check (str or None): Folder whose existence indicates
 |              that the dataset has already been downloaded, or
 |              None to check the existence of root/{cls.name}.
 |      
 |      Returns:
 |          str: Path to extracted dataset.
 |  
 |  splits(path=None, root='.data', train=None, validation=None, test=None, **kwargs) from builtins.type
 |      Create Dataset objects for multiple splits of a dataset.
 |      
 |      Arguments:
 |          path (str): Common prefix of the splits' file paths, or None to use
 |              the result of cls.download(root).
 |          root (str): Root dataset storage directory. Default is '.data'.
 |          train (str): Suffix to add to path for the train set, or None for no
 |              train set. Default is None.
 |          validation (str): Suffix to add to path for the validation set, or None
 |              for no validation set. Default is None.
 |          test (str): Suffix to add to path for the test set, or None for no test
 |              set. Default is None.
 |          Remaining keyword arguments: Passed to the constructor of the
 |              Dataset (sub)class being used.
 |      
 |      Returns:
 |          Tuple[Dataset]: Datasets for train, validation, and
 |          test splits in that order, if provided.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  sort_key = None
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from torch.utils.data.dataset.Dataset:
 |  
 |  __add__(self, other)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from torch.utils.data.dataset.Dataset:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

数据集遍历方式1

# 数据集访问与遍历
for i in range(5):  #len(dataset)
    print(dataset[i])

<torchtext.data.example.Example object at 0x000002A9D233C4A8>
<torchtext.data.example.Example object at 0x000002A9B5FEE9E8>
<torchtext.data.example.Example object at 0x000002A9D8FAE9E8>
<torchtext.data.example.Example object at 0x000002A9D8FAEA90>
<torchtext.data.example.Example object at 0x000002A9D8FAE9B0>

数据集遍历方式2

# 数据集访问与遍历
for one_ex in dataset: 
    print(one_ex)
    break

<torchtext.data.example.Example object at 0x000002A9D233C4A8>

构建批次数据

数据集的数据使用迭代器来完成访问。从上面例子应该知道，从数据集无法访问到具体的数据值，没有提供访问的标准接口。

Iterator的帮助文档

使用Iterator类也是两种方式：
1. 构造器：
  - __init__(self, dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)
2. 使用类函数
  - splits(datasets, batch_sizes=None, **kwargs)

Iterator提供了数据遍历方式，只是遍历的是批次。
- __iter__(self)
- __len__(self)
直接返回数据：
- data(self)

from torchtext.data import Iterator
help(Iterator)

Help on class Iterator in module torchtext.data.iterator:

class Iterator(builtins.object)
 |  Defines an iterator that loads batches of data from a Dataset.
 |  
 |  Attributes:
 |      dataset: The Dataset object to load Examples from.
 |      batch_size: Batch size.
 |      batch_size_fn: Function of three arguments (new example to add, current
 |          count of examples in the batch, and current effective batch size)
 |          that returns the new effective batch size resulting from adding
 |          that example to a batch. This is useful for dynamic batching, where
 |          this function would add to the current effective batch size the
 |          number of tokens in the new example.
 |      sort_key: A key to use for sorting examples in order to batch together
 |          examples with similar lengths and minimize padding. The sort_key
 |          provided to the Iterator constructor overrides the sort_key
 |          attribute of the Dataset, or defers to it if None.
 |      train: Whether the iterator represents a train set.
 |      repeat: Whether to repeat the iterator for multiple epochs. Default: False.
 |      shuffle: Whether to shuffle examples between epochs.
 |      sort: Whether to sort examples according to self.sort_key.
 |          Note that shuffle and sort default to train and (not train).
 |      sort_within_batch: Whether to sort (in descending order according to
 |          self.sort_key) within each batch. If None, defaults to self.sort.
 |          If self.sort is True and this is False, the batch is left in the
 |          original (ascending) sorted order.
 |      device (str or `torch.device`): A string or instance of `torch.device`
 |          specifying which device the Variables are going to be created on.
 |          If left as default, the tensors will be created on cpu. Default: None.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |  
 |  __len__(self)
 |  
 |  create_batches(self)
 |  
 |  data(self)
 |      Return the examples in the dataset in order, sorted, or shuffled.
 |  
 |  init_epoch(self)
 |      Set up the batch generator for a new epoch.
 |  
 |  load_state_dict(self, state_dict)
 |  
 |  state_dict(self)
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  splits(datasets, batch_sizes=None, **kwargs) from builtins.type
 |      Create Iterator objects for multiple splits of a dataset.
 |      
 |      Arguments:
 |          datasets: Tuple of Dataset objects corresponding to the splits. The
 |              first such object should be the train set.
 |          batch_sizes: Tuple of batch sizes to use for the different splits,
 |              or None to use the same batch_size for all splits.
 |          Remaining keyword arguments: Passed to the constructor of the
 |              iterator class being used.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  epoch

使用splits函数构建Iterator对象

splits函数的核心参数是datasets与batch_size
- datasets：需要list类型；
- batch_size：类型与datasets匹配；

from torchtext.data import Iterator
print(len(dataset))
it_dataset, = Iterator.splits((dataset, ), batch_sizes=(100, )) 
it_dataset, len(it_dataset)

6300





(<torchtext.data.iterator.Iterator at 0x2a9d9c0b0b8>, 63)

词向量与构建词表

构建的Iterator还不能直接工作，因为Iterator的工作需要词表，通过词表才能把文本转化为数值（原理是TF词频）
构建此表两种方式
- 使用与训练的词向量，使用vectors参数指定
- 使用默认的词向量，设置vectors = None

预训练的词向量

这里我们只关心中文，英文可以使用spacy与sacremoses
- 下载地址：https://github.com/Embedding/Chinese-Word-Vectors

预先训练的词向量

下载的词向量文件
- 700+Mb，比较刺激的文件。
  
  词向量训练文件
加载词向量文件

from torchtext.vocab import Vectors
# 会有一个加载过程。
vectors = Vectors(name="sgns.zhihu.word", cache="datasets")
vectors

  0%|    
  | 0/259922 [00:00<?, ?it/s]Skipping token b'259922' with 1-dimensional vector [b'300']; likely a header
100%|██████████████████████████████████████████████████| 259922/259922 [00:30<00:00, 8568.81it/s]

<torchtext.vocab.Vectors at 0x2a9d9b11ac8>

使用词向量构建词表

# 文本使用预先训练的词向量
fld_text.build_vocab(dataset, vectors = vectors)   # 见上面的词向量

# 标签是整数，不用词向量。
fld_label.build_vocab(dataset)

使用数据集

遍历

现在可以使用it_dataset迭代数据集了Iterator。
- __iter__(self)
- __len__(self)
- 注意：没有__item__函数。智能迭代。

for item  in  it_dataset:
    print(item)

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 54x100]
    [.label]:[torch.LongTensor of size 100]

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 48x100]
    [.label]:[torch.LongTensor of size 100]

 .......

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 54x100]
    [.label]:[torch.LongTensor of size 100]

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 53x100]
    [.label]:[torch.LongTensor of size 100]

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 48x100]
    [.label]:[torch.LongTensor of size 100]

取数据

取文本

for item  in  it_dataset:
    print(item.text)    # item.label

tensor([[ 284, 2568,  115,  ...,   66,   62,   14],
        [1041,    2,  990,  ...,  848,   92,  158],
        [ 445,  369,   17,  ...,   19,  585, 1103],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]])
......
tensor([[  96,  548,  197,  ...,   45,   12,   47],
        [ 635, 1167,   62,  ..., 1036, 1306,   10],
        [9668,   14,   14,  ...,  357, 1329,   36],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]])

取标签

for item  in  it_dataset:
    print(item.label)

tensor([0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1,
        0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
        1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,
        0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
        1, 0, 1, 0])

......
tensor([1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
        0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
        1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
        0, 1, 0, 1])

文本分类中的TorchText应用

数据集处理

函数封装

import pandas as pd
from torchtext.data import Field
from torchtext.data import Example
from torchtext.data import Dataset
from torchtext.data import Iterator
from torchtext.vocab import Vectors
import re
import jieba
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
fld_label = Field()
fld_text = Field()
# 标签字段比较简答
fld_label.sequential = False     # 这个属性默认True
fld_label.use_vocab = False      # 这个属性默认True

# 特征字段
fld_text.sequential = True     # 这个属性默认True
fld_text.use_vocab = True      # 这个属性默认True
fld_text.batch_first=True

# 因为sequential为True，则必须指定分词属性token
def word_cut(text):
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut
# 构建Dataset需要的fields
fields = [("text", fld_text),("label", fld_label)]   # 两个字段
    
def load_data(data_file):

    
    data = pd.read_csv(data_file, sep='\t')   # csv: Comma-Separated Values，tsv: Tab-Separated Values

    examples = []
    for txt, lab in zip(data["text"], data["label"]):
        one_example = Example.fromlist([txt, lab], fields)
        examples.append(one_example)

    dataset = Dataset(examples, fields)

    it_dataset, = Iterator.splits((dataset, ), batch_sizes=(1000, ))    # 每个批次过大，GPU容易溢出

    vectors = Vectors(name="sgns.zhihu.word", cache="datasets")
    fld_text.build_vocab(dataset, vectors = vectors)   # 见上面的词向量
    # 标签是整数，不用词向量。
    fld_label.build_vocab(dataset)
    
    return it_dataset

加载训练集与测试集

数据集文件说明：
- 训练集：train.tsv
- 验证集：valid.tsv

it_train = load_data("datasets/train.tsv")
it_valid = load_data("datasets/valid.tsv")
it_train, it_train

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\gaoke\AppData\Local\Temp\jieba.cache
Loading model cost 0.570 seconds.
Prefix dict has been built successfully.


(<torchtext.data.iterator.Iterator at 0x1c9406f3588>,
 <torchtext.data.iterator.Iterator at 0x1c9406f3588>)

模型

模型就使用LSTM

import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim,
                 n_layers=2, bidirectional=True, dropout=0.2, pad_idx=0):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, batch_first=True, bidirectional=bidirectional)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):
        embedded = self.dropout(self.embedding(text))
        output, (hidden, cell) = self.rnn(embedded)
        hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))

        return self.fc(hidden.squeeze(0))

训练

训练的核心函数

参数：
1. 训练集
2. 验证集
3. 模型

import torch.nn.functional as F
def train(train_iter, valid_iter, model):
    # 训练超参数
    EPOCHES = 10
    CUDA = torch.cuda.is_available()   # GPU内存不够
    # CUDA = False
    if CUDA:
        model.cuda()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    for epoch in range(1, EPOCHES):
        for batch in train_iter:  # 训练集
            feature, target = batch.text, batch.label
            if CUDA:
                feature, target = feature.cuda(), target.cuda()
            optimizer.zero_grad()
            logits = model(feature)
            loss = F.cross_entropy(logits, target)
            loss.backward()
            optimizer.step()

        # 测试预测准确率
        corrects = 0.0
        with torch.no_grad():
            # sample_num样本数量
            sample_num = 0
            for item in valid_iter:
                feature, target = item.text, item.label
                if CUDA:
                    feature, target = feature.cuda(), target.cuda()
                logits = model(feature)
                corrects += (torch.max(logits, 1)[1].view(target.size()).data == target.data).sum()
                sample_num += len(feature)
            print(F"轮数：{epoch:03d},\t准确率:{corrects/sample_num}")

准备训练的条件

条件包含：
- 构建网络需要的参数
  - 需要词向量化过程中词表等变量
- 数据集（已经准备好）

# 参数
vocabulary_size = len(fld_text.vocab)
embedding_dim = fld_text.vocab.vectors.size()[-1]
class_num = len(fld_label.vocab)
hidden_dim = 128
print(vocabulary_size, embedding_dim, hidden_dim, class_num)
# 构建网络模型
net = RNN(vocabulary_size, embedding_dim, hidden_dim, class_num)

11361 300 128 4

训练并验证

print("开始训练....")
train(it_train, it_valid, net)

# 保存模型
torch.save(net.state_dict(), "rnn.model")

开始训练....
轮数：001, 准确率:0.9114285707473755
轮数：002, 准确率:0.9372857213020325
轮数：003, 准确率:0.9451428651809692
轮数：004, 准确率:0.9494285583496094
轮数：005, 准确率:0.9472857117652893
轮数：006, 准确率:0.9490000009536743
轮数：007, 准确率:0.951714277267456
轮数：008, 准确率:0.953000009059906
轮数：009, 准确率:0.9485714435577393

附录：

预测的实现代码就很简单了，这里就不列出了。

TORCH04-01TorchText之文本数据集处理

tortext 模块结构

torchtext.data结构

TorchText使用例子

环境安装

数据源

定义字段Field

Field类帮助文档

Field类说明

构建Feild字段的例子

构建数据集

Dataset的帮助文档

Dataset属性说明

构建数据集的字段

Example帮助文档

构建Example对象

构建Example列表

构建数据集

深入理解数据集

构建批次数据

Iterator的帮助文档

使用splits函数构建Iterator对象

词向量与构建词表

预训练的词向量

使用词向量构建词表

使用数据集

遍历

取数据

文本分类中的TorchText应用

数据集处理

函数封装

加载训练集与测试集

模型

训练

训练的核心函数

准备训练的条件

训练并验证

猜你喜欢

热点阅读