TORCH04-01TorchText之文本数据集处理

2020-04-07  本文已影响0人  杨强AT南京

  原来一致凑合着使用Torch中torch.util.data下的数据集工具做数据处理,但是其中的DataLoader要求样本的长度是对齐的,而且对不同的数据源需要做细节处理。
  Torch提供了torchtext.data模块用来实现文本的处理,并并结合中文分词工具,基本上可以满足日常的文本处理了。
  这个主题就是介绍torchtext并入门,主要介绍Field,Example,Dataset,Vectors的使用,并使用LSTM网络做了一个文本分类的例子。实际torchtext还是很彪悍的工具模块。


tortext 模块结构

torchtext.data结构

TorchText使用模式示意图

TorchText使用例子

环境安装

  1. 安装torchtext
    • pip install torchtext
安装torchtext
  1. 可选安装1 - 分词工具
    • pip install spacy
    • python -m spacy download en
Spacy分词工具 安装库
  1. 可选安装2 - 分词工具
    • pip install sacremoses
安装sacremoses
  1. 安装 -分词工具
    • 结巴分词
    • pip install jieba

数据源

数据源文件 数据格式

定义字段Field

Field类帮助文档

from torchtext.data import Field

Field?
�[1;31mInit signature:�[0m
�[0mField�[0m�[1;33m(�[0m�[1;33m
�[0m    �[0msequential�[0m�[1;33m=�[0m�[1;32mTrue�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0muse_vocab�[0m�[1;33m=�[0m�[1;32mTrue�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0minit_token�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0meos_token�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mfix_length�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mdtype�[0m�[1;33m=�[0m�[0mtorch�[0m�[1;33m.�[0m�[0mint64�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mpreprocessing�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mpostprocessing�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mlower�[0m�[1;33m=�[0m�[1;32mFalse�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mtokenize�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mtokenizer_language�[0m�[1;33m=�[0m�[1;34m'en'�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0minclude_lengths�[0m�[1;33m=�[0m�[1;32mFalse�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mbatch_first�[0m�[1;33m=�[0m�[1;32mFalse�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mpad_token�[0m�[1;33m=�[0m�[1;34m'<pad>'�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0munk_token�[0m�[1;33m=�[0m�[1;34m'<unk>'�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mpad_first�[0m�[1;33m=�[0m�[1;32mFalse�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mtruncate_first�[0m�[1;33m=�[0m�[1;32mFalse�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mstop_words�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m,�[0m�[1;33m
�[0m    �[0mis_target�[0m�[1;33m=�[0m�[1;32mFalse�[0m�[1;33m,�[0m�[1;33m
�[0m�[1;33m)�[0m�[1;33m�[0m�[0m
�[1;31mDocstring:�[0m     
Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented
by tensors.  It holds a Vocab object that defines the set of possible values
for elements of the field and their corresponding numerical representations.
The Field object also holds other parameters relating to how a datatype
should be numericalized, such as a tokenization method and the kind of
Tensor that should be produced.

If a Field is shared between two columns in a dataset (e.g., question and
answer in a QA dataset), then they will have a shared vocabulary.

Attributes:
    sequential: Whether the datatype represents sequential data. If False,
        no tokenization is applied. Default: True.
    use_vocab: Whether to use a Vocab object. If False, the data in this
        field should already be numerical. Default: True.
    init_token: A token that will be prepended to every example using this
        field, or None for no initial token. Default: None.
    eos_token: A token that will be appended to every example using this
        field, or None for no end-of-sentence token. Default: None.
    fix_length: A fixed length that all examples using this field will be
        padded to, or None for flexible sequence lengths. Default: None.
    dtype: The torch.dtype class that represents a batch of examples
        of this kind of data. Default: torch.long.
    preprocessing: The Pipeline that will be applied to examples
        using this field after tokenizing but before numericalizing. Many
        Datasets replace this attribute with a custom preprocessor.
        Default: None.
    postprocessing: A Pipeline that will be applied to examples using
        this field after numericalizing but before the numbers are turned
        into a Tensor. The pipeline function takes the batch as a list, and
        the field's Vocab.
        Default: None.
    lower: Whether to lowercase the text in this field. Default: False.
    tokenize: The function used to tokenize strings using this field into
        sequential examples. If "spacy", the SpaCy tokenizer is
        used. If a non-serializable function is passed as an argument,
        the field will not be able to be serialized. Default: string.split.
    tokenizer_language: The language of the tokenizer to be constructed.
        Various languages currently supported only in SpaCy.
    include_lengths: Whether to return a tuple of a padded minibatch and
        a list containing the lengths of each examples, or just a padded
        minibatch. Default: False.
    batch_first: Whether to produce tensors with the batch dimension first.
        Default: False.
    pad_token: The string token used as padding. Default: "<pad>".
    unk_token: The string token used to represent OOV words. Default: "<unk>".
    pad_first: Do the padding of the sequence at the beginning. Default: False.
    truncate_first: Do the truncating of the sequence at the beginning. Default: False
    stop_words: Tokens to discard during the preprocessing step. Default: None
    is_target: Whether this field is a target variable.
        Affects iteration over batches. Default: False
�[1;31mFile:�[0m           c:\program files\python36\lib\site-packages\torchtext\data\field.py
�[1;31mType:�[0m           type
�[1;31mSubclasses:�[0m     ReversibleField, NestedField, LabelField, ShiftReduceField, ParsedTextField, BABI20Field

Field类说明

CLASS torchtext.data.Field(
    sequential=True,         # 是否序列数据, 默认值True,如果为False,则不需要init_token参数。
    use_vocab=True,          # 是否使用词袋对象,默认True,如果指定False,则不需要处理这个字典,表示这个字典默认是Numerical。 
    init_token=None,         # 加载每个字段前的处理函数。
    eos_token=None,          # 加载完每个字段后的处理函数。
    fix_length=None,         # 指定字段的文本长度。
    dtype=torch.int64,       # 数据类型。
    preprocessing=None,      # 在token后,转换为Numerical之前的处理管道。
    postprocessing=None,     # 转换为Numerical之后的处理管道。
    lower=False,             # 是否小写转换。
    tokenize=None,           # 用来把文本转成序列文本单词的函数,缺省的使用string.split函数,如果指定Spacy,就使用Spacy分词工具。
    tokenizer_language='en', # 指定文本语言,在指定除en意外语言,tokenize必须使用Spacy。
    include_lengths=False,   # 是否只返回补丁长度,还是返回不定长度与数据长度。返回两个长度使用元组类型。
    batch_first=False,       # 是否把批次大小放第一个维度(这是因为LSTM等网络模块对格式的要求)。
    pad_token='<pad>',       # 用来做补丁对齐处理的token函数。
    unk_token='<unk>',       # 出现OOV的处理函数。OOV(Out-of-vocabulary)就是出现不在词袋内的单词的处理函数。
    pad_first=False,         # 补丁对齐的两种情况,在前补丁(True),在后补丁(False)
    truncate_first=False,    # 文本超过长度的截断方式:丢弃前面(True)与后面(False)
    stop_words=None,         # 预处理步骤中需要丢弃的单词(停用词)。
    is_target=False)         # 是否是标签字段。

构建Feild字段的例子

  1. 构建默认对象
    • 因为索引不是我们需要的数据列,所以该字段不用处理。
from torchtext.data import Field
fld_label = Field()
fld_text = Field()
  1. 设置基本属性
# 标签字段比较简答
fld_label.sequential = False     # 这个属性默认True
fld_label.use_vocab = False      # 这个属性默认True

# 特征字段
fld_text .sequential = True     # 这个属性默认True
fld_text .use_vocab = True      # 这个属性默认True
# 因为sequential为True,则必须指定分词属性token

  1. 设置token属性,指定分词函数
    • 该函数的要求:
      1. 参数:传入一个样本的特征(就是text字段)
      2. 返回:返回一个列表,就是分词以后的结果,这样字段的数据就不是字符串,而是单词列表。
import re
import jieba
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
def word_cut(text):
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]

fld_text .tokenize = word_cut

构建数据集

Dataset的帮助文档

from torchtext.data import Dataset
Dataset?
�[1;31mInit signature:�[0m �[0mDataset�[0m�[1;33m(�[0m�[0mexamples�[0m�[1;33m,�[0m �[0mfields�[0m�[1;33m,�[0m �[0mfilter_pred�[0m�[1;33m=�[0m�[1;32mNone�[0m�[1;33m)�[0m�[1;33m�[0m�[0m
�[1;31mDocstring:�[0m     
Defines a dataset composed of Examples along with its Fields.

Attributes:
    sort_key (callable): A key to use for sorting dataset examples for batching
        together examples with similar lengths to minimize padding.
    examples (list(Example)): The examples in this dataset.
    fields (dict[str, Field]): Contains the name of each column or field, together
        with the corresponding Field object. Two fields with the same Field object
        will have a shared vocabulary.
�[1;31mInit docstring:�[0m
Create a dataset from a list of Examples and Fields.

Arguments:
    examples: List of Examples.
    fields (List(tuple(str, Field))): The Fields to use in this tuple. The
        string is a field name, and the Field is the associated field.
    filter_pred (callable or None): Use only examples for which
        filter_pred(example) is True, or use all examples if None.
        Default is None.
�[1;31mFile:�[0m           c:\program files\python36\lib\site-packages\torchtext\data\dataset.py
�[1;31mType:�[0m           type
�[1;31mSubclasses:�[0m     TabularDataset, LanguageModelingDataset, SST, TranslationDataset, SequenceTaggingDataset, TREC, IMDB, BABI20

Dataset属性说明

  1. 构造器说明:
    Dataset(examples, fields, filter_pred=None)
          # examples:数据列表,类型是Examples列表。
          # fields:字段列表,类型是tuple(str, Field))列表。
          # filter_pred:过滤数据集的条件,类型是可调用对象或者函数,样本是否使用,根据函数的返回值确定。True就使用。若为None,样本全部使用。
  1. 属性说明:
    1. sort_key :类型是callable:
    2. examples:类型list(Example)
    3. fields:类型dict[str, Field]

构建数据集的字段

from torchtext.data import Field
import re
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
# ----------------------------------------------------------------------
# 1. 数据集需要的Fields定义:fields (List(tuple(str, Field)))

fld_label = Field()
fld_text = Field()
# 标签字段比较简答
fld_label.sequential = False     # 这个属性默认True
fld_label.use_vocab = False      # 这个属性默认True

# 特征字段
fld_text .sequential = True     # 这个属性默认True
fld_text .use_vocab = True      # 这个属性默认True

# 因为sequential为True,则必须指定分词属性token
def word_cut(text):
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut

# 构建Dataset需要的fields
fields = [("text", fld_text),("label", fld_label)]   # 两个字段

fields
[('text', <torchtext.data.field.Field at 0x2a9d2336a20>),
 ('label', <torchtext.data.field.Field at 0x2a9d2336780>)]

Example帮助文档

from torchtext.data import Example
help(Example)
Help on class Example in module torchtext.data.example:

class Example(builtins.object)
 |  Defines a single training or test example.
 |  
 |  Stores each column of the example as an attribute.
 |  
 |  Class methods defined here:
 |  
 |  fromCSV(data, fields, field_to_index=None) from builtins.type
 |  
 |  fromJSON(data, fields) from builtins.type
 |  
 |  fromdict(data, fields) from builtins.type
 |  
 |  fromlist(data, fields) from builtins.type
 |  
 |  fromtree(data, fields, subtrees=False) from builtins.type
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

构建Example对象

from torchtext.data import Field
from torchtext.data import Example

# 这里使用上面构建的fields,上面的fields是否正确,这个就可以检测
one_example = Example.fromlist(["我是数据,很长的数据", 1], fields)     # 1是标签
one_example
<torchtext.data.example.Example at 0x2a9d22e2ba8>

构建Example列表

import pandas as pd
from torchtext.data import Field
from torchtext.data import Example
# ----------------------------------------------------------------------
# 2. 数据集需要的exampls列表(list(Example)):
# 使用pandas读取csv文件,其他方式也可以。比如csv库。
data = pd.read_csv("datasets/train.tsv", sep='\t')   # csv: Comma-Separated Values,tsv: Tab-Separated Values

examples = []
for txt, lab in zip(data["text"], data["label"]):
    one_example = Example.fromlist([txt, lab], fields)
    examples.append(one_example)
examples[0:5]    # 显示5个
[<torchtext.data.example.Example at 0x2a9d233c4a8>,
 <torchtext.data.example.Example at 0x2a9b5fee9e8>,
 <torchtext.data.example.Example at 0x2a9d8fae9e8>,
 <torchtext.data.example.Example at 0x2a9d8faea90>,
 <torchtext.data.example.Example at 0x2a9d8fae9b0>]

构建数据集

from torchtext.data import Dataset

# 这个数据集与torch.utils.data的Dataset是有差异的。 torch.utils.data的DataLoader要求数据是整齐的,就是每个记录长度一样。
dataset = Dataset(examples, fields)
dataset
<torchtext.data.dataset.Dataset at 0x2a9b7f1a780>

深入理解数据集

help(dataset)
Help on Dataset in module torchtext.data.dataset object:

class Dataset(torch.utils.data.dataset.Dataset)
 |  Defines a dataset composed of Examples along with its Fields.
 |  
 |  Attributes:
 |      sort_key (callable): A key to use for sorting dataset examples for batching
 |          together examples with similar lengths to minimize padding.
 |      examples (list(Example)): The examples in this dataset.
 |      fields (dict[str, Field]): Contains the name of each column or field, together
 |          with the corresponding Field object. Two fields with the same Field object
 |          will have a shared vocabulary.
 |  
 |  Method resolution order:
 |      Dataset
 |      torch.utils.data.dataset.Dataset
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getattr__(self, attr)
 |  
 |  __getitem__(self, i)
 |  
 |  __init__(self, examples, fields, filter_pred=None)
 |      Create a dataset from a list of Examples and Fields.
 |      
 |      Arguments:
 |          examples: List of Examples.
 |          fields (List(tuple(str, Field))): The Fields to use in this tuple. The
 |              string is a field name, and the Field is the associated field.
 |          filter_pred (callable or None): Use only examples for which
 |              filter_pred(example) is True, or use all examples if None.
 |              Default is None.
 |  
 |  __iter__(self)
 |  
 |  __len__(self)
 |  
 |  filter_examples(self, field_names)
 |      Remove unknown words from dataset examples with respect to given field.
 |      
 |      Arguments:
 |          field_names (list(str)): Within example only the parts with field names in
 |              field_names will have their unknown words deleted.
 |  
 |  split(self, split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
 |      Create train-test(-valid?) splits from the instance's examples.
 |      
 |      Arguments:
 |          split_ratio (float or List of floats): a number [0, 1] denoting the amount
 |              of data to be used for the training split (rest is used for test),
 |              or a list of numbers denoting the relative sizes of train, test and valid
 |              splits respectively. If the relative size for valid is missing, only the
 |              train-test split is returned. Default is 0.7 (for the train set).
 |          stratified (bool): whether the sampling should be stratified.
 |              Default is False.
 |          strata_field (str): name of the examples Field stratified over.
 |              Default is 'label' for the conventional label field.
 |          random_state (tuple): the random seed used for shuffling.
 |              A return value of `random.getstate()`.
 |      
 |      Returns:
 |          Tuple[Dataset]: Datasets for train, validation, and
 |          test splits in that order, if the splits are provided.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  download(root, check=None) from builtins.type
 |      Download and unzip an online archive (.zip, .gz, or .tgz).
 |      
 |      Arguments:
 |          root (str): Folder to download data to.
 |          check (str or None): Folder whose existence indicates
 |              that the dataset has already been downloaded, or
 |              None to check the existence of root/{cls.name}.
 |      
 |      Returns:
 |          str: Path to extracted dataset.
 |  
 |  splits(path=None, root='.data', train=None, validation=None, test=None, **kwargs) from builtins.type
 |      Create Dataset objects for multiple splits of a dataset.
 |      
 |      Arguments:
 |          path (str): Common prefix of the splits' file paths, or None to use
 |              the result of cls.download(root).
 |          root (str): Root dataset storage directory. Default is '.data'.
 |          train (str): Suffix to add to path for the train set, or None for no
 |              train set. Default is None.
 |          validation (str): Suffix to add to path for the validation set, or None
 |              for no validation set. Default is None.
 |          test (str): Suffix to add to path for the test set, or None for no test
 |              set. Default is None.
 |          Remaining keyword arguments: Passed to the constructor of the
 |              Dataset (sub)class being used.
 |      
 |      Returns:
 |          Tuple[Dataset]: Datasets for train, validation, and
 |          test splits in that order, if provided.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  sort_key = None
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from torch.utils.data.dataset.Dataset:
 |  
 |  __add__(self, other)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from torch.utils.data.dataset.Dataset:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
  1. 数据集遍历方式1
# 数据集访问与遍历
for i in range(5):  #len(dataset)
    print(dataset[i])
<torchtext.data.example.Example object at 0x000002A9D233C4A8>
<torchtext.data.example.Example object at 0x000002A9B5FEE9E8>
<torchtext.data.example.Example object at 0x000002A9D8FAE9E8>
<torchtext.data.example.Example object at 0x000002A9D8FAEA90>
<torchtext.data.example.Example object at 0x000002A9D8FAE9B0>
  1. 数据集遍历方式2
# 数据集访问与遍历
for one_ex in dataset: 
    print(one_ex)
    break
<torchtext.data.example.Example object at 0x000002A9D233C4A8>

构建批次数据

Iterator的帮助文档

from torchtext.data import Iterator
help(Iterator)
Help on class Iterator in module torchtext.data.iterator:

class Iterator(builtins.object)
 |  Defines an iterator that loads batches of data from a Dataset.
 |  
 |  Attributes:
 |      dataset: The Dataset object to load Examples from.
 |      batch_size: Batch size.
 |      batch_size_fn: Function of three arguments (new example to add, current
 |          count of examples in the batch, and current effective batch size)
 |          that returns the new effective batch size resulting from adding
 |          that example to a batch. This is useful for dynamic batching, where
 |          this function would add to the current effective batch size the
 |          number of tokens in the new example.
 |      sort_key: A key to use for sorting examples in order to batch together
 |          examples with similar lengths and minimize padding. The sort_key
 |          provided to the Iterator constructor overrides the sort_key
 |          attribute of the Dataset, or defers to it if None.
 |      train: Whether the iterator represents a train set.
 |      repeat: Whether to repeat the iterator for multiple epochs. Default: False.
 |      shuffle: Whether to shuffle examples between epochs.
 |      sort: Whether to sort examples according to self.sort_key.
 |          Note that shuffle and sort default to train and (not train).
 |      sort_within_batch: Whether to sort (in descending order according to
 |          self.sort_key) within each batch. If None, defaults to self.sort.
 |          If self.sort is True and this is False, the batch is left in the
 |          original (ascending) sorted order.
 |      device (str or `torch.device`): A string or instance of `torch.device`
 |          specifying which device the Variables are going to be created on.
 |          If left as default, the tensors will be created on cpu. Default: None.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |  
 |  __len__(self)
 |  
 |  create_batches(self)
 |  
 |  data(self)
 |      Return the examples in the dataset in order, sorted, or shuffled.
 |  
 |  init_epoch(self)
 |      Set up the batch generator for a new epoch.
 |  
 |  load_state_dict(self, state_dict)
 |  
 |  state_dict(self)
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  splits(datasets, batch_sizes=None, **kwargs) from builtins.type
 |      Create Iterator objects for multiple splits of a dataset.
 |      
 |      Arguments:
 |          datasets: Tuple of Dataset objects corresponding to the splits. The
 |              first such object should be the train set.
 |          batch_sizes: Tuple of batch sizes to use for the different splits,
 |              or None to use the same batch_size for all splits.
 |          Remaining keyword arguments: Passed to the constructor of the
 |              iterator class being used.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  epoch

使用splits函数构建Iterator对象

from torchtext.data import Iterator
print(len(dataset))
it_dataset, = Iterator.splits((dataset, ), batch_sizes=(100, )) 
it_dataset, len(it_dataset)
6300





(<torchtext.data.iterator.Iterator at 0x2a9d9c0b0b8>, 63)

词向量与构建词表

预训练的词向量

预先训练的词向量
from torchtext.vocab import Vectors
# 会有一个加载过程。
vectors = Vectors(name="sgns.zhihu.word", cache="datasets")
vectors
  0%|    
  | 0/259922 [00:00<?, ?it/s]Skipping token b'259922' with 1-dimensional vector [b'300']; likely a header
100%|██████████████████████████████████████████████████| 259922/259922 [00:30<00:00, 8568.81it/s]

<torchtext.vocab.Vectors at 0x2a9d9b11ac8>

使用词向量构建词表

# 文本使用预先训练的词向量
fld_text.build_vocab(dataset, vectors = vectors)   # 见上面的词向量

# 标签是整数,不用词向量。
fld_label.build_vocab(dataset)

使用数据集

遍历

for item  in  it_dataset:
    print(item)
[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 54x100]
    [.label]:[torch.LongTensor of size 100]

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 48x100]
    [.label]:[torch.LongTensor of size 100]

 .......

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 54x100]
    [.label]:[torch.LongTensor of size 100]

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 53x100]
    [.label]:[torch.LongTensor of size 100]

[torchtext.data.batch.Batch of size 100]
    [.text]:[torch.LongTensor of size 48x100]
    [.label]:[torch.LongTensor of size 100]

取数据

  1. 取文本
for item  in  it_dataset:
    print(item.text)    # item.label
tensor([[ 284, 2568,  115,  ...,   66,   62,   14],
        [1041,    2,  990,  ...,  848,   92,  158],
        [ 445,  369,   17,  ...,   19,  585, 1103],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]])
......
tensor([[  96,  548,  197,  ...,   45,   12,   47],
        [ 635, 1167,   62,  ..., 1036, 1306,   10],
        [9668,   14,   14,  ...,  357, 1329,   36],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]])
  1. 取标签
for item  in  it_dataset:
    print(item.label)
tensor([0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1,
        0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
        1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,
        0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
        1, 0, 1, 0])

......
tensor([1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
        0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
        1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
        0, 1, 0, 1])

文本分类中的TorchText应用

数据集处理

函数封装

import pandas as pd
from torchtext.data import Field
from torchtext.data import Example
from torchtext.data import Dataset
from torchtext.data import Iterator
from torchtext.vocab import Vectors
import re
import jieba
regex = re.compile(r'[^\u4e00-\u9fa5aA-Za-z0-9]')
fld_label = Field()
fld_text = Field()
# 标签字段比较简答
fld_label.sequential = False     # 这个属性默认True
fld_label.use_vocab = False      # 这个属性默认True

# 特征字段
fld_text.sequential = True     # 这个属性默认True
fld_text.use_vocab = True      # 这个属性默认True
fld_text.batch_first=True

# 因为sequential为True,则必须指定分词属性token
def word_cut(text):
    text = regex.sub(' ', text)
    return [word for word in jieba.cut(text) if word.strip()]
fld_text .tokenize = word_cut
# 构建Dataset需要的fields
fields = [("text", fld_text),("label", fld_label)]   # 两个字段
    
def load_data(data_file):

    
    data = pd.read_csv(data_file, sep='\t')   # csv: Comma-Separated Values,tsv: Tab-Separated Values

    examples = []
    for txt, lab in zip(data["text"], data["label"]):
        one_example = Example.fromlist([txt, lab], fields)
        examples.append(one_example)

    dataset = Dataset(examples, fields)

    it_dataset, = Iterator.splits((dataset, ), batch_sizes=(1000, ))    # 每个批次过大,GPU容易溢出

    vectors = Vectors(name="sgns.zhihu.word", cache="datasets")
    fld_text.build_vocab(dataset, vectors = vectors)   # 见上面的词向量
    # 标签是整数,不用词向量。
    fld_label.build_vocab(dataset)
    
    return it_dataset

加载训练集与测试集

it_train = load_data("datasets/train.tsv")
it_valid = load_data("datasets/valid.tsv")
it_train, it_train
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\gaoke\AppData\Local\Temp\jieba.cache
Loading model cost 0.570 seconds.
Prefix dict has been built successfully.


(<torchtext.data.iterator.Iterator at 0x1c9406f3588>,
 <torchtext.data.iterator.Iterator at 0x1c9406f3588>)

模型

import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim,
                 n_layers=2, bidirectional=True, dropout=0.2, pad_idx=0):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, batch_first=True, bidirectional=bidirectional)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):
        embedded = self.dropout(self.embedding(text))
        output, (hidden, cell) = self.rnn(embedded)
        hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))

        return self.fc(hidden.squeeze(0)) 

训练

训练的核心函数

import torch.nn.functional as F
def train(train_iter, valid_iter, model):
    # 训练超参数
    EPOCHES = 10
    CUDA = torch.cuda.is_available()   # GPU内存不够
    # CUDA = False
    if CUDA:
        model.cuda()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    for epoch in range(1, EPOCHES):
        for batch in train_iter:  # 训练集
            feature, target = batch.text, batch.label
            if CUDA:
                feature, target = feature.cuda(), target.cuda()
            optimizer.zero_grad()
            logits = model(feature)
            loss = F.cross_entropy(logits, target)
            loss.backward()
            optimizer.step()

        # 测试预测准确率
        corrects = 0.0
        with torch.no_grad():
            # sample_num样本数量
            sample_num = 0
            for item in valid_iter:
                feature, target = item.text, item.label
                if CUDA:
                    feature, target = feature.cuda(), target.cuda()
                logits = model(feature)
                corrects += (torch.max(logits, 1)[1].view(target.size()).data == target.data).sum()
                sample_num += len(feature)
            print(F"轮数:{epoch:03d},\t准确率:{corrects/sample_num}")

准备训练的条件

# 参数
vocabulary_size = len(fld_text.vocab)
embedding_dim = fld_text.vocab.vectors.size()[-1]
class_num = len(fld_label.vocab)
hidden_dim = 128
print(vocabulary_size, embedding_dim, hidden_dim, class_num)
# 构建网络模型
net = RNN(vocabulary_size, embedding_dim, hidden_dim, class_num)
11361 300 128 4

训练并验证

print("开始训练....")
train(it_train, it_valid, net)

# 保存模型
torch.save(net.state_dict(), "rnn.model")
开始训练....
轮数:001, 准确率:0.9114285707473755
轮数:002, 准确率:0.9372857213020325
轮数:003, 准确率:0.9451428651809692
轮数:004, 准确率:0.9494285583496094
轮数:005, 准确率:0.9472857117652893
轮数:006, 准确率:0.9490000009536743
轮数:007, 准确率:0.951714277267456
轮数:008, 准确率:0.953000009059906
轮数:009, 准确率:0.9485714435577393

附录:

上一篇 下一篇

猜你喜欢

热点阅读