《自然语言处理基于预训练模型的方法》笔记
写在前面
部分自己手敲代码:链接
封面图
1 绪论
预训练(Pre-train)
即首先在一个原任务上预先训练一个初始模型,然后在下游任务(目标任务)上继续对该模型进行精调(Fine-Tune),从而得到提高下游任务准确率的目的,本质上也是一种迁移学习(Transfer Learning)
2 自然语言处理基础
2.1 文本的表示
2.1.1 独热表示
One-hot Encoding无法使用余弦函数计算相似度,同时会造成数据稀疏(Data Sparsity)
2.1.2 词的分布式表示
分布式语义假设:词的含义可以由其上下文的分布进行表示
使得利用共现频次表现的向量提供了一定的相似性
2.1.2.1 上下文
可以使用词在句子中的一个固定窗口内的词作为其上下文,也可以使用所在的文档本身作为上下文
- 前者反映词的局部性质:具有相似词法、句法属性的词将会具有相似的向量表示
- 后者更多反映词代表的主题信息
2.1.2.2 共现频次作为词的向量表示的问题
- 高频词误导计算结果
- 高阶关系无法反映
- 仍有稀疏性问题
例子:”A“与”B“共现过,”B“与”C“共现过,”C“与”D“共现过,只能知道”A“与”C“都和”B“共现过,但”A“与”D“这种高阶关系没法知晓
2.1.2.3 奇异值分解
可以使用奇异值分解的做法解决共现频次无法反映词之间高阶关系的问题
奇异值分解后的U的每一行其实表示对应词的d维向量表示,由于U的各列相互正交,则可以认为词表示的每一维表达了该词的一种独立的”潜在语义“
分解结果上下文比较相近的词在空间上距离比较近,潜在语义分析(LSA)
2.1.3 词嵌入表示
经常直接简称为词向量,利用自然语言文本中蕴含的自监督学习信号(即词与上下文的共现信息),先来预训练词向量,往往会获得更好的效果
2.1.4 词袋表示
BOW,不考虑顺序,将文本中全部的词所对应的向量表示(即可以是独热表示,也可以是分布式或者词向量)下那个家,即构成了文本的向量表示。如果是独热表示,每一维是词在文本中出现的次数
2.2 自然语言处理任务
2.2.1 n-gram
句首加上<BOS>,句尾加上<EOS>
2.2.2 平滑
- 当n比较大或者测试句子中含有未登录词(Out-Of-Vocaabulary,OOV)时,会出现零概率,可以使用加1平滑
- 当训练集较小时,加1会得到过高的概率估计,所以转为加平滑,其中,例如bigram语言模型,平滑后条件概率为:
2.2.3 语言模型性能评估
方法:
- 运用到具体的任务中,得到外部任务评价(计算代价高)
- 困惑度(Perplexity, PPL),内部评价
困惑度:
连乘会浮点下溢
困惑度越小,单词序列的概率越大,但是困惑度越低的语言模型并不总能在外部任务上得到更好的性能指标,但是会有一定的正相关性,困惑度是一种快速评价语言模型性能的指标
2.3 自然语言处理基础任务
2.3.1 中文分词
前向最大匹配分词,明显缺点是倾向与切分出较长的词,也会有切分歧义的问题
见附录代码2
2.3.2 子词切分
以英语为例的语言,如果按照天然的分隔符进行切分的话,会造成一定的数据稀疏的问题,而且会导致词表过大而降低处理速度,所以有传统语言学规则,词形还原和词干提取,但是结果可能不是一个完整的词
2.3.2.1 子词切分算法
原理:都是使用尽量长且频次高的子词对单词进行切分,字节对编码算法(Byte Pair Encoding, BPE)
2.3.3 词性标注
主要难点在于歧义性,一个词在不同的上下文中可能有不同的意思
2.3.4 句法分析
给定一个句子,分析句子的句法成分信息,辅助下游处理任务
句法结构表示法:
S表示起始符号,NP名词短语,VP动词短语,sub表示主语,obj表示宾语
例子
您转的这篇文章很无知。
您转这篇文章很无知。第一句话主语是“文章”,第二句话的主语是“转”这个动作
2.3.5 语义分析
词语的颗粒度考虑,一个词语具有多重语义,例如“打”,词义消歧(Word Sense Disambiguation,WSD)任务,可以使用语义词典确定,例如WordNet
2.4 自然语言处理应用任务
2.4.1 信息抽取
- NER
- 关系抽取(实体之间的语义关系,如夫妻、子女、工作单元和地理空间上的位置关系等二元关系)
-
事件抽取(识别人们感兴趣的事件以及事件所涉及的时间、地点和任务等关键信息)
2.4.2 情感分析
- 情感分类(识别文中蕴含的情感类型或者情感强度)
- 情感信息抽取(抽取文中情感元素,如评价词语、评价对象和评价搭配等)
2.4.3 问答系统(QA)
- 检索式问答系统
- 知识库问答系统
- 常问问题集问答系统
- 阅读理解式问答系统
2.4.4 机器翻译(MT)
”理性主义“:基于规则
”经验主义“:数据驱动
基于深度学习的MT也成为神经机器翻译NMT
2.4.5 对话系统
对话系统主要分为任务型对话系统和开放域对话系统,后者也会被称为聊天机器人(Chtabot)
2.4.5.1 任务型对话系统
包含三个模块:NLU(自然语言理解)、DM(对话管理)、NLG(自然语言生成)
1)NLU通常包含话语的领域、槽值、意图
2)DM通常包含对话状态跟踪(Dialogue State Tracking)、对话策略优化(Dialogue Policy Optimization)
对话状态一般表示为语义槽和值的列表。
U:帮我定一张明天去北京的机票
例如,通过到对以上用户话语
对话状态一般表示为语义槽和值的列表,例如对于话术的NLU的结果进行对话状态跟踪,得到当前对话状态:【到达地=北京;出发时间=明天;出发地=NULL;数量=1】
获取到当前对话状态后,进行策略优化,即选择下一步采用什么样的策略,也叫动作,比如可以询问出发地等
NLG通常通过写模板即可实现
2.5 基本问题
2.5.1 文本分类
2.5.2 结构预测
2.5.2.1 序列标注(Sequence Labeling)
CRF几考虑了每个词属于某个标签(发射概率),还考虑了标签之间的相互关系(转移概率)
2.5.2.2 序列分割
人名(PER)、地名(LOC)、机构名(ORG)
输入:”我爱北京天安门“,分词:”我 爱 北京 天安门“,NER结果:”北京天安门=LOC“
2.5.2.3 图结构生成
输入的是自然语言,输出结果是一个以图表示的结构,算法有两大类:基于图和基于转移
2.5.3 序列到序列问题(Seq2Seq)
也成为编码器-解码器(Encoder-Decoder)模型
2.6 评价指标
2.6.1 准确率(Accuracy
最简单直观的评价指标,常被用于文本分类、词性标注等问题
2.6.2 F值
针对某一类别的评价
是加权调和参数;P是精确率(Precision);R是召回率(Recall),当权重为=1时,表示精确率和召回率同样重要,也称F1值
“正确识别的命名实体数目”为1 (“哈尔滨”),“识别出的命名实体总数”为2 ("张"和"哈尔滨"),"测试文本中命名实体的总数"为2("张三"和"哈尔滨"),那么此时精确率和召回率皆为1/2 = 0.5,最终的F1 = 0.5,与基于词计算的准确率(0.875)相比,该值更为合理了
2.6.3 其他评价
BLEU值是最常用的机器翻译自动评价指标
2.7 习题
3 基础工具集与常用数据集
3.1 NTLK工具集
3.1.1 语料库和词典资源
3.1.1.1 停用词
英文中的”a“,”the“,”of“,“to”等
from nltk.corpus import stopwords
stopwords.words('english')
3.1.1.2 常用语料库
NLTK提供了多种语料库(文本数据集),如图书、电影评论和聊天记录等,它们可以被分为两类,即未标注语料库(又称生语料库或生文本,Raw text)和人工标注语料库( Annotated corpus )
- 未标注语料库
比如说小说的原文等
- 人工标注语料库
3.1.1.3 常用词典
WordNet
普林斯顿大学构建的英文语义词典(也称作辞典,Thesaurus),其主要特色是定义了同义词集合(Synset),每个同义词集合由具有相同意义的词义组成,为每一个同义词集合提供了简短的释义(Gloss),不同同义词集合之间还具有一定的语义关系
from nltk.corpus import wordnet
syns = wordnet.synsets("bank")
syns[0].name()
syns[1].definition()
SentiWordNet
基于WordNet标注的同义词集合的情感倾向性词典,有褒义、贬义、中性三个情感值
3.1.2 NLP工具集
3.1.2.1 分句
from nltk.corpus import gutenberg
from nltk.tokenize import sent_tokenize
text = gutenberg.raw('austen-emma.txt')
sentences = sent_tokenize(text)
print(sentences[100])
# 结果:
# Mr. Knightley loves to find fault with me, you know-- \nin a joke--it is all a joke.
3.1.2.2 标记解析
一个句子是由若干标记(Token)按顺序构成的,其中标记既可以是一个词,也可以是标点符号等,这些标记是自然语言处理最基本的输入单元。
将句子分割为标记的过程叫作标记解析(Tokenization)。英文中的单词之间通常使用空格进行分割。不过标点符号通常和前面的单词连在一起,因此标记解析的一项主要工作是将标点符号和前面的单词进行拆分。和分句一样,也无法使用简单的规则进行标记解析,仍以符号"."为例,它既可作为句号,也可以作为标记的一部分,如不能简单地将"Mr."分成两个标记。同样,NLTK提供了标记解析功能,也称作标记解析器(Tokenizer)
from nltk.tokenize import word_tokenize
word_tokenize(sentences[100])
# 结果:
# ['Mr.','Knightley','loves','to','find','fault','with','me',',','you','know','--','in','a','joke','--','it','is','all','a','joke','.']
3.1.2.3 词性标注
from nltk import pos_tag
print(pos_tag(word_tokenize("They sat by the fire.")))
print(pos_tag(word_tokenize("They fire a gun.")))
# 结果:
# [('They', 'PRP'), ('sat', 'VBP'), ('by', 'IN'), ('the', 'DT'), ('fire', 'NN'), ('.', '.')]
# [('They', 'PRP'), ('fire', 'VBP'), ('a', 'DT'), ('gun', 'NN'), ('.', '.')]
3.1.2.4 其他工具
命名实体识别、组块分析、句法分析等
3.2 LTP工具集(哈工大)
中文分词、词形标注、命名实体识别、依存句法分析和语义角色标注等,具体查api
3.3 Pytorch
# 1.创建
torch.empty(2, 3) # 未初始化
torch.randn(2, 3) # 标准正态
torch.zeros(2, 3, dtype=torch.long) # 张量为整数类型
torch.zeros(2, 3, dtype=torch.double) # 双精度浮点数
torch.tensor([[1.0, 3.8, 2.1], [8.6, 4.0, 2.4]]) # 通过列表创建
torch.arange(1, 4) # 生成1到4的数
# 2.GPU
torch.rand(2, 3).to("cuda")
# 3.加减乘除都是元素运算
x = torch.tensor([1, 2, 3], dtype=torch.double)
y = torch.tensor([4, 5, 6], dtype=torch.double)
print(x * y)
# tensor([ 4., 10., 18.], dtype=torch.float64)
# 4.点积
x.dot(y)
# 5.所有元素求平均
x.mean()
# 6.按维度求平均
x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.double)
print(x.mean(dim=0))
print(x.mean(dim=1))
# tensor([2.5000, 3.5000, 4.5000], dtype=torch.float64)
# tensor([2., 5.], dtype=torch.float64)
# 7.拼接(按列和行)
x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.double)
y = torch.tensor([[7, 8, 9], [10, 11, 12]], dtype=torch.double)
torch.cat((x, y), dim=0)
# tensor([[ 1., 2., 3.],
# [ 4., 5., 6.],
# [ 7., 8., 9.],
# [10., 11., 12.]], dtype=torch.float64)
# 8.梯度
x = torch.tensor([2.], requires_grad=True)
y = torch.tensor([3.], requires_grad=True)
z = (x + y) * (y - 2)
print(z)
z.backward()
print(x.grad, y.grad)
# tensor([5.], grad_fn=<MulBackward0>)
# tensor([1.]) tensor([6.])
# 9.调整形状
view和reshape区别是,view要求张量为连续的,张量可以用is_conuous判断是否连续,其他一样
transpose交换维度(只能交换两个维度),permute可以交换多个维度
# 10.升维和降维
a = torch.tensor([1, 2, 3, 4])
b = a.unsqueeze(dim=0)
print(b, b.shape)
c = b.squeeze() # 去掉所有形状中为1的维
print(c, c.shape)
# tensor([[1, 2, 3, 4]]) torch.Size([1, 4])
# tensor([1, 2, 3, 4]) torch.Size([4])
3.4 语料处理
# 删除空的成对符号
def remove_empty_paired_punc(in_str):
return in_str.replace('()', '').replace('《》', '').replace('【】', '').replace('[]', '')
# 删除多余的html标签
def remove_html_tags(in_str):
html_pattern = re.compile(r'<[^>]+>', re.S)
return html_pattern.sub('', in_str)
# 删除不可见控制字符
def remove_control_chars(in_str):
control_chars = ''.join(map(chr, list(range(0, 32)) + list(range(127, 160))))
control_chars = re.compile('[%s]' % re.escape(control_chars))
return control_chars.sub('', in_str)
3.5 数据集
- Common Crawl
- HuggingFace Datasets(超多数据集)
使用HuggingFace Datasets之前,pip安装datasets,其提供数据集以及评价方法
3.6 习题
4 自然语言处理中的神经网络基础
4.1 多层感知器模型
4.1.1 感知机
输入转成x,过程为特征提取(Feature Extraction)
4.1.2 线性回归
和感知机类似,
4.1.3 逻辑回归
其中,,回归模型常用于分类问题
4.1.4 softmax回归
例如数字识别,每个类,结果为
使用矩阵表示
4.1.5 多层感知机(Multi-layer Perceptron,MLP)
多层感知机是解决线性不可分问题的方案,是堆叠多层线性分类器,并在隐层假如了非线性激活函数
4.1.6 模型实现
4.1.6.1 nn
from torch import nn
linear = nn.Linear(32, 2) # 输入特征数目为32维,输出特征数目为2维
inputs = torch.rand(3, 32) # 创建一个形状为(3,32)的随机张量,3为batch批次大小
outputs = linear(inputs)
print(outputs)
# 输出:
# tensor([[ 0.2488, -0.3663],
# [ 0.4467, -0.5097],
# [ 0.4149, -0.7504]], grad_fn=<AddmmBackward>)
# 输出为(3,2),即(batch,输出维度)
4.1.6.2 激活函数
from torch.nn import functional as F
# 对于每个元素进行sigmoid
activation = F.sigmoid(outputs)
print(activation)
# 结果:
# tensor([[0.6142, 0.5029],
# [0.5550, 0.4738],
# [0.6094, 0.4907]], grad_fn=<SigmoidBackward>)
# 沿着第2维(行方向)进行softmax,即对于每批次中的各样例分别进行softmax
activation = F.softmax(outputs, dim=1)
print(activation)
# 结果:
# tensor([[0.6115, 0.3885],
# [0.5808, 0.4192],
# [0.6182, 0.3818]], grad_fn=<SoftmaxBackward>)
activation = F.relu(outputs)
print(activation)
# 结果:
# tensor([[0.4649, 0.0115],
# [0.2210, 0.0000],
# [0.4447, 0.0000]], grad_fn=<ReluBackward0>)
4.1.6.3 多层感知机
import torch
from torch import nn
from torch.nn import functional as F
# 多层感知机
class MLP(nn.Module):
def __init__(self, input_dim, hidden_dim, num_class):
super(MLP, self).__init__()
# 线性变换:输入 -> 隐层
self.linear1 = nn.Linear(input_dim, hidden_dim)
# ReLU
self.activate = F.relu
# 线性变换:隐层 -> 输出
self.linear2 = nn.Linear(hidden_dim, num_class)
def forward(self, inputs):
hidden = self.linear1(inputs)
activation = self.activate(hidden)
outputs = self.linear2(activation)
probs = F.softmax(outputs, dim=1) # 获得每个输入属于某个类别的概率
return probs
mlp = MLP(input_dim=4, hidden_dim=5, num_class=2)
# 3个输入batch,4为每个输入的维度
inputs = torch.rand(3, 4)
probs = mlp(inputs)
print(probs)
# 结果:
# tensor([[0.3465, 0.6535],
# [0.3692, 0.6308],
# [0.4319, 0.5681]], grad_fn=<SoftmaxBackward>)
4.2 CNN
4.2.1 模型结构
计算最后输出边长为
其中,n为输入边长,p为padding,f为卷积核宽度
前馈神经网络,卷积神经网络
4.2.2 模型实现
4.2.2.1 卷积
Conv1d、Conv2d、Conv3d,自然语言处理中常用的一维卷积
简单来说,2d先横着扫再竖着扫,1d只能竖着扫,3d是三维立体扫
代码实现:注意pytorch中只能对倒数第2维数据进行卷积,因此传参时要转置一下,将需要卷积的数据弄到倒数第2维,这里将embeding的维度进行卷积,最后一般会在转置过来(没办法,pytorch设计的不太好,这点确实绕了一圈)
import torch
from torch.nn import Conv1d
inputs = torch.ones(2, 7, 5)
conv1 = Conv1d(in_channels=5, out_channels=3, kernel_size=2)
inputs = inputs.permute(0, 2, 1)
outputs = conv1(inputs)
outputs = outputs.permute(0, 2, 1)
print(outputs, outputs.shape)
# 结果:
# tensor([[[ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679]],
# [[ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679],
# [ 0.4823, -0.3208, -0.2679]]], grad_fn=<PermuteBackward> torch.Size([2, 6, 3]))
4.2.2.2 卷积、池化、全链接
import torch
from torch.nn import Conv1d
# 输入批次大小为2,即有两个序列,每个序列长度为6,输入的维度为5
inputs = torch.rand(2, 5, 6)
print("inputs = ", inputs, inputs.shape)
# class torch.nn.Conv1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)
# in_channels 词向量维度
# out_channels 卷积产生的通道
# kernel_size 卷积核尺寸,卷积大小实际为 kernel_size*in_channels
# 定义一个一维卷积,输入通道为5,输出通道为2,卷积核宽度为4
conv1 = Conv1d(in_channels=5, out_channels=2, kernel_size=4)
# 卷积核的权值是随机初始化的
print("conv1.weight = ", conv1.weight, conv1.weight.shape)
# 再定义一个一维卷积,输入通道为5,输出通道为2,卷积核宽度为3
conv2 = Conv1d(in_channels=5, out_channels=2, kernel_size=3)
outputs1 = conv1(inputs)
outputs2 = conv2(inputs)
# 输出1为2个序列,两个序列长度为3,大小为2
print("outputs1 = ", outputs1, outputs1.shape)
# 输出2为2个序列,两个序列长度为4,大小为2
print("outputs2 = ", outputs2, outputs2.shape)
# inputs = tensor([[[0.5801, 0.6436, 0.1947, 0.6487, 0.8968, 0.3009],
# [0.8895, 0.0390, 0.5899, 0.1805, 0.1035, 0.9368],
# [0.1585, 0.8440, 0.8345, 0.0849, 0.4730, 0.5783],
# [0.3659, 0.2716, 0.4990, 0.6657, 0.2565, 0.9945],
# [0.6403, 0.2125, 0.6234, 0.1210, 0.3517, 0.6784]],
# [[0.0855, 0.1844, 0.3558, 0.1458, 0.9264, 0.9538],
# [0.1427, 0.9598, 0.2031, 0.2354, 0.5456, 0.6808],
# [0.8981, 0.6998, 0.1424, 0.7445, 0.3664, 0.9132],
# [0.9393, 0.6905, 0.1617, 0.7266, 0.6220, 0.0726],
# [0.6940, 0.1242, 0.0561, 0.3435, 0.1775, 0.8076]]]) torch.Size([2, 5, 6])
# conv1.weight = Parameter containing:
# tensor([[[ 0.1562, -0.1094, -0.0228, 0.1879],
# [-0.0304, 0.1720, 0.0392, 0.0476],
# [ 0.0479, 0.0050, -0.0942, 0.0502],
# [-0.0905, -0.1414, 0.0421, 0.0708],
# [ 0.0671, 0.2107, 0.1556, 0.1809]],
# [[ 0.0453, 0.0267, 0.0821, 0.0792],
# [ 0.0428, 0.1096, 0.0132, 0.1285],
# [-0.0082, 0.2208, 0.2189, 0.1461],
# [ 0.0550, -0.0019, -0.0607, -0.1238],
# [ 0.0730, 0.1778, -0.0817, 0.2204]]], requires_grad=True) torch.Size([2, 5, 4])
# outputs1 = tensor([[[0.2778, 0.5726, 0.2568],
# [0.6502, 0.7603, 0.6844]],
# ...
# [-0.0940, -0.2529, -0.2081, 0.0786]],
# [[-0.0102, -0.0118, 0.0119, -0.1874],
# [-0.5899, -0.0979, -0.1233, -0.1664]]], grad_fn=<SqueezeBackward1>) torch.Size([2, 2, 4])
from torch.nn import MaxPool1d
# 输出序列长度3
pool1 = MaxPool1d(3)
# 输出序列长度4
pool2 = MaxPool1d(4)
outputs_pool1 = pool1(outputs1)
outputs_pool2 = pool2(outputs2)
print(outputs_pool1)
print(outputs_pool2)
# 由于outputs_pool1和outputs_pool2是两个独立的张量,需要cat拼接起来,删除最后一个维度,将2行1列的矩阵变成1个向量
outputs_pool_squeeze1 = outputs_pool1.squeeze(dim=2)
print(outputs_pool_squeeze1)
outputs_pool_squeeze2 = outputs_pool2.squeeze(dim=2)
print(outputs_pool_squeeze2)
outputs_pool = torch.cat([outputs_pool_squeeze1, outputs_pool_squeeze2], dim=1)
print(outputs_pool)
# tensor([[[0.5726],
# [0.7603]],
# [[0.4595],
# [0.9858]]], grad_fn=<SqueezeBackward1>)
# tensor([[[-0.0104],
# [ 0.0786]],
# [[ 0.0119],
# [-0.0979]]], grad_fn=<SqueezeBackward1>)
# tensor([[0.5726, 0.7603],
# [0.4595, 0.9858]], grad_fn=<SqueezeBackward1>)
# tensor([[-0.0104, 0.0786],
# [ 0.0119, -0.0979]], grad_fn=<SqueezeBackward1>)
# tensor([[ 0.5726, 0.7603, -0.0104, 0.0786],
# [ 0.4595, 0.9858, 0.0119, -0.0979]], grad_fn=<CatBackward>)
from torch.nn import Linear
linear = Linear(4, 2)
outputs_linear = linear(outputs_pool)
print(outputs_linear)
# tensor([[-0.0555, -0.0656],
# [-0.0428, -0.0303]], grad_fn=<AddmmBackward>)
4.2.3 TextCNN网络结构
class TextCNN(nn.Module):
def __init__(self, config):
super(TextCNN, self).__init__()
self.is_training = True
self.dropout_rate = config.dropout_rate
self.num_class = config.num_class
self.use_element = config.use_element
self.config = config
self.embedding = nn.Embedding(num_embeddings=config.vocab_size,
embedding_dim=config.embedding_size)
self.convs = nn.ModuleList([
nn.Sequential(nn.Conv1d(in_channels=config.embedding_size,
out_channels=config.feature_size,
kernel_size=h),
# nn.BatchNorm1d(num_features=config.feature_size),
nn.ReLU(),
nn.MaxPool1d(kernel_size=config.max_text_len-h+1))
for h in config.window_sizes
])
self.fc = nn.Linear(in_features=config.feature_size*len(config.window_sizes),
out_features=config.num_class)
if os.path.exists(config.embedding_path) and config.is_training and config.is_pretrain:
print("Loading pretrain embedding...")
self.embedding.weight.data.copy_(torch.from_numpy(np.load(config.embedding_path)))
def forward(self, x):
embed_x = self.embedding(x)
#print('embed size 1',embed_x.size()) # 32*35*256
# batch_size x text_len x embedding_size -> batch_size x embedding_size x text_len
embed_x = embed_x.permute(0, 2, 1)
#print('embed size 2',embed_x.size()) # 32*256*35
out = [conv(embed_x) for conv in self.convs] #out[i]:batch_size x feature_size*1
#for o in out:
# print('o',o.size()) # 32*100*1
out = torch.cat(out, dim=1) # 对应第二个维度(行)拼接起来,比如说5*2*1,5*3*1的拼接变成5*5*1
#print(out.size(1)) # 32*400*1
out = out.view(-1, out.size(1))
#print(out.size()) # 32*400
if not self.use_element:
out = F.dropout(input=out, p=self.dropout_rate)
out = self.fc(out)
return out
4.3 RNN
4.3.1 RNN和HMM的区别
4.3.2 模型实现
from torch.nn import RNN
# 每个时刻输入大小为4,隐含层大小为5
rnn = RNN(input_size=4, hidden_size=5, batch_first=True)
# 输入批次为,即有2个序列,序列长度为3,输入大小为4
inputs = torch.rand(2, 3, 4)
# 得到输出和更新之后的隐藏状态
outputs, hn = rnn(inputs)
print(outputs)
print(hn)
print(outputs.shape, hn.shape)
# tensor([[[-0.1413, 0.1952, -0.2586, -0.4585, -0.4973],
# [-0.3413, 0.3166, -0.2132, -0.5002, -0.2506],
# [-0.0390, 0.1016, -0.1492, -0.4582, -0.0017]],
# [[ 0.1747, 0.2208, -0.1599, -0.4487, -0.1219],
# [-0.1236, 0.1097, -0.2268, -0.4487, -0.0603],
# [ 0.0973, 0.3031, -0.1482, -0.4647, 0.0809]]],
# grad_fn=<TransposeBackward1>)
# tensor([[[-0.0390, 0.1016, -0.1492, -0.4582, -0.0017],
# [ 0.0973, 0.3031, -0.1482, -0.4647, 0.0809]]],
# grad_fn=<StackBackward>)
# torch.Size([2, 3, 5]) torch.Size([1, 2, 5])
import torch
from torch.autograd import Variable
from torch import nn
# 首先建立一个简单的循环神经网络:输入维度为20, 输出维度是50, 两层的单向网络
basic_rnn = nn.RNN(input_size=20, hidden_size=50, num_layers=2)
"""
通过 weight_ih_l0 来访问第一层中的 w_{ih},因为输入 x_{t}是20维,输出是50维,所以w_{ih}是一个50*20维的向量,另外要访问第
二层网络可以使用 weight_ih_l1.对于 w_{hh},可以用 weight_hh_l0来访问,而 b_{ih}则可以通过 bias_ih_l0来访问。当然可以对它
进行自定义的初始化,只需要记得它们是 Variable,取出它们的data,对它进行自定的初始化即可。
"""
print(basic_rnn.weight_ih_l0.size(), basic_rnn.weight_ih_l1.size(), basic_rnn.weight_hh_l0.size())
# 随机初始化输入和隐藏状态
toy_input = Variable(torch.randn(3, 1, 20))
h_0 = Variable(torch.randn(2*1, 1, 50))
print(toy_input[0].size())
# 将输入和隐藏状态传入网络,得到输出和更新之后的隐藏状态,输出维度是(100, 32, 20)。
toy_output, h_n = basic_rnn(toy_input, h_0)
print(toy_output[-1])
print(h_n)
print(h_n[1])
# torch.Size([50, 20]) torch.Size([50, 50]) torch.Size([50, 50])
# torch.Size([1, 20])
# tensor([[-0.5984, -0.3677, 0.0775, 0.2553, 0.1232, -0.1161, -0.2288, 0.1609,
# -0.1241, -0.3501, -0.3164, 0.3403, 0.0332, 0.2511, 0.0951, 0.2445,
# 0.0558, -0.0419, -0.1222, 0.0901, -0.2851, 0.1737, 0.0637, -0.3362,
# -0.1706, 0.2050, -0.3277, -0.2112, -0.4245, 0.0265, -0.0052, -0.4551,
# -0.3270, -0.1220, -0.1531, -0.0151, 0.2504, 0.5659, 0.4878, -0.0656,
# -0.7775, 0.4294, 0.2054, 0.0318, 0.4798, -0.1439, 0.3873, 0.1039,
# 0.1654, -0.5765]], grad_fn=<SelectBackward>)
# tensor([[[ 0.2338, 0.1578, 0.7547, 0.0439, -0.6009, 0.1042, -0.4840,
# -0.1806, -0.2075, -0.2174, 0.2023, 0.3301, -0.1899, 0.1618,
# 0.0790, 0.1213, 0.0053, -0.2586, 0.6376, 0.0315, 0.6949,
# 0.3184, -0.4901, -0.0852, 0.4542, 0.1393, -0.0074, -0.8129,
# -0.1013, 0.0852, 0.2550, -0.4294, 0.2316, 0.0662, 0.0465,
# -0.1976, -0.6093, 0.4097, 0.3909, -0.1091, -0.3569, 0.0366,
# 0.0665, 0.5302, -0.1765, -0.3919, -0.0308, 0.0061, 0.1447,
# 0.2676]],
# [[-0.5984, -0.3677, 0.0775, 0.2553, 0.1232, -0.1161, -0.2288,
# 0.1609, -0.1241, -0.3501, -0.3164, 0.3403, 0.0332, 0.2511,
# 0.0951, 0.2445, 0.0558, -0.0419, -0.1222, 0.0901, -0.2851,
# 0.1737, 0.0637, -0.3362, -0.1706, 0.2050, -0.3277, -0.2112,
# -0.4245, 0.0265, -0.0052, -0.4551, -0.3270, -0.1220, -0.1531,
# -0.0151, 0.2504, 0.5659, 0.4878, -0.0656, -0.7775, 0.4294,
# 0.2054, 0.0318, 0.4798, -0.1439, 0.3873, 0.1039, 0.1654,
# ...
# -0.1706, 0.2050, -0.3277, -0.2112, -0.4245, 0.0265, -0.0052, -0.4551,
# -0.3270, -0.1220, -0.1531, -0.0151, 0.2504, 0.5659, 0.4878, -0.0656,
# -0.7775, 0.4294, 0.2054, 0.0318, 0.4798, -0.1439, 0.3873, 0.1039,
# 0.1654, -0.5765]], grad_fn=<SelectBackward>)
初始化时,还可以设置其他网络参数,bidirectional=True、num_layers等
4.3.3 LSTM
from torch.nn import LSTM
lstm = LSTM(input_size=4, hidden_size=5, batch_first=True)
inputs = torch.rand(2, 3, 4)
# outputs为输出序列的隐含层,hn为最后一个时刻的隐含层,cn为最后一个时刻的记忆细胞
outputs, (hn, cn) = lstm(inputs)
# 输出两个序列,每个序列长度为3,大小为5
print(outputs)
print(hn)
print(cn)
# 输出银行层序列和最后一个时刻隐含层以及记忆细胞的形状
print(outputs.shape, hn.shape, cn.shape)
# tensor([[[-0.1102, 0.0568, 0.0929, 0.0579, -0.1300],
# [-0.2051, 0.0829, 0.0245, 0.0202, -0.2124],
# [-0.2509, 0.0854, 0.0882, -0.0272, -0.2385]],
# [[-0.1302, 0.0804, 0.0200, 0.0543, -0.1033],
# [-0.2794, 0.0736, 0.0247, -0.0406, -0.2233],
# [-0.2913, 0.1044, 0.0407, 0.0044, -0.2345]]],
# grad_fn=<TransposeBackward0>)
# tensor([[[-0.2509, 0.0854, 0.0882, -0.0272, -0.2385],
# [-0.2913, 0.1044, 0.0407, 0.0044, -0.2345]]],
# grad_fn=<StackBackward>)
# tensor([[[-0.3215, 0.2153, 0.1180, -0.0568, -0.4162],
# [-0.3982, 0.2704, 0.0568, 0.0097, -0.3959]]],
# grad_fn=<StackBackward>)
# torch.Size([2, 3, 5]) torch.Size([1, 2, 5]) torch.Size([1, 2, 5])
4.4 注意力模型
seq2seq这样的模型有一个基本假设,就是原始序列的最后一个隐含状态(一个向量)包含了该序列的全部信息,显然假设不合理,当序列比较长的时候,这点就更困难了,所以有了注意力模型
4.4.1 注意力机制
$$
\operatorname{attn}(\boldsymbol{q}, \boldsymbol{k})=\left{\begin{array}{l}
\boldsymbol{w}^{\top} \tanh (\boldsymbol{W}[\boldsymbol{q} ; \boldsymbol{k}]) ;多层感知器\
\boldsymbol{q}^{\top} \boldsymbol{W} \boldsymbol{k} ;双线性\
\boldsymbol{q}^{\top} \boldsymbol{k} ;点积\
\frac{\boldsymbol{q}^{\top} \boldsymbol{k}}{\sqrt{d}} ;避免因为向量维度d过大导致点积结果过大
\end{array}\right.
$$
4.4.2 自注意力模型
具体地,假设输入为n个向量组成的序列,,...,,输出为每个向量对应的新的向量表示,,...,,其中所有向量的大小均为d。那么,的计算公式为
式中,j是整个序列的索引值;是与之间的注意力(权重),其通过attn函数计算,然后再经过softmax函数进行归一化后获得。直观上的含义是如果与越相关,则它们计算的注意力值就越大,那么与对应的新的表示的贡献就越大
通过自注意力机制,可以直接计算两个距离较远的时刻之间的关系。而在循环神经网络中,由于信息是沿着时刻逐层传递的,因此当两个相关性较大的时刻距离较远时,会产生较大的信息损失。虽然引入了门控机制模型,如LSTM等,可指标不治本。因此,基于自注意力机制的自注意力模型已经逐步取代循环神经网络,成为自然语言处理的标准模型
4.4.3 Transformer
4.4.3.1 融入位置信息
两种方式,位置嵌入(Position Embedding)和位置编码(Position Encodings)
- 位置嵌入与词嵌入类似
- 位置编码是讲位置索引值通过函数映射到一个d维向量
4.4.3.2 Transformer块(Block)
包含自注意力、层归一化(Layer Normalization)、残差(Residual Connections)
4.4.4.3 自注意力计算结果互斥
自注意力结果需要经过归一化,导致即使一个输入和多个其他的输入相关,也无法同时为这些输入赋予较大的注意力值,集自注意力结果之间是互斥的,无法同时关注多个输入,所以使用多头自注意力模型
4.4.3.4 Transformer模型的优缺点
优点:与循环神经网络相比,Transformer 能够直接建模输入序列单元之间更长距离的依赖关系,从而使得 Transformer 对于长序列建模的能力更强。另外,在 Trans-former 的编码阶段,由于可以利用GPU 等多核计算设备并行地计算 Transformer 块内部的自注意力模型,而循环神经网络需要逐个计算,因此 Transformer 具有更高的训练速度。
缺点:不过,与循环神经网络相比,Transformer 的一个明显的缺点是参数量过于庞大。每一层的 Transformer 块大部分参数集中在自注意力模型中输入向量的三个角色映射矩阵、多头机制导致相应参数的倍增和引入非线性的多层感知器等。
更主要的是,还需要堆叠多层Transformer 块,从而参数量又扩大多倍。最终导致一个实用的Transformer模型含有巨大的参数量。巨大的参数量导致 Transformer模型非常不容易训练,尤其是当训练数据较小时
4.5 神经网络模型的训练
4.5.1 损失函数
均方误差MSE(Mean Squared Error)和交叉熵损失CE(Cross-Entropy)
4.5.2 小批次梯度下降
import torch
from torch import nn, optim
from torch.nn import functional as F
class MLP(nn.Module):
def __init__(self, input_dim, hidden_dim, num_class):
super(MLP, self).__init__()
self.linear1 = nn.Linear(input_dim, hidden_dim)
self.activate = F.relu
self.linear2 = nn.Linear(hidden_dim, num_class)
def forward(self, inputs):
hidden = self.linear1(inputs)
activation = self.activate(hidden)
outputs = self.linear2(activation)
log_probs = F.log_softmax(outputs, dim=1)
return log_probs
# 异或问题的4个输入
x_train = torch.tensor([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
# 每个输入对应的输出类别
y_train = torch.tensor([0, 1, 1, 0])
# 创建多层感知器模型,输入层大小为2,隐含层大小为5,输出层大小为2(即有两个类别)
model = MLP(input_dim=2, hidden_dim=5, num_class=2)
criterion = nn.NLLLoss() # 当使用log_softmax输出时,需要调用负对数似然损失(Negative Log Likelihood,NLL)
optimizer = optim.SGD(model.parameters(), lr=0.05)
for epoch in range(500):
y_pred = model(x_train)
loss = criterion(y_pred, y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("Parameters:")
for name, param in model.named_parameters():
print (name, param.data)
y_pred = model(x_train)
print("y_pred = ", y_pred)
print("Predicted results:", y_pred.argmax(axis=1))
# 结果:
# Parameters:
# linear1.weight tensor([[-0.4509, -0.5591],
# [-1.2904, 1.2947],
# [ 0.8418, 0.8424],
# [-0.4408, -0.1356],
# [ 1.2886, -1.2879]])
# linear1.bias tensor([ 4.5582e-01, -2.5727e-03, -8.4167e-01, -1.7634e-03, -1.5244e-04])
# linear2.weight tensor([[ 0.5994, -1.4792, 1.0836, -0.2860, -1.0873],
# [-0.2534, 0.9911, -0.7348, 0.0413, 1.3398]])
# linear2.bias tensor([ 0.7375, -0.1796])
# y_pred = tensor([[-0.2398, -1.5455],
# [-2.3716, -0.0980],
# [-2.3101, -0.1045],
# [-0.0833, -2.5269]], grad_fn=<LogSoftmaxBackward>)
# Predicted results: tensor([0, 1, 1, 0])
注:
- nn.linear理解为两层的,input_dim个神经元和output_dim个神经元的全连接网络
- argmax(axis=1)函数,找到第二维度方向最大的索引(比如二维的,就是沿列方向去统计)
- 可以将输出层的softmax层去掉,可以使用corssEntropyLoss作为损失函数,其在计算损失时会自动进行softmax计算,这样在模型预测时可以提高速度,因为没有进行softmax运算,直接将输出分数最高的类别作为预测结果即可
- 除了SGD还有,Adam、Adagrad等,这些优化器是对原始梯度下降的改进,改进思路包括动态调整学习率,对梯度积累等
4.6 情感分类实战
4.6.1 词表映射
from collections import defaultdict
class Vocab:
def __init__(self, tokens=None):
self.idx_to_token = list()
self.token_to_idx = dict()
if tokens is not None:
if "<unk>" not in tokens:
tokens = tokens + "<unk>"
for token in tokens:
self.idx_to_token.append(token)
self.token_to_idx[token] = len(self.idx_to_token) - 1
self.unk = self.token_to_idx['<unk>']
@classmethod
def build(cls, text, min_freq=1, reserved_tokens=None):
token_freqs = defaultdict(int)
for sentence in text:
for token in sentence:
token_freqs[token] += 1
uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else [])
uniq_tokens += [token for token, freq in token_freqs.items() if freq >= min_freq and token != "<unk>"]
return cls(uniq_tokens)
def __len__(self):
# 返回词表的大小
return len(self.idx_to_token)
def __getitem__(self, token):
# 查找输入标记对应的索引值,如果1该标记不存在,则返回标记<unk>的索引值(0)
return self.token_to_idx.get(token, self.unk)
def convert_tokens_to_ids(self, tokens):
return [self[token] for token in tokens]
def convert_ids_to_tokens(self, indices):
return [self.idx_to_token[index] for index in indices]
注:@classmethod表示的是类方法
4.6.2 词向量层
# 词表大小为8,向量维度为3
embedding = nn.Embedding(8, 3)
input = torch.tensor([[0, 1, 2, 1], [4, 6, 6, 7]], dtype=torch.long) # torch.long = torch.int64
output = embedding(input)
output
# 即在原始输入后增加了一个长度为3的维
# 结果:
# tensor([[[ 0.1747, 0.7580, 0.3107],
# [ 0.1595, 0.9152, 0.2757],
# [ 1.0136, -0.5204, 1.0620],
# [ 0.1595, 0.9152, 0.2757]],
# [[-0.9784, -0.3794, 1.2752],
# [-0.4441, -0.2990, 1.0913],
# [-0.4441, -0.2990, 1.0913],
# [ 2.0153, -1.0434, -0.9038]]], grad_fn=<EmbeddingBackward>)
4.6.3 融入词向量的MLP
import torch
from torch import nn
from torch.nn import functional as F
class MLP(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_class):
super(MLP, self).__init__()
# 词向量层
self.embedding = nn.Embedding(vocab_size, embedding_dim)
# 线性变换:词向量层 -> 隐含层
self.linear1 = nn.Linear(embedding_dim, hidden_dim)
self.activate = F.relu
# 线性变换:激活层 -> 输出层
self.linear2 = nn.Linear(hidden_dim, num_class)
def forward(self, inputs):
embeddings = self.embedding(inputs)
# 将序列中多个Embedding进行聚合(求平均)
embedding = embeddings.mean(dim=1)
hidden = self.activate(self.linear1(embedding))
outputs = self.linear2(hidden)
# 获得每个序列属于某一个类别概率的对数值
probs = F.log_softmax(outputs, dim=1)
return probs
mlp = MLP(vocab_size=8, embedding_dim=3, hidden_dim=5, num_class=2)
inputs = torch.tensor([[0, 1, 2, 1], [4, 6, 6, 7]], dtype=torch.long)
outputs = mlp(inputs)
print(outputs)
# 结果:
# tensor([[-0.6600, -0.7275],
# [-0.6108, -0.7828]], grad_fn=<LogSoftmaxBackward>)
4.6.4 文本长度统一
input1 = torch.tensor([0, 1, 2, 1], dtype=torch.long)
input2 = torch.tensor([2, 1, 3, 7, 5], dtype=torch.long)
input3 = torch.tensor([6, 4, 2], dtype=torch.long)
input4 = torch.tensor([1, 3, 4, 3, 5, 7], dtype=torch.long)
inputs = [input1, input2, input3, input4]
offsets = [0] + [i.shape[0] for i in inputs]
print(offsets)
# cumsum累加,即0+4=4,4+5=9,9+3=12
offsets = torch.tensor(offsets[: -1]).cumsum(dim=0)
print(offsets)
inputs = torch.cat(inputs)
print(inputs)
embeddingbag = nn.EmbeddingBag(num_embeddings=8, embedding_dim=3)
embeddings = embeddingbag(inputs, offsets)
print(embeddings)
# 结果:
# [0, 4, 5, 3, 6]
# tensor([ 0, 4, 9, 12])
# tensor([0, 1, 2, 1, 2, 1, 3, 7, 5, 6, 4, 2, 1, 3, 4, 3, 5, 7])
# tensor([[-0.6750, 0.8048, -0.1771],
# [ 0.2023, -0.1735, 0.2372],
# [ 0.4699, -0.2902, 0.3136],
# [ 0.2327, -0.2667, 0.0326]], grad_fn=<EmbeddingBagBackward>)
4.6.5 数据处理
def load_sentence_polarity():
from nltk.corpus import sentence_polarity
vocab = Vocab.build(sentence_polarity.sents())
train_data = [(vocab.convert_tokens_to_ids(sentence), 0) for sentence in sentence_polarity.sents(categories='pos')][: 4000] \
+ [(vocab.convert_tokens_to_ids(sentence), 1) for sentence in sentence_polarity.sents(categories='neg')][: 4000]
test_data = [(vocab.convert_tokens_to_ids(sentence), 0) for sentence in sentence_polarity.sents(categories='pos')][4000: ] \
+ [(vocab.convert_tokens_to_ids(sentence), 1) for sentence in sentence_polarity.sents(categories='neg')][4000: ]
return train_data, test_data, vocab
train_data, test_data, vocab = load_sentence_polarity()
4.6.5.1 构建DataLoader对象
from torch.utils.data import DataLoader, dataset
data_loader = DataLoader(
dataset,
batch_size=64,
collate_fn=collate_fn,
shuffle=True
)
# dataset为Dataset类的一个对象,用于存储数据
class BowDataset(Dataset):
def __init__(self, data):
# data为原始的数据,如使用load_sentence_polarity函数生成的训练数据
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, i):
# 返回下标为i的样例
return self.data[i]
# collaate_fn参数指向一个函数,用于对一个批次的样本进行整理,如将其转换为张量等
def collate_fn(examples):
# 从独立样本集合中构建各批次的输入输出
inputs = [torch.tensor(ex[0]) for ex in examples]
targets = torch.tensor([(ex[0]) for ex in examples], dtype=torch.long)
offsets = [0] + [i.shape[0] for i in inputs]
offsets = torch.tensor(offsets[: -1]).cumsum(dim=0)
inputs = torch.cat(inputs)
return inputs, offsets, targets
4.6.6 MLP的训练和测试
# tqdm 进度条
from tqdm.auto import tqdm
import torch
from torch import nn, optim
from torch.nn import functional as F
class MLP(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_class):
super(MLP, self).__init__()
# 词向量层
self.embedding = nn.EmbeddingBag(vocab_size, embedding_dim)
# 线性变换:词向量层 -> 隐含层
self.linear1 = nn.Linear(embedding_dim, hidden_dim)
self.activate = F.relu
# 线性变换:激活层 -> 输出层
self.linear2 = nn.Linear(hidden_dim, num_class)
def forward(self, inputs, offsets):
embedding = self.embedding(inputs, offsets)
hidden = self.activate(self.linear1(embedding))
outputs = self.linear2(hidden)
# 获得每个序列属于某一个类别概率的对数值
probs = F.log_softmax(outputs, dim=1)
return probs
embedding_dim = 128
hidden_dim = 256
num_class = 2
batch_size = 32
num_epoch = 5
# 加载数据
train_data, test_data, vocab = load_sentence_polarity()
train_data = BowDataset(train_data)
test_data = BowDataset(test_data)
train_data_loader = DataLoader(train_data, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
test_data_loader = DataLoader(test_data, batch_size=1, collate_fn=collate_fn, shuffle=False)
# 加载模型
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = MLP(len(vocab), embedding_dim, hidden_dim, num_class)
model.to(device)
# 训练
nll_loss = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()
for epoch in range(num_epoch):
total_loss = 0
for batch in tqdm(train_data_loader, desc=f"Training Epoch {epoch + 1}"):
inputs, offsets, targets = [x.to(device) for x in batch]
log_probs = model(inputs, offsets)
loss = nll_loss(log_probs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Loss:{total_loss:.2f}")
# 测试
acc = 0
for batch in tqdm(test_data_loader, desc=f"Testing"):
inputs, offsets, targets = [x.to(device) for x in batch]
with torch.no_grad():
output = model(inputs, offsets)
acc += (output.argmax(dim=1) == targets).sum().item()
print(f"Acc: {acc / len(test_data_loader):.2f}")
# 结果:
# Training Epoch 1: 100%|██████████| 250/250 [00:03<00:00, 64.04it/s]
# Training Epoch 2: 100%|██████████| 250/250 [00:04<00:00, 55.40it/s]
# Training Epoch 3: 100%|██████████| 250/250 [00:03<00:00, 82.54it/s]
# Training Epoch 4: 100%|██████████| 250/250 [00:03<00:00, 73.36it/s]
# Training Epoch 5: 100%|██████████| 250/250 [00:03<00:00, 72.61it/s]
# Testing: 33%|███▎ | 879/2662 [00:00<00:00, 4420.03it/s]
# Loss:45.66
# Testing: 100%|██████████| 2662/2662 [00:00<00:00, 4633.54it/s]
# Acc: 0.73
4.6.7 基于CNN的情感分类
复习一下conv1d
由于MLP词袋模型表示文本时,只考虑文本中的词语信息,忽略了词组信息,卷积可以提取词组信息,例如“不 喜欢”,卷积核为2,就可以提取特征“不 喜欢”
import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from collections import defaultdict
# 进度条
from tqdm.auto import tqdm
class Vocab:
def __init__(self, tokens=None):
self.idx_to_token = list()
self.token_to_idx = dict()
if tokens is not None:
if "<unk>" not in tokens:
tokens = tokens + "<unk>"
for token in tokens:
self.idx_to_token.append(token)
self.token_to_idx[token] = len(self.idx_to_token) - 1
self.unk = self.token_to_idx['<unk>']
@classmethod
def build(cls, text, min_freq=1, reserved_tokens=None):
token_freqs = defaultdict(int)
for sentence in text:
for token in sentence:
token_freqs[token] += 1
uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else [])
uniq_tokens += [token for token, freq in token_freqs.items() if freq >= min_freq and token != "<unk>"]
return cls(uniq_tokens)
def __len__(self):
# 返回词表的大小
return len(self.idx_to_token)
def __getitem__(self, token):
# 查找输入标记对应的索引值,如果1该标记不存在,则返回标记<unk>的索引值(0)
return self.token_to_idx.get(token, self.unk)
def convert_tokens_to_ids(self, tokens):
return [self[token] for token in tokens]
def convert_ids_to_tokens(self, indices):
return [self.idx_to_token[index] for index in indices]
class CnnDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, i):
return self.data[i]
def collate_fn(examples):
inputs = [torch.tensor(ex[0]) for ex in examples]
targets = torch.tensor([ex[1] for ex in examples], dtype=torch.long)
# 对batch内的样本进行padding,使其具有相同长度
inputs = pad_sequence(inputs, batch_first=True)
return inputs, targets
class CNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, filter_size, num_filter, num_class):
super(CNN, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.conv1d = nn.Conv1d(embedding_dim, num_filter, filter_size, padding=1)
self.activate = F.relu
self.linear = nn.Linear(num_filter, num_class)
def forward(self, inputs): # inputs: (32, 47) ,32个长度为47的序列
embedding = self.embedding(inputs) # embedding: (32, 47, 128),相当于加了原有加了一个词向量维度,
convolution = self.activate(self.conv1d(embedding.permute(0, 2, 1))) # convolution: (32, 100, 47)
pooling = F.max_pool1d(convolution, kernel_size=convolution.shape[2]) # pooling: (32, 100, 1)
pooling_squeeze = pooling.squeeze(dim=2) # pooling_squeeze: (32, 100)
outputs = self.linear(pooling_squeeze) # outputs: (32, 2)
log_probs = F.log_softmax(outputs, dim=1) # log_probs: (32, 2)
return log_probs
def load_sentence_polarity():
from nltk.corpus import sentence_polarity
vocab = Vocab.build(sentence_polarity.sents())
train_data = [(vocab.convert_tokens_to_ids(sentence), 0) for sentence in sentence_polarity.sents(categories='pos')][: 4000] \
+ [(vocab.convert_tokens_to_ids(sentence), 1) for sentence in sentence_polarity.sents(categories='neg')][: 4000]
test_data = [(vocab.convert_tokens_to_ids(sentence), 0) for sentence in sentence_polarity.sents(categories='pos')][4000: ] \
+ [(vocab.convert_tokens_to_ids(sentence), 1) for sentence in sentence_polarity.sents(categories='neg')][4000: ]
return train_data, test_data, vocab
#超参数设置
embedding_dim = 128
hidden_dim = 256
num_class = 2
batch_size = 32
num_epoch = 5
filter_size = 3
num_filter = 100
#加载数据
train_data, test_data, vocab = load_sentence_polarity()
train_dataset = CnnDataset(train_data)
test_dataset = CnnDataset(test_data)
train_data_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
test_data_loader = DataLoader(test_dataset, batch_size=1, collate_fn=collate_fn, shuffle=False)
#加载模型
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = CNN(len(vocab), embedding_dim, filter_size, num_filter, num_class)
model.to(device) #将模型加载到CPU或GPU设备
#训练过程
nll_loss = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001) #使用Adam优化器
model.train()
for epoch in range(num_epoch):
total_loss = 0
for batch in tqdm(train_data_loader, desc=f"Training Epoch {epoch + 1}"):
inputs, targets = [x.to(device) for x in batch]
log_probs = model(inputs)
loss = nll_loss(log_probs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Loss: {total_loss:.2f}")
#测试过程
acc = 0
for batch in tqdm(test_data_loader, desc=f"Testing"):
inputs, targets = [x.to(device) for x in batch]
with torch.no_grad():
output = model(inputs)
acc += (output.argmax(dim=1) == targets).sum().item()
#输出在测试集上的准确率
print(f"Acc: {acc / len(test_data_loader):.2f}")
# 结果:
# Training Epoch 1: 100%|██████████| 250/250 [00:06<00:00, 36.27it/s]
# Loss: 165.55
# Training Epoch 2: 100%|██████████| 250/250 [00:08<00:00, 31.13it/s]
# Loss: 122.83
# Training Epoch 3: 100%|██████████| 250/250 [00:06<00:00, 36.45it/s]
# Loss: 76.39
# Training Epoch 4: 100%|██████████| 250/250 [00:06<00:00, 41.66it/s]
# Loss: 33.92
# Training Epoch 5: 100%|██████████| 250/250 [00:06<00:00, 39.79it/s]
# Loss: 12.04
# Testing: 100%|██████████| 2662/2662 [00:00<00:00, 2924.88it/s]
#
# Acc: 0.72
4.6.8 基于Transformer的情感分类
!!!
<p style="color:red">需要重新研读</p>
!!!
4.7 词性标注实战
!!!
<p style="color:red">需要重新研读</p>
!!!
4.8 习题
5 静态词向量预训练模型
5.1 神经网络语言模型
N-gram语言模型存在明显的缺点:
- 容易受数据稀疏的影响,一般需要平滑处理
- 无法对超过N的上下文依赖关系进行建模
所以,基于神经网络的语言模型(如RNN、Transformer等)几乎替代了N-gram语言模型
5.1.1 预训练任务
监督信号来自与数据自身,这种学习方式成为自监督学习
5.1.1.1 前馈神经网络语言模型
(1)输入层
由当前时刻t的历史次序构成,可以用毒热编码也可以用位置下标表示
(2)词向量层
用低维、稠密向量表示,表示历史序列词向量拼接后的结果,词向量矩阵为
(3)隐含层
为输入层到隐含层之间的线性变换矩阵,为偏置,隐含层可以表示为:
(4)输出层
所以语言模型
参数量为:
词向量参数+隐藏层+输出层+词表,即
m和d是常数,模型的自由参数数量随词表大小呈线性增长,且历史词数n的增大并不会显著增加参数的数量
注:语言模型训练完成后的矩阵E为预训练得到的静态词向量
5.1.1.2 循环神经网络语言模型
RNN可以处理不定长依赖,“他喜欢吃苹果”(“吃”),“他感冒了,于是下班之后去了医院”(“感冒”和“医院”)
(1)输入层
由当前时刻t的历史次序构成,可以用毒热编码也可以用位置下标表示
(2)词向量层
t时刻的输入为前一个词和t-1时刻的隐含状态组成
(3)隐含层
,其中分别是v_{w_{t-1};h_{t-1}}与隐含层之间的权值矩阵,公式常常区分开
(4)输出层
所以RNN当序列较长时,训练存在梯度消失或者梯度爆炸,以前的做法是在反向传播过程中按长度进行截断,从而得到有效的训练,现在用LSTM和Transformer替代
5.1.2 模型实现
5.1.2.1 前馈神经网络语言模型
import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence
from collections import defaultdict
# 进度条
from tqdm.auto import tqdm
import nltk
BOS_TOKEN = "<bos>"
EOS_TOKEN = "<eos>"
PAD_TOKEN = "<pad>"
nltk.download('reuters')
nltk.download('punkt')
# from zipfile import ZipFile
# file_loc = '/root/nltk_data/corpora/reuters.zip'
# with ZipFile(file_loc, 'r') as z:
# z.extractall('/root/nltk_data/corpora/')
# file_loc = '/root/nltk_data/corpora/punkt.zip'
# with ZipFile(file_loc, 'r') as z:
# z.extractall('/root/nltk_data/corpora/')
class Vocab:
def __init__(self, tokens=None):
self.idx_to_token = list()
self.token_to_idx = dict()
if tokens is not None:
if "<unk>" not in tokens:
tokens = tokens + "<unk>"
for token in tokens:
self.idx_to_token.append(token)
self.token_to_idx[token] = len(self.idx_to_token) - 1
self.unk = self.token_to_idx['<unk>']
@classmethod
def build(cls, text, min_freq=1, reserved_tokens=None):
token_freqs = defaultdict(int)
for sentence in text:
for token in sentence:
token_freqs[token] += 1
uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else [])
uniq_tokens += [token for token, freq in token_freqs.items() if freq >= min_freq and token != "<unk>"]
return cls(uniq_tokens)
def __len__(self):
# 返回词表的大小
return len(self.idx_to_token)
def __getitem__(self, token):
# 查找输入标记对应的索引值,如果1该标记不存在,则返回标记<unk>的索引值(0)
return self.token_to_idx.get(token, self.unk)
def convert_tokens_to_ids(self, tokens):
return [self[token] for token in tokens]
def convert_ids_to_tokens(self, indices):
return [self.idx_to_token[index] for index in indices]
def get_loader(dataset, batch_size, shuffle=True):
data_loader = DataLoader(
dataset,
batch_size=batch_size,
collate_fn=dataset.collate_fn,
shuffle=shuffle
)
return data_loader
# 读取Reuters语料库
def load_reuters():
from nltk.corpus import reuters
text = reuters.sents()
text = [[word.lower() for word in sentence]for sentence in text]
vocab = Vocab.build(text, reserved_tokens=[PAD_TOKEN, EOS_TOKEN, PAD_TOKEN])
corpus = [vocab.convert_tokens_to_ids(sentence) for sentence in text]
return corpus, vocab
# 保存词向量
def save_pretrained(vocab, embeds, save_path):
"""
Save pretrained token vectors in a unified format, where the first line
specifies the `number_of_tokens` and `embedding_dim` followed with all
token vectors, one token per line.
"""
with open(save_path, "w") as writer:
writer.write(f"{embeds.shape[0]} {embeds.shape[1]}\n")
for idx, token in enumerate(vocab.idx_to_token):
vec = " ".join(["{:.4f}".format(x) for x in embeds[idx]])
writer.write(f"{token} {vec}\n")
print(f"Pretrained embeddings saved to: {save_path}")
# Dataset类
class NGramDataset(Dataset):
def __init__(self, corpus, vocab, context_size=2):
self.data = []
self.bos = vocab[BOS_TOKEN] # 句首标记id
self.eos = vocab[EOS_TOKEN] # 句尾标记id
for sentence in tqdm(corpus, desc="Data Construction"):
# 插入句首、句尾标记符
sentence = [self.bos] + sentence + [self.eos]
# 如句子长度小于预定义的上下文大小、则跳过
if len(sentence) < context_size:
continue
for i in range(context_size, len(sentence)):
# 模型输入:长度1为context_size的上下文
context = sentence[i-context_size: i]
# 当前词
target = sentence[i]
# 每个训练样本由(context, target)构成
self.data.append((context, target))
def __len__(self):
return len(self.data)
def __getitem__(self, i):
return self.data[i]
def collate_fn(self, examples):
# 从独立样本集合中构建批次的输入输出,并转换为PyTorch张量类型
inputs = torch.tensor([ex[0] for ex in examples], dtype=torch.long)
targets = torch.tensor([ex[1] for ex in examples], dtype=torch.long)
return (inputs, targets)
# 模型
class FeedForwaardNNLM(nn.Module):
def __init__(self, vocab_size, embedding_dim, context_size, hidden_dim):
super(FeedForwaardNNLM, self).__init__()
# 词向量层
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, hidden_dim)
self.linear2 = nn.Linear(hidden_dim, vocab_size)
self.activate = F.relu
def forward(self, inputs):
embeds = self.embeddings(inputs).view((inputs.shape[0], -1))
hidden = self.activate(self.linear1(embeds))
output = self.linear2(hidden)
log_probs = F.log_softmax(output, dim=1)
return log_probs
# 训练
embedding_dim = 128
hidden_dim = 256
batch_size = 1024
context_size = 3
num_epoch = 10
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
corpus, vocab = load_reuters()
dataset = NGramDataset(corpus, vocab, context_size)
data_loader = get_loader(dataset, batch_size)
nll_loss = nn.NLLLoss()
model = FeedForwaardNNLM(len(vocab), embedding_dim, context_size, hidden_dim)
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()
total_losses = []
for epoch in range(num_epoch):
total_loss = 0
for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"):
inputs, targets = [x.to(device) for x in batch]
optimizer.zero_grad()
log_probs = model(inputs)
loss = nll_loss(log_probs, targets)
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Loss: {total_loss:.2f}")
total_losses.append(total_loss)
save_pretrained(vocab, model.embeddings.weight.data, "/home/ffnnlm.vec")
# 结果
# [nltk_data] Downloading package reuters to /root/nltk_data...
# [nltk_data] Package reuters is already up-to-date!
# [nltk_data] Downloading package punkt to /root/nltk_data...
# [nltk_data] Package punkt is already up-to-date!
# Data Construction: 100%
# 54716/54716 [00:03<00:00, 19224.51it/s]
# Training Epoch 0: 100%
# 1628/1628 [00:35<00:00, 34.02it/s]
# Loss: 8310.34
# Training Epoch 1: 100%
# 1628/1628 [00:36<00:00, 44.29it/s]
# Loss: 6934.16
# Training Epoch 2: 100%
# 1628/1628 [00:36<00:00, 44.31it/s]
# Loss: 6342.58
# Training Epoch 3: 100%
# 1628/1628 [00:37<00:00, 42.65it/s]
# Loss: 5939.16
# Training Epoch 4: 100%
# 1628/1628 [00:37<00:00, 42.70it/s]
# Loss: 5666.03
# Training Epoch 5: 100%
# 1628/1628 [00:38<00:00, 42.76it/s]
# Loss: 5477.37
# Training Epoch 6: 100%
# 1628/1628 [00:38<00:00, 42.18it/s]
# Loss: 5333.53
# Training Epoch 7: 100%
# 1628/1628 [00:38<00:00, 42.44it/s]
# Loss: 5214.55
# Training Epoch 8: 100%
# 1628/1628 [00:38<00:00, 42.16it/s]
# Loss: 5111.15
# Training Epoch 9: 100%
# 1628/1628 [00:38<00:00, 42.21it/s]
# Loss: 5021.05
# Pretrained embeddings saved to: /home/ffnnlm.vec
5.1.2.2 前馈神经网络语言模型
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pad_sequence
from tqdm.auto import tqdm
BOS_TOKEN = "<bos>"
EOS_TOKEN = "<eos>"
PAD_TOKEN = "<pad>"
nltk.download('reuters')
nltk.download('punkt')
# from zipfile import ZipFile
# file_loc = '/root/nltk_data/corpora/reuters.zip'
# with ZipFile(file_loc, 'r') as z:
# z.extractall('/root/nltk_data/corpora/')
# file_loc = '/root/nltk_data/corpora/punkt.zip'
# with ZipFile(file_loc, 'r') as z:
# z.extractall('/root/nltk_data/corpora/')
class Vocab:
def __init__(self, tokens=None):
self.idx_to_token = list()
self.token_to_idx = dict()
if tokens is not None:
if "<unk>" not in tokens:
tokens = tokens + "<unk>"
for token in tokens:
self.idx_to_token.append(token)
self.token_to_idx[token] = len(self.idx_to_token) - 1
self.unk = self.token_to_idx['<unk>']
@classmethod
def build(cls, text, min_freq=1, reserved_tokens=None):
token_freqs = defaultdict(int)
for sentence in text:
for token in sentence:
token_freqs[token] += 1
uniq_tokens = ["<unk>"] + (reserved_tokens if reserved_tokens else [])
uniq_tokens += [token for token, freq in token_freqs.items() if freq >= min_freq and token != "<unk>"]
return cls(uniq_tokens)
def __len__(self):
# 返回词表的大小
return len(self.idx_to_token)
def __getitem__(self, token):
# 查找输入标记对应的索引值,如果1该标记不存在,则返回标记<unk>的索引值(0)
return self.token_to_idx.get(token, self.unk)
def convert_tokens_to_ids(self, tokens):
return [self[token] for token in tokens]
def convert_ids_to_tokens(self, indices):
return [self.idx_to_token[index] for index in indices]
def get_loader(dataset, batch_size, shuffle=True):
data_loader = DataLoader(
dataset,
batch_size=batch_size,
collate_fn=dataset.collate_fn,
shuffle=shuffle
)
return data_loader
# 读取Reuters语料库
def load_reuters():
from nltk.corpus import reuters
text = reuters.sents()
text = [[word.lower() for word in sentence]for sentence in text]
vocab = Vocab.build(text, reserved_tokens=[PAD_TOKEN, EOS_TOKEN, PAD_TOKEN])
corpus = [vocab.convert_tokens_to_ids(sentence) for sentence in text]
return corpus, vocab
# 保存词向量
def save_pretrained(vocab, embeds, save_path):
"""
Save pretrained token vectors in a unified format, where the first line
specifies the `number_of_tokens` and `embedding_dim` followed with all
token vectors, one token per line.
"""
with open(save_path, "w") as writer:
writer.write(f"{embeds.shape[0]} {embeds.shape[1]}\n")
for idx, token in enumerate(vocab.idx_to_token):
vec = " ".join(["{:.4f}".format(x) for x in embeds[idx]])
writer.write(f"{token} {vec}\n")
print(f"Pretrained embeddings saved to: {save_path}")
class RnnlmDataset(Dataset):
def __init__(self, corpus, vocab):
self.data = []
self.bos = vocab[BOS_TOKEN]
self.eos = vocab[EOS_TOKEN]
self.pad = vocab[PAD_TOKEN]
for sentence in tqdm(corpus, desc="Dataset Construction"):
# 模型输入:BOS_TOKEN, w_1, w_2, ..., w_n
input = [self.bos] + sentence
# 模型输出:w_1, w_2, ..., w_n, EOS_TOKEN
target = sentence + [self.eos]
self.data.append((input, target))
def __len__(self):
return len(self.data)
def __getitem__(self, i):
return self.data[i]
def collate_fn(self, examples):
# 从独立样本集合中构建batch输入输出
inputs = [torch.tensor(ex[0]) for ex in examples]
targets = [torch.tensor(ex[1]) for ex in examples]
# 对batch内的样本进行padding,使其具有相同长度
inputs = pad_sequence(inputs, batch_first=True, padding_value=self.pad)
targets = pad_sequence(targets, batch_first=True, padding_value=self.pad)
return (inputs, targets)
class RNNLM(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim):
super(RNNLM, self).__init__()
# 词嵌入层
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
# 循环神经网络:这里使用LSTM
self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
# 输出层
self.output = nn.Linear(hidden_dim, vocab_size)
def forward(self, inputs):
embeds = self.embeddings(inputs)
# 计算每一时刻的隐含层表示
hidden, _ = self.rnn(embeds)
output = self.output(hidden)
log_probs = F.log_softmax(output, dim=2)
return log_probs
embedding_dim = 64
context_size = 2
hidden_dim = 128
batch_size = 1024
num_epoch = 10
# 读取文本数据,构建FFNNLM训练数据集(n-grams)
corpus, vocab = load_reuters()
dataset = RnnlmDataset(corpus, vocab)
data_loader = get_loader(dataset, batch_size)
# 负对数似然损失函数,忽略pad_token处的损失
nll_loss = nn.NLLLoss(ignore_index=dataset.pad)
# 构建RNNLM,并加载至device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = RNNLM(len(vocab), embedding_dim, hidden_dim)
model.to(device)
# 使用Adam优化器
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()
for epoch in range(num_epoch):
total_loss = 0
for batch in tqdm(data_loader, desc=f"Training Epoch {epoch}"):
inputs, targets = [x.to(device) for x in batch]
optimizer.zero_grad()
log_probs = model(inputs)
loss = nll_loss(log_probs.view(-1, log_probs.shape[-1]), targets.view(-1))
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Loss: {total_loss:.2f}")
save_pretrained(vocab, model.embeddings.weight.data, "/home/rnnlm.vec")
5.2 Word2vec词向量
5.2.1 CBOW
用周围词预测中心词
(1)输入层
窗口为5,输入层由4个维度为词表长度的独热表示向量构成
(2)词向量层
输人层中每个词的独热表示向量经由矩阵 映射 至词向量空间:
对应的词向量即为矩阵 中相应位置的列向量, 则为由所有词向量构 成的矩阵或查找表。令 表示 的上下文单词集合, 对 中所有词向量取平均, 就得到了 的上下文表示:
(3)输出层
令 为隐含层到输出层的权值矩阵, 记 为 中与 对应的行向量, 那么 输出 的概率可由下式计算:
在 CBOW 模型的参数中, 矩阵 (上下文矩阵) 和 (中心词矩阵) 均可作为词向量矩阵, 它们分别描述了词表中的词在作为条件上下文或目标词时的不同性质。在实际中, 通常只用 就能够满足应用需求, 但是在某些任务中, 对两者进行组合得到的向量可能会取得更好的表现
5.2.2 Skip-gram模型
中心词预测周围
过程:
式中, 。
与 CBOW 模型类似, Skip-gram 模型中的权值矩阵 (中心词矩阵) 与 (上下文矩阵) 均可作为词向量 矩阵使用。
5.2.3 参数估计
与神经网络语言模型类似, 可以通过优化分类损失对 CBOW 模型和 Skipgram 模型进行训练, 需要估计的参数为 。例如, 给定一段长为 T 的 词序列
5.2.3.1 CBOW 模型的负对数似然损失函数为:
式中,
5.2.3.2 Skip-gram 模型的负对数似然损失函数为:
5.2.4 负采样
负采样(Negative Sampling)是构造了一个新的有监督学习问题:给定两个单词,比如orange和juice,去预测这是否是一对上下文词-目标词对(context-target),即是否这两个词会在一句话中相连出现,这也是一个二分类问题