文本分类
处理一个二分类的文本分类问题
-
获得已label的每条文本, 如下所示
image.png
-
找出label和文本之间的相关性, 而这个相关性便是之后需要模型来实现的. 通常拿到文本后, 需要对文本做出一定的理解, 比如最简单的思路便是统计词频, 看看哪些词对不同的分类有某种倾向性.
python实现最简单的统计词频的方式
from collections import Counter
import numpy as np
# 为每一个分类构建一个计数器, 以上图的分类为例
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()
# 将每条评论通过简单的分词, 按照不同的分类添加到上面的计数器中
for i in range(len(reviews)):
if(labels[i] == 'POSITIVE'):
for word in reviews[i].split(" "):
positive_counts[word] += 1
total_counts[word] += 1
else:
for word in reviews[i].split(" "):
negative_counts[word] += 1
total_counts[word] += 1
# 使用计数器的方法most_common()查看最常出现的词
positive_counts.most_common()
negative_counts.most_common()
但是如下图所示, 像the这样的词在positive和negative评论中都算作common words
image.png
但是我们统计词频的目的是想找出在positive中出现的词的频率要明显多于在negative出现的词, 为了实现这一点, 需要计算positive和negative评论之间的字词使用比率
pos_neg_ratios = Counter()
# Calculate the ratios of positive and negative uses of the most common words
# Consider words to be "common" if they've been used at least 100 times
for term,cnt in list(total_counts.most_common()):
if(cnt > 100):
pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
pos_neg_ratios[term] = pos_neg_ratio
image.png
通过上图我们可以做以下解释:
- 期望在positive评论中更多看到的词, 比如
amazing的比率大于1, 一个词越偏向positive, 它的正负比就离1越远 - 期望在negative评论中更多看到的词, 比如
terrible的比率小于1, 一个词越偏向negative, 它的正负比就离0越近 - 对于中性词, 并没有真正传达任何情绪, 因为你期望在各种评论中看到它们, 比如
the, 那么它的值就非常接近1
但是直接比较这些比率也存在一些问题
- postive的postive-to-negative 比例的绝对值大于negative的postive-to-negative 比例的绝对值, 所以没有办法直接比较两个数字, 看看一个词是否传达了相同程度地 积极情绪, 而另一个词传达了负面情绪.因此我们需要把所有值集中在0附近
为了实现这一点, 转换所有的ratio, 通过对数函数.
# Convert ratios to logs
for word,ratio in pos_neg_ratios.most_common():
pos_neg_ratios[word] = np.log(ratio)
print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))
image.png
现在可以看到中性词的值接近0, the接近零但略微为正, 因此他可能用于更积极的评论而非负面评论.amazing的比例高于1, 显然是一个积极的情绪, 而terrible有相似的分数, 但在相反的方向, 所以它低于-1
将文本转换为数字
刚刚我们已经验证了我们的理论 即单个单词, 比整条评论 更能对正负标签作出预测, 现在 我们要将数据集转换成数字, 运用上述的理论, 让神经网络以这种特定的方式来寻找相关性, 要做的 就是将单词作为输入, 传入神经网络 使其能够寻找相关性, 并作出正确的正负预测. 最简单的上手方式就是对每个单词计数, 将这些计数作为输入值传入神经网络, 这些值应该与你想预测的结果具有相关性, 就预测正标签和负标签来说, 显然 神经网络还无法直接根据单词预测正负, 我们的做法就是 以数字的形式代表“正”和“负”
image.png
image.png
数字 1 代表正 数字 0 代表负, 之所以在一个神经元里判断两种结果, 因为正和负是互不相容的, 可以避免我们的训练网络得出某条评论既是正面又是负面的
构建一个输入/输出数据
- 构建一个所有评论中所涵盖的单词集合
vocab = set(list(total_counts.keys()))
vocab_size = len(vocab)
print(vocab_size)
----------------------------
74074
image.png
-
layer_0 is the input layer
-
layer_1 is a hidden layer
-
layer_2 is the output layer.
-
创建一个numpy array
layer_0, 并且初始化全部为0
layer_0 = np.zeros((1,vocab_size))
layer_0.shape
-----------------------------------
(1, 74074)
-
layer_0包含词汇表中每个单词的一个条目,如上图所示。 我们需要确保知道每个单词的索引,因此创建一个存储每个单词索引的查找表。
# Create a dictionary of words in the vocabulary mapped to index positions
# (to be used in layer_0)
word2index = {}
for i,word in enumerate(vocab):
word2index[word] = i
# display the map of words to indices
word2index
image.png
- 接下来构建的第一个函数, 接收一条评论 从中提取单词, 对单词计数 并将结果放入一个向量中, 这个向量必须是恒定长度, 与词汇表长度一样
def update_input_layer(review):
""" Modify the global layer_0 to represent the vector form of review.
The element at a given index of layer_0 should represent
how many times the given word occurs in the review.
Args:
review(string) - the string of the review
Returns:
None
"""
global layer_0
# clear out previous state, reset the layer to be all 0s
layer_0 *= 0
# count how many times each word is used in the given review and store the results in layer_0
for word in review.split(" "):
layer_0[0][word2index[word]] += 1
update_input_layer(reviews[0])
layer_0
-----------------------------------------
array([[18., 0., 0., ..., 0., 0., 0.]])
- 构建第二个函数, 用来将正、负映射到 1 和 0 上
def get_target_for_label(label):
"""Convert a label to `0` or `1`.
Args:
label(string) - Either "POSITIVE" or "NEGATIVE".
Returns:
`0` or `1`.
"""
# TODO: Your code here
if label == 'POSITIVE':
return 1
else:
return 0
构建一个神经网络
运用三层神经网络, 去除隐藏层的非线性, 利用之前创建的函数, 快速生成训练数据, 每输入一条评论和一个标签, 就会分别转换成我们所需的输入和输出两个向量, 接下来要做的 就是创建一个函数, 来预处理数据, 并进行预测.
import time
import sys
import numpy as np
# Encapsulate our neural network in a class
class SentimentNetwork:
def __init__(self, reviews, labels, hidden_nodes = 10, learning_rate = 0.1):
"""Create a SentimenNetwork with the given settings
Args:
reviews(list) - List of reviews used for training
labels(list) - List of POSITIVE/NEGATIVE labels associated with the given reviews
hidden_nodes(int) - Number of nodes to create in the hidden layer
learning_rate(float) - Learning rate to use while training
"""
# Assign a seed to our random number generator to ensure we get
# reproducable results during development
np.random.seed(1)
# process the reviews and their associated labels so that everything
# is ready for training
self.pre_process_data(reviews, labels)
# Build the network to have the number of hidden nodes and the learning rate that
# were passed into this initializer. Make the same number of input nodes as
# there are vocabulary words and create a single output node.
self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)
def pre_process_data(self, reviews, labels):
review_vocab = set()
# TODO: populate review_vocab with all of the words in the given reviews
# Remember to split reviews into individual words
# using "split(' ')" instead of "split()".
# 将给定数量review所出现的词放入一个集合
for review in reviews:
for word in review.split(" "):
review_vocab.add(word)
# Convert the vocabulary set to a list so we can access words via indices
# 将上述集合转换为列表
self.review_vocab = list(review_vocab)
label_vocab = set()
# TODO: populate label_vocab with all of the words in the given labels.
# There is no need to split the labels because each one is a single word.
# 将label的种类放入一个集合
for label in labels:
label_vocab.add(label)
# Convert the label vocabulary set to a list so we can access labels via indices
# 将上述集合转换为列表
self.label_vocab = list(label_vocab)
# Store the sizes of the review and label vocabularies.
# 确定input的数量通过review中出现的唯一词的数量
# 确定output的数量通过label中出现的唯一词的数量
self.review_vocab_size = len(self.review_vocab)
self.label_vocab_size = len(self.label_vocab)
# Create a dictionary of words in the vocabulary mapped to index positions
# 为每个input词分配固定的索引
self.word2index = {}
# TODO: populate self.word2index with indices for all the words in self.review_vocab
# like you saw earlier in the notebook
for i, word in enumerate(self.review_vocab):
self.word2index[word] = i
# Create a dictionary of labels mapped to index positions
# 为每个output词分配固定的索引
self.label2index = {}
# TODO: do the same thing you did for self.word2index and self.review_vocab,
# but for self.label2index and self.label_vocab instead
for i, label in enumerate(self.label_vocab):
self.label2index[label] = i
# 初始化输入层的节点(即有多少个feature), 初始化隐藏层的节点, 初始化输出层的节点, 以及初始化学习速率
def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
# Store the number of nodes in input, hidden, and output layers.
self.input_nodes = input_nodes
self.hidden_nodes = hidden_nodes
self.output_nodes = output_nodes
# Store the learning rate
self.learning_rate = learning_rate
# Initialize weights
# 初始化输入层到输出层的权重矩阵, 权重矩阵的row, 由features的数量决定, 权重矩阵的column, 由隐藏层的节点数确定
# TODO: initialize self.weights_0_1 as a matrix of zeros. These are the weights between
# the input layer and the hidden layer.
self.weights_0_1 = np.zeros((self.input_nodes, self.hidden_nodes))
# TODO: initialize self.weights_1_2 as a matrix of random values.
# These are the weights between the hidden layer and the output layer.
self.weights_1_2 = np.random.normal(0.0, 1/self.output_nodes**-0.5, (self.hidden_nodes, self.output_nodes))
# 将输入层的features值全都初始化为0, 在后面根据每条review进行更新
# TODO: Create the input layer, a two-dimensional matrix with shape
# 1 x input_nodes, with all values initialized to zero
self.layer_0 = np.zeros((1,input_nodes))
# 传入一条review, 更新review的feature, 也即在词向量中, 将review中出现的词在词向量中进行值的更新
def update_input_layer(self,review):
""" Modify the global layer_0 to represent the vector form of review.
The element at a given index of layer_0 should represent
how many times the given word occurs in the review.
Args:
review(string) - the string of the review
Returns:
None
"""
self.layer_0 *= 0
for word in review.split(' '):
if word in self.word2index.keys():
self.layer_0[0][self.word2index[word]] += 1
# 将输出结果map为相应的数字
def get_target_for_label(self,label):
"""
Convert a label to `0` or `1`.
Args:
label(string) - Either "POSITIVE" or "NEGATIVE".
Returns:
`0` or `1`.
"""
if label == 'POSITIVE':
return 1
else:
return 0
# 构建激活函数
def sigmoid(self,x):
# TODO: Return the result of calculating the sigmoid activation function
# shown in the lectures
return 1/(1+np.exp(-x))
# 构建激活函数的导数
def sigmoid_output_2_derivative(self,output):
# TODO: Return the derivative of the sigmoid activation function,
# where "output" is the original output from the sigmoid fucntion
return output*(1-output)
def train(self, training_reviews, training_labels):
# make sure out we have a matching number of reviews and labels
assert(len(training_reviews) == len(training_labels))
# Keep track of correct predictions to display accuracy during training
correct_so_far = 0
# Remember when we started for printing time statistics
start = time.time()
# loop through all the given reviews and run a forward and backward pass,
# updating weights for every item
for i in range(len(training_reviews)):
# TODO: Get the next review and its correct label
review = training_reviews[i]
label = training_labels[i]
# TODO: Implement the forward pass through the network.
# That means use the given review to update the input layer,
# then calculate values for the hidden layer,
# and finally calculate the output layer.
#
# Do not use an activation function for the hidden layer,
# but use the sigmoid activation function for the output layer.
# 输入层
self.update_input_layer(review)
# 隐藏层
layer_1 = self.layer_0.dot(self.weights_0_1)
# 输出层
layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
# TODO: Implement the back propagation pass here.
# That means calculate the error for the forward pass's prediction
# Output error
layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)
# 计算预测和实际输出之间的误差
# Backpropagated error
layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error
# Update the weights
self.weights_1_2 -= layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
self.weights_0_1 -= self.layer_0.T.dot(layer_1_delta) * self.learning_rate # update input-to-hidden weights with gradient descent step
# and update the weights in the network according to their
# contributions toward the error, as calculated via the
# gradient descent and back propagation algorithms you
# learned in class
# TODO: Keep track of correct predictions. To determine if the prediction was
# correct, check that the absolute value of the output error
# is less than 0.5. If so, add one to the correct_so_far count.
if (layer_2 < 0.5 and label == 'NEGATIVE'):
correct_so_far += 1
elif (layer_2 >= 0.5 and label == 'POSITIVE'):
correct_so_far += 1
# For debug purposes, print out our prediction accuracy and speed
# throughout the training process.
elapsed_time = float(time.time() - start)
reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] \
+ "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
+ " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
+ " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
if(i % 2500 == 0):
print("")
def test(self, testing_reviews, testing_labels):
"""
Attempts to predict the labels for the given testing_reviews,
and uses the test_labels to calculate the accuracy of those predictions.
"""
# keep track of how many correct predictions we make
correct = 0
# we'll time how many predictions per second we make
start = time.time()
# Loop through each of the given reviews and call run to predict
# its label.
for i in range(len(testing_reviews)):
pred = self.run(testing_reviews[i])
if(pred == testing_labels[i]):
correct += 1
# For debug purposes, print out our prediction accuracy and speed
# throughout the prediction process.
elapsed_time = float(time.time() - start)
reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
+ "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
+ " #Correct:" + str(correct) + " #Tested:" + str(i+1) \
+ " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
def run(self, review):
"""
Returns a POSITIVE or NEGATIVE prediction for the given review.
"""
# TODO: Run a forward pass through the network, like you did in the
# "train" function. That means use the given review to
# update the input layer, then calculate values for the hidden layer,
# and finally calculate the output layer.
#
# Note: The review passed into this function for prediction
# might come from anywhere, so you should convert it
# to lower case prior to using it.
# Input Layer
self.update_input_layer(review.lower())
# Hidden layer
layer_1 = self.layer_0.dot(self.weights_0_1)
# Output layer
layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
# Return POSITIVE for values above greater-than-or-equal-to 0.5 in the output layer;
# return NEGATIVE for other values
if(layer_2[0] >= 0.5):
return "POSITIVE"
else:
return "NEGATIVE"
Run the following cell to create a SentimentNetwork that will train on all but the last 1000 reviews (we're saving those for testing). Here we use a learning rate of 0.1.
mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)
Run the following cell to test the network's performance against the last 1000 reviews (the ones we held out from our training set). We have not trained the model yet, so the results should be about 50% as it will just be guessing and there are only two possible values to choose from.
image.png
Run the following cell to actually train the network. During training, it will display the model's accuracy repeatedly as it trains so you can see how well it's doing.
image.png
-
learning rate, 0.01, and then train the new network.
-
- learning rate, 0.001, and then train the new network.
image.png