机器学习

文本分类

2019-02-09  本文已影响32人  IntoTheVoid

处理一个二分类的文本分类问题

python实现最简单的统计词频的方式

from collections import Counter
import numpy as np
# 为每一个分类构建一个计数器, 以上图的分类为例
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()
# 将每条评论通过简单的分词, 按照不同的分类添加到上面的计数器中
for i in range(len(reviews)):
    if(labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in reviews[i].split(" "):
            negative_counts[word] += 1
            total_counts[word] += 1
# 使用计数器的方法most_common()查看最常出现的词
positive_counts.most_common()
negative_counts.most_common()

但是如下图所示, 像the这样的词在positive和negative评论中都算作common words

image.png

但是我们统计词频的目的是想找出在positive中出现的词的频率要明显多于在negative出现的词, 为了实现这一点, 需要计算positive和negative评论之间的字词使用比率

pos_neg_ratios = Counter()

# Calculate the ratios of positive and negative uses of the most common words
# Consider words to be "common" if they've been used at least 100 times
for term,cnt in list(total_counts.most_common()):
    if(cnt > 100):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio
image.png

通过上图我们可以做以下解释:

但是直接比较这些比率也存在一些问题

为了实现这一点, 转换所有的ratio, 通过对数函数.

# Convert ratios to logs
for word,ratio in pos_neg_ratios.most_common():
    pos_neg_ratios[word] = np.log(ratio)

print("Pos-to-neg ratio for 'the' = {}".format(pos_neg_ratios["the"]))
print("Pos-to-neg ratio for 'amazing' = {}".format(pos_neg_ratios["amazing"]))
print("Pos-to-neg ratio for 'terrible' = {}".format(pos_neg_ratios["terrible"]))
image.png

现在可以看到中性词的值接近0, the接近零但略微为正, 因此他可能用于更积极的评论而非负面评论.amazing的比例高于1, 显然是一个积极的情绪, 而terrible有相似的分数, 但在相反的方向, 所以它低于-1

将文本转换为数字

刚刚我们已经验证了我们的理论 即单个单词, 比整条评论 更能对正负标签作出预测, 现在 我们要将数据集转换成数字, 运用上述的理论, 让神经网络以这种特定的方式来寻找相关性, 要做的 就是将单词作为输入, 传入神经网络 使其能够寻找相关性, 并作出正确的正负预测. 最简单的上手方式就是对每个单词计数, 将这些计数作为输入值传入神经网络, 这些值应该与你想预测的结果具有相关性, 就预测正标签和负标签来说, 显然 神经网络还无法直接根据单词预测正负, 我们的做法就是 以数字的形式代表“正”和“负”

image.png image.png

数字 1 代表正 数字 0 代表负, 之所以在一个神经元里判断两种结果, 因为正和负是互不相容的, 可以避免我们的训练网络得出某条评论既是正面又是负面的

构建一个输入/输出数据
vocab = set(list(total_counts.keys()))
vocab_size = len(vocab)
print(vocab_size)
----------------------------
74074
image.png
layer_0 = np.zeros((1,vocab_size))
layer_0.shape
-----------------------------------
(1, 74074)
# Create a dictionary of words in the vocabulary mapped to index positions
# (to be used in layer_0)
word2index = {}
for i,word in enumerate(vocab):
    word2index[word] = i
    
# display the map of words to indices
word2index
image.png
def update_input_layer(review):
    """ Modify the global layer_0 to represent the vector form of review.
    The element at a given index of layer_0 should represent
    how many times the given word occurs in the review.
    Args:
        review(string) - the string of the review
    Returns:
        None
    """
     
    global layer_0
    
    # clear out previous state, reset the layer to be all 0s
    layer_0 *= 0
    
    # count how many times each word is used in the given review and store the results in layer_0 
    for word in review.split(" "):
        layer_0[0][word2index[word]] += 1

update_input_layer(reviews[0])
layer_0
-----------------------------------------
array([[18.,  0.,  0., ...,  0.,  0.,  0.]])
def get_target_for_label(label):
    """Convert a label to `0` or `1`.
    Args:
        label(string) - Either "POSITIVE" or "NEGATIVE".
    Returns:
        `0` or `1`.
    """
    # TODO: Your code here
    if label ==  'POSITIVE':
        return 1
    else:
        return 0

构建一个神经网络

运用三层神经网络, 去除隐藏层的非线性, 利用之前创建的函数, 快速生成训练数据, 每输入一条评论和一个标签, 就会分别转换成我们所需的输入和输出两个向量, 接下来要做的 就是创建一个函数, 来预处理数据, 并进行预测.

import time
import sys
import numpy as np

# Encapsulate our neural network in a class
class SentimentNetwork:
    def __init__(self, reviews, labels, hidden_nodes = 10, learning_rate = 0.1):
        """Create a SentimenNetwork with the given settings
        Args:
            reviews(list) - List of reviews used for training
            labels(list) - List of POSITIVE/NEGATIVE labels associated with the given reviews
            hidden_nodes(int) - Number of nodes to create in the hidden layer
            learning_rate(float) - Learning rate to use while training
        
        """
        # Assign a seed to our random number generator to ensure we get
        # reproducable results during development 
        np.random.seed(1)

        # process the reviews and their associated labels so that everything
        # is ready for training
        self.pre_process_data(reviews, labels)
        
        # Build the network to have the number of hidden nodes and the learning rate that
        # were passed into this initializer. Make the same number of input nodes as
        # there are vocabulary words and create a single output node.
        self.init_network(len(self.review_vocab),hidden_nodes, 1, learning_rate)

    def pre_process_data(self, reviews, labels):
        
        review_vocab = set()
        # TODO: populate review_vocab with all of the words in the given reviews
        #       Remember to split reviews into individual words 
        #       using "split(' ')" instead of "split()".
        # 将给定数量review所出现的词放入一个集合
        for review in reviews:
            for word in review.split(" "):
                review_vocab.add(word)
        # Convert the vocabulary set to a list so we can access words via indices
        # 将上述集合转换为列表
        self.review_vocab = list(review_vocab)
        
        label_vocab = set()
        # TODO: populate label_vocab with all of the words in the given labels.
        #       There is no need to split the labels because each one is a single word.
        # 将label的种类放入一个集合
        for label in labels:
            label_vocab.add(label)
        # Convert the label vocabulary set to a list so we can access labels via indices
        # 将上述集合转换为列表
        self.label_vocab = list(label_vocab)
        
        # Store the sizes of the review and label vocabularies.
        # 确定input的数量通过review中出现的唯一词的数量
        # 确定output的数量通过label中出现的唯一词的数量
        self.review_vocab_size = len(self.review_vocab)
        self.label_vocab_size = len(self.label_vocab)
        
        # Create a dictionary of words in the vocabulary mapped to index positions
        # 为每个input词分配固定的索引
        self.word2index = {}
        # TODO: populate self.word2index with indices for all the words in self.review_vocab
        #       like you saw earlier in the notebook
        for i, word in enumerate(self.review_vocab):
            self.word2index[word] = i
        # Create a dictionary of labels mapped to index positions
        # 为每个output词分配固定的索引
        self.label2index = {}
        # TODO: do the same thing you did for self.word2index and self.review_vocab, 
        #       but for self.label2index and self.label_vocab instead
        for i, label in enumerate(self.label_vocab):
            self.label2index[label] = i
    # 初始化输入层的节点(即有多少个feature), 初始化隐藏层的节点, 初始化输出层的节点, 以及初始化学习速率    
    def init_network(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        # Store the number of nodes in input, hidden, and output layers.
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes

        # Store the learning rate
        self.learning_rate = learning_rate

        # Initialize weights
        # 初始化输入层到输出层的权重矩阵, 权重矩阵的row, 由features的数量决定, 权重矩阵的column, 由隐藏层的节点数确定
        # TODO: initialize self.weights_0_1 as a matrix of zeros. These are the weights between
        #       the input layer and the hidden layer.
        self.weights_0_1 = np.zeros((self.input_nodes, self.hidden_nodes))
        # TODO: initialize self.weights_1_2 as a matrix of random values. 
        #       These are the weights between the hidden layer and the output layer.
        self.weights_1_2 = np.random.normal(0.0, 1/self.output_nodes**-0.5, (self.hidden_nodes, self.output_nodes))
        # 将输入层的features值全都初始化为0, 在后面根据每条review进行更新
        # TODO: Create the input layer, a two-dimensional matrix with shape 
        #       1 x input_nodes, with all values initialized to zero
        self.layer_0 = np.zeros((1,input_nodes))
    
    # 传入一条review, 更新review的feature, 也即在词向量中, 将review中出现的词在词向量中进行值的更新    
    def update_input_layer(self,review):
        """ Modify the global layer_0 to represent the vector form of review.
            The element at a given index of layer_0 should represent
            how many times the given word occurs in the review.
        Args:
            review(string) - the string of the review
        Returns:
            None
        """
        self.layer_0 *= 0
        for word in review.split(' '):
            if word in self.word2index.keys():
                self.layer_0[0][self.word2index[word]] += 1
    # 将输出结果map为相应的数字            
    def get_target_for_label(self,label):
        """
        Convert a label to `0` or `1`.
        Args:
            label(string) - Either "POSITIVE" or "NEGATIVE".
        Returns:
            `0` or `1`.
        """ 
        if label == 'POSITIVE':
            return 1
        else:
            return 0
    # 构建激活函数    
    def sigmoid(self,x):
        # TODO: Return the result of calculating the sigmoid activation function
        #       shown in the lectures
        return 1/(1+np.exp(-x))
    # 构建激活函数的导数
    def sigmoid_output_2_derivative(self,output):
        # TODO: Return the derivative of the sigmoid activation function, 
        #       where "output" is the original output from the sigmoid fucntion 
        return output*(1-output)

    def train(self, training_reviews, training_labels):
        
        # make sure out we have a matching number of reviews and labels
        assert(len(training_reviews) == len(training_labels))
        
        # Keep track of correct predictions to display accuracy during training 
        correct_so_far = 0
        
        # Remember when we started for printing time statistics
        start = time.time()

        # loop through all the given reviews and run a forward and backward pass,
        # updating weights for every item
        for i in range(len(training_reviews)):
            
            # TODO: Get the next review and its correct label
            review = training_reviews[i]
            label = training_labels[i]
            # TODO: Implement the forward pass through the network. 
            #       That means use the given review to update the input layer, 
            #       then calculate values for the hidden layer,
            #       and finally calculate the output layer.
            # 
            #       Do not use an activation function for the hidden layer,
            #       but use the sigmoid activation function for the output layer.
            # 输入层
            self.update_input_layer(review)
            # 隐藏层
            layer_1 = self.layer_0.dot(self.weights_0_1)
            # 输出层
            layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
            # TODO: Implement the back propagation pass here. 
            #       That means calculate the error for the forward pass's prediction
            # Output error
            layer_2_error = layer_2 - self.get_target_for_label(label) # Output layer error is the difference between desired target and actual output.
            layer_2_delta = layer_2_error * self.sigmoid_output_2_derivative(layer_2)
            
            # 计算预测和实际输出之间的误差
            # Backpropagated error
            layer_1_error = layer_2_delta.dot(self.weights_1_2.T) # errors propagated to the hidden layer
            layer_1_delta = layer_1_error # hidden layer gradients - no nonlinearity so it's the same as the error

            # Update the weights
            self.weights_1_2 -= layer_1.T.dot(layer_2_delta) * self.learning_rate # update hidden-to-output weights with gradient descent step
            self.weights_0_1 -= self.layer_0.T.dot(layer_1_delta) * self.learning_rate # update input-to-hidden weights with gradient descent step
            #       and update the weights in the network according to their
            #       contributions toward the error, as calculated via the
            #       gradient descent and back propagation algorithms you 
            #       learned in class
            
            # TODO: Keep track of correct predictions. To determine if the prediction was
            #       correct, check that the absolute value of the output error 
            #       is less than 0.5. If so, add one to the correct_so_far count.
            if (layer_2 < 0.5 and label == 'NEGATIVE'):
                correct_so_far += 1
            elif (layer_2 >= 0.5 and label == 'POSITIVE'):
                correct_so_far += 1
                
            # For debug purposes, print out our prediction accuracy and speed 
            # throughout the training process. 

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(training_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct_so_far) + " #Trained:" + str(i+1) \
                             + " Training Accuracy:" + str(correct_so_far * 100 / float(i+1))[:4] + "%")
            if(i % 2500 == 0):
                print("")
    
    def test(self, testing_reviews, testing_labels):
        """
        Attempts to predict the labels for the given testing_reviews,
        and uses the test_labels to calculate the accuracy of those predictions.
        """
        
        # keep track of how many correct predictions we make
        correct = 0

        # we'll time how many predictions per second we make
        start = time.time()

        # Loop through each of the given reviews and call run to predict
        # its label. 
        for i in range(len(testing_reviews)):
            pred = self.run(testing_reviews[i])
            if(pred == testing_labels[i]):
                correct += 1
            
            # For debug purposes, print out our prediction accuracy and speed 
            # throughout the prediction process. 

            elapsed_time = float(time.time() - start)
            reviews_per_second = i / elapsed_time if elapsed_time > 0 else 0
            
            sys.stdout.write("\rProgress:" + str(100 * i/float(len(testing_reviews)))[:4] \
                             + "% Speed(reviews/sec):" + str(reviews_per_second)[0:5] \
                             + " #Correct:" + str(correct) + " #Tested:" + str(i+1) \
                             + " Testing Accuracy:" + str(correct * 100 / float(i+1))[:4] + "%")
    
    def run(self, review):
        """
        Returns a POSITIVE or NEGATIVE prediction for the given review.
        """
        # TODO: Run a forward pass through the network, like you did in the
        #       "train" function. That means use the given review to 
        #       update the input layer, then calculate values for the hidden layer,
        #       and finally calculate the output layer.
        #
        #       Note: The review passed into this function for prediction 
        #             might come from anywhere, so you should convert it 
        #             to lower case prior to using it.
        # Input Layer
        self.update_input_layer(review.lower())

        # Hidden layer
        layer_1 = self.layer_0.dot(self.weights_0_1)

        # Output layer
        layer_2 = self.sigmoid(layer_1.dot(self.weights_1_2))
        
        # Return POSITIVE for values above greater-than-or-equal-to 0.5 in the output layer;
        # return NEGATIVE for other values
        if(layer_2[0] >= 0.5):
            return "POSITIVE"
        else:
            return "NEGATIVE"
        

Run the following cell to create a SentimentNetwork that will train on all but the last 1000 reviews (we're saving those for testing). Here we use a learning rate of 0.1.

mlp = SentimentNetwork(reviews[:-1000],labels[:-1000], learning_rate=0.1)

Run the following cell to test the network's performance against the last 1000 reviews (the ones we held out from our training set). We have not trained the model yet, so the results should be about 50% as it will just be guessing and there are only two possible values to choose from.

image.png

Run the following cell to actually train the network. During training, it will display the model's accuracy repeatedly as it trains so you can see how well it's doing.

image.png image.png
上一篇 下一篇

猜你喜欢

热点阅读