深度学习·神经网络·计算机视觉程序员

深度学习之文本与序列--基于Keras的IMDB电影评论分类

2018-06-29  本文已影响11人  简书已注销

【应用场景】
在深度学习中,文本和序列有着很多的应用场景:

【用文本数据工作】
在深度学习的模型,并不会将原始的文本数据直接送进神经网络中,会将文本装换成数值张量,向量化是其中的一种方式。有很多种不同的方式:

解释一下几个在文本处理中常用的几个名词:

Tips:

【token的两种编码方式】

词(word)级别的one-hot编码

import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# 10
# 定义一个集合,得到{'The': 1, 'cat': 2, 'sat': 3, 'on': 4, 'the': 5, 'mat.': 6, 'dog': 7, 'ate': 8, 'my': 9, 'homework.': 10},也就是筛选出这个句子中对应的了哪些词,然后并赋予索引值,其实就是个词库
token_index = {}
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1

# 限制了读取的句子的长度,一句话最长10个词
max_length = 10
results = np.zeros(shape=(len(samples),
                          max_length,
                          max(token_index.values()) + 1))

# print(results) 2, 10, 11
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.
print(results)
The cat sat on the mat.
The dog ate my homework.

字符(character)级别的one-hot编码

import numpy as np
import string
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
# 预先定义一个字符集 '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\\'()*+,-./:;<=>?@[\\]^_`{|}~‘
characters = string.printable
token_index = dict(zip(range(1, len(characters) + 1), characters))

max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample):
        for key, value in token_index.items():
            if value == character:
                index = key
                results[i, j, index] = 1.


print(results)
截取部分图

用Keras实现基于词(word)级别的one-hot编码,封装的是真的好。


samples = ['The cat sat on the mat.', 'The dog ate my homework.']
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)

sequences = tokenizer.texts_to_sequences(samples)
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

还有一种多样化的one-hot编码,称为one-hot散列法。由于hash值是唯一的,因此这种方法应用于由于在词典中的不同token的数量太大而不能明确处理。但是唯一的缺陷就是会产生哈希碰撞,也就是两个不同词可能对应同一个哈希值,然而神经网络并不能说出这两个词之间的差别。解决方案是让哈希空间的维度大小远远大于不同token已有哈希值的数量

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

dimensionality = 1000
max_length = 10

results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        # 计算hash值
        index = abs(hash(word)) % dimensionality
        results[i, j, index] = 1

print(results)
import os
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
import matplotlib.pyplot as plt


# settings
max_len = 100
training_samples = 200
validation_samples = 10000
max_words = 10000
embedding_dim = 100


def process_data():
    '''
    处理IMDB数据,将数据按标签分为pos,neg
    :return: labels,texts
    '''
    imdb_dir = 'D:\\text2sequences\\aclImdb\\aclImdb'
    train_dir = os.path.join(imdb_dir, 'train')

    labels = []
    texts = []

    for label_type in ['pos', 'neg']:
        dir_name = os.path.join(train_dir, label_type)
        for fname in os.listdir(dir_name):
            if fname[-4:] == '.txt':
                f = open(os.path.join(dir_name, fname), 'r', encoding='UTF-8')
                texts.append(f.read())
                f.close()
                if label_type == 'neg':
                    labels.append(0)
                else:
                    labels.append(1)

    return labels, texts


def tokennize_data():
    '''
    将text向量化,切分训练集和验证集
    :return:x_train, y_train, x_val, y_val即训练集和验证集的label和text
    '''

    labels, texts = process_data()
    tokenizer = Tokenizer(num_words=max_words)
    tokenizer.fit_on_texts(texts=texts)
    sequences = tokenizer.texts_to_sequences(texts=texts)
    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))
    data = pad_sequences(sequences, maxlen=max_len)

    labels = np.asarray(labels)
    print('Shape of data tensor:', data.shape)
    print('Shape of label tensor:', labels.shape)
    indices = np.arange(data.shape[0])
    np.random.shuffle(indices)
    data = data[indices]
    labels = labels[indices]

    x_train = data[:training_samples]
    y_train = labels[:training_samples]
    x_val = data[training_samples: training_samples + validation_samples]
    y_val = labels[training_samples: training_samples + validation_samples]

    return x_train, y_train, x_val, y_val, word_index


def parse_word_embedding(word_index):
    '''
    将预计算的词向量空间的word建立索引和矩阵
    :return:
    '''
    glove_dir = 'D:\\text2sequences\\glove.6B'

    embeddings_index = {}
    f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'), 'r', encoding='UTF-8')
    for  line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

    f.close()
    print('Found %s word vectors.' % len(embeddings_index))

    embedding_matrix = np.zeros((max_words, embedding_dim))

    for word, i in word_index.items():
        if i < max_words:
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                embedding_matrix[i] = embedding_vector

    return embedding_matrix


def train_model():
    '''
    训练模型
    :return:训练时loss,acc
    '''
    model = Sequential()
    model.add(Embedding(max_words, embedding_dim, input_length=max_len))
    model.add(Flatten())
    model.add(Dense(32, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.summary()

    # 将GLOVE加载到模型中
    x_train, y_train, x_val, y_val, word_index = tokennize_data()
    embedding_matrix = parse_word_embedding(word_index)
    model.layers[0].set_weights([embedding_matrix])
    model.layers[0].trainable = False

    model.compile(optimizer='rmsprop',
                  loss='binary_crossentropy',
                  metrics=['acc'])

    history = model.fit(x_train, y_train,
                        epochs=10,
                        batch_size=32,
                        validation_data=(x_val, y_val))
    model.save('pre_trained_glove_model.h5')

    return history


def plott_results():
    '''
    作图
    '''
    history = train_model()
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(1, len(acc) + 1)
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'b', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.figure()
    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()
    plt.show()


if __name__ == '__main__':
    plott_results()

评估分析
从上面的训练结果可以看出,使用预训练的embedding层出现了严重的过拟合,验证集的准确率只在0.5左右。因为我们的训练集太少了,模型的性能严重的依赖我们选择的200个样本(并且这200个样本还是随机选取的)。为了解决过拟合这一问题,我们对这个base 版本进行改进,不进行预训练再来看看


def train_model():
    '''
    训练模型
    :return:训练时loss,acc
    '''
    model = Sequential()
    model.add(Embedding(max_words, embedding_dim, input_length=max_len))
    model.add(Flatten())
    model.add(Dense(32, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.summary()

    # 将GLOVE加载到模型中
    x_train, y_train, x_val, y_val, word_index = tokennize_data()
    embedding_matrix = parse_word_embedding(word_index)

    model.compile(optimizer='rmsprop',
                  loss='binary_crossentropy',
                  metrics=['acc'])

    history = model.fit(x_train, y_train,
                        epochs=10,
                        batch_size=32,
                        validation_data=(x_val, y_val))
    model.save('pre_trained_glove_model.h5')

    return history

【参考文献】

【代码】

【感谢】

感谢每一个读到这里的朋友,如有疑问,请在下方留言与评论,或者发到我的邮箱,互相学习,互相分享,走过路过,点个赞呗😀

上一篇下一篇

猜你喜欢

热点阅读