Tensorflow 基本文本分类
2018-09-25 本文已影响36人
dalalaa
导入工具库
import tensorflow as tf
from tensorflow import keras
import numpy as np
print(tf.__version__)
1.10.0
导入数据
导入数据集,仍然是采用国内特色的导入方式,先自己下载,然后再导入。
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(path = 'H:/tf_project/imdb.npz',num_words=10000)
npz格式的数据也可以直接使用np.load()导入,导入格式为类似字典的格式,可以使用dict()将之转化为字典。
npy格式的数据文件使用np.load()导入之后直接就是array()格式。
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))
Training entries: 25000, labels: 25000
print(train_data[0])
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
len(train_data[0]), len(train_data[1])
(218, 189)
将整型数组重新转化为单词
# A dictionary mapping words to an integer index
# 直接下载
# 地址: https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
word_index = imdb.get_word_index(r"H:\tf_project\imdb_word_index.json")
word_index
{'fawn': 34701,
'tsukino': 52006,
'nunnery': 52007,
'sonja': 16816,
'vani': 63951,
'woods': 1408,
...}
# 所有word的编码往后移三位
word_index = {k:(v+3) for k,v in word_index.items()}
# 添加其他标记符
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2 # unknown
word_index["<UNUSED>"] = 3
# 翻转key和value
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
将训练数据转化为文字
decode_review(train_data[0])
"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"
array格式的电影评论需要转化为张量才能传入网络,可以通过如下两种方式实现:
-
独热编码,即转化为只包含0和1的向量,如列表[3,5]可以被转化为一个10000维向量,该向量中除了下标为3和5的位置为1,其他位置均为0。这种方式对内存要求比较高。
-
我们可以填充数组,使所有的数组具备相同的长度,然后传入到网络中。
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
len(train_data[0]), len(train_data[1])
(256, 256)
print(train_data[0])
[ 1 14 22 16 43 530 973 1622 1385 65 458 4468 66 3941 4
173 36 256 5 25 100 43 838 112 50 670 2 9 35 480
284 5 150 4 172 112 167 2 336 385 39 4 172 4536 1111
17 546 38 13 447 4 192 50 16 6 147 2025 19 14 22
4 1920 4613 469 4 22 71 87 12 16 43 530 38 76 15
13 1247 4 22 17 515 17 12 16 626 18 2 5 62 386
12 8 316 8 106 5 4 2223 5244 16 480 66 3785 33 4
130 12 16 38 619 5 25 124 51 36 135 48 25 1415 33
6 22 12 215 28 77 52 5 14 407 16 82 2 8 4
107 117 5952 15 256 4 2 7 3766 5 723 36 71 43 530
476 26 400 317 46 7 4 2 1029 13 104 88 4 381 15
297 98 32 2071 56 26 141 6 194 7486 18 4 226 22 21
134 476 26 480 5 144 30 5535 18 51 36 28 224 92 25
104 4 226 65 16 38 1334 88 12 16 283 5 16 4472 113
103 32 15 16 5345 19 178 32 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0]
搭建模型
vocab_size = 10000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 16) 160000
_________________________________________________________________
global_average_pooling1d (Gl (None, 16) 0
_________________________________________________________________
dense (Dense) (None, 16) 272
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________
-
第一层是Embedding层
-
第二层是GlobalAveragePooling1D层
-
第三四层是全连接层
-
输出层只有一个节点,使用sigmoid激活函数将结果约束到0-1之间。
损失函数和优化器
model.compile(optimizer=tf.train.AdamOptimizer(),
loss='binary_crossentropy',
metrics=['accuracy'])
创建验证集
x_val = train_data[:10000]
partial_x_train = train_data[10000:]
y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]
训练模型
这里的history是fit的返回值,包括了训练过程中指标变化信息
history = model.fit(partial_x_train,
partial_y_train,
epochs=40,
batch_size=512,
validation_data=(x_val, y_val),
verbose=1)
Train on 15000 samples, validate on 10000 samples
Epoch 1/40
15000/15000 [==============================] - 4s 249us/step - loss: 0.7391 - acc: 0.5035 - val_loss: 0.7010 - val_acc: 0.4947
Epoch 2/40
15000/15000 [==============================] - 1s 52us/step - loss: 0.6931 - acc: 0.5251 - val_loss: 0.6912 - val_acc: 0.5338
Epoch 3/40
15000/15000 [==============================] - 1s 51us/step - loss: 0.6903 - acc: 0.5801 - val_loss: 0.6897 - val_acc: 0.5656
Epoch 4/40
15000/15000 [==============================] - 1s 50us/step - loss: 0.6884 - acc: 0.6543 - val_loss: 0.6879 - val_acc: 0.6747
Epoch 5/40
15000/15000 [==============================] - 1s 48us/step - loss: 0.6864 - acc: 0.6421 - val_loss: 0.6860 - val_acc: 0.7004
Epoch 6/40
15000/15000 [==============================] - 1s 51us/step - loss: 0.6841 - acc: 0.7283 - val_loss: 0.6837 - val_acc: 0.7259
Epoch 7/40
15000/15000 [==============================] - 1s 53us/step - loss: 0.6810 - acc: 0.7203 - val_loss: 0.6805 - val_acc: 0.6978
Epoch 8/40
15000/15000 [==============================] - 1s 53us/step - loss: 0.6769 - acc: 0.7057 - val_loss: 0.6759 - val_acc: 0.6885
Epoch 9/40
15000/15000 [==============================] - 1s 51us/step - loss: 0.6707 - acc: 0.7150 - val_loss: 0.6695 - val_acc: 0.7142
Epoch 10/40
15000/15000 [==============================] - 1s 56us/step - loss: 0.6628 - acc: 0.7443 - val_loss: 0.6610 - val_acc: 0.7356
Epoch 11/40
15000/15000 [==============================] - 1s 51us/step - loss: 0.6529 - acc: 0.7487 - val_loss: 0.6503 - val_acc: 0.7497
Epoch 12/40
15000/15000 [==============================] - 1s 50us/step - loss: 0.6387 - acc: 0.7843 - val_loss: 0.6345 - val_acc: 0.7720
Epoch 13/40
15000/15000 [==============================] - 1s 56us/step - loss: 0.6182 - acc: 0.7861 - val_loss: 0.6157 - val_acc: 0.7727
Epoch 14/40
15000/15000 [==============================] - 1s 50us/step - loss: 0.5933 - acc: 0.7986 - val_loss: 0.5889 - val_acc: 0.7900
Epoch 15/40
15000/15000 [==============================] - 1s 49us/step - loss: 0.5614 - acc: 0.8103 - val_loss: 0.5584 - val_acc: 0.7956
Epoch 16/40
15000/15000 [==============================] - 1s 49us/step - loss: 0.5295 - acc: 0.8157 - val_loss: 0.5293 - val_acc: 0.8052
Epoch 17/40
15000/15000 [==============================] - 1s 50us/step - loss: 0.4963 - acc: 0.8327 - val_loss: 0.5008 - val_acc: 0.8192
Epoch 18/40
15000/15000 [==============================] - 1s 52us/step - loss: 0.4647 - acc: 0.8423 - val_loss: 0.4726 - val_acc: 0.8273
Epoch 19/40
15000/15000 [==============================] - 1s 49us/step - loss: 0.4349 - acc: 0.8519 - val_loss: 0.4471 - val_acc: 0.8363
Epoch 20/40
15000/15000 [==============================] - 1s 50us/step - loss: 0.4076 - acc: 0.8607 - val_loss: 0.4243 - val_acc: 0.8434
Epoch 21/40
15000/15000 [==============================] - 1s 49us/step - loss: 0.3829 - acc: 0.8707 - val_loss: 0.4043 - val_acc: 0.8489
Epoch 22/40
15000/15000 [==============================] - 1s 50us/step - loss: 0.3612 - acc: 0.8773 - val_loss: 0.3872 - val_acc: 0.8547
Epoch 23/40
15000/15000 [==============================] - 1s 49us/step - loss: 0.3424 - acc: 0.8833 - val_loss: 0.3729 - val_acc: 0.8587
Epoch 24/40
15000/15000 [==============================] - 1s 49us/step - loss: 0.3256 - acc: 0.8885 - val_loss: 0.3605 - val_acc: 0.8643
Epoch 25/40
15000/15000 [==============================] - 1s 48us/step - loss: 0.3111 - acc: 0.8935 - val_loss: 0.3500 - val_acc: 0.8673
Epoch 26/40
15000/15000 [==============================] - 1s 49us/step - loss: 0.2980 - acc: 0.8960 - val_loss: 0.3415 - val_acc: 0.8698
Epoch 27/40
15000/15000 [==============================] - 1s 50us/step - loss: 0.2868 - acc: 0.8989 - val_loss: 0.3338 - val_acc: 0.8711
Epoch 28/40
15000/15000 [==============================] - 1s 49us/step - loss: 0.2758 - acc: 0.9039 - val_loss: 0.3268 - val_acc: 0.8746
Epoch 29/40
15000/15000 [==============================] - 1s 49us/step - loss: 0.2666 - acc: 0.9058 - val_loss: 0.3218 - val_acc: 0.8751
Epoch 30/40
15000/15000 [==============================] - 1s 53us/step - loss: 0.2588 - acc: 0.9079 - val_loss: 0.3164 - val_acc: 0.8768
Epoch 31/40
15000/15000 [==============================] - 1s 51us/step - loss: 0.2498 - acc: 0.9125 - val_loss: 0.3124 - val_acc: 0.8769
Epoch 32/40
15000/15000 [==============================] - 1s 50us/step - loss: 0.2431 - acc: 0.9135 - val_loss: 0.3086 - val_acc: 0.8793
Epoch 33/40
15000/15000 [==============================] - 1s 50us/step - loss: 0.2352 - acc: 0.9170 - val_loss: 0.3052 - val_acc: 0.8805
Epoch 34/40
15000/15000 [==============================] - 1s 47us/step - loss: 0.2288 - acc: 0.9183 - val_loss: 0.3030 - val_acc: 0.8807
Epoch 35/40
15000/15000 [==============================] - 1s 51us/step - loss: 0.2231 - acc: 0.9195 - val_loss: 0.2998 - val_acc: 0.8802
Epoch 36/40
15000/15000 [==============================] - 1s 51us/step - loss: 0.2166 - acc: 0.9220 - val_loss: 0.2975 - val_acc: 0.8825
Epoch 37/40
15000/15000 [==============================] - 1s 49us/step - loss: 0.2111 - acc: 0.9247 - val_loss: 0.2956 - val_acc: 0.8831
Epoch 38/40
15000/15000 [==============================] - 1s 49us/step - loss: 0.2058 - acc: 0.9259 - val_loss: 0.2940 - val_acc: 0.8834
Epoch 39/40
15000/15000 [==============================] - 1s 49us/step - loss: 0.2003 - acc: 0.9294 - val_loss: 0.2922 - val_acc: 0.8846
Epoch 40/40
15000/15000 [==============================] - 1s 52us/step - loss: 0.1953 - acc: 0.9307 - val_loss: 0.2908 - val_acc: 0.8848
验证模型
results = model.evaluate(test_data, test_labels)
print(results)
25000/25000 [==============================] - 2s 86us/step
[0.3060342230606079, 0.87492000000000003]
结果可视化
history_dict = history.history
history_dict.keys()
dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])
误差率
import matplotlib.pyplot as plt
%matplotlib inline
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
output_31_0.png
上图中,大约20个epochs之后,validation loss的下降变缓,模型开始出现过拟合现象。
准确率
plt.clf() # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
output_34_0.png