Tensorflow[实战篇]——LSTM自动生成唐诗

2017-08-26 本文已影响714人 Salon_sai

Preface

前几天的台风（天鸽）给广东省的市民带来很严重的生活和生产影响。在台风的袭击下，被风吹倒的车压住车主，海浪潮冲进居民的屋里，工地上十几层楼高的吊塔被吹倒。面对自然界的力量，我深知人类的渺小和脆弱，尽管人类的科技水平不断提高，但我们也只是借助大自然的力量去提高生活水平。作为人类的我们应该敬畏自然。

Overview

言归正转，今天我想说的使用简单的LSTM模型学习生成唐诗。

LSTM

LSTM在之前的文章中介绍过：Tensorflow[基础篇]——lstm的理解与实现，LSTM是一个处理序列的深度学习网络，他与一般RNN不同在于LSTM适合于处理和预测时间序列中间隔和延迟非常长的重要事件。也就是说与预测词汇很远的相关上下文也可以记忆下来并作为预测的重要依据。假如想深入了解可以点开我的文章去看看实现LSTM的公式，论文和模型实现细节：Tensorflow[基础篇]——lstm的理解与实现，这里我就不多说了。

IDEA

我们在训练样本里面（3w首唐诗）得到每个字的字典（word->ID）和反向字典（ID->word）。通过字典将每首唐诗变为由ID组成的向量，再通过ID向量通过embedding_lookup变成“词”向量。而train_label是由train_data向后移一位得到的。

如“[大漠孤烟直]”作为train_data，而相应的train_label就是“大漠孤烟直]]”，也就是
“[”->“大”，“大”->“漠”，“漠”->“孤”，“孤”->“烟”，“烟”->“直”，“直”->“]”，“]”->“]”，这样子先后顺序一一对相。这也是RNN的一个重要的特征。
这里的“[”和“]”是开始符和结束符，用于生成唐诗的开始与结束标记。

LSTM模型我们使用tensorflow给的tf.nn.rnn_cell.BasicLSTMCell生成LSTM基本模型，当然你也可以使用其他LSTM模型的变种（GRU等）。最后使用sequence_loss_by_example得到损失函数作为训练目标。

Detail

1. read.py

# -*-coding:utf-8-*-#
import collections

def get_poetrys(poetry_file="./poetry.txt"):
    poetrys = []
    with open(poetry_file, 'r', encoding='utf-8') as f:
        for index, line in enumerate(f):
            try:
                title, content = line.strip().split(":")
                content = content.replace(" ", "")
                if '_' in content or '(' in content or \
                '（' in content or '《' in content or \
                '[' in content :
                    continue
                if len(content) < 5 or len(content) > 79:
                    continue
                content = '[' + content + ']'
                poetrys.append(content)
            except Exception as e:
                pass
        return poetrys

def build_dataset():
    poetrys = get_poetrys()
    poetrys = sorted(poetrys, key=lambda line: len(line))
    print("唐诗总数:", len(poetrys))
    words = []
    for poetry in poetrys:
        words += [word for word in poetry]
    counter = collections.Counter(words)
    # 从大到小排序
    counter_pairs = sorted(counter.items(), key=lambda x: -x[1])
    # 从counter中解压，并获取当中的词(不重复)
    words, _ = zip(*counter_pairs)
    words = words[:len(words)] + (" ", )
    # word -> id
    dictionary = dict(zip(words, range(len(words))))
    # id -> word
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))

    poetry_vectors = [[dictionary[word] for word in poetry] for poetry in poetrys]
    return dictionary, poetry_vectors, reversed_dictionary

这里是很简单将全部的唐诗进行处理，最后通过build_dataset()转化成为得到唐诗的词典，唐诗向量和反向词典。得到预处理好的唐诗集合。

2. BatchGenerator.py

# -*-coding:utf-8-*-#
import numpy as np
import reader

class BatchGenerator(object):
    """docstring for BatchGenerator"""
    def __init__(self, data, batch_size, empty_key):
        self._batch_size = batch_size
        self._offset = 0
        self._batch = []
        self._data_size = len(data)
        self._batch_num = self._data_size // self._batch_size
        self._data = data
        self._generate_batch(empty_key)

    def _generate_batch(self, empty_key):
        for index in range(self._batch_num):
            start = index * self._batch_size
            end = start + self._batch_size
            # 当前batch中诗词的最大长度
            length = max(map(len, self._data[start: end]))
            # 创建batch数据，假如有诗词没有达到最大长度使用空格作为补充
            batch_data = np.full((self._batch_size, length), empty_key, np.int32)
            for row in range(self._batch_size):
                poetry = self._data[start + row]
                batch_data[row, :len(poetry)] = poetry
            self._batch.append(batch_data)

    def next(self):
        x_data = self._batch[self._offset]
        y_data = np.copy(x_data)
        y_data[:, : -1] = x_data[:, 1: ]
        self._offset = (self._offset + 1) % self._batch_num
        return x_data, y_data

这里生成batch数据，创建batch_generater后调用next()方法获取下一个批次。可以留意的是假如这里我们仅仅计算每个batch诗词中的最长诗词长度作为batch中的shape列数，即shape=(batch_size, max_length)。那么每一个batch的max_length都不一样哦。为什么不一次性统一最大长度max_length呢？

这是因为假如全部batch的max_length都一致的话，那么我们需要计算全部诗词中的最大长度作为max_length。那这样总体比较短的batch里就存在比较多没意义的空格作为补充，这即影响训练结果，又影响我们训练的速度。

参考于知乎：tensor flow dynamic_rnn 与rnn有啥区别？

3. GeneratePoetryModel.py

# -*-coding:utf-8-*-#
import tensorflow as tf

class GeneratePoetryModel(object):
    """docstring for GeneratePoetryModel"""
    def __init__(self, X, batch_size, input_size, output_size, model='lstm', rnn_size=128, num_layers=2):
        self._model = model
        self._num_unit = rnn_size   # LSTM的单元个数
        self._num_layers = num_layers # LSTM的层数
        self._input_size = input_size # 最后全连接层输入维数
        self._output_size = output_size # 最后全连接层输出维数
        self._model_layers = self._get_layer() # 获得模型的LSTM隐含层

        self._initial_state = self._model_layers.zero_state(batch_size, tf.float32) # 定义初始状态

        with tf.variable_scope('rnnlm'):
            n = (self._num_unit + self._output_size) * 0.5
            scale = tf.sqrt(3 / n)
            # 全连接层的参数定义
            softmax_w = tf.get_variable(
                "softmax_w",
                [self._num_unit, self._output_size],
                initializer=tf.random_uniform_initializer(-scale, scale))
            softmax_b = tf.get_variable(
                "softmax_b",
                [self._output_size],
                initializer=tf.random_uniform_initializer(-scale, scale))
            with tf.device("/cpu:0"):
                embedding = tf.get_variable("embedding", [self._input_size, self._num_unit])
                inputs = tf.nn.embedding_lookup(embedding, X)

        # 运行隐含层LSTM
        outputs, last_state = tf.nn.dynamic_rnn(self._model_layers, inputs, initial_state=self._initial_state, scope="rnnlm")
        self._outputs = tf.reshape(outputs, [-1, self._num_unit])
        self._last_state = last_state
        # 得到全连接层结果
        self._logists = tf.matmul(self._outputs, softmax_w) + softmax_b
        # 得到预测结果
        self._probs = tf.nn.softmax(self._logists)

    def _get_cell(self):
        if self._model == 'rnn':
            cell_fun = tf.nn.rnn_cell.BasicRNNCell
        elif self._model == 'gru':
            cell_fun = tf.nn.rnn_cell.GRUCell
        elif self._model == 'lstm':
            cell_fun = tf.nn.rnn_cell.BasicLSTMCell

        return cell_fun(self._num_unit, state_is_tuple=True)

    def _get_layer(self):
        cell = self._get_cell()
        return tf.nn.rnn_cell.MultiRNNCell([cell] * self._num_layers, state_is_tuple=True)

    def results(self):
        """
        输出神经网络的结果和需要的参数
        """
        return self._logists, self._last_state, self._probs, self._initial_state

我们在这里定义了LSTM模型为了就是运用刚刚说的tf.nn.rnn_cell.BasicLSTMCell，假如想使用其他的变种LSTM模型可以传参时，改变model。这里有个有意思的方法tf.nn.dynamic_rnn。他与tf.nn.rnn不同在于这是个可以运行输入的shape不同。考虑一下现在的情况，我们使用batch进行训练的时候，我们需要batch的shape不是一样的。因为我们刚刚在batch_generater得到的每个batch的shape=(batch_size, max_length)，而max_length的长度不一。为了解决这个问题，我们就使用了tf.nn.dynamic_rnn这个方法，而tf.nn.rnn必须要求输入的shape必须一致的，那么就跟我们原来的需求相反了。

4. learning_poetry.py

# -*-coding:utf-8-*-#
import datetime
import tensorflow as tf

import os
import sys

import reader
from BatchGenerator import BatchGenerator
from GeneratePoetryModel import GeneratePoetryModel

dictionary, poetry_vectors, _ = reader.build_dataset()

empty_key = dictionary.get(' ')

batch_size =64

batch_generator = BatchGenerator(poetry_vectors, batch_size, empty_key)

# x_data, y_data = batch_generator.next()

input_size = output_size = len(dictionary) + 1

train_data = tf.placeholder(tf.int32, [batch_size, None])
train_label = tf.placeholder(tf.int32, [batch_size, None])

model = GeneratePoetryModel(X=train_data, batch_size=batch_size, input_size=input_size, output_size=output_size)

logists, last_state, _, _ = model.results()
targets = tf.reshape(train_label, [-1])
loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logists], [targets], [tf.ones_like(targets, dtype=tf.float32)], len(dictionary))
cost = tf.reduce_mean(loss)

global_step = tf.Variable(0, trainable=False)
learning_rate = tf.train.exponential_decay(0.01, global_step, batch_generator._batch_num, 0.9, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate)
gradients, v = zip(*optimizer.compute_gradients(cost))
gradients, _ = tf.clip_by_global_norm(gradients, 5)
optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

with tf.Session() as session:
    session.run(tf.global_variables_initializer())

    saver = tf.train.Saver(tf.global_variables())
    print("training...")
    model_dir = "./model/"
    if not os.path.exists(model_dir):
        os.mkdir(model_dir)
        print("create the directory: %s" % model_dir)
    # 损失值最小的回合
    best_cost_epoch = 0
    # 损失最小值
    best_cost = float('Inf')
    start_time = datetime.datetime.now()
    for epoch in range(141):
        epoch_start_time = datetime.datetime.now()
        epoch_mean_cost = 0
        for batch in range(batch_generator._batch_num):
            x_data, y_data = batch_generator.next()
            _, _, c, lr, gs = session.run(
                [optimizer, last_state, cost, learning_rate, global_step],
                feed_dict={train_data: x_data, train_label: y_data})
            epoch_mean_cost += c
            print("current epoch %d, current batch is %d, mean cost : %2.8f, learning rate: %2.8f, global step : %d"
                %(epoch, batch, c, lr, gs))
        epoch_mean_cost = epoch_mean_cost / batch_generator._batch_num
        print("="*80)
        if epoch != 0:
            print("\nthe best cost : %2.8f, the best epoch index : %d, current epoch cost : %2.8f. \n" \
                %(best_cost, best_cost_epoch, epoch_mean_cost))
        if best_cost > epoch_mean_cost:
            print("the best epoch will change from %d to %d" %(best_cost_epoch, epoch))
            best_cost = epoch_mean_cost
            best_cost_epoch = epoch
            saver.save(session, model_dir + 'poetry.module-best')
        if epoch % 7 == 0:
            saver.save(session, model_dir + 'poetry.module', global_step=epoch)
        end_time = datetime.datetime.now()
        timedelta = end_time - epoch_start_time
        print("the epoch training spends %d days, %d hours, %d minutes, %d seconds.\n" \
            %(timedelta.days, timedelta.seconds // 3600, timedelta.seconds // 60, timedelta.seconds % 60))
        print("="*80)
    print("\n")
    timedelta = end_time - start_time
    print("*"*80)
    print("\nThe training spends %d days, %d hours, %d minutes, %d seconds" \
            %(timedelta.days, timedelta.seconds // 3600, timedelta.seconds // 60, timedelta.seconds % 60))

最后我们再讲优化算法补上，就差不多大功告成了。这里的优化算法做clip和逐级降低学习率。好了，接下来就开始无耻的训练之旅吧。

5. create_poetry.py

# -*-coding:utf-8-*-#
import numpy as np
import tensorflow as tf

import reader
from GeneratePoetryModel import GeneratePoetryModel

dictionary, _, reversed_dictionary = reader.build_dataset()

def to_word(weights):
    """
    通过传入的权重，计算向量的概率分布并通过随机采样获得最接近的词语，
    类似遗传算法的选择步骤。（个人认为不够严谨）
    """
    t = np.cumsum(weights)
    s = np.sum(weights)
    sample = int(np.searchsorted(t, np.random.rand(1) * s))
    return reversed_dictionary[sample]

# 定义输入的只有一个字词，然后根据上一个字词推测下一个词的位置
input_data = tf.placeholder(tf.int32, [1, None])
# 输入和输出的尺寸为1
input_size = output_size = len(reversed_dictionary) + 1
# 定义模型
model = GeneratePoetryModel(X=input_data, batch_size=1, input_size=input_size, output_size=output_size)
# 获取模型的输出参数
_, last_state, probs, initial_state = model.results()

with tf.Session() as session:
    session.run(tf.global_variables_initializer())
    saver = tf.train.Saver(tf.global_variables())
    print("generate...")
    saver.restore(session, './model/poetry.module-140')
    # 起始字符是'['，
    x = np.array([list(map(dictionary.get, '['))])
    # 运行初始0状态
    state_ = session.run(initial_state)
    word = poem = '['
    # 结束字符是']'
    while word != ']':
        # 使用上一级的state和output作为输入
        probs_, state_ = session.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})
        word = to_word(probs_)
        poem += word
        # 获取词语的id
        x = np.zeros((1, 1))
        x[0, 0] = dictionary[word]
    print(poem)

这里已“["作为开始符号，以“]”作为结束符号。这样就可以判断生成的诗词是否完整。

Experimental Results

[者外临江冷，门谈上水空。远人长忆欲，月色杳冠年。招海幂明匆，星辕藏似金。声凉草不醒，窗静塘称琴。]

者外临江冷，门谈上水空。远人长忆欲，月色杳冠年。招海幂明匆，星辕藏似金。声凉草不醒，窗静塘称琴。

[玉低回上道，鹤发不离群。张屋犹飞绕，兼言潜草边。浮舟仍已尽，会数种如群。王子怜桃李，闲空昔浦书。无同在天韶，何处似戎衣。]

玉低回上道，鹤发不离群。张屋犹飞绕，兼言潜草边。浮舟仍已尽，会数种如群。王子怜桃李，闲空昔浦书。无同在天韶，何处似戎衣。

[水平西望使人苏，此地何曾肯相思。独念千端饮在器，路来闲墨网来难。一行分管潮中去，朝更还似盘抹热。愁闻几醉遗名处，长短逢离火上风。]

水平西望使人苏，此地何曾肯相思。独念千端饮在器，路来闲墨网来难。一行分管潮中去，朝更还似盘抹热。愁闻几醉遗名处，长短逢离火上风。

Conclusion

这个实验算是做成半成品。虽然还是有不少瑕疵，但我觉得还是值得记录一下的。以下是我自己的一些优化想法：

在生成诗词的时候，会陷入一个很长的生成过程，因为一直都没有生成结束符']'。这代表还没学得很彻底。
我个人认为这个还不完整，因为损失还是降到一定程度就没有再下降，不知道是否达到全局最小。
训练的所用的embedding，我这里使用的embedding是随机生产的，但我觉得可以使用word2vec会更好。
对于唐诗我觉得对仗程度应该有一定要求，所以使用attention模型结合或许会提高一定的对仗程度。
这些诗词都是比较无意义的，假如可以通过主题生成，或许展示效果会更好。

References

Thank

好了，感谢大家的收看吧。差不多要睡觉啦！早唞！好梦！
最后附上自己的源码地址：https://github.com/Salon-sai/learning-tensorflow/tree/master/lesson5