(翻译)TensorFlow 广度和深度学习教程
TensorFlow Wide & Deep Learning Tutorial#
TensorFlow 广度和深度学习教程#
In the previous TensorFlow Linear Model Tutorial, we trained a logistic regression model to predict the probability that the individual has an annual income of over 50,000 dollars using the Census Income Dataset. TensorFlow is great for training deep neural networks too, and you might be thinking which one you should choose—Well, why not both? Would it be possible to combine the strengths of both in one model?
在之前的TensorFlow 线性模型教程里,我们已经训练了一个logistic回归模型去使用人口普查收入数据来预测市民年收入能否达到50,000美金的概率。TensorFlow同样在训练深度神经网络中表现优秀,然后你也许会想或许哪一种更适合——那,为什么不两者一起使用呢?有可能将这两者的长处结合在一个模型里吗?
In this tutorial, we'll introduce how to use the TF.Learn API to jointly train a wide linear model and a deep feed-forward neural network. This approach combines the strengths of memorization and generalization. It's useful for generic large-scale regression and classification problems with sparse input features (e.g., categorical features with a large number of possible feature values). If you're interested in learning more about how Wide & Deep Learning works, please check out our research paper.
在这篇教程,我们会介绍如何去使用TF.Learn API去共同训练一个广度线性模型和一个深度前馈神经网络。这种方法结合了记忆和泛化的优势。它在通用大规模回归和稀疏输入特征的分类问题(例如分类特征有一个很大的可能值域)上十分有效。如果你有兴趣想学习广度和深度学习是如何工作的,可以查看我们的这篇研究论文。
![][01]
The figure above shows a comparison of a wide model (logistic regression with sparse features and transformations), a deep model (feed-forward neural network with an embedding layer and several hidden layers), and a Wide & Deep model (joint training of both). At a high level, there are only 3 steps to configure a wide, deep, or Wide & Deep model using the TF.Learn API:
- Select features for the wide part: Choose the sparse base columns and crossed columns you want to use.
- Select features for the deep part: Choose the continuous columns, the embedding dimension for each categorical column, and the hidden layer sizes.
- Put them all together in a Wide & Deep model (DNNLinearCombinedClassifier).
And that's it! Let's go through a simple example.
上图展示了一个广度模型(拥有稀疏特征和转换的logistic回归),一个深度模型(拥有嵌入层和多个隐藏层的前馈神经网络)和一个广度和深度模型(联合训练)的对比。在高层级里,这里使用TF.Learn API只需要三个步骤去配置一个广度,深度或广度&深度模型。
- 为广度部分选择特征:选择你想要使用的稀疏基本列和交叉列。
- 为深度部分选择特征:选择连续列,每一个分列类的嵌入层和隐藏层的大小。
- 在广度和深度模型中将他们结合在一起(使用DNNLinearCombinedClassifier)
Setup#
安装
To try the code for this tutorial:
Install TensorFlow if you haven't already.
Download the tutorial code.
Install the pandas data analysis library. tf.learn doesn't require pandas, but it does support it, and this tutorial uses pandas.
尝试一下这篇教程的代码:
安装TensorFlow,如果你还没安装的话。
下载教程代码.
安装pandas数据分析库。tf.learn并不依赖pandas,但是其支持它,并且此教程也会使用pandas.
To install pandas:
Get pip:
为了安装pandas:
获取pip:
Ubuntu/Linux 64-bit
$ sudo apt-get install python-pip python-dev
Mac OS X
$ sudo easy_install pip
$ sudo easy_install --upgrade six
>Use **pip** to install pandas:
使用**pip**去安装pandas:
>```
$ sudo pip install pandas
If you have trouble installing pandas, consult the instructions on the pandas site.
如果你在安装pandas的途中遇到了问题,可以在pandas的官网上参阅说明
Execute the tutorial code with the following command to train the linear model described in this tutorial:
使用以下的命令去执行教程代码以训练教程描述的线性模型:
$ python wide_n_deep_tutorial.py --model_type=wide_n_deep
>Read on to find out how this code builds its linear model.
通读并理解这个代码是如何建立线性模型的。
>#Define Base Feature Columns#
#定义基本特征列#
>First, let's define the base categorical and continuous feature columns that we'll use. These base columns will be the building blocks used by both the wide part and the deep part of the model.
首先,让我们来定义要使用的基本分类特征列和基本连续特征列。这些基本列会用于构建这个模型的广度部分和深度部分。
>```
import tensorflow as tf
># Categorical base columns.
gender = tf.contrib.layers.sparse_column_with_keys(column_name="gender", keys=["Female", "Male"])
race = tf.contrib.layers.sparse_column_with_keys(column_name="race", keys=[
"Amer-Indian-Eskimo", "Asian-Pac-Islander", "Black", "Other", "White"])
education = tf.contrib.layers.sparse_column_with_hash_bucket("education", hash_bucket_size=1000)
relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship", hash_bucket_size=100)
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass", hash_bucket_size=100)
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation", hash_bucket_size=1000)
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country", hash_bucket_size=1000)
># Continuous base columns.
age = tf.contrib.layers.real_valued_column("age")
age_buckets = tf.contrib.layers.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
education_num = tf.contrib.layers.real_valued_column("education_num")
capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")
The Wide Model: Linear Model with Crossed Feature Columns#
广度模型:拥有交叉特征列的线性模型#
The wide model is a linear model with a wide set of sparse and crossed feature columns:
这个广度模型是一个拥有稀疏集和交叉特征列的线性模型:
wide_columns = [
gender, native_country, education, occupation, workclass, relationship, age_buckets,
tf.contrib.layers.crossed_column([education, occupation], hash_bucket_size=int(1e4)),
tf.contrib.layers.crossed_column([native_country, occupation], hash_bucket_size=int(1e4)),
tf.contrib.layers.crossed_column([age_buckets, education, occupation], hash_bucket_size=int(1e6))]
>Wide models with crossed feature columns can memorize sparse interactions between features effectively. That being said, one limitation of crossed feature columns is that they do not generalize to feature combinations that have not appeared in the training data. Let's add a deep model with embeddings to fix that.
拥有交叉特征列的广度模型能够有效记忆稀疏特征之间的稀疏交互。虽说如此,交叉特征列的一个限制在于他并不推广到没有在训练集出现过的特征组合。所以让我们来添加一个带嵌入的深度模型去修复这个问题。
>#The Deep Model: Neural Network with Embeddings#
#深度部分:带嵌入的神经网络#
>The deep model is a feed-forward neural network, as shown in the previous figure. Each of the sparse, high-dimensional categorical features are first converted into a low-dimensional and dense real-valued vector, often referred to as an embedding vector. These low-dimensional dense embedding vectors are concatenated with the continuous features, and then fed into the hidden layers of a neural network in the forward pass. The embedding values are initialized randomly, and are trained along with all other model parameters to minimize the training loss. If you're interested in learning more about embeddings, check out the TensorFlow tutorial on [Vector Representations of Words](https://www.tensorflow.org/versions/r0.12/tutorials/word2vec/index.html), or [Word Embedding](https://en.wikipedia.org/wiki/Word_embedding) on Wikipedia.
这个深度模式是前馈神经网络,即上图显示的那张。每一个稀疏,高维的分类特征是首先被转化成低维和密集的实值向量,通常被称为嵌入向量。这些低维密集嵌入向量与连续特征相连接,然后在前向传递过程中反馈进神经网络的隐藏层。嵌入值是随机初始化的,并与所有其他模型参数一起训练以最小化训练损失。如果你有兴趣想要了解有关嵌入,可以查看TensorFlow教程的[单词的向量表现法](https://www.tensorflow.org/versions/r0.12/tutorials/word2vec/index.html)或维基百科上的[Word Embedding](https://en.wikipedia.org/wiki/Word_embedding)条目。
>We'll configure the embeddings for the categorical columns using embedding_column, and concatenate them with the continuous columns:
我们将使用嵌入列来对分类列进行嵌入操作,并且使其与连续列进行关联:
>```
deep_columns = [
tf.contrib.layers.embedding_column(workclass, dimension=8),
tf.contrib.layers.embedding_column(education, dimension=8),
tf.contrib.layers.embedding_column(gender, dimension=8),
tf.contrib.layers.embedding_column(relationship, dimension=8),
tf.contrib.layers.embedding_column(native_country, dimension=8),
tf.contrib.layers.embedding_column(occupation, dimension=8),
age, education_num, capital_gain, capital_loss, hours_per_week]
The higher the dimension of the embedding is, the more degrees of freedom the model will have to learn the representations of the features. For simplicity, we set the dimension to 8 for all feature columns here. Empirically, a more informed decision for the number of dimensions is to start with a value on the order of ![][log] or ![][k4] where ![][n] is the number of unique features in a feature column and ![][k] is a small constant (usually smaller than 10).
嵌入的维度越高,模型将拥有更高的自由度去学习表示特征。为了简单起见,我们在此将所有特征列的维度设置为8。根据经验来看,更明智的确定维度数量的方法是以一个值大约是 ![][log] 或 ![][k4]为起点的数量,其中![][n]是特征列中唯一特征的数量和![][k]是一个很小的常数(通常小于10)。
Through dense embeddings, deep models can generalize better and make predictions on feature pairs that were previously unseen in the training data. However, it is difficult to learn effective low-dimensional representations for feature columns when the underlying interaction matrix between two feature columns is sparse and high-rank. In such cases, the interaction between most feature pairs should be zero except a few, but dense embeddings will lead to nonzero predictions for all feature pairs, and thus can over-generalize. On the other hand, linear models with crossed features can memorize these “exception rules” effectively with fewer model parameters.
通过复杂的嵌入,深层模型可以更好地推广,并对以前在训练集中未曾出现过的特征对进行预测。然而,他却很难在两个特征列的底层交互矩阵即稀疏又高秩的时,其难以有效的学习特征列的低维表示。在这种情况下,特多数特征的交互应该为0或很少。但是密集嵌入会导致让所有特征作出非0预测,并导致过度推广。 另一方面,带交叉特征的线性模型能够在使用少量模型参数中能够有效的记忆这些“例外事件”。
Now, let's see how to jointly train wide and deep models and allow them to complement each other’s strengths and weaknesses.
现在,让我们看看如何共同训练广度和深度模型并使其互补。
Combining Wide and Deep Models into One#
将广度和深度模型合为一体#
The wide models and deep models are combined by summing up their final output log odds as the prediction, then feeding the prediction to a logistic loss function. All the graph definition and variable allocations have already been handled for you under the hood, so you simply need to create a DNNLinearCombinedClassifier:
广度模型和深度模型通过将它们的最终输出的对数几率相加和作为预测结果,然后将预测结果反馈到对数损失函数。所有的图定义和变量分配已经自动处理了,因此你只需要简单的创建一个DNNLinearCombinedClassifier:
import tempfile
model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.DNNLinearCombinedClassifier(
model_dir=model_dir,
linear_feature_columns=wide_columns,
dnn_feature_columns=deep_columns,
dnn_hidden_units=[100, 50])
>#Training and Evaluating The Model#
#训练和评估模型#
>Before we train the model, let's read in the Census dataset as we did in the [TensorFlow Linear Model tutorial](https://www.tensorflow.org/versions/master/tutorials/wide/). The code for input data processing is provided here again for your convenience:
在我们开始训练这个模型之前,我们先按照我们之间在[TensorFlow线性模型教程](http://www.jianshu.com/p/6868fc1f65d0)所做过的那样,读入人口普查数据集。为方便起见,这里再次提供输入数据的处理代码:
>```
import pandas as pd
import urllib
># Define the column names for the data sets.
COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race", "gender",
"capital_gain", "capital_loss", "hours_per_week", "native_country", "income_bracket"]
LABEL_COLUMN = 'label'
CATEGORICAL_COLUMNS = ["workclass", "education", "marital_status", "occupation",
"relationship", "race", "gender", "native_country"]
CONTINUOUS_COLUMNS = ["age", "education_num", "capital_gain", "capital_loss",
"hours_per_week"]
># Download the training and test data to temporary files.
# Alternatively, you can download them yourself and change train_file and
# test_file to your own paths.
train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()
urllib.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)
urllib.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)
># Read the training and test data sets into Pandas dataframe.
df_train = pd.read_csv(train_file, names=COLUMNS, skipinitialspace=True)
df_test = pd.read_csv(test_file, names=COLUMNS, skipinitialspace=True, skiprows=1)
df_train[LABEL_COLUMN] = (df_train['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)
df_test[LABEL_COLUMN] = (df_test['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)
>def input_fn(df):
# Creates a dictionary mapping from each continuous feature column name (k) to
# the values of that column stored in a constant Tensor.
continuous_cols = {k: tf.constant(df[k].values)
for k in CONTINUOUS_COLUMNS}
# Creates a dictionary mapping from each categorical feature column name (k)
# to the values of that column stored in a tf.SparseTensor.
categorical_cols = {k: tf.SparseTensor(
indices=[[i, 0] for i in range(df[k].size)],
values=df[k].values,
shape=[df[k].size, 1])
for k in CATEGORICAL_COLUMNS}
# Merges the two dictionaries into one.
feature_cols = dict(continuous_cols.items() + categorical_cols.items())
# Converts the label column into a constant Tensor.
label = tf.constant(df[LABEL_COLUMN].values)
# Returns the feature columns and the label.
return feature_cols, label
>def train_input_fn():
return input_fn(df_train)
>def eval_input_fn():
return input_fn(df_test)
After reading in the data, you can train and evaluate the model:
在读入数据之后,你可以开始训练和评估这个模型了:
m.fit(input_fn=train_input_fn, steps=200)
results = m.evaluate(input_fn=eval_input_fn, steps=1)
for key in sorted(results):
print "%s: %s" % (key, results[key])
>The first line of the output should be something like **accuracy: 0.84429705**. We can see that the accuracy was improved from about 83.6% using a wide-only linear model to about 84.4% using a Wide & Deep model. If you'd like to see a working end-to-end example, you can download our [example code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/wide_n_deep_tutorial.py).
输出的第一行应该是类似于**accuracy: 0.84429705**。我们可以看到准确率从广度模型的83.6%中提升到了广度和深度模型的84.4%。如果你想要一个完整的代码,你可以下载我们的[完整代码](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/wide_n_deep_tutorial.py)。
>Note that this tutorial is just a quick example on a small dataset to get you familiar with the API. Wide & Deep Learning will be even more powerful if you try it on a large dataset with many sparse feature columns that have a large number of possible feature values. Again, feel free to take a look at our [research paper](http://arxiv.org/abs/1606.07792) for more ideas about how to apply Wide & Deep Learning in real-world large-scale maching learning problems.
要注意的是这个教程只是提一个在小数据集上熟悉API的简单例子。如果有许多具有大量可能的特征值的稀疏特征列的大型数据集上尝试,广度和深度学习会变得更加强大。再次,随时可以看看我们的[研究论文](http://arxiv.org/abs/1606.07792)以了解更多关于如何应用广度和深度学习在现实世界大规模机器学习问题的想法
> 原文:https://www.tensorflow.org/versions/master/tutorials/wide_and_deep/index.html
[01]:https://www.tensorflow.org/versions/master/images/wide_n_deep.svg
[n]:http://latex.codecogs.com/png.latex?n
[k]:http://latex.codecogs.com/png.latex?k
[log]:http://latex.codecogs.com/png.latex?\log_2(n)
[k4]:http://latex.codecogs.com/png.latex?k\sqrt[4]n