TF2 Keras (6): 预处理层 Preprocess
本文是对官方文档 的学习笔记。
这里介绍的预处理层 (Preprocessing Layers) 是Keras 原生组件。 其实它提供的各种对数据的预处理都可以用其他工具完成 (pandas, numpy, sklearn), 而且网上也有很多代码。 Preprocessing Layers 来做预处理的最大好处是: 构建好的模型会自带预处理机制, 这样有助于构建一个 end-to-end 的模型, 最大程度的减少调用者的麻烦。 模型使用者可以直接将 raw string 或者 raw image 喂给模型。
Preprocessing Layers:
Core preprocessing layers
- TextVectorization : 文本向量化
- Normalization : 对数值featrue 做正则化
Structured data preprocessing layers
CategoryEncoding layer: 对已经转为indices 的 Category 进行 one-hot, multi-hot, or TF-IDF 编码, 常与StringLookup, IntegerLookup 联用
Hashing layer: 对Featrue 进行 hash 化
Discretization layer: 对数值featrue 进行分段,变成Category 类型
StringLookup layer: 把String Category 变成 indices
IntegerLookup layer: 把numeric Category 变成 indices
CategoryCrossing layer: 把多列交叉,生成新的 featrue
Image preprocessing layers
- Resizing layer:修改图片尺寸
- Rescaling layer:把图片颜色数值变为[0, 1]
- CenterCrop : center crop ?
Image data augmentation layers
- RandomCrop layer:裁剪
- RandomFlip layer:翻转
- RandomTranslation layer:变换 ??
- RandomRotation layer:旋转
- RandomZoom layer:缩放
- RandomHeight layer:??
- RandomWidth layer:??
包含状态的 Layer 和 Adapt
以下 Layer 的状态需要根据不同数据进行计算, 在对数据处理前, 需要调用 adapt ,对数据进行学习。
- TextVectorization
- Normalization
- StringLookup
- IntegerLookup
- CategoryEncoding
- Discretization
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing
data = np.array([[0.1, 0.2, 0.3], [0.8, 0.9, 1.0], [1.5, 1.6, 1.7],])
layer = preprocessing.Normalization()
normalized_data = layer(data)
print("Features mean: %.2f" % (normalized_data.numpy().mean()))
print("Features std: %.2f" % (normalized_data.numpy().std()))
是否把Preprocessing Layer 放到Model 中?
- 可以利用GPU 加速
- 如果有GPU,把Layer 放入Model 对 Normalization 和 Image 相关 Preprocessing Layer 会有很大好处。
inputs = keras.Input(shape=input_shape)
x = preprocessing_layer(inputs)
outputs = rest_of_the_model(x)
model = keras.Model(inputs, outputs)
- 会在CPU上运算
- 可以 Buffer, 甚至缓存到文件中
- 异步运算
- 对于 TextVectorization 和 结构化数据比较好
Image data augmentation
from tensorflow import keras
from tensorflow.keras import layers
# Create a data augmentation stage with horizontal flipping, rotations, zooms
data_augmentation = keras.Sequential(
# Create a model that includes the augmentation stage
input_shape = (32, 32, 3)
classes = 10
inputs = keras.Input(shape=input_shape)
# Augment images
x = data_augmentation(inputs)
# Rescale image values to [0, 1]
x = preprocessing.Rescaling(1.0 / 255)(x)
# Add the rest of the model
outputs = keras.applications.ResNet50(
weights=None, input_shape=input_shape, classes=classes
model = keras.Model(inputs, outputs)
Normalizing numerical features
# Load some data
(x_train, y_train), _ = keras.datasets.cifar10.load_data()
x_train = x_train.reshape((len(x_train), -1))
input_shape = x_train.shape[1:]
classes = 10
# Create a Normalization layer and set its internal state using the training data
normalizer = preprocessing.Normalization()
# Create a model that include the normalization layer
inputs = keras.Input(shape=input_shape)
x = normalizer(inputs)
outputs = layers.Dense(classes, activation="softmax")(x)
model = keras.Model(inputs, outputs)
# Train the model
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy"), y_train)
Encoding string categorical features via one-hot encoding
# Define some toy data
data = tf.constant(["a", "b", "c", "b", "c", "a"])
# Use StringLookup to build an index of the feature values
indexer = preprocessing.StringLookup()
# Use CategoryEncoding to encode the integer indices to a one-hot vector
encoder = preprocessing.CategoryEncoding(output_mode="binary")
# Convert new test data (which includes unknown feature values)
test_data = tf.constant(["a", "b", "c", "d", "e", ""])
encoded_data = encoder(indexer(test_data))
# Define some toy data
data = tf.constant([10, 20, 20, 10, 30, 0])
# Use IntegerLookup to build an index of the feature values
indexer = preprocessing.IntegerLookup()
# Use CategoryEncoding to encode the integer indices to a one-hot vector
encoder = preprocessing.CategoryEncoding(output_mode="binary")
# Convert new test data (which includes unknown feature values)
test_data = tf.constant([10, 10, 20, 50, 60, 0])
encoded_data = encoder(indexer(test_data))
Applying the hashing trick to an integer categorical feature
If you have a categorical feature that can take many different values (on the order of 10e3 or higher), where each value only appears a few times in the data, it becomes impractical and ineffective to index and one-hot encode the feature values. Instead, it can be a good idea to apply the "hashing trick": hash the values to a vector of fixed size. This keeps the size of the feature space manageable, and removes the need for explicit indexing.
# Sample data: 10,000 random integers with values between 0 and 100,000
data = np.random.randint(0, 100000, size=(10000, 1))
# Use the Hashing layer to hash the values to the range [0, 64]
hasher = preprocessing.Hashing(num_bins=64, salt=1337)
# Use the CategoryEncoding layer to one-hot encode the hashed values
encoder = preprocessing.CategoryEncoding(max_tokens=64, output_mode="binary")
encoded_data = encoder(hasher(data))
Encoding text as a sequence of token indices
# Define some text data to adapt the layer
data = tf.constant(
"The Brain is wider than the Sky",
"For put them side by side",
"The one the other will contain",
"With ease and You beside",
# Instantiate TextVectorization with "int" output_mode
text_vectorizer = preprocessing.TextVectorization(output_mode="int")
# Index the vocabulary via `adapt()`
# You can retrieve the vocabulary we indexed via get_vocabulary()
vocab = text_vectorizer.get_vocabulary()
print("Vocabulary:", vocab)
# Create an Embedding + LSTM model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = layers.Embedding(input_dim=len(vocab), output_dim=64)(x)
outputs = layers.LSTM(1)(x)
model = keras.Model(inputs, outputs)
# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)
Encoding text as a dense matrix of ngrams with multi-hot encoding
# Define some text data to adapt the layer
data = tf.constant(
"The Brain is wider than the Sky",
"For put them side by side",
"The one the other will contain",
"With ease and You beside",
# Instantiate TextVectorization with "binary" output_mode (multi-hot)
# and ngrams=2 (index all bigrams)
text_vectorizer = preprocessing.TextVectorization(output_mode="binary", ngrams=2)
# Index the bigrams via `adapt()`
"Encoded text:\n",
text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
# Create a Dense model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)
print("Model output:", test_output)
Encoding text as a dense matrix of ngrams with TF-IDF weighting
# Define some text data to adapt the layer
data = tf.constant(
"The Brain is wider than the Sky",
"For put them side by side",
"The one the other will contain",
"With ease and You beside",
# Instantiate TextVectorization with "tf-idf" output_mode
# (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams)
text_vectorizer = preprocessing.TextVectorization(output_mode="tf-idf", ngrams=2)
# Index the bigrams and learn the TF-IDF weights via `adapt()`
"Encoded text:\n",
text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
# Create a Dense model
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
# Call the model on test data (which includes unknown tokens)
test_data = tf.constant(["The Brain is deeper than the sea"])
test_output = model(test_data)
print("Model output:", test_output)