特征工程
2021-05-03 本文已影响0人
纵春水东流
一、目录
1、onehot编码
(1)单词级ont-hot编码
(2)字符级ont-hot编码
(3)keras实现单词级的ont-hot编码
(4)使用散列技巧的单词级的ont-hot编码
2、词嵌入
二、代码
1、onehot编码
(1)单词级ont-hot编码
import numpy as np
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
#构建标记索引
token_index = {}
for sample in samples:
for word in sample.split():
if word not in token_index:
token_index[word] = len(token_index) + 1
max_length = 10
results = np.zeros(shape=(len(samples),max_length,max(token_index.values()) + 1))
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[:max_length]:
index = token_index.get(word)
results[i, j, index] = 1.
(2)字符级ont-hot编码
import string
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable#所有可打印的ASCII字符
token_index = dict(zip(range(1, len(characters) + 1), characters))
max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
for i, sample in enumerate(samples):
for j, character in enumerate(sample):
index = token_index.get(character)
results[i, j, index] = 1.
(3)keras实现单词级的ont-hot编码
from keras.preprocessing.text import Tokenizer
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)
sequences = tokenizer.texts_to_sequences(samples)
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
(4)使用散列技巧的单词级的ont-hot编码
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
dimensionality = 1000
max_length = 10
results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[:max_length]:
index = abs(hash(word)) % dimensionality#将单词散列为0~1000的随机数索引
results[i, j, index] = 1.
2、词嵌入
词嵌入的获得有两种方式,一种在做任务的时候顺便获得词嵌入,使用其他来源的词嵌入
(1)在完成任务时候训练的词嵌入
from keras.datasets import imdb
from keras.layers import preprocessing
max_features = 10000
maxlen = 20
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding
model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()
history = model.fit(x_train, y_train,epochs=10,batch_size=32,validation_split=0.2
(2)使用GloVe词嵌入文件
略