特征工程

2021-05-03 本文已影响0人纵春水东流

一、目录
1、onehot编码
(1)单词级ont-hot编码
(2)字符级ont-hot编码
(3)keras实现单词级的ont-hot编码
(4)使用散列技巧的单词级的ont-hot编码
2、词嵌入

二、代码
1、onehot编码
(1)单词级ont-hot编码

import numpy as np
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

#构建标记索引
token_index = {}
for sample in samples:
  for word in sample.split():
    if word not in token_index:
      token_index[word] = len(token_index) + 1

max_length = 10
results = np.zeros(shape=(len(samples),max_length,max(token_index.values()) + 1))
for i, sample in enumerate(samples):
  for j, word in list(enumerate(sample.split()))[:max_length]:
    index = token_index.get(word)
    results[i, j, index] = 1.

(2)字符级ont-hot编码

import string
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable#所有可打印的ASCII字符
token_index = dict(zip(range(1, len(characters) + 1), characters))

max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
for i, sample in enumerate(samples):
  for j, character in enumerate(sample):
    index = token_index.get(character)
    results[i, j, index] = 1.

(3)keras实现单词级的ont-hot编码

from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)

sequences = tokenizer.texts_to_sequences(samples)

one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
word_index = tokenizer.word_index

print('Found %s unique tokens.' % len(word_index))

(4)使用散列技巧的单词级的ont-hot编码

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
dimensionality = 1000
max_length = 10
results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
  for j, word in list(enumerate(sample.split()))[:max_length]:
    index = abs(hash(word)) % dimensionality#将单词散列为0~1000的随机数索引
    results[i, j, index] = 1.

2、词嵌入
词嵌入的获得有两种方式，一种在做任务的时候顺便获得词嵌入，使用其他来源的词嵌入
(1)在完成任务时候训练的词嵌入

from keras.datasets import imdb
from keras.layers import preprocessing
max_features = 10000
maxlen = 20

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding
model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train,epochs=10,batch_size=32,validation_split=0.2

(2)使用GloVe词嵌入文件

略

特征工程

猜你喜欢

热点阅读