NLP in TensorFlow:数据预处理

2019-08-07 本文已影响0人 poteman

Explore the BBC news archive。主要知识点为tokenizer和pad_sequences

导入所需的包

import csv
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

下载所需数据和停用词

!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv \
    -O /tmp/bbc-text.csv
stopwords = [ "a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

数据预处理：去掉停用词，并获取标签和文本数据

sentences = []
labels = []
with open("/tmp/bbc-text.csv", 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader)
    for row in reader:
      labels.append(row[0])
      sentence = row[1]
      for word in stopwords:
        token = " " + word + " "
        sentence = sentence.replace(token, " ")
        sentence = sentence.replace("  ", " ")
      sentences.append(sentence)

print(len(sentences))
print(sentences[0])

将单词tokenizer化

tokenizer = Tokenizer(num_words = 100000, oov_token = 'OOV')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(len(word_index))

padding，将句子的长度置为一致

sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, maxlen=None, padding='post', truncating='pre')
print(padded[0])
print(padded.shape)

同理将标签Tokenizer化

label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
label_seq = label_tokenizer.texts_to_sequences(labels)
label_word_index = label_tokenizer.word_index

print(label_seq)
print(label_word_index)

【参考文献】
1.keras文本预处理
2.keras pad_sequences

NLP in TensorFlow:数据预处理

猜你喜欢

热点阅读