text-processing

2021-08-06 本文已影响0人 Wilbur_

notes from :https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63

The basics of NLP for text

Sentence Tokenization
Word Tokenization
Text Lemmatization and Stemming
Stop Words
Regex
Bag-of-words
TF-IDF

Sentence tokenization
Sentence tokenization is the problem of dividing a string into its component sentences.
To apply a sentence tokenization with NLTK we can use the nltk.sent_tokenize function

text = "Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice."
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    print(sentence)
    print()

Word tokenization
word tokenization is the problem of dividing a string into its component of words. we can use nltk.word_tokenize function

for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    print(words)
    print()

Text lemmatization and stemming
Documents can contain different forms of a word such as drive, drives, driving.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form

python code

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

def compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word, pos):
    """
    Print the results of stemmind and lemmitization using the passed stemmer, lemmatizer, word and pos (part of speech)
    """
    print("Stemmer:", stemmer.stem(word))
    print("Lemmatizer:", lemmatizer.lemmatize(word, pos))
    print()

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "seen", pos = wordnet.VERB)
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "drove", pos = wordnet.VERB)

output:

Stemmer: seen
Lemmatizer: see

Stemmer: drove
Lemmatizer: drive

Stop words
Stop words are words which are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise.
Stop words usually refer to the most common words such as "and", "the", "a" in a language
The NLTK tool has a predefined list of stopwords that refers to the most common words.

from nltk.corpus import stopwords
print(stopwords.words("english"))

Regex
a more powerful tool to tokenize the way you want
Bag-of-words
Machine learning algorithms cannot work with raw text directly, we need to convert the text into vectors of numbers.

The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.

To use this model, we need to:

Design a vocabulary of known words
Choose a measure of the presence of known words

Any information about the order or structure of words is discarded. That's why it's called a bag of words. This model is trying to understand whether a known word occurs in a document, but don't know where is that word in the document.

THe intuition is that similar documents have similar contents. Also, from a content, we can learn something about the meaning of the document.

we can use CountVectorizer class from the sklearn library to design our vocab

create the document vectors
Next, we need to score the words in each document. The task here is to convert each raw text into a vector of numbers. After that, we can use these vectors as input for a machine learning model. The simplest scoring method is to mark the presence of words with 1 for present and 0 for absence.

python code using CountVectorizer class:

# Import the libraries we need
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Step 2. Design the Vocabulary
# The default token pattern removes tokens of a single character. That's why we don't have the "I" and "s" tokens in the output
count_vectorizer = CountVectorizer()

# Step 3. Create the Bag-of-Words Model
bag_of_words = count_vectorizer.fit_transform(documents)

# Show the Bag-of-Words Model as a pandas DataFrame
feature_names = count_vectorizer.get_feature_names()
pd.DataFrame(bag_of_words.toarray(), columns = feature_names)

The complexity of the bag-of-words model comes in deciding how to design the vocabulary of known words and how to score the presence of known words.

Designing the Vocabulary
When the vocabulary size increases, the vector representation of the documents also increases. Directly implementing the vectorizer will have chance of wasting space. (lots of zeros) we can use text cleaning techniques before we create our bag-of-words model:

Ignoring the case of the words
Ignoring punctuation
Removing the stop words from our documents
Reducing the words to their base form (text lemmatization and stemming)
Fixing misspelled words

Another more complex way to create a vocabulary is to use grouped words. This changes the scope of the vocabulary and allows the bag-of-words model to get more details about the document. This approach is called n-grams.

An n-gram is a sequence of a number of items (words, letter, numbers, digits, etc.). In the context of text corpora. n-grams typically refer to a sequence of words. A unigram is a word. Bigram is two words.

The bag-of-bigrams is more powerful than the bag-of-words approach.

Scoring words
Once, we have created our vocabulary of known words, we need to score the occurrence of the words in our data. We saw one very simple approach 1 for presence, 0 for absent

TF-IDF

Term Frequency (TF): a scoring of the frequency of the word in the current document.

term Frequency Formula
Inverse Term Frequency (ITF): a scoring of how rare the word is across documents.

Inverse Document Frequency Formula
Finally, we can use the previous formulas to calculate the TF-IDF score for a given term like this:

TF-IDF formula

text-processing

TF-IDF

猜你喜欢

热点阅读