text-processing

2021-08-06  本文已影响0人  Wilbur_

notes from :https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63

The basics of NLP for text

  1. Sentence Tokenization
  2. Word Tokenization
  3. Text Lemmatization and Stemming
  4. Stop Words
  5. Regex
  6. Bag-of-words
  7. TF-IDF
  1. Sentence tokenization
    Sentence tokenization is the problem of dividing a string into its component sentences.
    To apply a sentence tokenization with NLTK we can use the nltk.sent_tokenize function
text = "Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice."
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    print(sentence)
    print()
  1. Word tokenization
    word tokenization is the problem of dividing a string into its component of words. we can use nltk.word_tokenize function
for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    print(words)
    print()
  1. Text lemmatization and stemming
    Documents can contain different forms of a word such as drive, drives, driving.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form

python code

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

def compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word, pos):
    """
    Print the results of stemmind and lemmitization using the passed stemmer, lemmatizer, word and pos (part of speech)
    """
    print("Stemmer:", stemmer.stem(word))
    print("Lemmatizer:", lemmatizer.lemmatize(word, pos))
    print()

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "seen", pos = wordnet.VERB)
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "drove", pos = wordnet.VERB)

output:

Stemmer: seen
Lemmatizer: see

Stemmer: drove
Lemmatizer: drive
  1. Stop words
    Stop words are words which are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise.
    Stop words usually refer to the most common words such as "and", "the", "a" in a language
    The NLTK tool has a predefined list of stopwords that refers to the most common words.
from nltk.corpus import stopwords
print(stopwords.words("english"))
  1. Regex
    a more powerful tool to tokenize the way you want

  2. Bag-of-words
    Machine learning algorithms cannot work with raw text directly, we need to convert the text into vectors of numbers.

The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.

To use this model, we need to:

  1. Design a vocabulary of known words
  2. Choose a measure of the presence of known words

Any information about the order or structure of words is discarded. That's why it's called a bag of words. This model is trying to understand whether a known word occurs in a document, but don't know where is that word in the document.

THe intuition is that similar documents have similar contents. Also, from a content, we can learn something about the meaning of the document.

we can use CountVectorizer class from the sklearn library to design our vocab

  1. create the document vectors
    Next, we need to score the words in each document. The task here is to convert each raw text into a vector of numbers. After that, we can use these vectors as input for a machine learning model. The simplest scoring method is to mark the presence of words with 1 for present and 0 for absence.

python code using CountVectorizer class:

# Import the libraries we need
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Step 2. Design the Vocabulary
# The default token pattern removes tokens of a single character. That's why we don't have the "I" and "s" tokens in the output
count_vectorizer = CountVectorizer()

# Step 3. Create the Bag-of-Words Model
bag_of_words = count_vectorizer.fit_transform(documents)

# Show the Bag-of-Words Model as a pandas DataFrame
feature_names = count_vectorizer.get_feature_names()
pd.DataFrame(bag_of_words.toarray(), columns = feature_names)

The complexity of the bag-of-words model comes in deciding how to design the vocabulary of known words and how to score the presence of known words.

Designing the Vocabulary
When the vocabulary size increases, the vector representation of the documents also increases. Directly implementing the vectorizer will have chance of wasting space. (lots of zeros) we can use text cleaning techniques before we create our bag-of-words model:

Another more complex way to create a vocabulary is to use grouped words. This changes the scope of the vocabulary and allows the bag-of-words model to get more details about the document. This approach is called n-grams.

An n-gram is a sequence of a number of items (words, letter, numbers, digits, etc.). In the context of text corpora. n-grams typically refer to a sequence of words. A unigram is a word. Bigram is two words.

The bag-of-bigrams is more powerful than the bag-of-words approach.

Scoring words
Once, we have created our vocabulary of known words, we need to score the occurrence of the words in our data. We saw one very simple approach 1 for presence, 0 for absent

TF-IDF

上一篇下一篇

猜你喜欢

热点阅读