2. NLP with Probabilistic Models

2020-12-03 本文已影响0人 Kevin不会创作

Autocorrect
Part of Speech Tagging and Hidden Markov Models
Autocomplete and Language Models

Autocorrect

What is autocorrect?

Autocorrect is an application that changes misspelled words into the correct ones.
How it works?
1. Identify a misspelled word
2. Find strings n edit distance away
3. Filter candidates
4. Calculate word probabilities
Identify a misspelled word

if word not in vocab:
    misspelled = True

Find strings n edit distance away
- Edit: an operation performed on a string to change it
- Edit distance counts the number of edit operations
- n is usually one to three edits
- Insert (add a letter)
  to: top, two
- Delete (remove a letter)
  hat: ha, at, ht
- Switch (swap 2 adjacent letters)
  eta: eat, tea
- Replace (change 1 letter to another)
  jaw: jar, paw
Filter candidates

Compare these strings to a known dictionary or vocabulary. If the string does not appear in the dictionary, remove it from the list of candidates.
Calculate word probabilities

$P(w)=\frac{C(w)}{V}$
- $P(w)$ Probability of a word
- $C(w)$ Number of times the word appears
- $V$ Total size of the corpus

Minimum edit distance

How to evaluate similarity between 2 strings?
Minimum number of edits needed to transform 1 string into the other
Applications
- Spelling correction
- Document similarity
- Machine translation
- DNA sequencing
Edits (Edit cost)
- Insert (1)
- Delete (1)
- Replace (2)
Example

Source: play
Target: stay
- p $\rightarrow$ s : replace
- l $\rightarrow$ t : replace
- Edit distance = 2 + 2 = 4

Part of Speech Tagging and Hidden Markov Models

What is part of speech tagging?

Part of speech refers to the category of words or the lexical terms in the language. Examples of these lexical terms in the English language would be noun, verb, adjective, adverb, pronoun, preposition, etc. The process of assigning these tags to the words of a sentence or your corpus is referred to as parts of speech tagging, or POS tagging for short.

Markov Chains

A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

You can use a table called transition matrix to store the states and transition probabilities. Each row in the matrix represents transition probabilities of one state to all other states. Notice that for all the odds going transition probabilities from a given state, the sum of these transition probabilities should always be one. To assign a part of speech tag to the first word in the sentence, you can introduce what is known as an initial state by you include these probabilities in the table.

Transition Matrix

Hidden Markov Model

The name hidden Markov model implies that states are hidden or not directly observable. Going back to the Markov model that has the states for the parts of speech, such as noun, verb, or other, you can now think of these as hidden states because these are not directly observable from the text data.

The hidden Markov model also has additional probabilities known as emission probabilities. These describe the transition from the hidden states of your hidden Markov model, which are parts of speech seen here as circles for noun, verb, and other, to the observables or the words of your corpus.

Emission Matrix

Calculation

Calculate Transition Probabilities
1. Count occurrences of tag pairs $C(t_{i-1},t_i)$
2. Calculate probabilities using the counts
  $P(t_i|t_{i-1})=\frac{C(t_{i-1},t_i)}{\sum_{j=1}^{N}C(t_{i-1},t_{j})}$
Populating the Transition Matrix
Transition Matrix
To avoid a division by zero, you can change your formula slightly
$P(t_i|t_{i-1})=\frac{C(t_{i-1},t_i)+\epsilon}{\sum_{j=1}^{N}C(t_{i-1},t_{j})+N*{\epsilon}}$
Populating the Emission Matrix
Emission Matrix
$P(w_i|t_i)=\frac{C(t_{i},w_i)+\epsilon}{\sum_{j=1}^{V}C(t_{i},w_{j})+N*{\epsilon}}=\frac{C(t_{i},w_i)+\epsilon}{C(t_i)+N*{\epsilon}}$

Viterbi Algorithm

If you're given a sentence like "Why not learn something?", what is the most likely sequence of parts of speech tags given the sentence in your model? The sequence can be computed using the Viterbi algorithm.

The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especially in the context of Markov information sources and hidden Markov models (HMM).

Viterbi Algorithm

Initialization

Given your transition and emission probabilities, you first populates and then use the auxiliary matrices C and D. The matrix C holds the intermediate optimal probabilities and matrix D, the indices of the visited states.

In the initialization step, the first column of each of our matrices, C and D is populated. The first column of C represents probability of the transitions from the start states by in the graph to the first tag $t_i$ and word $W_1$ .
Initialization
In the D matrix, you store the labels that represent the different states you're traversing when finding the most likely sequence of parts of speech tags for the given sequence of words, $W_1$ all the way to $W_k$ .
Initialization
Forward Pass
Forward
- $b_{i,cindex(w_2)}$ is the emission probability from tag $t_1$ toward $w_2$
- $a_{k,i}$ is the transition probability from the parts of speech tag $t_k$ to the current tag $t_1$
- $c_{k,j-1}$ is the probability for the preceding path you've traversed
Backward Pass

You'll retrieve the most likely sequence of parts of speech tags for your given sequence of words using Backward pass.
Backward Backward

Autocomplete and Language Models

What language model can do?
- Autocomplete a sentence
  - Estimate probability of word sequences
  - Estimate probability of a word following a sequence of words
- Speech recognition
  Example: P(I saw a van) > P(eyes awe of an)
- Spelling correction
  Example: P(entered the shop to buy) > P(entered the ship to buy)
- Augmentative communication
  Predict most likely word from menu for people unable to physically talk or sign

N-grams Language Model

What is N-gram?

An N-gram is a sequence of N words.
- Example
  
  Corpus: I am happy because I am learning
  Unigrams: {I, am, happy, because, learning}
  Bigrams: {I am, am happy, happy because, ...}
  Trigrams: {I am happy, am happy because, ...}
Probability
- Unigram probability
  $P(w)=\frac{C(w)}{m}$
- Bigram probability
  $P(y|x)=\frac{C(xy)}{\sum_{w}C(xw)}=\frac{C(xy)}{C(x)}$
- Trigram probability
  $P(w_3|w_1^2)=\frac{C(w_1^2{w_3})}{C(w_1^2)}$
- N-gram probability
  $P(w_N|w_1^{N-1})=\frac{C(w_1^{N-1}w_N)}{C(w_1^{N-{1}})}$
Sequence Probability

2. NLP with Probabilistic Models

Table of Contents

Autocorrect

Minimum edit distance

Part of Speech Tagging and Hidden Markov Models

Markov Chains

Hidden Markov Model

Calculation

Viterbi Algorithm

Autocomplete and Language Models

N-grams Language Model

猜你喜欢

热点阅读