BMI598: Natural Language Process

2018-11-05  本文已影响53人  MrGiovanni

Author: Zongwei Zhou | 周纵苇
Weibo: @MrGiovanni
Email: zongweiz@asu.edu

1. Token Features


1.1 token feature

1.2 context feature

1.3 sentence features

1.4 section features

1.5 document features

1.6 normalization

Stemming和Lemmatization的区别
Stemming:基于规则

from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
porter_stemmer.stem('wolves')
# 结果里es被去掉了
u'wolv'

Lemmatization:基于字典

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('wolves')
# 结果准确
u'wolf'

2. Word Embedding


2.1 tf-idf

特征的长度是整个字典单词数
关键词:计数
参考这个example:https://en.wikipedia.org/wiki/Tf%E2%80%93idf

2.2 word2vec

特征长度是固定的,一般比较小(几百)

Start with V random 300-dimensional vectors as initial embeddings
Use logistic regression, the second most basic classifier used in machine learning after naïve bayes

Pre-trained models are available for download
https://code.google.com/archive/p/word2vec/
You can use gensim (in python) to access the models
http://nlp.stanford.edu/projects/glove/

Brilliant insight: Use running text as implicitly supervised training data!

Setup
Let's represent words as vectors of some length (say 300), randomly initialized.
So we start with 300 * V random parameters. V是字典中单词的数目。
Over the entire training set, we’d like to adjust those word vectors such that we

Learning the classifier
Iterative process.
We’ll start with 0 or random weights
Then adjust the word weights to

3. Sentence vectors


Distributed Representations of Sentences and Documents

PV-DM [???]

What about the unseen paragraphs? [???]

PV-DBOW [???]

When predicting sentiment of a sentence, use paragraph vector instead of single word embedding.

4. Neural Network


\sigma(z)=\frac{1}{1+e^{-z}}
softmax(z_i)=\frac{e^{z_i}}{\sum_{j=1}^{d}d^{z_j}} 1\leq i\leq d

import numpy as np
z = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
softmax = lambda z:np.exp(z)/np.sum(np.exp(z))
softmax(z)
array([0.02364054, 0.06426166, 0.1746813 , 0.474833  , 0.02364054, 0.06426166, 0.1746813 ])

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

5. Highlight summary


上一篇下一篇

猜你喜欢

热点阅读