文本分类调研
持续更新中
Introduction
1. Definition
什么是文本分类,即我们常说的text classification,简单的说就是把一段文本划分到我们提前定义好的一个或多个类别。可以说是属于document classification的范畴。
Input:
a document d
a fixed set of classes C = {c1, c2, ... , cn}
Output:
a predicted class ci from C
2. Some simple application
- spam detection
- authorship attribution
- age/gender identification
- sentiment analysis
- assigning subject categories, topics or genes
......
Traditional methods
1. Naive Bayes
two assumptions:
- Bag of words assumption:
position doesn't matter - Conditional independency:
to compute these probabilities:
add-one smoothing to prevent the situation in which we get zero:(you can add other number as well)
to deal with unknown/unshown words:
main features:
- very fast, low storage requirements
- robust to irrelevant features
- good in domains with many equally important features
- optimal if the indolence assumption hold
- lacks accuracy in general
2. SVM
cost function of SVM:
2. SVM decision boundary
when C is very large:
about kernel:
until now,it seems that the SVM are only applicable to two-class classification.
Comparing with Logistic regression:
while applying SVM and Logistic regression to text classification, all you need to do is to get the labeled data and find a proper way to represent the texts with vectors (you can use one-hot representation , word2vec, doc2vec ......)
Neural network methods
1. CNN
(1) the paper Convolutional Neural Networks for Sentence Classification which appeared in EMNLP 2014
(2) the paper A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification
The model uses multiple filters to obtain multiple features. These features form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over labels.
For regularization we employ dropout on the penultimate layer with a constraint on l2-norms of the weight vectors. Dropout prevents co-adaptation of hidden units by randomly dropping out.
Pre-trained Word Vectors
We use the publicly available word2vec vectors that were trained on 100 billion words from Google News.
Results
There is simplified implementation using Tensorflow on Github:https://github.com/dennybritz/cnn-text-classification-tf
2. RNN
the paper Hierarchical Attention Networks for Document Classification which appeared in NAACL 2016
in this paper we test the hypothesis that better representations can be obtained by incorporating knowledge of document structure in the model architecture
- It is observed that different words and sentences in a documents are differentially informative.
-
Moreover, the importance of words and sentences are highly context dependent.
i.e. the same word or sentence may be dif- ferentially important in different context
Attention serves two benefits: not only does it often result in better performance, but it also provides in- sight into which words and sentences contribute to the classification decision which can be of value in applications and analysis
Hierarchical Attention Network
If you want to learn more about Attention Mechanisms:http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/
In the model they used the GRU-based sequence encoder.
1. Word Encoder:
2. Word Attention:
3. Sentence Encoder:
4. Sentence Attention:
5. Document Classification:
Because the document vector v is a high level representation of document d:
j is the label of document d
Results
There is simplified implementation written in Python on Github:https://github.com/richliao/textClassifier
References
https://www.cs.cmu.edu/%7Ediyiy/docs/naacl16.pdf
https://www.cs.cmu.edu/%7Ediyiy/docs/naacl16.pdf
https://www.coursera.org/learn/machine-learning/home/
https://www.youtube.com/playlist?list=PL6397E4B26D00A269