       这篇文章主要是通过两个简单的例子让大家了解一下如何使用 NLTK 做预测。第一个例子是根据一个给定的人名来预测这个人的性别。第二个例子是确定所有评论中积极评论和消极评论所占的比例。

       这里的数据我们直接使用 NLTK 资源中提供的人名。如果想要查看该人名相关的文件可以到 nltk_data --> corpora --> names 文件夹去查看。加载这些数据的代码如下:

from nltk.corpus import names
# Load data and training 
names = ([(name, 'male') for name in names.words('male.txt')] + 
     [(name, 'female') for name in names.words('female.txt')])


[(u'Aaron', 'male'), (u'Abbey', 'male'), (u'Abbie', 'male')]
[(u'Zorana', 'female'), (u'Zorina', 'female'), (u'Zorine', 'female')]

       我们这里是选取人名的最后一个字母作为该人名的特征。这种选取方式可能不准确甚至是不合理,但没有关系,因为我们的目的是了解 NTTK 提取特征的方式。

featuresets = [(gender_features(n), g) for (n,g) in names]

       其中方法 gender_features(n) 的代码如下:

def gender_features(word): 
    return {'last_letter': word[-1]}


# Train
classifier = nltk.NaiveBayesClassifier.train(train_set) 
# Predict

       在训练过程中,我们直接使用了 NLTK 提供的贝叶斯分类器模型做训练。NLTK 中有很多模型供我们使用,这里我们仅选用了贝叶斯分类器来做实例的演示。现在将完整的代码展示如下:

import nltk.classify.util
from nltk.corpus import names

def gender_features(word):
    return {'last_letter': word[-1]}

# Load data and training
names = ([(name, 'male') for name in names.words('male.txt')] +
         [(name, 'female') for name in names.words('female.txt')])

train_set = [(gender_features(n), g) for (n, g) in names]
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Predict


positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]


def word_feats(words):
    return dict([(word, True) for word in words])
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]


train_set = negative_features + positive_features + neutral_features


classifier = NaiveBayesClassifier.train(train_set)


from nltk.classify import NaiveBayesClassifier

def word_feats(words):
    return dict([(word, True) for word in words])

positive_vocab = ['awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)']
negative_vocab = ['bad', 'terrible', 'useless', 'hate', ':(']
neutral_vocab = ['movie', 'the', 'sound', 'was', 'is', 'actors', 'did', 'know', 'words', 'not']

positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]

train_set = negative_features + positive_features + neutral_features

classifier = NaiveBayesClassifier.train(train_set)

# Predict
neg = 0
pos = 0
sentence = "Awesome movie, I liked it"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
    classResult = classifier.classify(word_feats(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1

print('Positive: ' + str(float(pos) / len(words)))
print('Negative: ' + str(float(neg) / len(words)))


