用WordNet实现中文情感分析

2017-03-23 本文已影响0人 xieyan0811

1.分析

中文的情感分析可以用词林做，词林有一大类（Ｇ类）对应心理活动，但是相对于wordnet还是太简单了．因此使用nltk+wordnet的方案，如下：

1)中文分词：结巴分词

2)中英文翻译：wordnet汉语开放词网，可从以下网址下载：

http://compling.hss.ntu.edu.sg/cow/

3)情感分析：wordnet的sentiwordnet组件

4)停用词：参考以下网页，另外加入常用标点符号

http://blog.csdn.net/u010533386/article/details/51458591

2.代码

# encoding=utf-8

import jieba

import sys

import codecs

reload(sys)

import nltk

from nltk.corpus import wordnet as wn

from nltk.corpus import sentiwordnet as swn

sys.setdefaultencoding('utf8')

def doSeg(filename) :

f =open(filename, 'r+')

file_list = f.read()

f.close()

seg_list = jieba.cut(file_list)

stopwords= []

forword in open("./stop_words.txt", "r"):

stopwords.append(word.strip())

ll = []

for segin seg_list :

if(seg.encode("utf-8") not in stopwords and seg != ' ' and seg != ''and seg != "\n" and seg != "\n\n"):

ll.append(seg)

returnll

def loadWordNet():

f =codecs.open("./cow-not-full.txt", "rb", "utf-8")

known =set()

for lin f:

ifl.startswith('#') or not l.strip():

continue

row= l.strip().split("\t")

iflen(row) == 3:

(synset, lemma, status) = row

elif len(row) == 2:

(synset, lemma) = row

status = 'Y'

else:

print "illformed line: ", l.strip()

ifstatus in ['Y', 'O' ]:

if not (synset.strip(),lemma.strip()) in known:

known.add((synset.strip(), lemma.strip()))

returnknown

def findWordNet(known, key):

ll =[];

for kkin known:

if(kk[1] == key):

ll.append(kk[0])

returnll

def id2ss(ID):

returnwn._synset_from_pos_and_offset(str(ID[-1:]), int(ID[:8]))

def getSenti(word):

returnswn.senti_synset(word.name())

if __name__ == '__main__' :

known =loadWordNet()

words =doSeg(sys.argv[1])

n = 0

p = 0

forword in words:

ll =findWordNet(known, word)

if(len(ll) != 0):

n1 = 0.0

p1 = 0.0

for wid in ll:

desc = id2ss(wid)

swninfo = getSenti(desc)

p1 = p1 + swninfo.pos_score()

n1 = n1 + swninfo.neg_score()

if (p1 != 0.0 or n1 != 0.0):

print word, '-> n ', (n1 / len(ll)), ", p ", (p1 / len(ll))

p= p + p1 / len(ll)

n= n + n1 / len(ll)

print"n", n, ", p", p

3.待解决的问题

1)结巴分词与wordnet chinese中的词不能一一对应

结巴分词虽然可以导入自定义的词典，但仍有些结巴分出的词，在wordnet找不到对应词义，比如＂太后＂，＂童子＂，还有一些组合词如＂很早已前＂，＂黄山＂等等．大多是名词，需要进一步＂学习＂．

临时的解决方案是：将其当作＂专有名词＂处理

2)一词多义／一义多词

无论是情感分析，还是语义分析，中文或者英文，都需要解决词和义的对应问题.

临时的解决方案是：找到该词的所有语义，取其平均的情感值．另外，结巴也可判断出词性作为进一步参考．

3)语义问题

语义问题是最根本的问题，一方面需要分析句子的结构，另外也和内容也有关，尤其是长文章，经常会使用＂先抑后扬＂＂对比分析＂，这样就比较难以判断感情色彩了．

4.参考：

1)Learning lexical scales:WordNet and SentiWordNet

http://compprag.christopherpotts.net/wordnet.html

2)SentiWordNet Interface

http://www.nltk.org/howto/sentiwordnet.html

用WordNet实现中文情感分析

1.分析

2.代码

3.待解决的问题

4.参考：

猜你喜欢

热点阅读