大V的微博特征提取(简单的爬虫加数据分析)
文章的思路来源是在学习《集体智慧编程》中关于寻找独立特征一章,想到把不同新闻来源换成不同微博大V的内容,很好奇会得到什么结果?
1.内容获取
1.1 模拟登录微博
用各大V的原创微博内容代替新闻来源。这边在wap版微博进行抓取,相对weibo.com来说weibo.cn版本更加简单,同时登录相对没那么复杂。首先模拟登录,代码(Python2.7)如下:
weiboUrl = 'http://weibo.cn/pub/'
loginUrl = bs(requests.get(weiboUrl).content).find("div",{"class":"ut"}).find("a")['href']
origInfo = bs(requests.get(loginUrl).content)loginInfo = origInfo.find("form")['action']
loginpostUrl = 'http://login.weibo.cn/login/'+loginInfo
headers = { 'Host': 'login.weibo.cn',
'User-Agent' : 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; BOIE9;ZHCN)',
'Referer' : 'http://login.weibo.cn/login/?ns=1&revalid=2&backURL=http%3A%2F%2Fweibo.cn%2F&backTitle=%D0%C2%C0%CB%CE%A2%B2%A9&vt=',
}
postData = { 'mobile': '我的微博账号',
origInfo.find("form").find("input",{"type":"password"})['name']: '我的微博密码',
'remember':'on',
'backURL':origInfo.find("form").find("input",{"name":"backURL"})['value'],
'backTitle': origInfo.find("form").find("input",{"name":"backTitle"})['value'],
'tryCount': origInfo.find("form").find("input",{"name":"tryCount"})['value'],
'vk': origInfo.find("form").find("input",{"name":"vk"})['value'],
'submit': origInfo.find("form").find("input",{"name":"submit"})['value'], }```
s = requests.Session()
req = s.post(loginpostUrl, data=postData, headers=headers)```
那么模拟登录就已经完成,接下来就可以利用s.get(url,cookies=req.cookies)的形式去爬各种url了。
下面,利用一个元组和一个字典确定要抓取的大V的微博ID:
name_tuple = (u'谷大',u'magasa',u'猴姆',u'中国国家天文',u'耳帝',u'freshboy',u'松鼠可学会',u'人人影视',u'博物杂志',u'noisey音乐', u'网易公开课',u'LifeTime',u'DK在北京',u'电影扒客',u'lonelyplanet')
name_dic = {u'谷大':u'ichthy',u'magasa':u'magasafilm',u'猴姆':u'houson100037',u'中国国家天文':u'64807699',u'耳帝':u'eargod',u'freshboy':u'freshboy',u'松鼠可学会':u'songshuhui',u'人人影视':u'yyets',u'博物杂志':u'bowu',u'noisey音乐':u'noisey', u'网易公开课':u'163open',u'LifeTime':u'usinvester',u'DK在北京':u'dkinbj',u'电影扒客':u'2315579285', u'lonelyplanet':u'lonelyplanet',}```
元组的目的是为了固定大V的顺序,因为如果直接遍历字典的keys,顺序会是随机的,给我们后续确定特征来源带来很大麻烦。
接下来开始遍历这些微博:
def get_weibos():
s = requests.Session()
req = s.post(loginpostUrl, data=postData, headers=headers)
all_words={}
store_weibos = {}
for person in name_tuple:
person_id = name_dic[person]
store_weibos.setdefault(person,{})
for index in range(1,7): #每个人抓7页的微博
index_added = str(index)
person_url = 'http://weibo.cn/'+person_id+'?filter=1&page='+index_added+'&vt=4' #url加入filter=1参数只查看原创微博
req2 = s.get(person_url,cookies=req.cookies)
soup1 = bs(req2.text)
for op in soup1.find_all('div'):
if u'class' and u'id' in op.attrs.keys() and u'c' in op.attrs[u'class']: #通过观察网页中div标签中class类为'c'的是微博内容
op_weibo = op.span.text ```
1.2 中文分词
通过上文得到的op_weibo就是每条的微博内容了,但是与英文不同的是,不能直接通过空格进行分词,这里用结巴分词(https://github.com/fxsjy/jieba)
接上节代码
op_weibo = op.span.text
for word in jieba.cut(op_weibo,cut_all=False):
if len(word)>1 and word not in clean_words:
#利用jieba.cut得到分词结果集,筛选去掉长度很短的符号或词,同时可以设立clean_words进行过滤
#store_weibos为一个字典,每个人下又为一个字典,纪录他的微博中出现的单词及次数
store_weibos[person].setdefault(word,0)
store_weibos[person][word]+=1
for word_1 in store_weibos[person].keys():
all_words.setdefault(word_1,0)
all_words[word_1]+=1
print 'get %s already' %person
#allwords 是筛选出在超过3个人中都出现的词以及在少于90%的人中出现的词
allwords = [w for w,n in all_words.items() if n>3 and n<len(name_dic.keys())*0.9]
#l1是每个人创建跟allwords一样长的词表,对应这些词在该人下出现的次数,即为[person-words]矩阵
l1 = [[(word in store_weibos[person] and store_weibos[person][word] or 0) for word in allwords] for person in name_tuple]
return all_words, store_weibos,allwords,l1```
##2.特征提取-矩阵分解
下面代码主要来自《集体智慧编程》(http://www.amazon.cn/集体智慧编程-西格兰/dp/B001NPDVP2)
def difcost(a,b): #构造代价函数,用于矩阵特征分解
dif=0
for i in range(shape(a)[0]):
for j in range(shape(a)[1]):
dif += pow(a[i,j]-b[i,j],2)
return dif
分解矩阵,将[个人-单词]矩阵分解为[个人-特征]*[特征-单词]矩阵
def factorize(v, pc=10, iter=50):
ic = shape(v)[0] #icfc
fc = shape(v)[1]
w = matrix([[random.random() for j in range(pc)] for i in range(ic)]) #icpc weight matrix
h = matrix([[random.random() for j in range(fc)] for i in range(pc)]) #pc*fc feature matrix
find v = w*h Matrix Factorization
for i in range(iter):
wh = w*h
cost = difcost(wh,v)
#every 10 times print the cost
if i%10 == 0: print cost
if cost == 0: break
hn = (transpose(w)*v)
hd = (transpose(w)*w*h)
h = matrix(array(h)*array(hn)/array(hd))
wn = (v*transpose(h))
wd = (w*h*transpose(h))
w = matrix(array(w)*array(wn)/array(wd))
return w,h
#按特征展示
def showfeatures(w,h,titles,wordvec,out = 'features.txt'):
outfile = file(out,'w')
pc,wc = shape(h) # h is feature matrix
toppatterns=[[] for x in range(len(titles))]
patternnames= []
pc is the number of features
for i in range(pc):
slist=[] # wc is the number of words
for j in range(wc):
slist.append((h[i,j],wordvec[j]))
slist.sort()
slist.reverse() #sorted by weight-h[i,j] from big to little, the get the correlated word
n = [s[1] for s in slist[0:6]]
outfile.write(str(n)+'\n')
patternnames.append(n) #w[j,i] refer to article-feature
flist = []
for j in range(len(titles)):
flist.append((w[j,i],titles[j]))
toppatterns[j].append((w[j,i],i,titles[j]))
flist.sort()
flist.reverse()
for f in flist[0:3]:
outfile.write(str(f)+'\n')
outfile.write('\n')
outfile.close()
return toppatterns,patternnames
按文章展示
def showarticles(titles, toppatterns, patternnames, out='articles.txt'):
outfile = file(out,'w')
for j in range(len(titles)):
outfile.write(titles[j].encode('utf8')+'\n')
# sort w:article-feature desc
toppatterns[j].sort()
toppatterns[j].reverse()
#top3 w[article,feature]
for i in range(3):
a = u''.encode('utf8')
for word in patternnames[toppatterns[j][i][1]]:
a=a+' '+word.encode('utf8')
outfile.write(str(toppatterns[j][i][0])+' '+ a +'\n')
# w[article,feature]+feature ,respectively outfile.write('\n')outfile.close()```
3.结果
import weibo_feature (模块名)
a,b,c,d = weibo_feature.getweibos()```
其中d就是我们需要的人对词的列表,转换为矩阵m_d后,利用factorize函数分解为权重矩阵weights和特征矩阵feat。c是全部人的词中参与计数的完备词库。利用这些数据就可以进行展示了。
m_d = matrix(d)
weights,feat = weibo_feature. factorize(m_d)
topp,pn = weibo_feature.showfeatures(weights,feat,name_tuple,c)
weibo_feature.showarticle(name_tuple,topp,pn)
通过得到的文件可以看出:showfeatures展示每个特征跟哪些人相关,更直观的是showarticle,可以看到每个人跟哪些特征最相关
![1.jpg](http:https://img.haomeiwen.com/i743445/979ccc28ae037792.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
![2.jpg](http://upload-images.jianshu.io/upload_images/743445-a7ee693a4ed2c2ab.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
如上两个图分别是词组特征与人的权重,以及人与词组特征的权重的结果。
**可以看出从特征看对应哪些大V是一目了然的,同时每个大V最相关的前三个特征也一目了然,并且仅从结果和平时经验判断这个特征分解和系数是比较合理的。**
*存在的问题:最好能增加词性的判断,排除掉一些连词。*
*抓取的数据较少,没有放到数据库中,只是一个demo。*