py爬虫大数据产品经理文章收集

大V的微博特征提取(简单的爬虫加数据分析)

2015-09-09  本文已影响2051人  TheMarcMa

文章的思路来源是在学习《集体智慧编程》中关于寻找独立特征一章,想到把不同新闻来源换成不同微博大V的内容,很好奇会得到什么结果?

1.内容获取

1.1 模拟登录微博

用各大V的原创微博内容代替新闻来源。这边在wap版微博进行抓取,相对weibo.com来说weibo.cn版本更加简单,同时登录相对没那么复杂。首先模拟登录,代码(Python2.7)如下:

weiboUrl = 'http://weibo.cn/pub/'
loginUrl  = bs(requests.get(weiboUrl).content).find("div",{"class":"ut"}).find("a")['href']
origInfo  = bs(requests.get(loginUrl).content)loginInfo = origInfo.find("form")['action']
loginpostUrl = 'http://login.weibo.cn/login/'+loginInfo
headers = { 'Host': 'login.weibo.cn',            
            'User-Agent' : 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; BOIE9;ZHCN)',            
            'Referer' : 'http://login.weibo.cn/login/?ns=1&revalid=2&backURL=http%3A%2F%2Fweibo.cn%2F&backTitle=%D0%C2%C0%CB%CE%A2%B2%A9&vt=',          
           }
postData = { 'mobile': '我的微博账号',                                                                  
             origInfo.find("form").find("input",{"type":"password"})['name']: '我的微博密码',           
             'remember':'on',      
             'backURL':origInfo.find("form").find("input",{"name":"backURL"})['value'],               
             'backTitle': origInfo.find("form").find("input",{"name":"backTitle"})['value'],          
             'tryCount': origInfo.find("form").find("input",{"name":"tryCount"})['value'],           
             'vk': origInfo.find("form").find("input",{"name":"vk"})['value'],                        
             'submit': origInfo.find("form").find("input",{"name":"submit"})['value'],              }``` 

s = requests.Session()
req = s.post(loginpostUrl, data=postData, headers=headers)```
那么模拟登录就已经完成,接下来就可以利用s.get(url,cookies=req.cookies)的形式去爬各种url了。
下面,利用一个元组和一个字典确定要抓取的大V的微博ID:

name_tuple = (u'谷大',u'magasa',u'猴姆',u'中国国家天文',u'耳帝',u'freshboy',u'松鼠可学会',u'人人影视',u'博物杂志',u'noisey音乐', u'网易公开课',u'LifeTime',u'DK在北京',u'电影扒客',u'lonelyplanet')
name_dic = {u'谷大':u'ichthy',u'magasa':u'magasafilm',u'猴姆':u'houson100037',u'中国国家天文':u'64807699',u'耳帝':u'eargod',u'freshboy':u'freshboy',u'松鼠可学会':u'songshuhui',u'人人影视':u'yyets',u'博物杂志':u'bowu',u'noisey音乐':u'noisey', u'网易公开课':u'163open',u'LifeTime':u'usinvester',u'DK在北京':u'dkinbj',u'电影扒客':u'2315579285', u'lonelyplanet':u'lonelyplanet',}```
元组的目的是为了固定大V的顺序,因为如果直接遍历字典的keys,顺序会是随机的,给我们后续确定特征来源带来很大麻烦。

接下来开始遍历这些微博:

def get_weibos():
s = requests.Session()
req = s.post(loginpostUrl, data=postData, headers=headers)
all_words={}
store_weibos = {}
for person in name_tuple:
person_id = name_dic[person]
store_weibos.setdefault(person,{})
for index in range(1,7): #每个人抓7页的微博
index_added = str(index)
person_url = 'http://weibo.cn/'+person_id+'?filter=1&page='+index_added+'&vt=4' #url加入filter=1参数只查看原创微博
req2 = s.get(person_url,cookies=req.cookies)
soup1 = bs(req2.text)
for op in soup1.find_all('div'):
if u'class' and u'id' in op.attrs.keys() and u'c' in op.attrs[u'class']: #通过观察网页中div标签中class类为'c'的是微博内容
op_weibo = op.span.text ```

1.2 中文分词

通过上文得到的op_weibo就是每条的微博内容了,但是与英文不同的是,不能直接通过空格进行分词,这里用结巴分词(https://github.com/fxsjy/jieba)
接上节代码

                         op_weibo = op.span.text
                         for word in jieba.cut(op_weibo,cut_all=False):  
                            if len(word)>1 and word not in clean_words:    
                         #利用jieba.cut得到分词结果集,筛选去掉长度很短的符号或词,同时可以设立clean_words进行过滤
                               #store_weibos为一个字典,每个人下又为一个字典,纪录他的微博中出现的单词及次数                  
                               store_weibos[person].setdefault(word,0)                        
                               store_weibos[person][word]+=1    
           for word_1 in store_weibos[person].keys():  
               all_words.setdefault(word_1,0)   
               all_words[word_1]+=1
           print 'get %s already' %person
    #allwords 是筛选出在超过3个人中都出现的词以及在少于90%的人中出现的词 
    allwords = [w for w,n in all_words.items() if n>3 and n<len(name_dic.keys())*0.9]
    #l1是每个人创建跟allwords一样长的词表,对应这些词在该人下出现的次数,即为[person-words]矩阵
    l1 = [[(word in store_weibos[person] and store_weibos[person][word] or 0) for word in allwords] for person in name_tuple]
    return all_words, store_weibos,allwords,l1```

##2.特征提取-矩阵分解
下面代码主要来自《集体智慧编程》(http://www.amazon.cn/集体智慧编程-西格兰/dp/B001NPDVP2)

def difcost(a,b): #构造代价函数,用于矩阵特征分解
dif=0
for i in range(shape(a)[0]):
for j in range(shape(a)[1]):
dif += pow(a[i,j]-b[i,j],2)
return dif

分解矩阵,将[个人-单词]矩阵分解为[个人-特征]*[特征-单词]矩阵

def factorize(v, pc=10, iter=50):
ic = shape(v)[0] #icfc
fc = shape(v)[1]
w = matrix([[random.random() for j in range(pc)] for i in range(ic)]) #ic
pc weight matrix
h = matrix([[random.random() for j in range(fc)] for i in range(pc)]) #pc*fc feature matrix

find v = w*h Matrix Factorization

for i in range(iter):        
  wh = w*h        
  cost = difcost(wh,v)        
  #every 10 times print the cost        
  if i%10 == 0: print cost        
  if cost == 0: break        
  hn = (transpose(w)*v)          
  hd = (transpose(w)*w*h)
  h = matrix(array(h)*array(hn)/array(hd))       
  wn = (v*transpose(h)) 
  wd = (w*h*transpose(h))
  w = matrix(array(w)*array(wn)/array(wd))    

return w,h

#按特征展示
def showfeatures(w,h,titles,wordvec,out = 'features.txt'):
outfile = file(out,'w')
pc,wc = shape(h) # h is feature matrix
toppatterns=[[] for x in range(len(titles))]
patternnames= []

pc is the number of features

for i in range(pc):
slist=[] # wc is the number of words
for j in range(wc):
slist.append((h[i,j],wordvec[j]))
slist.sort()
slist.reverse() #sorted by weight-h[i,j] from big to little, the get the correlated word
n = [s[1] for s in slist[0:6]]
outfile.write(str(n)+'\n')
patternnames.append(n) #w[j,i] refer to article-feature
flist = []
for j in range(len(titles)):
flist.append((w[j,i],titles[j]))
toppatterns[j].append((w[j,i],i,titles[j]))

flist.sort()        
flist.reverse()        
for f in flist[0:3]:            
  outfile.write(str(f)+'\n')        
outfile.write('\n')    

outfile.close()
return toppatterns,patternnames

按文章展示

def showarticles(titles, toppatterns, patternnames, out='articles.txt'):
outfile = file(out,'w')

for j in range(len(titles)):
outfile.write(titles[j].encode('utf8')+'\n')
# sort w:article-feature desc
toppatterns[j].sort()
toppatterns[j].reverse()
#top3 w[article,feature]
for i in range(3):
a = u''.encode('utf8')
for word in patternnames[toppatterns[j][i][1]]:
a=a+' '+word.encode('utf8')
outfile.write(str(toppatterns[j][i][0])+' '+ a +'\n')
# w[article,feature]+feature ,respectively outfile.write('\n')outfile.close()```

3.结果

import weibo_feature   (模块名)
a,b,c,d = weibo_feature.getweibos()```
其中d就是我们需要的人对词的列表,转换为矩阵m_d后,利用factorize函数分解为权重矩阵weights和特征矩阵feat。c是全部人的词中参与计数的完备词库。利用这些数据就可以进行展示了。

m_d = matrix(d)
weights,feat = weibo_feature. factorize(m_d)
topp,pn = weibo_feature.showfeatures(weights,feat,name_tuple,c)
weibo_feature.showarticle(name_tuple,topp,pn)

通过得到的文件可以看出:showfeatures展示每个特征跟哪些人相关,更直观的是showarticle,可以看到每个人跟哪些特征最相关



![1.jpg](http:https://img.haomeiwen.com/i743445/979ccc28ae037792.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

![2.jpg](http://upload-images.jianshu.io/upload_images/743445-a7ee693a4ed2c2ab.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
如上两个图分别是词组特征与人的权重,以及人与词组特征的权重的结果。
**可以看出从特征看对应哪些大V是一目了然的,同时每个大V最相关的前三个特征也一目了然,并且仅从结果和平时经验判断这个特征分解和系数是比较合理的。**
*存在的问题:最好能增加词性的判断,排除掉一些连词。*
*抓取的数据较少,没有放到数据库中,只是一个demo。*
上一篇 下一篇

猜你喜欢

热点阅读