Python爬虫学习(2)显示wiki页面数据
2017-06-18 本文已影响0人
语落心生
当我们决定好构建的url连接之后,所需要的就是观察网页的html结构
我们找到的wiki百科内容为mw-cntent-text标签,由于我们只需要其中包含的p后的标签词条链接,构建url结构 mw-content-text -> p[0]
data:image/s3,"s3://crabby-images/84f39/84f39c25e5bb76a39a6733bef4dc547c21271d5b" alt=""
我们发现编辑链接的结构如下
所有词条连接的a标签位于词条连接的mp-tfa标签下
find层次结构为 mp-tfa -> a -> a href
data:image/s3,"s3://crabby-images/977cc/977cc3fd0d1a24f00686ea5bf10d0981731525da" alt=""
采集数据
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getlinks(pageUrl):
global pages
html=urlopen("http://en.wikipedia.org"+pageUrl)
bsObj=BeautifulSoup(html,'html.parser')
try:
print(bsObj.h1.get_text())
print(bsObj.find(id="mw-content-text").findAll("p")[0])
print(bsObj.find(id="mp-tfa").find("a").attrs['href'])
except AttributeError:
print("页面缺少一些属性")
for link in bsObj.findAll("a" , href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
newPage=link.attrs['href']
print(newPage)
pages.add(newPage)
getlinks(newPage)
getlinks("")
console output
data:image/s3,"s3://crabby-images/972f8/972f8600d26c4139ba4a4e32f1e54bed7c940fe0" alt=""
发现在找到a标签之后立即抛出异常
检查编辑链接的层次顺序,修改 mp-tfa -> p -> b -> a href
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getlinks(pageUrl):
global pages
html=urlopen("http://en.wikipedia.org"+pageUrl)
bsObj=BeautifulSoup(html,'html.parser')
try:
print(bsObj.h1.get_text())
print(bsObj.find(id="mw-content-text").findAll("p")[0])
print(bsObj.find(id="mp-tfa",style="padding:2px 5px").find("p").find("b").find("a").attrs['href'])
except AttributeError:
print("页面缺少一些属性")
for link in bsObj.findAll("a" , href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
newPage=link.attrs['href']
print("--------\n"+newPage)
pages.add(newPage)
getlinks(newPage)
getlinks("")
console output
data:image/s3,"s3://crabby-images/762fe/762fe9129f6b6982512f7477dcdba138799687e3" alt=""
原因在于之前分析的页面仅在于Main_page页面,继续对跳转之后的页面进行解析,发现并没有mp-tfa标签
data:image/s3,"s3://crabby-images/0a6da/0a6daef366c3b03f0d6c6e3dee80ae304e083c75" alt=""
修改url构造 mw-content-test -> p ->a href
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getlinks(pageUrl):
global pages
html=urlopen("http://en.wikipedia.org"+pageUrl)
bsObj=BeautifulSoup(html,'html.parser')
try:
print(bsObj.h1.get_text())
print(bsObj.find(id="mw-content-text").findAll("p")[0])
print(bsObj.find(id="mw-content-text").find("p").find("a").attrs['href'])
except AttributeError:
print("页面缺少一些属性")
for link in bsObj.findAll("a" , href=re.compile("^(/wiki/)")):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
newPage=link.attrs['href']
print("--------\n"+newPage)
pages.add(newPage)
getlinks(newPage)
getlinks("")
console output
成功拿到词条链接
data:image/s3,"s3://crabby-images/a7341/a73411b1783bf449e64fc010d022ce698fe005eb" alt=""