python爬虫学习-day4-使用lxml+xpath提取内容
2019-05-13 本文已影响37人
光小月
目录
- python爬虫学习-day1
- python爬虫学习-day2正则表达式
- python爬虫学习-day3-BeautifulSoup
- python爬虫学习-day4-使用lxml+xpath提取内容
- python爬虫学习-day5-selenium
- python爬虫学习-day6-ip池
- python爬虫学习-day7-实战
Xpath简单介绍
http://www.w3school.com.cn/xpath/index.asp
2,使用xpath提取丁香园论坛的回复内容:
示例
import requests
from lxml import etree
def run():
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}
url = 'http://www.dxy.cn/bbs/thread/626626#626626'
res = requests.get(url, headers=headers)
tree = etree.HTML(res.text)
names = tree.xpath('//div[@class="auth"]/a/text()')
create_times = tree.xpath('//div[@class="post-info"]/span/text()')
del create_times[1]
del create_times[1]
contents = tree.xpath('//td[@class="postbody"]/text()')
for content in contents:
print(content.strip())
result = []
for i in range(len(names)):
dictTmp = {'name': names[i].strip(), 'create_time': create_times[i].strip(), 'content': contents[i].strip()}
print(dictTmp)
print('*' * 80)
result.append(dictTmp)
if __name__ == '__main__':
run()
结果
PS: 若你觉得可以、还行、过得去、甚至不太差的话,可以“关注”一下,就此谢过!