Python总结-----爬虫战斗篇
2020-05-21 本文已影响0人
1ace156a39cd
原理看上一篇
工具篇
Xpath Help 谷歌插件(谷歌商店你懂得)
爬取凤凰首页新闻
插件使用
15899728741537.jpg ![ 15899729750422.jpg 15899730008793.jpg image.png 15899730297970.jpg
提取全部修改Xpath语法即可
image.png在python上如何使用?
代码如下:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import requests
from lxml import etree
from lxml.html import tostring#将某个元素节点 保存为字符串
import json
def getNews():
url = 'https://news.ifeng.com/'
html = requests.get(url=url)
html = html.content.decode('utf-8')
news_tree = etree.HTML(html)
# #xpath返回一个集合数组,如果有20条,则数组的len为20
titles = news_tree.xpath(
'//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/a/@title')
hrefs = news_tree.xpath('//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/a/@href')
imgs = news_tree.xpath(
'//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/a/img/@src')
times = news_tree.xpath(
'//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/div/div/time')
tags = news_tree.xpath(
'//*[@id="root"]/div[6]/div[1]/div[5]/ul/li/div/div/span')
#通过遍历,获得每一个的信息,然后存入字典中
#然后存入数组,返回json数据
array = []
count = 0
while (count < len(titles)):
title = titles[count]
link = hrefs[count]
img = imgs[count]
time = times[count].text
tag = tags[count].text
dic = {'title': title, 'href': link, 'img': img, 'time': time, 'tag': tag}
array.append(dic)
count = count + 1
return json.dumps(array, ensure_ascii=False)
if __name__ == "__main__":
jsonstring = getNews()
print(jsonstring)
打印输入如下:
[{
"title": "绿地回应被举报高管贪腐问题:调查中 不会姑息",
"href": "//news.ifeng.com/c/7weTelvvWbY",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/CDB7AA8A2B55483B843DAF99CE559E11_w698_h392.png",
"time": "今天 12:05",
"tag": "中国新闻网"
}, {
"title": "美国抗议者在白宫外放装尸袋办“葬礼” 问责政府抗疫不力",
"href": "//news.ifeng.com/c/7weTH6IwesH",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/0370BFA6C72EAB721A55DB02731CED811930349E_w698_h392.png",
"time": "今天 12:05",
"tag": "环球网"
}, {
"title": "张文宏:各地有偶发病例是大概率事件,应长期保持适当社交距离",
"href": "//news.ifeng.com/c/7weRdCzXkJc",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/1F2D720F73E54AF8956B39DB212606C6_w690_h387.jpg",
"time": "今天 11:37",
"tag": "张文宏医生"
}, {
"title": "又美又有才,难道她就是特朗普的“完美”发言人?",
"href": "//news.ifeng.com/c/7weRLg43Viq",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/2CB1E314289C482395DE1CD313E0CCD2_w698_h392.jpg",
"time": "今天 11:33",
"tag": "冰汝看美国"
}, {
"title": "美国传染病专家福奇两周未接受采访,美媒怀疑其被禁声",
"href": "//news.ifeng.com/c/7weOmUniq6O",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/9DE3F1B1A36F4832BA4DF6D12267D80C_w698_h392.jpg",
"time": "今天 11:15",
"tag": "澎湃新闻"
}, {
"title": "酒驾致广东援鄂医生王烁殉职案开庭 被告曾以涉嫌交通肇事罪被批捕",
"href": "//news.ifeng.com/c/7wePraJu7kG",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/3968E83E17854629AC6BDEE647F8C3B4_w698_h392.png",
"time": "今天 11:10",
"tag": "南方都市报"
}, {
"title": "全国政协会议将为抗疫牺牲烈士和逝世同胞默哀一分钟",
"href": "//news.ifeng.com/c/7wePxtxWRZA",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/4F5C7FED0DA045EE96DDD311B4542436_w533_h299.jpg",
"time": "今天 11:09",
"tag": "工人日报"
}, {
"title": "全国人大代表姚劲波:降低公积金缴存比例,减轻企业经营负担",
"href": "//news.ifeng.com/c/7wePkyXwfho",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/740C78C4AE2878A548CAFB829EA511B7B5405646_w698_h392.jpg",
"time": "今天 11:08",
"tag": "澎湃新闻网"
}, {
"title": "人民日报:把“黑暴”赶出香港,得从根上拔除“毒瘤”",
"href": "//news.ifeng.com/c/7wePQZK5wUS",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/07991C78F2EA42DB85E525BE4E847C6F_w600_h336.jpg",
"time": "今天 11:05",
"tag": "人民日报"
}, {
"title": "华为美国高管:美国断供我们能挺过去,不过大量美国人会失业",
"href": "//news.ifeng.com/c/7wePHsPV6UC",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/01D0AD2B338D469286850CC5CD8F19AE_w569_h319.jpg",
"time": "今天 11:04",
"tag": "环球网"
}, {
"title": "人大代表建议:取消生育三孩以上的处罚政策 国家给予育儿补贴",
"href": "//news.ifeng.com/c/7weNTNgpLOi",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/630A01F7A7A78464A6D06536A6A6873858EFD058_w698_h392.jpg",
"time": "今天 11:00",
"tag": "新京报"
}, {
"title": "疯狂的头盔:我10天赚了800万",
"href": "//news.ifeng.com/c/7weOSjP6hN2",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/9CCED3FB9DFF4554B5C9F9FC4599B608_w512_h287.jpg",
"time": "今天 10:53",
"tag": "纵相新闻"
}, {
"title": "美国加州联邦参议员提议案 谴责“中国病毒”等词汇指称新冠",
"href": "//news.ifeng.com/c/7weNkE8jFA0",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/FF274567E4C0492496E65C1ECF119A87_w698_h392.jpg",
"time": "今天 10:44",
"tag": "中国日报网"
}, {
"title": "王学坤委员:建议建立农民退休制度 让65岁以上农民“洗脚上田,老有所养”",
"href": "//news.ifeng.com/c/7weNbtSo7v6",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/75DF69024398483AAA24F657C0EC764F_w602_h338.png",
"time": "今天 10:39",
"tag": "最高人民检察院"
}, {
"title": "特朗普叫嚣“中国有个疯子”,评论区翻车",
"href": "//news.ifeng.com/c/7weN4fqF7BI",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/4969C19002F340678BCC48B9266B1D2C_w698_h392.jpg",
"time": "今天 10:32",
"tag": "观察者网"
}, {
"title": "军报头版评论:“蓬佩奥们”边喊抓贼边做贼,下场注定可悲",
"href": "//news.ifeng.com/c/7weMvlBWShM",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/636D31AEB733412F96581882FCEFC64E_w698_h392.png",
"time": "今天 10:31",
"tag": "解放军报"
}, {
"title": "特殊时期的中国两会 外媒都在关注这些",
"href": "//news.ifeng.com/c/7weMghINhz6",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/D4270B52C0C14B72909202DBABECE1B6_w698_h392.jpg",
"time": "今天 10:28",
"tag": "央视新闻客户端"
}, {
"title": "雷军建议:进一步降低民营企业进入卫星互联网门槛",
"href": "//news.ifeng.com/c/7weL0NllooO",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/AB6E6070DCCBE7465220469B2578CD910EA67390_w698_h392.jpg",
"time": "今天 10:20",
"tag": "澎湃新闻"
}, {
"title": "北京15座王府14座被占,政协委员:应设腾退协调机构",
"href": "//news.ifeng.com/c/7weKz9vBGm5",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/ucms/2020_21/429BCCAD3FC016669563C909F36859F71B506DE0_w698_h392.jpg",
"time": "今天 10:20",
"tag": "新京报"
}, {
"title": "荷兰政府:水貂可能将新冠病毒传给人 清查所有养殖场",
"href": "//news.ifeng.com/c/7weKeI1Yr6D",
"img": "//d.ifengimg.com/w144_h80_q70/x0.ifengimg.com/thmaterial/2020_21/09FBF81BFE594527AAF2C36D2ED4EEDF_w519_h291.jpg",
"time": "今天 10:16",
"tag": "观察者网"
}]
如果需要新闻详情呢:
方式一:直接在列表中返回,也就是在 getNews()
方法中,先获取到连接 hrefs
然后遍历链接 得到 href
再去重新使用 lxml
抓取,这种方式对直接返回给客户端使用不是很友好,一个是返回 json
体积过大,一个是等待时间过长
方式二:重写抓取函数,传入相对应页面的 URL
获取详情数据代码如下:
def getNewsContent(url):
html = requests.get(url=url)
html = html.content.decode('utf-8')
news_content_tree = etree.HTML(html)
#因为xpath 语法可以保证只获取一个详情元素,所以直接取第一个即可
content = news_content_tree.xpath(
'//*[@id="root"]/div/div[3]/div[1]/div[1]/div[3]')[0]
content_html = str(tostring(content))
#如果打印 会发现 前面有一个(b') 以及最后的 (') 所以直接执行切割字符串操作
content_html_text = content_html[2:len(content_html)-1]
return content_html_text
打印数据如下:
<div class="main_content-LcrEruCc"><div><div class="text-3zQ3cZD4"><p>近日,21岁的冼嘉豪因暴动罪被香港法院判刑4年,他在求情信中说:“没有一天不后悔”。2019年6月至2020年4月15日,8001人被捕,1365人被起诉,566人被控暴动罪。个体的悲剧还在持续上演,数字的揪心让人持久难平,一场“修例风波”造就的暴力旋涡,已让多少香港年轻人命运脱轨、前途毁弃。</p><p>曾经拥有的东西因为参与非法暴力活动而丧失,一直拥有的生活因为暴力破坏而止步,狮子山下的纷乱伤害了多少逐梦路上的人。回望香港“修例风波”,正是因为反中乱港分子鼓吹暴力、煽惑暴力,被洗脑的年轻人迷信暴力、使用暴力,香港才结出了孩子有家难回、有梦难圆,市民有工难开、无工可开的苦果,让繁荣稳定的香港陷入危机困境。</p><p><img src="https://x0.ifengimg.com/ucms/2020_21/A1688E829DE205EEBC309384E3783FE8BA15437D_w1080_h1920.jpg"></p><p>这是香港市民想要的吗?最基本的安全被剥夺,出行怕有人又去砸地铁,营业怕黑衣人又来打砸,饭桌上有不同政见也不敢轻易发表,校园里竟成了“兵工厂”;人被贴上标签,店被贴上标签,被起底、被排斥、被攻击,在所谓“私了”和“装修”之下,黑色恐怖的利刃戳进市民的心,让人普遍变得焦虑、恐惧。因为暴徒,个人这小家被黑暗包裹,因为暴力,香港这个大家已满目疮痍,怎能不让人心痛、不让人愤慨,不让人期盼香港重归祥和安定!</p><p>在“修例风波”中,人们已经看尽暴力的危害、暴徒的凶残。特区政府警务处处长邓炳强此前表示,香港正面临本土恐怖主义的威胁,威胁到香港市民的人身安全,也在对国家安全造成冲击。反暴力,是因为暴力已渗透进香港市民的日常生活,危险近在咫尺;是因为暴力还有延续、扩散和升级的可能,要摧毁家园;是因为暴力不止,暴徒将更加猖狂,反中乱港分子将更加嚣张,香港要葬送掉一代代人辛苦建立的基业,辉煌篇章被恐怖主义湮灭。</p><p>通过香港警方严正执法,香港暴徒的气焰已被压制;由于香港市民拥护止暴制乱,香港暴力的土壤正被逐步铲除。但发生在香港的暴力并未绝迹,蠢蠢欲动的暴徒还在伺机而动。5月份前后,人们又看到了暴徒投掷的燃烧弹,看到了暴徒寄出的恐吓邮件。香港市民需要强化共识,一起向暴力说不;香港警方需要再接再厉,不给暴徒任何喘息之机。更需从根本上想办法,根治“黑暴”这个毒瘤。只有让暴徒、暴力成过街老鼠、众矢之的,纵暴、施暴的人付出沉重的代价,香港才有岁月静好,市民才能安心生活。</p></div><span></span><div class="end-37GBinZ_"></div></div></div>