使用Newspaper框架抓取新闻
2019-01-21 本文已影响30人
SeanCheney
Newspaper框架是Python爬虫框架中在GitHub上点赞排名第三的爬虫框架,适合抓取新闻网页。
推荐安装Python3版本:pip3 install newspaper3k
(pip install newspaper是Python2版本)
- 基本使用方法
url = 'https://www.washingtonpost.com/powerpost/trump-to-make-new-offer-to-democrats-as-government-shutdown-drags-on/2019/01/19/2cde029e-1bf3-11e9-9ebf-c5fed1b7a081_story.html?utm_term=.4db5c2055c6d'
# 创建文章对象
article = Article(url)
# 下载网页
article.download()
# 打印html文档
print(article.html)
# 网页解析
article.parse()
# 标题
print(article.title)
# # 作者
print(article.authors)
# 发布日期
print(article.publish_date)
# 正文
print(article.text)
# 配图
print(article.top_image)
# 视频
print(article.movies)
# 自然语言处理
article.nlp()
# 关键词
print(article.keywords)
# 文章摘要
print(article.summary)
- 整体抓取首页
import newspaper
# 构建新闻源
washingtonpost_paper = newspaper.build('https://www.washingtonpost.com')
# 所有文章的url
for article in washingtonpost_paper.articles:
print(article.url)
# 文章分裂
for category in washingtonpost_paper.category_urls():
print(category)
- Requests和Newspaper结合解析正文
import requests
from newspaper import fulltext
html = requests.get('https://www.washingtonpost.com/business/economy/2019/01/17/19662748-1a84-11e9-9ebf-c5fed1b7a081_story.html?utm_term=.26198c91916f').text
text = fulltext(html)
print(text)
- Google Trends信息
import newspaper
# Google的新闻热点
print(newspaper.hot())
# 流行网站
print(newspaper.popular_urls())
- 多任务
import newspaper
from newspaper import news_pool
# 创建并行任务
slate_paper = newspaper.build('http://slate.com')
tc_paper = newspaper.build('http://techcrunch.com')
espn_paper = newspaper.build('http://espn.com')
papers = [slate_paper, tc_paper, espn_paper]
news_pool.set(papers, threads_per_source=2) # (3*2) = 6 共6个线程
news_pool.join()
print(slate_paper.articles[10].html)