用Scrapy采集公管学院新闻
2017-05-21 本文已影响0人
安小宇
采集对象:四川大学公共管理学院新闻动态及内容
爬取规则:用css选择器的方法来进行元素定位
采集过程
激活,进入虚拟环境
1.png
创建项目
2.png
修改items.py文件
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class GgnewsItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
time = scrapy.Field()
content = scrapy.Field()
img = scrapy.Field()
编写爬虫
import scrapy
from ggnews.items import GgnewsItem
class GgnewsSpider(scrapy.Spider):
name = "spidernews"
start_urls = [
'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1',
]
def parse(self, response):
for href in response.css('div.pb30.mb30 div.right_info.p20.bgf9 ul.index_news_ul.dn li a.fl::attr(href)'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse2)
next_page = response.css('div.w100p div.px_box.w1000.auto.ovh.cf div.pb30.mb30 div.mobile_pager.dn li.c::text').extract_first()
if next_page is not None:
next_url = int(next_page) + 1
next_urls = '?c=special&sid=1&page=%s' % next_url
print next_urls
next_urls = response.urljoin(next_urls)
yield scrapy.Request(next_urls,callback = self.parse)
def parse2(self, response):
items = []
for new in response.css('div.w1000.auto.cf div.w780.pb30.mb30.fr div.right_info.p20'):
item = GgnewsItem()
item['title'] = new.css('div.detail_zy_title h1::text').extract_first(),
item['time'] = new.css('div.detail_zy_title p::text').extract_first(),
item['content'] = new.css('div.detail_zy_c.pb30.mb30 p span::text').extract(),
item['img'] = new.css('div.detail_zy_c.pb30.mb30 p.MsoNormal img::attr(src)').extract(),
items.append(item)
return items
将爬虫文件拖进spiders文件夹下
3.png4.png
执行爬虫
scrapy crawl spidernews -o spidernews.xml
(开始几次一直出现 ImportError: No module named items的错误,查百度发现时spiders 目录中的.py文件不能和项目名同名的问题,对其文件名进行修改)
5.png
scrapy crawl spidernews -o spidernews.json
7.png
得到数据
6.png 8.png 9.png 10.png