推荐系统1:Scrapy创建一个简单的爬虫
2018-06-19 本文已影响0人
崔业康
创建项目
进入到文件存放目录下
创建项目,执行 scrapy startproject zhihuscrapy
创建爬虫
在spiders目录下创建文件 zhihu_spider.py
文件代码如下:
import scrapy
class ZhihuSpider(scrapy.Spider):
name = "zhihu"
allowed_domains = ["zhihu.com"]
start_urls = [
"https://zhuanlan.zhihu.com/p/38198729",
"https://zhuanlan.zhihu.com/p/38235624"
]
def parse(self, response):
for sel in response.xpath('//head'):
title = sel.xpath('title/text()').extract()
link = sel.xpath('title/text()').extract()
desc = sel.xpath('title/text()').extract()
print title, link, desc
设置请求头
在settings.py中增加
#请求头
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
#关闭robot
ROBOTSTXT_OBEY = False
#关闭cookies追踪
COOKIES_ENABLED = False
启动爬取
回到项目目录下
scrapy crawl zhihu
改进代码
import scrapy
from zhihuscrapy.items import ZhihuscrapyItem
class ZhihuSpider(scrapy.Spider):
name = "zhihu"
allowed_domains = ["zhihu.com"]
start_urls = [
"https://zhuanlan.zhihu.com/p/38198729",
"https://zhuanlan.zhihu.com/p/38235624"
]
def parse(self, response):
for href in response.css("UserLink-link > a::attr('href')"):
#url = response.urljoin(response.url, href.extract())
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
for sel in response.xpath('//head'):
item = ZhihuscrapyItem()
item['title'] = sel.xpath('title/text()').extract()
item['link'] = sel.xpath('title/text()').extract()
item['desc'] = sel.xpath('title/text()').extract()
yield item
执行,并输出
scrapy crawl zhihu -o items.json
参考: Scrapy爬虫(1)-知乎
参考: Scrapy入门教程