25.使用pyspider爬取阿里招聘

2019-04-02  本文已影响0人  starrymusic

阿里招聘爬虫

对于这样一个网页

我想要抓取它的职位信息,只有前面这些肯定远远不够,想要抓取它的详情页,这个样式滴

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2019-03-21 17:17:52
# Project: alijob
import pymongo
from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    client = pymongo.MongoClient(host="localhost", port=27017)
    db = client['ali']
    crawl_config = {
    }
    def __init__(self):
        self.page = 1
        self.total_page = 10
        self.baseurl = 'http://job.alibaba.com/zhaopin/positionList.htm?spm=a2obv.11410899.0.0.55ef6c61NoDzQM#page/'
    @every(minutes=24 * 60)
    def on_start(self):
        while self.page < self.total_page:
            self.crawl(self.baseurl+str(self.page), callback=self.index_page, validate_cert=False,fetch_type="js")
            self.page += 1
    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('td > span > a').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    @config(priority=2)
    def detail_page(self, response):
        url = response.url
        title = response.doc('title').text()
        description = response.doc('.detail-content').text()
        return {
            "url": url,
            "title": title,
            "description": description
        }
    
    def on_result(self, result):
        if result:
            self.save_to_mongo(result)

    def save_to_mongo(self, result):
        if self.db['ali'].insert(result):
            print("save to mongodb success", result)

如果第一次使用pyspider后,第二天再次使用pyspider all 命令开启pyspider,可能会报这样的错误:

phantomjs fetch running on port 25555

那可以关掉cmd.exe再次使用pyspider打开,不带all参数。

详情可参考:https://github.com/hfxjd9527/alijob

上一篇下一篇

猜你喜欢

热点阅读