爬虫框架webmagic与spring boot的结合使用

2016-10-03 本文已影响5499人水花一现

1. 爬虫框架webmagic

WebMagic是一个简单灵活的爬虫框架。基于WebMagic，你可以快速开发出一个高效、易维护的爬虫。

1.1 官网地址

官网文档写的比较清楚，建议大家直接阅读官方文档，也可以阅读下面的内容。地址如下：

官网：http://webmagic.io

中文文档地址: http://webmagic.io/docs/zh/

English: http://webmagic.io/docs/en

2. webmagic与spring boot框架集成

spring boot与webmagic的结合主要有三个模块，分别为爬取模块Processor,入库模块Pipeline，向数据库存入爬取数据，和定时任务模块Scheduled,复制定时爬取网站数据。

2.1 maven添加

<dependency> 
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.5.3</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.5.3</version>
</dependency>

2.2 爬取模块`Processor`

爬取简书首页Processor，分析简书首页的页面数据，获取响应的简书链接和标题，放入wegmagic的Page中，到入库模块取出添加到数据库。代码如下：

package com.shang.spray.common.processor;

import com.shang.spray.entity.News;
import com.shang.spray.entity.Sources;
import com.shang.spray.pipeline.NewsPipeline;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;

import java.util.List;

/**
 * info:简书首页爬虫
 * Created by shang on 16/9/9.
 */
public class JianShuProcessor implements PageProcessor {

    private Site site = Site.me()
            .setDomain("jianshu.com")
            .setSleepTime(100)
            .setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36");
    ;

    public static final String list = "http://www.jianshu.com";

    @Override
    public void process(Page page) {
        if (page.getUrl().regex(list).match()) {
            List<Selectable> list=page.getHtml().xpath("//ul[@class='article-list thumbnails']/li").nodes();
            for (Selectable s : list) {
                String title=s.xpath("//div/h4/a/text()").toString();
                String link=s.xpath("//div/h4").links().toString();
                News news=new News();
                news.setTitle(title);
                news.setInfo(title);
                news.setLink(link);
                news.setSources(new Sources(5));
                page.putField("news"+title, news);
            }
        }
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider spider=Spider.create(new JianShuProcessor());
        spider.addUrl("http://www.jianshu.com");
        spider.addPipeline(new NewsPipeline());
        spider.thread(5);
        spider.setExitWhenComplete(true);
        spider.start();
    }
}

2.3 入库模块`Pipeline`

入库模块结合spring boot的Repository模块一起组合成入库方法，继承webmagic的Pipeline，然后实现方法，在process方法中获取爬虫模块的数据，然后调用spring boot的save方法。代码如下：

package com.shang.spray.pipeline;

import com.shang.spray.entity.News;
import com.shang.spray.entity.Sources;
import com.shang.spray.repository.NewsRepository;
import org.apache.commons.lang3.StringUtils;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.jpa.domain.Specification;
import org.springframework.stereotype.Repository;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

import javax.persistence.criteria.CriteriaBuilder;
import javax.persistence.criteria.CriteriaQuery;
import javax.persistence.criteria.Predicate;
import javax.persistence.criteria.Root;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.Map;

/**
 * info:新闻
 * Created by shang on 16/8/22.
 */
@Repository
public class NewsPipeline implements Pipeline {

    @Autowired
    protected NewsRepository newsRepository;

    @Override
    public void process(ResultItems resultItems, Task task) {
        for (Map.Entry<String, Object> entry : resultItems.getAll().entrySet()) {
            if (entry.getKey().contains("news")) {
                News news=(News) entry.getValue();
                Specification<News> specification=new Specification<News>() {
                    @Override
                    public Predicate toPredicate(Root<News> root, CriteriaQuery<?> criteriaQuery, CriteriaBuilder criteriaBuilder) {
                        return criteriaBuilder.and(criteriaBuilder.equal(root.get("link"),news.getLink()));
                    }
                };
                if (newsRepository.findOne(specification) == null) {//检查链接是否已存在
                    news.setAuthor("水花");
                    news.setTypeId(1);
                    news.setSort(1);
                    news.setStatus(1);
                    news.setExplicitLink(true);
                    news.setCreateDate(new Date());
                    news.setModifyDate(new Date());
                    newsRepository.save(news);
                }
            }

        }
    }
}

2.4 定时任务模块`Scheduled`

使用spring boot自带的定时任务注解@Scheduled(cron = "0 0 0/2 * * ? "),每天从0天开始，每两个小时执行一次爬取任务，在定时任务里调取webmagic的爬取模块Processor。代码如下：

package com.shang.spray.common.scheduled;

import com.shang.spray.common.processor.DevelopersProcessor;
import com.shang.spray.common.processor.JianShuProcessor;
import com.shang.spray.common.processor.ZhiHuProcessor;
import com.shang.spray.entity.Config;
import com.shang.spray.pipeline.NewsPipeline;
import com.shang.spray.service.ConfigService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.jpa.domain.Specification;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.Spider;

import javax.persistence.criteria.CriteriaBuilder;
import javax.persistence.criteria.CriteriaQuery;
import javax.persistence.criteria.Predicate;
import javax.persistence.criteria.Root;


/**
 * info:新闻定时任务
 * Created by shang on 16/8/22.
 */
@Component
public class NewsScheduled {
    @Autowired
    private NewsPipeline newsPipeline;

    /**
     * 简书
     */
    @Scheduled(cron = "0 0 0/2 * * ? ")//从0点开始,每2个小时执行一次
    public void jianShuScheduled() {
        System.out.println("----开始执行简书定时任务");
        Spider spider = Spider.create(new JianShuProcessor());
        spider.addUrl("http://www.jianshu.com");
        spider.addPipeline(newsPipeline);
        spider.thread(5);
        spider.setExitWhenComplete(true);
        spider.start();
        spider.stop();
    }

}

2.5 spring boot启用定时任务

在spring boot的Application里启用定时任务注解，@EnableScheduling。代码如下：

package com.shang.spray;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.boot.builder.SpringApplicationBuilder;
import org.springframework.boot.context.web.SpringBootServletInitializer;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.annotation.EnableScheduling;

/**
 * info:
 * Created by shang on 16/7/8.
 */
@Configuration
@EnableAutoConfiguration
@ComponentScan
@SpringBootApplication
@EnableScheduling
public class SprayApplication extends SpringBootServletInitializer{

    @Override
    protected SpringApplicationBuilder configure(SpringApplicationBuilder application) {
        return application.sources(SprayApplication.class);
    }

    public static void main(String[] args) throws Exception {
        SpringApplication.run(SprayApplication.class, args);
    }
}

3. 结束语

使用webmagic是我在水花一现项目中爬取网站数据时使用的的爬虫框架，在综合比较的其他几个爬虫框架后，选择了这个框架，这个框架比较简单易学，且功能强大，我这里只使用了基本的功能，还有许多强大的功能都没有使用。有兴趣的可以去看看官方文档！

有需要代码的可以去我的github上去拉取相关代码，此代码在水花一现项目中使用过。
欢迎大家关注我的水花一现项目。

水花一现APP下载地址：https://www.pgyer.com/0qj6

博客:http://www.shuihua.me

微信公众号:水花一现，shuihuayixian

邮箱:shangjing105@163.com

Github:https://github.com/shangjing105

QQ:787019494

爬虫框架webmagic与spring boot的结合使用

1. 爬虫框架webmagic

1.1 官网地址

2. webmagic与spring boot框架集成

2.1 maven添加

2.2 爬取模块`Processor`

2.3 入库模块`Pipeline`

2.4 定时任务模块`Scheduled`

2.5 spring boot启用定时任务

3. 结束语

猜你喜欢

热点阅读

爬虫框架webmagic与spring boot的结合使用

1. 爬虫框架webmagic

1.1 官网地址

2. webmagic与spring boot框架集成

2.1 maven添加

2.2 爬取模块Processor

2.3 入库模块Pipeline

2.4 定时任务模块Scheduled

2.5 spring boot启用定时任务

3. 结束语

猜你喜欢

热点阅读

2.2 爬取模块`Processor`

2.3 入库模块`Pipeline`

2.4 定时任务模块`Scheduled`