SpringBoot整合WebMagic实现爬虫(简单入门含gi

2021-10-07 本文已影响0人小小的人_e5f6

SpringBoot整合WebMagic

前言

为什么我要整合WebMagic ?

WebMagic是一个简单灵活的Java爬虫框架。基于WebMagic，你可以快速开发出一个高效、易维护的爬虫。

因为部分网站它不支持外链图片上传,而我已经把我的图片资源上传了,所以我需要把所有的资源进行获取整合再在部分网站重新上传;

举个🌰:

图片上传失败例子

红框里的就是上传失败的。

很无奈,人家不支持外链地址,那没有办法,自己重新上传吧!

所以我通过WebMagic进行爬取我放置在其他网站的博客再进行上传。

如下🌰:

csdn博客文章

爬取后获得的资源：

爬取(120436445)图片资源

文档迭代

version	功能	详细描述
0.0.1	通过整合WebMagic实现了通用列表(多页面)及详细内容(单页面)的通用爬取接口	通用列表(多页面是指列表页面中的详情页),单页就是指的详情页

项目技术版本(技术选型)

jdk - version - 1.8

maven - version - 3.8.1

SpringBoot - version - 2.2.2.RELEASE

Swagger - version - 2.7.0

remark - version - 1.0.0

webmagic - version - 0.7.3

其他参考pom

项目地址

gitee地址: https://gitee.com/zjydzyjs/spring-boot-use-case-collection/tree/master/webmagic

前置知识

WebMagic 需自主学习

[webmagic官网](http://webmagic.io/)

[webmagic详情文档](http://webmagic.io/docs/zh/)

这里说一下我的感慨,webmagic的作者是真厉害,他再官方上也说了,该框架是他自己业余开发,这个精神值得佩服,在此感谢作者([Yihua Huang](https://github.com/code4craft));

假设你已经学习了该项目，可以继续学习了。

WebMagic 简单测试

文档中的第一个用例:

文档中的第一个用例

你可以再idea中进行搜索(快速按两次 shift),输入GithubRepoPageProcessor类就可以发现这个作者已经写好的用例了,我们可以测试一下。

package us.codecraft.webmagic.processor.example;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;

public class GithubRepoPageProcessor implements PageProcessor {
    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000).setTimeOut(10000);

    public GithubRepoPageProcessor() {
    }

    public void process(Page page) {
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-]+/[\\w\\-]+)").all());
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-])").all());
        page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());
        if (page.getResultItems().get("name") == null) {
            page.setSkip(true);
        }

        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
    }

    public Site getSite() {
        return this.site;
    }

    public static void main(String[] args) {
        Spider.create(new GithubRepoPageProcessor()).addUrl(new String[]{"https://github.com/code4craft"}).thread(5).run();
    }
}

你直接通过通过以下用例进行调用就可以测试了,在类{@link com.blacktea.webmagic.demo.WebMagicTest}里,

@Test
void testDemo1(){
    Spider.create(new GithubRepoPageProcessor())
            .addUrl(new String[]{"https://github.com/code4craft"})
            .thread(5).run();
}

运行得到错误内容:

10:34:13.307 [main] DEBUG us.codecraft.webmagic.scheduler.QueueScheduler - push to queue https://github.com/code4craft
10:34:13.730 [main] INFO us.codecraft.webmagic.Spider - Spider github.com started!
10:34:13.762 [pool-1-thread-1] DEBUG org.apache.http.client.protocol.RequestAddCookies - CookieSpec selected: standard
10:34:13.766 [pool-1-thread-1] DEBUG org.apache.http.client.protocol.RequestAuthCache - Auth cache not set in the context
10:34:13.766 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection request: [route: {s}->https://github.com:443][total kept alive: 0; route allocated: 0 of 100; total allocated: 0 of 5]
10:34:13.774 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection leased: [id: 0][route: {s}->https://github.com:443][total kept alive: 0; route allocated: 1 of 100; total allocated: 1 of 5]
10:34:13.775 [pool-1-thread-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Opening connection {s}->https://github.com:443
10:34:13.778 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.DefaultHttpClientConnectionOperator - Connecting to github.com/140.82.112.4:443
10:34:13.778 [pool-1-thread-1] DEBUG org.apache.http.conn.ssl.SSLConnectionSocketFactory - Connecting socket to github.com/140.82.112.4:443 with timeout 10000
10:34:14.052 [pool-1-thread-1] DEBUG org.apache.http.conn.ssl.SSLConnectionSocketFactory - Enabled protocols: [TLSv1]
10:34:14.052 [pool-1-thread-1] DEBUG org.apache.http.conn.ssl.SSLConnectionSocketFactory - Enabled cipher suites:[TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDH_RSA_WITH_AES_256_CBC_SHA, TLS_DHE_RSA_WITH_AES_256_CBC_SHA, TLS_DHE_DSS_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDH_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_DSS_WITH_AES_128_CBC_SHA, TLS_EMPTY_RENEGOTIATION_INFO_SCSV]
10:34:14.052 [pool-1-thread-1] DEBUG org.apache.http.conn.ssl.SSLConnectionSocketFactory - Starting handshake
10:34:14.314 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.DefaultManagedHttpClientConnection - http-outgoing-0: Shutdown connection
10:34:14.314 [pool-1-thread-1] DEBUG org.apache.http.impl.execchain.MainClientExec - Connection discarded
10:34:14.314 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection released: [id: 0][route: {s}->https://github.com:443][total kept alive: 0; route allocated: 0 of 100; total allocated: 0 of 5]
10:34:14.316 [pool-1-thread-1] WARN us.codecraft.webmagic.downloader.HttpClientDownloader - download page https://github.com/code4craft error
javax.net.ssl.SSLHandshakeException: Received fatal alert: protocol_version
    at sun.security.ssl.Alert.createSSLException(Alert.java:131)
    at sun.security.ssl.Alert.createSSLException(Alert.java:117)
    at sun.security.ssl.TransportContext.fatal(TransportContext.java:340)
    at sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293)
    at sun.security.ssl.TransportContext.dispatch(TransportContext.java:186)
    at sun.security.ssl.SSLTransport.decode(SSLTransport.java:154)
    at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1279)
    at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1188)
    at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:401)
    at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:373)
    at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
    at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
    at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at us.codecraft.webmagic.downloader.HttpClientDownloader.download(HttpClientDownloader.java:85)
    at us.codecraft.webmagic.Spider.processRequest(Spider.java:404)
    at us.codecraft.webmagic.Spider.access$000(Spider.java:61)
    at us.codecraft.webmagic.Spider$1.run(Spider.java:320)
    at us.codecraft.webmagic.thread.CountableThreadPool$1.run(CountableThreadPool.java:74)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
10:34:15.324 [main] INFO us.codecraft.webmagic.Spider - Spider github.com closed! 1 pages downloaded.

那第一个例子就出现这个问题,那得解决呀,是吧?

于是我就在github的issues找到了该问题！

(protocol_version)bug解决

作者说是会在0.7.4中解决,你可能会说那我们换版本就好了,抱歉,作者可能是工作太忙了,所以0.7.4迟迟不出,那没办法,那咱们就自己改吧。

手动解决 protocol_version bug

先找到作者说的**HttpClientGenerator**类中的方法*buildSSLConnectionSocketFactory*，

private SSLConnectionSocketFactory buildSSLConnectionSocketFactory() {
    try {
        return new SSLConnectionSocketFactory(this.createIgnoreVerifySSL());
    } catch (KeyManagementException var2) {
        this.logger.error("ssl connection fail", var2);
    } catch (NoSuchAlgorithmException var3) {
        this.logger.error("ssl connection fail", var3);
    }

    return SSLConnectionSocketFactory.getSocketFactory();
}

将里面的内容改造就行,不过因为是源码,如果你不想重新打包的话,就需要自己去重写引用HttpClientGenerator

如何知道在那里使用上HttpClientGenerator的?

官网说了,WebMagic的结构分为`Downloader`、`PageProcessor`、`Scheduler`、`Pipeline`四大组件，

那我们怎么知道这四大组件是怎么相互关联的呢?

可以从之前的那个**GithubRepoPageProcessor**类的执行方法中获得

 Spider.create(new GithubRepoPageProcessor())
        .addUrl(new String[]{"https://github.com/code4craft"})
        .thread(5)
        .run();

从上可以知道,执行一个爬取任务,需要通过**Spider**去启动,那我们查看一下

package us.codecraft.webmagic;
......

public class Spider implements Runnable, Task {
    protected Downloader downloader;
    protected List<Pipeline> pipelines = new ArrayList();
    protected PageProcessor pageProcessor;
    protected List<Request> startRequests;
    protected Site site;
    protected String uuid;
    protected Scheduler scheduler = new QueueScheduler();
    protected Logger logger = LoggerFactory.getLogger(this.getClass());
    protected CountableThreadPool threadPool;
    protected ExecutorService executorService;
    protected int threadNum = 1;
    protected AtomicInteger stat = new AtomicInteger(0);
    protected boolean exitWhenComplete = true;
    protected static final int STAT_INIT = 0;
    protected static final int STAT_RUNNING = 1;
    protected static final int STAT_STOPPED = 2;
    protected boolean spawnUrl = true;
    protected boolean destroyWhenExit = true;
    private ReentrantLock newUrlLock = new ReentrantLock();
    private Condition newUrlCondition;
    private List<SpiderListener> spiderListeners;
    private final AtomicLong pageCount;
    private Date startTime;
    private int emptySleepTime;

 ...... 省略
     
     
    /** @deprecated */
    public Spider downloader(Downloader downloader) {
        return this.setDownloader(downloader);
    }

    public Spider setDownloader(Downloader downloader) {
        this.checkIfRunning();
        this.downloader = downloader;
        return this;
    }
     
     public void run() {
        this.checkRunningStat();
        this.initComponent();
        this.logger.info("Spider {} started!", this.getUUID());
        ......
        this.logger.info("Spider {} closed! {} pages downloaded.", this.getUUID(), this.pageCount.get());
     }
    
    protected void initComponent() {
        if (this.downloader == null) {
            // 在这里也看到了默认使用 HttpClientDownloader
            this.downloader = new HttpClientDownloader();
        }

        if (this.pipelines.isEmpty()) {
            this.pipelines.add(new ConsolePipeline());
        }

        this.downloader.setThread(this.threadNum);
        if (this.threadPool == null || this.threadPool.isShutdown()) {
            if (this.executorService != null && !this.executorService.isShutdown()) {
                this.threadPool = new CountableThreadPool(this.threadNum, this.executorService);
            } else {
                this.threadPool = new CountableThreadPool(this.threadNum);
            }
        }

        if (this.startRequests != null) {
            Iterator var1 = this.startRequests.iterator();

            while(var1.hasNext()) {
                Request request = (Request)var1.next();
                this.addRequest(request);
            }

            this.startRequests.clear();
        }

        this.startTime = new Date();
    }
    
}

从上面可以发现四个都有,我们点击进Downloader类

可以发现

Downloader接口实现

三个实现类(红框内容,my那个是我自己实现的)!

然后你点进HttpClientDownloader类

package us.codecraft.webmagic.downloader;

......
@ThreadSafe
public class HttpClientDownloader extends AbstractDownloader {
    private Logger logger = LoggerFactory.getLogger(this.getClass());
    private final Map<String, CloseableHttpClient> httpClients = new HashMap();
    // 在这里就发现使用HttpClientGenerator
    private HttpClientGenerator httpClientGenerator = new HttpClientGenerator();
    private HttpUriRequestConverter httpUriRequestConverter = new HttpUriRequestConverter();
    private ProxyProvider proxyProvider;
    private boolean responseHeader = true;
    ......省略
}

从以上我们知道了Spider类中有属性Download接口,该接口有实现类HttpClientDownloader，当你在执行Spider.run()时,run()会执行initComponent()，如果你没有设置Spider的download属性,就会自动的选择HttpClientDownloader作为实现类,而HttpClientDownloader中又使用了HttpClientGenerator作为参数,那我们既然要改HttpClientGenerator类中的方法buildSSLConnectionSocketFactory（）,那就得重写HttpClientGenerator和Download

实现，得到MyHttpClientGenerator和MyHttpClientDownloader。

怎么使用新得Download实现?

1.再在使用Spider得时候setDownloader(MyHttpClientDownloader),这样就可以了。

Spider.create(pageProcessor)
        .setDownloader(new MyHttpClientDownloader());

2.或者增加一个类用于继承Spider类,重写initComponent()方法中得

    protected void initComponent() {
        if (this.downloader == null) {
            // 修改这里为 this.downloader = new MyHttpClientDownloader();
            this.downloader = new HttpClientDownloader();
        }
        
        if (this.pipelines.isEmpty()) {
        this.pipelines.add(new ConsolePipeline());
    }

    this.downloader.setThread(this.threadNum);
    if (this.threadPool == null || this.threadPool.isShutdown()) {
        if (this.executorService != null && !this.executorService.isShutdown()) {
            this.threadPool = new CountableThreadPool(this.threadNum, this.executorService);
        } else {
            this.threadPool = new CountableThreadPool(this.threadNum);
        }
    }

    if (this.startRequests != null) {
        Iterator var1 = this.startRequests.iterator();

        while(var1.hasNext()) {
            Request request = (Request)var1.next();
            this.addRequest(request);
        }

        this.startRequests.clear();
    }

    this.startTime = new Date();
}

再次执行testDemo1()

    @Test
    void testDemo1(){
        // 出现错误 -> javax.net.ssl.SSLHandshakeException: Received fatal alert: protocol_version
//        Spider.create(new GithubRepoPageProcessor())
//                .addUrl(new String[]{"https://github.com/code4craft"})
//                .thread(5).run();

        //使用MyHttpClientDownloader解决  protocol_version,但是可能出现github请求超时,这个与github访问有关系,我访问不了
        Spider.create(new GithubRepoPageProcessor())
                .addUrl(new String[]{"https://github.com/code4craft"})
                .setDownloader(new MyHttpClientDownloader())
                .thread(5).run();
    }

官网其他用例

package us.codecraft.webmagic.processor;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;

public interface PageProcessor {
    void process(Page var1);

    Site getSite();
}

你可以通过查找PageProcessor接口得实现,得到官网用例,

PageProcessor实现

本文实现

#### 功能

当前版本通过webmagic实现三个[接口](https://gitee.com/zjydzyjs/spring-boot-use-case-collection/blob/master/webmagic/src/main/java/com/blacktea/webmagic/demo/controller/ProTestController.java)，可实现三个通用功能。

接口请求报文

名称	必填	类型	说明
fileName	true	String	表示下载后得文件名称(要求带文件类型后缀)；
itemPageUrl	false	String	目标页面地址；（该项与helpUrl、targetUrl是分开的,只有进行文章详情爬取的时候才需要itemPageUrl）
helpUrl	false	String	提供需要目标地址的页面url,例如CSDN的列表页;
targetUrl	false	String	找寻目标地址的匹配url,暂时仅支持 regx 匹配;
fields	true	List<Field>	匹配规则集合，长度必须大于0
key	true	String (Field.key)	fields[index].key属性,用于标记该条规则得别名,在导出zip时,会自动将该key作为一级或二级(item一级,list二级)目录名称作为zip内文件分区
value	true	String (Field.value)	fields[index].value属性,用于在页面中进行匹配得内容
type	true	String (Field.type)	fields[index].type属性,与Field.value进行组合使用,当前支持 ['XPath', 'Regex', 'Css', 'JsonPath']，是对webmagi{@link us.codecraft.webmagic.model.annotation.Type}得复用，源码下得matchVal()方法
fileType	true	String (Field.fileType)	fields[index].fileType属性,是对进行匹配后得到结果数据按照类型进行处理,源码下得getDataMap()方法

zip分区说明

为了区分匹配后获取的资源与避免文件名称重复导致文件被覆盖,我这里做了目录分区

1. 当你用`itemPageUrl` 时,文件结构为 

    xxx.zip/key/fileType;

    2. 当你用`helpUrl and targetUrl`时,文件结构为

    第几个页面/key/fileType;

匹配结果调试

WebMagicUtil.getSelectableValue()

private static Object getSelectableValue(Field field,Page page, boolean all){
    Selectable selectable = WebMagicUtil.matchVal(field, page);
    log.debug("当前field.key={},匹配的结果为:{}",field.getKey(),JSON.toJSONString(selectable.all()));
    if (Validator.isNotNull(selectable)){
        // selectable get()/all(),看导出单个具体文件(md),还是压缩文件(zip)
        if (all){
           return selectable.all();
        }else {
           return selectable.get();
        }
    }
    return null;
}

该方法会打印出符合的匹配结果的,默认打印全部结果。

1. 通用详情页导出md

1.1 swagger下载md文件

使用swagger-ui进行md文件下载,地址 -> http://localhost:8000/pro/item/export/md

post,application/json

swagger-md下载请求参数

1.2 实际入参json(CSDN示例)

{
    "fileName":"md下载测试.md",
    "itemPageUrl":"https://blog.csdn.net/weixin_43917143/article/details/120436445",
    "fields":[
        {"key":"md","value":"//*[@id=\"article_content\"]/[@id=\"content_views\"]","type":"XPath","fileType":"MD"},
        {"key":"md1","value":"//*[@id=\"article_content\"]/[@id=\"content_views\"]","type":"XPath","fileType":"MD"}
    ]
}

多个key我会把多个key匹配到的内容一起写入到一个md,只要匹配到符合的内容在一个以上就会进行额外拼接 "# md分割-下一篇\n\n"，再拼接下一篇。

1.3 效果

md下载swagger

md文件最终效果

因为我这里爬取的匹配结果为2,所以我就会再拼接一条 "# md分割-下一篇\n\n"。

注意: 这里不是因为 fields.size= 2,所以才导出2个的,是因为你所配置的规则匹配后的内容结果为2。

有可能你配置的规则没有匹配到内容

或者

配置一个规则却可以出现多个拼接

多个拼接(多个根据WebMagicUtil.csdnProcess()的all参数决定)也是有可能的!

all 为

false,即为符合当前这条field规则的最多只取一条,也就是一个匹配结果[0]拼接index=0,

true,取出所有,也就是一个匹配结果[size],循环多个拼接。

2. 通用详情页导出zip

###### 2.1 swagger下载zip

使用swagger-ui进行zip文件下载,[地址](http://localhost:8000/swagger-ui.html#!/ProTestController/exportCSDNMdUsingPOST) -> http://localhost:8000/pro/item/export/zip

post,application/json

2.2 json

CSDN示例

{
    "fileName":"img下载测试.zip",
    "itemPageUrl":"https://blog.csdn.net/weixin_43917143/article/details/120436445",
    "fields":[
          {"key":"images","value":"//*[@id=\"article_content\"]//*//img","type":"XPath","fileType":"PNG"},
          {"key":"icons","value":"//div[@class=\"mouse-box\"]//img","type":"XPath","fileType":"PNG"}
    ]
}

这里建议使用postman进行测试,postman可以将最终结果文件流保存为zip文件

2.3 效果

详情zip下载效果

3. 通用列表页获取详情页进行zip导出

3.1下载地址

http://localhost:8000/pro/list/export/zip

post,application/json

3.2 json

CSDN示例:

{
    "fileName":"xxx.zip",
    "helpUrl":"https://blog.csdn.net/weixin_43917143",
    "targetUrl":"(https://blog\\.csdn\\.net/weixin_43917143/article/details/\\w+)",
    "fields":[
        {"key":"md","value":"//*[@id=\"article_content\"]/[@id=\"content_views\"]","type":"XPath","fileType":"MD"},
           {"key":"images","value":"//*[@id=\"article_content\"]//*//img","type":"XPath","fileType":"PNG"},
            {"key":"icons","value":"//div[@class=\"mouse-box\"]//img","type":"XPath","fileType":"PNG"}
    ]
}

简书示例:

{
    "fileName":"简书-xxx.zip",
    "helpUrl":"https://www.jianshu.com/u/7dcee71ee038",
    "targetUrl":"(https://www\\.jianshu\\.com/p/\\w+)",
    "fields":[
        {"key":"md","value":"//div[@class=\"_gp-ck\"]/section[@class=\"ouvjEz\"][1]","type":"XPath","fileType":"MD"}
    ]
}

3.3 效果

CSDN效果:

列表zip下载效果

SpringBoot整合WebMagic实现爬虫(简单入门含gi