wechat_spider 微信爬虫

2018-05-29 本文已影响0人锋_5bdc

基于Node 的微信爬虫，通过中间人代理的原理，批量获取微信文章数据，包括阅读量、点赞量和评论等数据。

开始

安装前准备

查看安装 node，版本大于 8.8.1
安装 MongoDB，版本大于 3.4.6
安装Redis
安装Node 全局模块nodemon 和pm2 查看

安装

git clone https://github.com/lqqyt2423/wechat_spider.git
cd wechat_spider
npm install

本项目基于代理模块AnyProxy，解析微信HTTPS 请求需在电脑和手机上都安装证书。AnyProxy 证书安装。

使用

Administrator@PC-201805221036 MINGW64 /e/ufutx_project/wechat_spider (master)
$ npm start

> wechat_spider@1.1.0 start E:\ufutx_project\wechat_spider
> nodemon index.js --ignore client/

[nodemon] 1.17.5
[nodemon] to restart at any time, enter `rs`
[nodemon] watching: *.*
[nodemon] starting `node index.js`
请配置代理:  xx.xx.xx.xx:8101
可视化界面: http://localhost:8104

确保电脑和手机连接同一WIFI ，npm start 之后，命令行输出请配置代理: xx.xx.xx.xx:8101 类似语句，手机设置代理为此IP 和端口（即为AnyProxy 证书安装的最后一步：设置代理）
浏览器打开可视化界面: http://localhost:8104
打开任意公众号 =》查看"历史文章" =》便自动帮你爬取数据 =》观察电脑命令行的输出，查看数据是否保存至MongoDB

自定义配置

目前可支持的配置项举例如下：
* 控制是否开启文章或历史详情页自动跳转
* 控制跳转时间间隔
* 根据文章发布时间控制抓取范围
* 是否保存文章正文内容
* 是否保存文章评论
* 可编辑index.js ，config.js 和targetBiz.json 进行自定义配置。文件中注释有详细说明。

前端页面由React 编写，如需修改，可编辑client 文件中的代码。

MongoDB 数据信息

数据库database: wechat_spider

数据表collections:

posts - 文章数据
profiles - 公众号数据
comments - 评论数据
categories - 自定义的公众号分类

从MongoDB 导出数据

mongoexport --db wechat_spider --collection posts --type=csv --fields title,link,publishAt,readNum,likeNum,msgBiz,msgMid,msgIdx,sourceUrl,cover,digest,isFail --out ~/Desktop/posts.csv

以上命令会导出数据至桌面的posts.csv 中。

wechat_spider 微信爬虫

开始

安装前准备

安装

使用

自定义配置

MongoDB 数据信息

从MongoDB 导出数据

猜你喜欢

热点阅读