微信爬虫实战

2020-08-16  本文已影响0人  misspass

https://nuozhilin.site/2020/02/20/2020-02-20-weixin-crawler-practice/
https://www.lizenghai.com/archives/31687.html

代理

macOS

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
2
3
4
</pre>

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">brew install mitmproxy

mitmdump

Proxy server listening at http://*:8080

</pre>

|

mitmproxy是一款开源免费且可编程的HTTPS代理 可选方案还有AnyProxy

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
2
3
</pre>

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">双击 ~/.mitmproxy/mitmproxy-ca-cert.pem

配置 mitmproxy证书为 始终信任
</pre>

|

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
2
3
4
</pre>

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">系统配置 => 网络 => 高级 => 代理

Web Proxy (HTTP) => 127.0.0.1:8080
Secure Web Proxy (HTTPS) => 127.0.0.1:8080
</pre>

|

Android

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
</pre>

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">拷贝 macOS证书~/.mitmproxy/mitmproxy-ca-cert.pem至手机
</pre>

|

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
</pre>

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">MIUI11 => 设置 => 加密与凭据 => 从SD卡安装
</pre>

|

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
</pre>

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">网络 => 代理 => macos_ip:8080
</pre>

|

macOS和Android必须处于同一网络

爬虫

数据库

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
2
3
</pre>

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">docker run --name mysql-weixin -p 3306:3306 -e MYSQL_ROOT_PASSWORD=123456 -d mysql:5.7.17

docker exec -i mysql-weixin mysql -uroot -p123456 <<< "CREATE DATABASE IF NOT EXISTS wechat DEFAULT CHARSET utf8mb4 COLLATE utf8mb4_general_ci;"
</pre>

|

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
</pre>

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">docker run --name redis-weixin -p 6379:6379 -d redis
</pre>

|

爬虫

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
2
3
4
5
6
7
8
9
10
</pre>

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">wget https://zbkj-service.oss-cn-beijing.aliyuncs.com/wechat/wechat_spider.zip

unzip wechat_spider.zip && rm -rf __MACOSX

cd wechat_spider

chmod +x wechat-spider-mac

退出之前的mitmdump

./wechat-spider-mac
</pre>

|

爬虫服务的数据库配置文件在config.yaml

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
</pre>

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">docker exec -i mysql-weixin mysql -uroot -p123456 <<< "USE wechat; INSERT INTO wechat_account_task (__biz) VALUES('MzIyNzk1MTU2OQ==');"
</pre>

|

上述biz是指公众号”机械指挥官”在微信平台中的唯一编号 同时手机微信需要关注此公众号 否则数据不能抓取全

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
2
3
</pre>

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">MIUI11 => 微信 => 通讯录 => 公众号 =>

"机械指挥官" => 新闻资讯 => "机械指挥官" (历史消息)
</pre>

|

先添加要抓取的公众号 然后再访问任意公众号的”历史消息” 才能触发

此时爬虫开始抓取 ./logs/wechat_spider.log日志如下

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px 20px 1px 1px; color: rgb(102, 102, 102); background: rgb(0, 0, 0); line-height: 1.6; border: none; text-align: right;">1
2
</pre>

|

<pre style="overflow: auto; font-family: consolas, Menlo, "PingFang SC", "Microsoft YaHei", monospace; font-size: 13px; margin: 0px; padding: 1px; color: rgb(234, 234, 234); background: rgb(0, 0, 0); line-height: 1.6; border: none;">MainThread|2020-02-20 14:48:17,877|deal_data.py|deal_article_list|line:290|INFO| 抓取到列表底部 无更多文章,公众号 MzIyNzk1MTU2OQ== 抓取完毕
MainThread|2020-02-20 15:00:40,828|deal_data.py|__parse_article_list|line:153|INFO| 采集到上次发布时间 公众号 MzIyNzk1MTU2OQ== 采集完成
</pre>

|

参考

上一篇下一篇

猜你喜欢

热点阅读