Web Spider

2018-06-04 本文已影响18人方寸拾光

Web Spider

最近女票公司又要苦逼的统计一些数据了，然而我这次放弃了直接从网站上一个一个的数据复制粘贴到Excel。因为这是一项劳民伤财的工程，还好这次给了我足够的时间来研究。下面是我做的爬虫代码。

代码中用到的主要工具有两个，一个是Cheerio，是node端的jquery（能做一些类似前端的选择器的操作，几乎和前端的一样）。另外一个神器是Node-xlsx，是把爬取的数据放入excel的工具，使用也是相当简单。

主文件就是这个 tencent-spider.js

const

http = require('http'),

fs = require('fs'),

cheerio = require('cheerio'),

xlsx = require('node-xlsx');

const writeXlsx = datas => {

let buffer = xlsx.build([

{

name:'Tencent Video Reading',

data: datas

}

]);

fs.writeFileSync('./harvest/tencent/1.xlsx', buffer, {'flag':'w'}); //生成excel

};

//该函数的作用：在本地存储所爬取的新闻内容资源

const savedContent = $ => {

let dataArr = [];

dataArr.push(['标题', '阅读量', '时间']);

$('.figures_list li').each(function (index, item) {

let title = $(this).find('strong a').text(),

reading = $(this).find('.figure_info .info_inner').text(),

time = $(this).find('.figure_info .figure_info_time').text();

console.log(title);

let data = [title, reading, time];

dataArr.push(data); //一行一行添加的不是一列一列

});

writeXlsx(dataArr);

};

const startRequest = x => {

//采用http模块向服务器发起一次get请求

http.get(x, function (res) {

let html =''; //用来存储请求网页的整个html内容

res.setEncoding('utf-8'); //防止中文乱码

//监听data事件，每次取一块数据

res.on('data', function (chunk) {

html += chunk;

});

//监听end事件，如果整个网页内容的html都获取完毕，就执行回调函数

res.on('end', function () {

let $ =cheerio.load(html); //采用cheerio模块解析html

savedContent($); //存储每篇文章的内容及文章标题

});

}).on('error', function (err) {

console.log(err);

});

};

startRequest("http://v.qq.com/vplus/wevideo/videos"); //主程序开始运行

## Explore Spider :watermelon:

In this repo, I will crawl different website data.

## Tech Stack :strawberry:

- [Node](https://nodejs.org)

- [Cheerio](https://cheerio.js.org)

- [Node-xlsx](https://github.com/mgcrea/node-xlsx)

## Installation :green_apple:

```

$ git clone https://github.com/JimmieMax/explore-spider.git

$ npm install

```

## Run tasks :banana:

```

//tencent spider http://v.qq.com/vplus/wevideo/videos

node tencent-spider

```

## Authors :cherries:

Jimmie Max

Web Spider

猜你喜欢

热点阅读