Nodejs爬虫——机票查询学习笔记(2).md

2017-04-04 本文已影响111人 ccminn

2017.3.20 - 2017.3.31

笔记索引

mongodb数据库去重
日期数组编造
同步查询代码编写
https://segmentfault.com/q/1010000005615722/a-1020000005615887
insertMany函数
.insertMany([
{key:"1",key1:"value1"},
{key:"2",key1:"value1"},
{key:"3",key1:"value1"},
……
])；
前端网页设计
图表描述(待补)
学习正则表达式（待补）
总结

学习资源

验证码识别
 promiseA+ 介绍
 promise篇深入讲解

对于加密信息的网站，使用模拟真实浏览器的方法发出post/get请求
python模拟无gui浏览器

详细笔记

1. mongodb存入数据避免重复

方法一：使用update()方法：
调用update()方法使用upsert标志创建一个新文档当没有匹配查询条件的文档时。

//存入数据库  
var newRoute = new Route({
                                 'departCity' : departCityName,
                                 'departCode' : departCityCode,
                                 'arrivalCity' : arrivalCityName,
                                'arrivalCode' : arrivalCityCode,
                                 'expired' : false,
                             })
                             newRoute.save(function (err, data) {
                                 if (err){
                                     console.error(err);
                                 }else {
                                     console.log('record a new route from '+ data.departCity + ' to ' + data.arrivalCity +' successful!');
                                 }
                             })
//存在的问题：
//1.没有检查是否已存在相同航线记录，就存入了数据库，可能导致信息重复
//2.尝试使用.update({},{upsert:ture})方法，但在未设置{$set: {}}的情况下无法实现upsert
//3.如果使用findOne()与.save(),因为这个存储操作在六层for循环中，由于异步调用机制的干扰，被传入函数的执行对象永远是同一条数据。
//因此采用async的同步限制async.mapLimit([], Num, function(), callback)
    
//改进后
async.mapLimit(_routes,1, function (_route, callback) {
                    Route.find({
                        departCode: _route.departCode,
                        arrivalCode: _route.arrivalCode,
                    }, function (err, doc) {
                        if (doc.length === 0) {
                            _route.save(function (err, data) {
                                if (err) {
                                    console.error(err);
                                    console.log('save error');
                                } else {
                                    console.log('record a new route from ' + data.departCity + ' to ' + data.arrivalCity + ' successful!');
                                }
                            });
                        } else {
                            console.log('duplicate record');
                        }
                        callback(null, 'one');
                    });
            }, function (err ,result) {
                console.log(result);
                console.log('all routes have been recorded!');
            })

方法二：对数据库本身进行操作，增加复合唯一索引
db.collection.ensureIndex({key:, key:1, key:1},{unique: true});

2. 生成近三个月内的日历

日期数组转换
日期格式转换的format汇总
使用注意事项：
this.getMonth()需要+1才是真实日期。因为在js中月份从0开始。
获取明天的日期

function theDayAfter(day) {
    var today = new Date();
    var targetDay_milliseconds = today.getTime() + 1000*24*60*60*day;
    var targetDay = new Date();
    targetDay.setTime(targetDay_milliseconds);
    return targetDay.Format('mm/dd/yyyy');
}

3. 针对携程等网站的加密技术，采用模拟真实浏览器的技术

模拟的目的在于，像真人操作浏览器一样，在后台发起请求的同时，还要运行一下前台界面上的js。带出网页加载后才可能会生成的信息。（eg. searchKey等加密信息，也许来源于网页的get请求后收到的某个response中，也可能是通过网页中某个js操作后，计算得出）
参考资料：从反爬虫的角度思考爬虫
爬虫必须执行一下js 代码，才能取得动态 key，爬虫实现执行 js 的模拟环境（模拟真实浏览器）成为破解的关键。
实验用到的工具类型：
模拟浏览器点击事件、字符填充事件
selenium 模拟浏览器事件
webdriver 浏览器的驱动
phantomjs 一个没有图形界面的浏览器

关于几个依赖之间的关系的区分 WebDriver and the Selenium-Server

WebDriver and the Selenium-Server
You may, or may not, need the Selenium Server, depending on how you intend to use Selenium-WebDriver. If your browser and tests will all run on the same machine, and your tests only use the WebDriver API, then you do not need to run the Selenium-Server; WebDriver will run the browser directly.
There are some reasons though to use the Selenium-Server with Selenium-WebDriver.
You are using Selenium-Grid to distribute your tests over multiple machines or virtual machines (VMs).
You want to connect to a remote machine that has a particular browser version that is not on your current machine.
You are not using the Java bindings (i.e. Python, C#, or Ruby) and would like to use HtmlUnit Driver

[中文文档](https://wizardforcel.gitbooks.io/selenium-doc/content/official-site/selenium-web-driver.html
selenium webdriver = selenium 2)
selenium-webdriver的另一个API文档

一个官方的Example

var driver = new webdriver.Builder().build();
driver.get('http://www.google.com');

var element = driver.findElement(webdriver.By.name('q'));
element.sendKeys('Cheese!');
element.submit();

driver.getTitle().then(function(title) {
  console.log('Page title is: ' + title);
});

driver.wait(function() {
  return driver.getTitle().then(function(title) {
    return title.toLowerCase().lastIndexOf('cheese!', 0) === 0;
  });
}, 3000);

driver.getTitle().then(function(title) {
  console.log('Page title is: ' + title);
});

driver.quit();

selenium-webdriver的API整理：

获取页面：
``
driver.get("http://www.google.com");

定位dom元素：  
id查找：

var element = driver.findElement(By.id('coolestWidgetEvah'));

类名查找：

driver.findElements(By.className("cheese")).then(cheeses => console.log(cheeses.length));```

标签查找：

var frame = driver.findElement(By.tagName('iframe'));

name查找：

var cheese = driver.findElement(By.name('cheese'));

链接标签内的文字查找：

<a href="http://www.google.com/search?q=cheese">cheese</a>>
var cheese = driver.findElement(By.linkText('cheese'));

链接标签内的部分文字查找：

<a href="http://www.google.com/search?q=cheese">search for cheese</a>>
var cheese = driver.findElement(By.partialLinkText('cheese'));
css查找：
<div id="food"><span class="dairy">milk</span><span class="dairy aged">cheese</span></div>
var cheese = driver.findElement(By.css('#food span.dairy.aged'));

XPath查找：

<input type="text" name="example" />
<INPUT type="text" name="other" />
driver.findElements(By.xpath("//input")).then(cheeses => console.log(cheeses.length));

执行js：（官方doc无js版，仅是个人推测）

driver.executeScript("");

获取文本值：

var element = driver.findElement(By.id('elementID'));
element.getText().then(text => console.log(`Text is `));

填写表单:

element.sendKeys("");

点击事件：

element.click();

提交表单事件：

element.submit();

切换窗口：

driver.switchTo().window('NewURL');

切换Frame：

driver.switchTo().frame('frameName');

selenium中的submit()与click()的区别
click()方法就是单纯的点击下，或者说是单击下，但是submit()方法一般使用在有form标签的表单中

难点

怎么在发出post请求后，等待页面加载完全，再抓取页面元素
现在想到的办法，就是抓取第一条航班是否可定位，如果有，就抓，否则就等待
selenium如何判断当前页面已经加载完成
待尝试方案：使用webdriverIO，一个针对selenium2的处理包

遇到的错误整理：

async.mapLimit([list], limitNumber , function1() , function2())
function1函数在每完成limitNumber次同步后调用执行。在function1中，忘记写callback导致同步进程进行到第一个limitNumber就终止。
在function2中在完成整个的list数组同步之后调用执行，不能缺少。
for循环与async不能嵌套使用
callback所在的位置要注意，错置于函数体的外部可能会发生调用栈溢出
RangeError: Maximum call stack size exceeded

过多的递归调用为什么会引起栈溢出呢？事实上，函数调用的参数是通过栈空间来传递的，在调用过程中会占用线程的栈资源。而递归调用，只有走到最后的结束点后函数才能依次退出，而未到达最后的结束点之前，占用的栈空间一直没有释放，如果递归调用次数过多，就可能导致占用的栈资源超过线程的最大值，从而导致栈溢出，导致程序的异常退出。
也有可能是因为递归没有退出条件
解决方法：检查async调用是否合理

数据库存储数据类型要对应
flightPrice:Number与String数据的类型冲突。
遇到此类问题
{ [Error: socket hang up] code: 'ECONNRESET', response: undefined }
原因：服务器崩溃重启，客户端重新连接
解决：控制一下访问的间隔，利用setTimeout()函数

安装webdriberio
官方安装教程
 博客安装教程
按照官方教程操作，在执行jar包的时候如果出现以下报错，是JDK版本太低不适配导致的，选择博客教程中的"2.40.0"版本就好了。

$ java -jar selenium-server-standalone-3.0.1.jar
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/openqa/grid/selenium/GridLauncherV3 : Unsupported major.minor version 52.0

7.使用selenium-webdriver驱动chrome浏览器
官方API也有缺漏少ForBrowser('chrome')
一定要下载对应浏览器的driver
macOS 安装至/usr/bin
因为mac最新的系统版本已经开始SIP服务，就算使用sudo也不能对/usr/bin(以及其他几个比较重要的目录)进行操作，所以要先关闭SIP，在把浏览器driver移入/usr/bin
还要注意在~./bash_profile中添加/usr/bin的环境变量
然后 source 它，使它立即生效
驱动代码实例

var webdriver = require('selenium-webdriver');
var phantomjs = require('phantomjs-prebuilt');
var By = require('selenium-webdriver').By;
var cherrio = require('cheerio');
//官方教程里少了forBrowser('')这一步
var driver = new webdriver.Builder().forBrowser('phantomjs').build();
driver.get('http://www.baidu.com');

var element = driver.findElement(webdriver.By.name('wd'));
element.sendKeys('Cheese!');
element.submit();
driver.getPageSource().then(function (res) {
    // dirver.executeScript("console.log(documents.readyState);");
    var $ = cherrio.load(res);
    var button = $("#su");
    var doc = $("document");
    console.log("su=");
    console.log(button.val());
    console.log("readystate:");
    console.dir($("document").readyState);
    // dirver.executeScript("console.log(documents.readyState);");
})

driver.wait(function() {
    //判定网页的标题为以cheese!打头的字符串
    return driver.getTitle().then(function(title) {
        return title.toLowerCase().lastIndexOf('cheese!', 0) === 0;
    });
}, 1000);

driver.getPageSource().then(function (res) {
    // console.log("page content is : "+ res);
    var $ = cherrio.load(res);
    var button = $("#container .nums");
    console.log("nums = ");
    console.log(button.text());
})

driver.quit();