爬虫（五）firefox动态内容（知乎(需登录)）

2018-01-13 本文已影响0人马梦里

浏览器爬虫可以内嵌 js 脚本

https://zhuanlan.zhihu.com/p/25214682

登录方式

用 cookie 实现登录（请求头），需要提前手动登录
在登录窗口自动输入也可以
利用 firefox 配置文件，里面有网页的任何信息（用户名密码等）
https://support.mozilla.org/en-US/kb/profiles-where-firefox-stores-user-data?redirectlocale=en-US&redirectslug=Profiles

firefox 配置文件.png

配置文件路径.png

import config
import platform
import os
from splinter import Browser

def add_chrome_webdriver():
    print(platform.system())
    working_path = os.getcwd()
    library = 'library'
    path = os.path.join(working_path, library)
    os.environ['PATH'] += '{}{}{}'.format(os.pathsep, path, os.pathsep)
    print(os.environ['PATH'])

def scroll_to_end(browser):
    browser.execute_script('window.scrollTo(0, document.body.scrollHeight);')

将驱动加入环境变量
将浏览器滚动条滚动到底部，这就是插入 js 事件

def start_crawler():
    option = config.profile

    with Browser(profile=option) as browser:
        url = "https://www.zhihu.com"
        browser.visit(url)
        browser.reload()

        print(browser.html)
        scroll_to_end(browser)
        found = False

        while not found:
            print('loop')
            found = browser.is_text_present('1 天前')
            if found:
                print('拿到了最近1天动态')
                break
            else:
                scroll_to_end(browser)

Browser() 的参数 profile 定义 firefox 的配置文件
visit() 访问 url，然后 reload() 重新加载页面（相当于刷新），进入登录界面
滑动滚动条进行循环查找


def main():
    add_chrome_webdriver()
    start_crawler()

if __name__ == '__main__':
    main()

爬虫（五）firefox动态内容（知乎(需登录)）

猜你喜欢

热点阅读