ubutun爬取动态网页-selenium
安装
首先安装selenium:
pip install selenium
谷歌浏览器驱动下载:
chromedriver所有版本下载url:
http://chromedriver.storage.googleapis.com/index.html
谷歌浏览器 与 插件chromedriver版本对应
(Session info: chrome=57)
(Driver info: chromedriver=2.25 , platform=Linux 4.2.0-42-generic x86_64)
环境设置:
将下载好的浏览器驱动chromedriver,移到谷歌安装文件夹下 /opt/google/chrome/
驱动设置权限 chmod 777 chromedriver
程序中谷歌驱动环境设置(必须引用),解决以下报错:
[ Error ] selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home
import os
from selenium import webdriver
def chromeDriver( ) :
#设置环境变量,打开谷歌浏览器
chromedriver = '/opt/google/chrome/chromedriver'
os.environ[ 'webdriver.chrome.driver' ] = chromedriver
return webdriver.Chrome(chromedriver)
简单实例
打开百度网页,该实例selenium 会打开浏览器窗口模拟浏览器,每次去请求url都会打开谷歌浏览器的百度网址,当打开浏览器时,你不能进行别的操作,比较繁琐,以下方案很好解决这个问题.
[必须引用上面的谷歌驱动环境设置代码]
import os
import time
from selenium import webdriver
def chromeDriver( ) :
chromedriver = '/opt/google/chrome/chromedriver'
os.environ[ 'webdriver.chrome.driver' ] = chromedriver
return webdriver.Chrome(chromedriver)
if __name__ == '__main__' :
driver = chromeDriver( )
# url = http/https+域名
driver.get('https://www.baidu.com')
time.sleep(1)
# 获取网页渲染后的html源代码
print driver.page_source
driver.quit( )
* selenium 不打开浏览器窗口模拟浏览器 *
安装模块
sudo apt-get install xvfb
sudo pip install pyvirtualdisplay
#coding:utf-8
import os
import time
from pyvirtualdisplay import Display
from selenium import webdriver
def chromeDriver():
chromedriver = '/opt/google/chrome/chromedriver'
os.environ['webdriver.chrome.driver'] = chromedriver
return webdriver.Chrome(chromedriver)
if __name__ == '__main__' :
url = "http://www.baidu.com"
'''
不显示浏览器窗口的两种方法:Display( )参数可无,但最好带参数
[1]with Display(backend="xvfb", size=(1024, 768)):
[2]display = Display(visible=0, size=(1024, 768))
display.start()
'''
with Display(backend="xvfb", size=(1440, 900)):
driver = chromeDriver()
# 浏览器窗口最大化显示
driver.maximize_window()
driver.get(url)
# 打开浏览器时打印提醒
print 'Opening Google browser...'
# .py文件路径下保存浏览器页面为test.png
driver.get_screenshot_as_file("test.png")
time.sleep(3)
#获取网页渲染后的html源代码
print driver.page_source
driver.quit()