杂/今日份的爬虫【Selenium相关,随用随找】

2019-12-12  本文已影响0人  yumiii_
  1. requests爬取的网页内容乱码
    首先,在网页的console台document.inputEncoding看一哈网站编码
    然后在request请求完得到的response设置respnse.encoding=...

  2. 或许是请求太快了,也或许是间隔太短了,但是暂时不知道出现这个情况怎么调。。我的代码中之前用的是urllib.urlopen请求的,不知道为啥换成requests.get请求就好了。???

urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

stackoverflow给出的问题原因和解决方案是(有没有大神知道第一个easy的代码怎么改啊 😢):


  1. headless浏览器
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_opt = Options()      # 创建参数设置对象.
# chrome_opt.set_headless()
chrome_opt.add_argument("--headless")
chrome_opt.add_argument('--disable-gpu')    # 配合上面的无界面化.
chrome_opt.add_argument('--window-size=1366,768')   # 设置窗口大小, 窗口大小会有影响.
driver = webdriver.Chrome(executable_path="....手动马赛克....",chrome_options=chrome_opt)

傻了吧唧的在代码最上面定义了一个浏览器,结果发现怎么加了无头模式还是依然打开了一个浏览器呀。无药可救了。排查了好久。

4.今日get新技能:在python里杀死进程

import signal
import os
pid = driver.service.process.pid
try:
    os.kill(int(pid), signal.SIGTERM)
    print("Killed chrome using process")
except ProcessLookupError as ex:
    pass
  1. 删除掉子元素中的某一个元素——clear()


        for e in bs.select("ul .res-desc"):
            tag = e.find("cite")
            tag.clear()
            desc_string += e.get_text()

6.更新webdriver与禁止浏览器更新
好久没用就会发现,google又升级了,webdriver又跟不上啦。
mac电脑下的禁止升级操作:

cd ~/Library/Google
sudo chown root:wheel GoogleSoftwareUpdate
然后重启谷歌浏览器,帮助那里就会变成这个样子,就可以啦

关于下配套的webdriver:
首先用这个chrome://version看当前谷歌浏览器的版本,然后去这个网站找对应的版本:http://chromedriver.storage.googleapis.com/index.html

7.全部代码:

#encoding:utf-8
__author__ = 'yumihuang'
# project name:HelloWorld
# time:2019-12-12


#!/usr/bin/env python3
from selenium import webdriver
import time
import os
import signal
from selenium.webdriver.chrome.options import Options
chrome_opt = Options()      # 创建参数设置对象.
# chrome_opt.set_headless()
chrome_opt.add_argument("--headless")
chrome_opt.add_argument('--disable-gpu')    # 配合上面的无界面化.
chrome_opt.add_argument('--window-size=1366,768')   # 设置窗口大小, 窗口大小会有影响.
import requests
import urllib.request
from bs4 import BeautifulSoup

# html=requests.get("http://www.tjcn.org/tjgbsy/nd/35338.html")
# print(html.text.encode("gbk"))
def get_all_links(url):
    '''
    :param url: 给定一个网址
    :return: 以字典的形式返回所有市的url
    '''
    html = urllib.request.urlopen(url).read().decode("gbk")
    soup = BeautifulSoup(html,"html.parser")
    sy_soup=soup.find("div",class_="sy")
    all_city={}
    for trs in sy_soup.find_all("tr"):
        # print(trs)
        # print("\n")
        tds = trs.find_all("td")
        # print(tds[1:])
        all_a = BeautifulSoup(str(tds[1:]),"html.parser").find_all("a")
        # print(all_a.get("href"))
        city_name=[]
        for item in all_a:
            now_url=item.get("href")
            city_name=item.get_text()
            all_city[city_name]=now_url
    return all_city

def write_to_file(strings,filename):
    with open(filename,"a") as f:
        for line in strings:
            f.write(line)

def parse_data(url,city_name):
    basic_url="http://www.tjcn.org"
    url=basic_url+url
    print(url)
    now_html = requests.get(url)
    now_html.encoding="gbk"
    now_html=now_html.text
    # # now_html=urllib.request.urlopen(url).read().decode("gbk")
    # # print(str(now_html.title))
    after_bs_parse=BeautifulSoup(now_html,"html.parser")
    # print(after_bs_parse.title.get_text())
    web_title=after_bs_parse.title.get_text().replace("2017年国民经济和社会发展统计公报 - 中国统计信息网","")
    # print(web_title)
    assert web_title==city_name
    pages= after_bs_parse.select("p.pageLink a[title='Page'] b")[-1].get_text()
    print(pages)
    contents_all=BeautifulSoup(now_html,"html.parser").select("td.xwnr")[0].get_text()
    # print(contents_all)
    # print("____________________________________")
    for page in range(2,int(pages)+1):
        content=get_sub_content(url,str(page))
        contents_all+=content
    write_to_file(contents_all,"city/"+city_name)





# parse_data()
def get_sub_content(url,page):
    '''

    :param page: string类型
    :return: 翻页后得到的内容
    '''

    sub_url=url.replace(".html","_"+page+".html")

    driver = webdriver.Chrome(executable_path="/Users/yumi/Documents/Code/HelloWorld/HelloWorld/chromedriver",chrome_options=chrome_opt)
    pid = driver.service.process.pid

    driver.implicitly_wait(10)
    driver.get(sub_url)

    # assert
    html = driver.page_source
    # print(html)
    try:
        contents = BeautifulSoup(html, "html.parser").select("td.xwnr")[0].get_text()
    # print(contents)
    except:
        print("not found!!!::"+sub_url)
    driver.quit()
    try:
        os.kill(int(pid), signal.SIGTERM)
        print("Killed chrome using process")
    except ProcessLookupError as ex:
        pass
    return contents
# test()

url_2017 = "http://www.tjcn.org/tjgbsy/nd/35338.html"
city=get_all_links(url_2017)
print(city)

for k,v in city.items():
    parse_data(v, k)
    try:
        parse_data(v,k)
    except:
        print(k+"市出问题了!!!")

# get_sub_content("http://www.tjcn.org","/tjgb/09sh/35333.html","2")
# parse_data("/tjgb/09sh/35333.html","上海市")

总结:虽然没有钱拿了。唉。但是数据清洗更麻烦,暂时我也不愿意趟这个浑水了。菜是原罪。

  1. 滚动找元素
    之前实习的适合做自动化,框架里封装了滑动寻找元素的方法,当时还在想这是咋实现的,原来selenium里有这个方法哇!
    上段测试代码:
from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Options

from bs4 import BeautifulSoup
chrome_opt = Options()      # 创建参数设置对象.
# chrome_opt.set_headless()
# chrome_opt.add_argument("--headless")
chrome_opt.add_argument('--disable-gpu')    # 配合上面的无界面化.
chrome_opt.add_argument('--window-size=1366,768')   # 设置窗口大小, 窗口大小会有影响.
driver = webdriver.Chrome(executable_path="/Users/yumi/Desktop/持续学习/codelearn/chromedriver",chrome_options=chrome_opt)
driver.get("https://wenku.baidu.com/view/aa31a84bcf84b9d528ea7a2c.html")
# more_butoon=driver.find_element_by_class_name("btn-know")
# if more_butoon!=None:
#   more_butoon.click()
# else:
#     pass

html = driver.page_source
html_bs = BeautifulSoup(html,"lxml")
print(html_bs)
button= driver.find_element_by_class_name("moreBtn")
driver.execute_script('arguments[0].scrollIntoView();', button)
time.sleep(5)
driver.execute_script("arguments[0].click();",button)

讲解一下, execute_script这个方法是用来执行js脚本的
滚动的话直接找那句话就好啦。
下面那个是我找到这个元素点击的时候出现了问题,本来我用的是button.click(),但报了下面的错:

selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element <span class="moreBtn goBtn">...</span> is not clickable at point (515, 637). Other element would receive the click: <div class="reader-tools-page xllDownloadLayerHit_left">...</div>

stackoverflow给出了下面两种解法,但是为什么会出现这种情况我不懂哇QAQ


老实说我试了一下,只有第一种方法可行,第二种方法点击到别的元素上去了(第二种方法在使用之前需要 from selenium.webdriver.common.action_chains import ActionChains

9.requests.session()会话保持
requests.session()会话保持

上一篇下一篇

猜你喜欢

热点阅读