杂/今日份的爬虫【Selenium相关,随用随找】
-
requests爬取的网页内容乱码
首先,在网页的console台document.inputEncoding
看一哈网站编码
然后在request请求完得到的response设置respnse.encoding=...
-
或许是请求太快了,也或许是间隔太短了,但是暂时不知道出现这个情况怎么调。。我的代码中之前用的是urllib.urlopen请求的,不知道为啥换成requests.get请求就好了。???
urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
stackoverflow给出的问题原因和解决方案是(有没有大神知道第一个easy的代码怎么改啊 😢):
- headless浏览器
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_opt = Options() # 创建参数设置对象.
# chrome_opt.set_headless()
chrome_opt.add_argument("--headless")
chrome_opt.add_argument('--disable-gpu') # 配合上面的无界面化.
chrome_opt.add_argument('--window-size=1366,768') # 设置窗口大小, 窗口大小会有影响.
driver = webdriver.Chrome(executable_path="....手动马赛克....",chrome_options=chrome_opt)
傻了吧唧的在代码最上面定义了一个浏览器,结果发现怎么加了无头模式还是依然打开了一个浏览器呀。无药可救了。排查了好久。
4.今日get新技能:在python里杀死进程
import signal
import os
pid = driver.service.process.pid
try:
os.kill(int(pid), signal.SIGTERM)
print("Killed chrome using process")
except ProcessLookupError as ex:
pass
-
删除掉子元素中的某一个元素——clear()
for e in bs.select("ul .res-desc"):
tag = e.find("cite")
tag.clear()
desc_string += e.get_text()
6.更新webdriver与禁止浏览器更新
好久没用就会发现,google又升级了,webdriver又跟不上啦。
mac电脑下的禁止升级操作:
cd ~/Library/Google
sudo chown root:wheel GoogleSoftwareUpdate
然后重启谷歌浏览器,帮助那里就会变成这个样子,就可以啦
关于下配套的webdriver:
首先用这个chrome://version
看当前谷歌浏览器的版本,然后去这个网站找对应的版本:http://chromedriver.storage.googleapis.com/index.html
7.全部代码:
#encoding:utf-8
__author__ = 'yumihuang'
# project name:HelloWorld
# time:2019-12-12
#!/usr/bin/env python3
from selenium import webdriver
import time
import os
import signal
from selenium.webdriver.chrome.options import Options
chrome_opt = Options() # 创建参数设置对象.
# chrome_opt.set_headless()
chrome_opt.add_argument("--headless")
chrome_opt.add_argument('--disable-gpu') # 配合上面的无界面化.
chrome_opt.add_argument('--window-size=1366,768') # 设置窗口大小, 窗口大小会有影响.
import requests
import urllib.request
from bs4 import BeautifulSoup
# html=requests.get("http://www.tjcn.org/tjgbsy/nd/35338.html")
# print(html.text.encode("gbk"))
def get_all_links(url):
'''
:param url: 给定一个网址
:return: 以字典的形式返回所有市的url
'''
html = urllib.request.urlopen(url).read().decode("gbk")
soup = BeautifulSoup(html,"html.parser")
sy_soup=soup.find("div",class_="sy")
all_city={}
for trs in sy_soup.find_all("tr"):
# print(trs)
# print("\n")
tds = trs.find_all("td")
# print(tds[1:])
all_a = BeautifulSoup(str(tds[1:]),"html.parser").find_all("a")
# print(all_a.get("href"))
city_name=[]
for item in all_a:
now_url=item.get("href")
city_name=item.get_text()
all_city[city_name]=now_url
return all_city
def write_to_file(strings,filename):
with open(filename,"a") as f:
for line in strings:
f.write(line)
def parse_data(url,city_name):
basic_url="http://www.tjcn.org"
url=basic_url+url
print(url)
now_html = requests.get(url)
now_html.encoding="gbk"
now_html=now_html.text
# # now_html=urllib.request.urlopen(url).read().decode("gbk")
# # print(str(now_html.title))
after_bs_parse=BeautifulSoup(now_html,"html.parser")
# print(after_bs_parse.title.get_text())
web_title=after_bs_parse.title.get_text().replace("2017年国民经济和社会发展统计公报 - 中国统计信息网","")
# print(web_title)
assert web_title==city_name
pages= after_bs_parse.select("p.pageLink a[title='Page'] b")[-1].get_text()
print(pages)
contents_all=BeautifulSoup(now_html,"html.parser").select("td.xwnr")[0].get_text()
# print(contents_all)
# print("____________________________________")
for page in range(2,int(pages)+1):
content=get_sub_content(url,str(page))
contents_all+=content
write_to_file(contents_all,"city/"+city_name)
# parse_data()
def get_sub_content(url,page):
'''
:param page: string类型
:return: 翻页后得到的内容
'''
sub_url=url.replace(".html","_"+page+".html")
driver = webdriver.Chrome(executable_path="/Users/yumi/Documents/Code/HelloWorld/HelloWorld/chromedriver",chrome_options=chrome_opt)
pid = driver.service.process.pid
driver.implicitly_wait(10)
driver.get(sub_url)
# assert
html = driver.page_source
# print(html)
try:
contents = BeautifulSoup(html, "html.parser").select("td.xwnr")[0].get_text()
# print(contents)
except:
print("not found!!!::"+sub_url)
driver.quit()
try:
os.kill(int(pid), signal.SIGTERM)
print("Killed chrome using process")
except ProcessLookupError as ex:
pass
return contents
# test()
url_2017 = "http://www.tjcn.org/tjgbsy/nd/35338.html"
city=get_all_links(url_2017)
print(city)
for k,v in city.items():
parse_data(v, k)
try:
parse_data(v,k)
except:
print(k+"市出问题了!!!")
# get_sub_content("http://www.tjcn.org","/tjgb/09sh/35333.html","2")
# parse_data("/tjgb/09sh/35333.html","上海市")
总结:虽然没有钱拿了。唉。但是数据清洗更麻烦,暂时我也不愿意趟这个浑水了。菜是原罪。
- 滚动找元素
之前实习的适合做自动化,框架里封装了滑动寻找元素的方法,当时还在想这是咋实现的,原来selenium里有这个方法哇!
上段测试代码:
from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_opt = Options() # 创建参数设置对象.
# chrome_opt.set_headless()
# chrome_opt.add_argument("--headless")
chrome_opt.add_argument('--disable-gpu') # 配合上面的无界面化.
chrome_opt.add_argument('--window-size=1366,768') # 设置窗口大小, 窗口大小会有影响.
driver = webdriver.Chrome(executable_path="/Users/yumi/Desktop/持续学习/codelearn/chromedriver",chrome_options=chrome_opt)
driver.get("https://wenku.baidu.com/view/aa31a84bcf84b9d528ea7a2c.html")
# more_butoon=driver.find_element_by_class_name("btn-know")
# if more_butoon!=None:
# more_butoon.click()
# else:
# pass
html = driver.page_source
html_bs = BeautifulSoup(html,"lxml")
print(html_bs)
button= driver.find_element_by_class_name("moreBtn")
driver.execute_script('arguments[0].scrollIntoView();', button)
time.sleep(5)
driver.execute_script("arguments[0].click();",button)
讲解一下, execute_script
这个方法是用来执行js脚本的
滚动的话直接找那句话就好啦。
下面那个是我找到这个元素点击的时候出现了问题,本来我用的是button.click(),但报了下面的错:
selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element <span class="moreBtn goBtn">...</span> is not clickable at point (515, 637). Other element would receive the click: <div class="reader-tools-page xllDownloadLayerHit_left">...</div>
stackoverflow给出了下面两种解法,但是为什么会出现这种情况我不懂哇QAQ
老实说我试了一下,只有第一种方法可行,第二种方法点击到别的元素上去了(第二种方法在使用之前需要
from selenium.webdriver.common.action_chains import ActionChains
)
9.requests.session()会话保持
requests.session()会话保持