Python爬虫从入门到放弃Pythonpython 爬虫

python爬虫从入门到放弃之十七:常见反爬手段

2019-08-16  本文已影响20人  52d19f475fe5

反反爬的主要思路就是:尽可能的去模拟浏览器,浏览器在如何操作,代码中就如何去实现。

例如:浏览器先请求了地址url1,保留了cookie在本地,之后请求地址url2,带上了之前的cookie,代码中也可以这样去实现。

1 通过headers字段来反爬

headers中有很多字段,这些字段都有可能会被对方服务器拿过来进行判断是否为爬虫

用正则实现请求头自动添加引号代码:
import re

headers_str = '''

Host: www.baidu.com
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36


'''
pattern = '^(.*?): (.*)$'
for line in headers_str.splitlines():
    print(re.sub(pattern, '\'\\1\':\'\\2\',', line))

运行结果,如下,免去手动添加引号

'Host':'www.baidu.com',
'Sec-Fetch-Mode':'navigate',
'Sec-Fetch-Site':'none',
'Sec-Fetch-User':'?1',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',

1.1通过headers中的User-Agent字段来反爬

安装库:

pip3 install fake_useragent

随机User-Agent代码:

from fake_useragent import UserAgent

ua = UserAgent()
print(ua.random)
print(ua.random)
print(ua.random)

运行结果:

Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:24.0) Gecko/20100101 Firefox/24.0
Mozilla/5.0 (X11; CrOS i686 3912.101.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36
1.2 通过referer字段或者是其他字段来反爬
1.3 通过cookie来反爬
2 通过js来反爬

普通的爬虫默认情况下无法执行js,获取js执行之后的结果,所以很多时候对方服务器会通过js的技术实现反扒

2.1 通过js实现跳转来反爬
2.2 通过js生成了请求参数
2.3 通过js实现了数据的加密
3 通过验证码来反爬
4 通过ip地址来反爬
# 66代理:
http://www.66ip.cn/6.html
# 西刺代理:
https://www.xicidaili.com/
# 快代理:
https://www.kuaidaili.com/free/

多协程获取西刺高匿代理

from gevent import monkey
monkey.patch_all()
from fake_useragent import UserAgent
import re, requests, time, gevent, json


# 随机获取ua
def gen_ua():
    ua = UserAgent()
    return ua.random


# 获取西刺高匿代理
def get_ip(url):
    headers = {'User-Agent': gen_ua()}
    html = requests.get(url, headers=headers).text
    time.sleep(1)
    pattern = r'<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*?(\d{2,5}).*?高匿.*?(HTTPS?)'
    tds = re.findall(pattern, html, re.S)
    for td in tds:
        k = 'http' if td[2] == 'HTTP' else 'https'
        v = '{}:{}'.format(td[0], td[1])
        yield {k: v}


# 检测IP是否可用
def check_ip(ips):
    for ip in ips:
        try:
            r = requests.get('https://httpbin.org/ip', proxies=ip, timeout=3)
            r.raise_for_status()
            if '119.129' not in r.text:
                print(ip, '可用')
                ip_str = json.dumps(ip)
                with open('ip.txt', 'a', encoding='utf-8') as f:
                    f.write(ip_str + '\n')
            else:
                pass
        except:
            pass


# 主程序
def crawler(url):
    ips = get_ip(url)
    check_ip(ips)


if __name__ == '__main__':
    urls = [
        'https://www.xicidaili.com/nn/',
        'https://www.xicidaili.com/wn/',
        'https://www.xicidaili.com/wt/'
    ]
    url_list = [url+str(i+1) for url in urls for i in range(2)]
    tasks_list = [gevent.spawn(crawler, url) for url in url_list]
    gevent.joinall(tasks_list)

注:

IP地址查询,能查询表示活的

import requests
import re


# 查询ip地址
def check_ip(proxies):
    try:
        r = requests.get("https://httpbin.org/ip", proxies=proxies, timeout=4)
        origin = r.json()["origin"].split(',')[0]
        url = 'http://www.ip138.com/ips138.asp?ip=' + origin
        headers = {'User-Agent': 'Mozilla/5.0'}
        res = requests.get(url, headers=headers)
        res.encoding = res.apparent_encoding
        result = re.findall(r'<li>本站数据:(.*?)</li>', res.text, re.S)[0]
        print(result)
    except:
        print('ip连接失败!请重试,或更换其它ip')


if __name__ == '__main__':
    proxies = {'https': '117.80.4.174:808'}
    check_ip(proxies)



使用代理ip

import requests,json


# 读取ip
def read_ip():
    f = open('ip.txt','r',encoding = 'utf-8')
    lines = f.readlines()
    f.close()
    for line in lines:
        ip = line.strip('\n')
        yield json.loads(ip)


# 检测并使用代理ip
def run():
    ips = read_ip()
    for ip in ips:
        try:
            r = requests.get('https://httpbin.org/ip', proxies=ip, timeout=3)
            r.raise_for_status()
            if '119.129' not in r.text:
                try:
                    res = requests.get('http://www.baidu.com', proxies=ip, timeout=3)
                    res.raise_for_status()
                    break
                except:
                    pass
            else:
                pass
        except:
            pass
    return res


if __name__ == "__main__":
    res = run()
    print(res.url)
 
5 通过用户行为来反扒
小结:

反爬的手段非常多,但是一般而言,完全的模仿浏览器的行为是没问题的


>>>阅读更多文章请点击以下链接:

python爬虫从入门到放弃之一:认识爬虫
python爬虫从入门到放弃之二:HTML基础
python爬虫从入门到放弃之三:爬虫的基本流程
python爬虫从入门到放弃之四:Requests库基础
python爬虫从入门到放弃之五:Requests库高级用法
python爬虫从入门到放弃之六:BeautifulSoup库
python爬虫从入门到放弃之七:正则表达式
python爬虫从入门到放弃之八:Xpath
python爬虫从入门到放弃之九:Json解析
python爬虫从入门到放弃之十:selenium库
python爬虫从入门到放弃之十一:定时发送邮件
python爬虫从入门到放弃之十二:多协程
python爬虫从入门到放弃之十三:Scrapy概念和流程
python爬虫从入门到放弃之十四:Scrapy入门使用
python爬虫从入门到放弃之十五:ScrapyScrapy爬取多个页面
python爬虫从入门到放弃之十六:Xpath简化
python爬虫从入门到放弃之十七:常见反爬手段

上一篇下一篇

猜你喜欢

热点阅读