分布式爬虫框架

python爬虫学习-day6-ip池

2019-05-15  本文已影响26人  光小月

目录

  1. python爬虫学习-day1
  2. python爬虫学习-day2正则表达式
  3. python爬虫学习-day3-BeautifulSoup
  4. python爬虫学习-day4-使用lxml+xpath提取内容
  5. python爬虫学习-day5-selenium
  6. python爬虫学习-day6-ip池
  7. python爬虫学习-day7-实战

学习IP相关知识

  1. 学习什么是IP,为什么会出现IP被封,如何应对IP被封的问题。

  2. 抓取西刺代理,并构建自己的代理池。

  3. 西刺直通点:https://www.xicidaili.com/

1. 为什么会出现IP被封,如何应对IP被封的问题。

网站为了防止被爬取,会有反爬机制,对于同一个IP地址的大量同类型的访问,会封锁IP,过一段时间后,才能继续访问
现有的反扒策略:

0. 检测浏览器header, User-Agent
1. ip 封禁
2. 图片验证码
3. 滑块
4. JS轨迹
5. 证书加密
6. AI识别

2. 如何应对IP被封

1. 建立代理IP, 轮换访问
2. 设置访问时间间隔
3. 可动态设置user agent
4. 禁用cookies
5. 设置延迟下载
6. 使用Google Cache
7. 使用IP地址池(代理IP、VPN等)
8. 使用Crawlera

参考: https://desmonday.github.io/2019/03/06/python%E7%88%AC%E8%99%AB%E5%AD%A6%E4%B9%A0-day6-IP%E4%BB%A3%E7%90%86/

3. 获取代理IP地址

网站: https://www.xicidaili.com/

示例

import requests, re
from bs4 import BeautifulSoup as bs
import json


def get_html(url):
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
    headers = {'User-Agent': user_agent}
    try:
        # html = requests.get(url=url, headers=headers)
        r = requests.get(url, headers=headers, timeout=10)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        print('error , not open page by url:' + url)


def get_proxy_ip(html):
    html = bs(html, 'html.parser')
    proxy_ips = html.find(id='ip_list').find_all('tr')
    for proxy_ip in proxy_ips:
        if len(proxy_ip.select('td')) > 0:
            ip = proxy_ip.select('td')[1].text
            port = proxy_ip.select('td')[2].text
            protocol = proxy_ip.select('td')[5].text
            if protocol in protocollists:
                proxy_ip_list.append(f'{protocol}://{ip}:{port}')
    return proxy_ip_list


def check_proxy_avaliability(ip):
    url = 'https://www.baidu.com'
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
    headers = {'User-Agent': user_agent}
    try:
        proxies = {}
        if ip.startswith(('HTTPS', 'https')):
            proxies['HTTPS'] = ip
        else:
            proxies['HTTP'] = ip
        r = requests.get(url=url, headers=headers, proxies=proxies, timeout=10)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        text, status_code = r.text, r.status_code
        if status_code == 200:
            print('有效IP, %s', ip)
            return True
        else:
            print('无效IP, %s', ip)
            return False
    except:
        print('error  ', url)
        return False


if __name__ == '__main__':
    proxy_ip_list = []
    url = 'https://www.xicidaili.com/'
    protocollists = ['http', 'https', 'HTTP', 'HTTPS']
    html = get_html(url)
    ips = get_proxy_ip(html)
    print(ips)
    use_ip_list = []
    for ip in ips:
        if check_proxy_avaliability(ip):
            use_ip_list.append(ip)
    print('有效代理ip')
    print(use_ip_list)

结果

['HTTP://120.83.98.192:9999', 'HTTP://115.239.25.244:9999', 'HTTP://120.83.111.221:9999', 'HTTP://1.198.72.153:9999', 'HTTP://180.126.169.6:8118', 'HTTPS://122.137.4.230:8118', 'HTTP://163.204.244.150:9999', 'HTTPS://112.85.170.172:9999', 'HTTPS://112.85.131.34:9999', 'HTTP://112.85.171.8:9999', 'HTTPS://112.85.164.213:9999', 'HTTPS://117.95.98.120:42704', 'HTTP://115.225.49.188:8118', 'HTTPS://119.162.37.165:8118', 'HTTP://123.139.28.36:8118', 'HTTPS://171.80.3.169:9999', 'HTTPS://112.85.164.161:9999', 'HTTPS://115.53.16.222:9999', 'HTTPS://36.99.212.233:9999', 'HTTPS://113.121.46.78:9999', 'HTTPS://221.225.147.58:8118', 'HTTP://222.186.45.145:57273', 'HTTP://121.238.82.201:8118', 'HTTP://114.221.20.254:8118', 'HTTPS://58.250.23.210:1080', 'HTTPS://180.160.139.246:8118', 'HTTPS://112.80.41.86:8888', 'HTTP://218.64.69.79:8080', 'HTTPS://59.38.61.164:9797', 'HTTPS://182.18.13.149:53281', 'HTTP://113.251.221.143:8118', 'HTTP://117.90.5.64:9000', 'HTTPS://125.32.80.52:8080', 'HTTPS://58.247.127.145:53281', 'HTTPS://175.23.40.250:8080', 'HTTP://211.162.70.229:3128', 'HTTPS://182.149.157.168:8118', 'HTTPS://218.22.7.62:53281', 'HTTP://121.79.131.58:8080', 'HTTP://14.115.106.178:808', 'HTTPS://122.137.4.230:8118', 'HTTPS://112.85.170.172:9999', 'HTTPS://112.85.131.34:9999', 'HTTPS://221.225.147.58:8118', 'HTTPS://112.85.164.213:9999', 'HTTPS://117.95.98.120:42704', 'HTTPS://119.162.37.165:8118', 'HTTPS://171.80.3.169:9999', 'HTTPS://112.85.164.161:9999', 'HTTPS://115.53.16.222:9999', 'HTTPS://36.99.212.233:9999', 'HTTPS://113.121.46.78:9999', 'HTTPS://123.163.96.141:9999', 'HTTPS://222.137.4.96:8118', 'HTTPS://115.53.20.2:9999', 'HTTPS://124.94.199.204:9999', 'HTTPS://112.87.71.206:9999', 'HTTPS://120.83.105.77:9999', 'HTTPS://112.85.151.97:9999', 'HTTPS://58.250.23.210:1080', 'HTTP://120.83.98.192:9999', 'HTTP://115.239.25.244:9999', 'HTTP://120.83.111.221:9999', 'HTTP://1.198.72.153:9999', 'HTTP://180.126.169.6:8118', 'HTTP://163.204.244.150:9999', 'HTTP://112.85.171.8:9999', 'HTTP://115.225.49.188:8118', 'HTTP://222.186.45.145:57273', 'HTTP://123.139.28.36:8118', 'HTTP://112.85.169.44:9999', 'HTTP://121.238.82.201:8118', 'HTTP://1.198.72.48:9999', 'HTTP://112.85.129.140:9999', 'HTTP://171.37.157.61:8123', 'HTTP://119.162.150.192:8118', 'HTTP://114.221.20.254:8118', 'HTTP://49.86.176.110:9999', 'HTTP://114.230.69.201:9999', 'HTTP://112.255.118.2:8118']
有效IP, %s HTTP://120.83.98.192:9999
有效IP, %s HTTP://115.239.25.244:9999
有效IP, %s HTTP://120.83.111.221:9999
有效IP, %s HTTP://1.198.72.153:9999
有效IP, %s HTTP://180.126.169.6:8118
有效IP, %s HTTPS://122.137.4.230:8118
有效IP, %s HTTP://163.204.244.150:9999
有效IP, %s HTTPS://112.85.170.172:9999
有效IP, %s HTTPS://112.85.131.34:9999
有效IP, %s HTTP://112.85.171.8:9999
有效IP, %s HTTPS://112.85.164.213:9999
有效IP, %s HTTPS://117.95.98.120:42704
有效IP, %s HTTP://115.225.49.188:8118
有效IP, %s HTTPS://119.162.37.165:8118
有效IP, %s HTTP://123.139.28.36:8118
有效IP, %s HTTPS://171.80.3.169:9999
有效IP, %s HTTPS://112.85.164.161:9999
有效IP, %s HTTPS://115.53.16.222:9999
有效IP, %s HTTPS://36.99.212.233:9999
有效IP, %s HTTPS://113.121.46.78:9999
有效IP, %s HTTPS://221.225.147.58:8118
有效IP, %s HTTP://222.186.45.145:57273
有效IP, %s HTTP://121.238.82.201:8118
有效IP, %s HTTP://114.221.20.254:8118
有效IP, %s HTTPS://58.250.23.210:1080
有效IP, %s HTTPS://180.160.139.246:8118
有效IP, %s HTTPS://112.80.41.86:8888
有效IP, %s HTTP://218.64.69.79:8080
有效IP, %s HTTPS://59.38.61.164:9797
有效IP, %s HTTPS://182.18.13.149:53281
有效IP, %s HTTP://113.251.221.143:8118
有效IP, %s HTTP://117.90.5.64:9000
有效IP, %s HTTPS://125.32.80.52:8080
有效IP, %s HTTPS://58.247.127.145:53281
有效IP, %s HTTPS://175.23.40.250:8080
有效IP, %s HTTP://211.162.70.229:3128
有效IP, %s HTTPS://182.149.157.168:8118
有效IP, %s HTTPS://218.22.7.62:53281
有效IP, %s HTTP://121.79.131.58:8080
有效IP, %s HTTP://14.115.106.178:808
有效IP, %s HTTPS://122.137.4.230:8118
有效IP, %s HTTPS://112.85.170.172:9999
有效IP, %s HTTPS://112.85.131.34:9999
有效IP, %s HTTPS://221.225.147.58:8118
有效IP, %s HTTPS://112.85.164.213:9999
有效IP, %s HTTPS://117.95.98.120:42704
有效IP, %s HTTPS://119.162.37.165:8118
有效IP, %s HTTPS://171.80.3.169:9999
有效IP, %s HTTPS://112.85.164.161:9999
有效IP, %s HTTPS://115.53.16.222:9999
有效IP, %s HTTPS://36.99.212.233:9999
有效IP, %s HTTPS://113.121.46.78:9999
有效IP, %s HTTPS://123.163.96.141:9999
有效IP, %s HTTPS://222.137.4.96:8118
有效IP, %s HTTPS://115.53.20.2:9999
有效IP, %s HTTPS://124.94.199.204:9999
有效IP, %s HTTPS://112.87.71.206:9999
有效IP, %s HTTPS://120.83.105.77:9999
有效IP, %s HTTPS://112.85.151.97:9999
有效IP, %s HTTPS://58.250.23.210:1080
有效IP, %s HTTP://120.83.98.192:9999
有效IP, %s HTTP://115.239.25.244:9999
有效IP, %s HTTP://120.83.111.221:9999
有效IP, %s HTTP://1.198.72.153:9999
有效IP, %s HTTP://180.126.169.6:8118
有效IP, %s HTTP://163.204.244.150:9999
有效IP, %s HTTP://112.85.171.8:9999
有效IP, %s HTTP://115.225.49.188:8118
有效IP, %s HTTP://222.186.45.145:57273
有效IP, %s HTTP://123.139.28.36:8118
有效IP, %s HTTP://112.85.169.44:9999
有效IP, %s HTTP://121.238.82.201:8118
有效IP, %s HTTP://1.198.72.48:9999
有效IP, %s HTTP://112.85.129.140:9999
有效IP, %s HTTP://171.37.157.61:8123
有效IP, %s HTTP://119.162.150.192:8118
有效IP, %s HTTP://114.221.20.254:8118
有效IP, %s HTTP://49.86.176.110:9999
有效IP, %s HTTP://114.230.69.201:9999
有效IP, %s HTTP://112.255.118.2:8118
有效代理ip
['HTTP://120.83.98.192:9999', 'HTTP://115.239.25.244:9999', 'HTTP://120.83.111.221:9999', 'HTTP://1.198.72.153:9999', 'HTTP://180.126.169.6:8118', 'HTTPS://122.137.4.230:8118', 'HTTP://163.204.244.150:9999', 'HTTPS://112.85.170.172:9999', 'HTTPS://112.85.131.34:9999', 'HTTP://112.85.171.8:9999', 'HTTPS://112.85.164.213:9999', 'HTTPS://117.95.98.120:42704', 'HTTP://115.225.49.188:8118', 'HTTPS://119.162.37.165:8118', 'HTTP://123.139.28.36:8118', 'HTTPS://171.80.3.169:9999', 'HTTPS://112.85.164.161:9999', 'HTTPS://115.53.16.222:9999', 'HTTPS://36.99.212.233:9999', 'HTTPS://113.121.46.78:9999', 'HTTPS://221.225.147.58:8118', 'HTTP://222.186.45.145:57273', 'HTTP://121.238.82.201:8118', 'HTTP://114.221.20.254:8118', 'HTTPS://58.250.23.210:1080', 'HTTPS://180.160.139.246:8118', 'HTTPS://112.80.41.86:8888', 'HTTP://218.64.69.79:8080', 'HTTPS://59.38.61.164:9797', 'HTTPS://182.18.13.149:53281', 'HTTP://113.251.221.143:8118', 'HTTP://117.90.5.64:9000', 'HTTPS://125.32.80.52:8080', 'HTTPS://58.247.127.145:53281', 'HTTPS://175.23.40.250:8080', 'HTTP://211.162.70.229:3128', 'HTTPS://182.149.157.168:8118', 'HTTPS://218.22.7.62:53281', 'HTTP://121.79.131.58:8080', 'HTTP://14.115.106.178:808', 'HTTPS://122.137.4.230:8118', 'HTTPS://112.85.170.172:9999', 'HTTPS://112.85.131.34:9999', 'HTTPS://221.225.147.58:8118', 'HTTPS://112.85.164.213:9999', 'HTTPS://117.95.98.120:42704', 'HTTPS://119.162.37.165:8118', 'HTTPS://171.80.3.169:9999', 'HTTPS://112.85.164.161:9999', 'HTTPS://115.53.16.222:9999', 'HTTPS://36.99.212.233:9999', 'HTTPS://113.121.46.78:9999', 'HTTPS://123.163.96.141:9999', 'HTTPS://222.137.4.96:8118', 'HTTPS://115.53.20.2:9999', 'HTTPS://124.94.199.204:9999', 'HTTPS://112.87.71.206:9999', 'HTTPS://120.83.105.77:9999', 'HTTPS://112.85.151.97:9999', 'HTTPS://58.250.23.210:1080', 'HTTP://120.83.98.192:9999', 'HTTP://115.239.25.244:9999', 'HTTP://120.83.111.221:9999', 'HTTP://1.198.72.153:9999', 'HTTP://180.126.169.6:8118', 'HTTP://163.204.244.150:9999', 'HTTP://112.85.171.8:9999', 'HTTP://115.225.49.188:8118', 'HTTP://222.186.45.145:57273', 'HTTP://123.139.28.36:8118', 'HTTP://112.85.169.44:9999', 'HTTP://121.238.82.201:8118', 'HTTP://1.198.72.48:9999', 'HTTP://112.85.129.140:9999', 'HTTP://171.37.157.61:8123', 'HTTP://119.162.150.192:8118', 'HTTP://114.221.20.254:8118', 'HTTP://49.86.176.110:9999', 'HTTP://114.230.69.201:9999', 'HTTP://112.255.118.2:8118']

参考资料:

  1. https://blog.csdn.net/weixin_43720396/article/details/88218204
  2. https://desmonday.github.io/2019/03/06/python%E7%88%AC%E8%99%AB%E5%AD%A6%E4%B9%A0-day6-IP%E4%BB%A3%E7%90%86/

PS: 若你觉得可以、还行、过得去、甚至不太差的话,可以“关注”一下,就此谢过!

上一篇下一篇

猜你喜欢

热点阅读