2019-03-06

2019-03-12 本文已影响0人拉一曲扯淡

最近在爬取淘宝，安居客上的数据的时候，遇到各种问题，最终我才发现，预备代理池迫在眉睫。

所以今天我准备写个系列，从爬取代理ip到代理验证，再到全可用的代理池。

真心的，不用代理各种不顺。

第一步、爬取快代理上所有的免费代理ip，为做代理池做准备。（其实免费的代理还是很不稳定的，不过对于我这个还在学习中的应该还是够用的。）。爬取到的ip存到redis里面。因为后面代理池的ip全部从redis里面取。

#-*- coding:utf8-*-

"""

@author:Administrator

@file: seleniumStu_2.py

@time: 2019/02/{DAY}

"""

import logging

LOG_FORMAT= "%(asctime)s - %(levelname)s - %(message)s"

logging.basicConfig(level=logging.DEBUG, format=LOG_FORMAT)

"""

模块的作用: 爬取快代理上的所有IP代理，为代理池做准备"""

import requests

# from pyquery import PyQuery as pq

from lxml import etree

import random

import time

from redis import StrictRedis,ConnectionPool

pool= ConnectionPool.from_url('redis://localhost:6379/0')

redis= StrictRedis(connection_pool=pool)

redis_key= 'proxy_key'

def get_response(url):

header = {

'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \

(KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'

}

try:

response= requests.get(url,timeout=random.choice(range(3,11)))

response.encoding= 'UTF-8'

if response.status_code== 200:

return response.text

except TimeoutError:

return None

def parse_response(html):

r= etree.HTML(html)

item= r.xpath('//tbody/tr')

for i in item:

redis.rpush('1',i.xpath('td[4]/text()')[0].lower()+'://'+\

i.xpath('td[1]/text()')[0]+':'+i.xpath('td[2]/text()')[0])

def main():

for pagein range(1,3626):

print('正在解析第 %s 页... ...' % page)

url= 'https://www.kuaidaili.com/free/inha/'+str(page)+'/'

time.sleep(5)

html= get_response(url)

parse_response(html)

if __name__== '__main__':

main()

运行了一段时间已经抓取了4200多条了，速度还是很快的。

今天先把爬取代理的代码贴出来，因为我是一步一步自己在研究，测试过了才发出来的。

后面代理的验证我还没弄明白，等弄明白了再把文章写出来。希望能帮到大家。。。。。

2019-03-06

猜你喜欢

热点阅读