爬取代理西刺ip

2018-03-10  本文已影响179人  两分与桥

写了一个可以从西刺中爬取代理ip的小爬虫,本来加上ip代理测试和代理爬取网站的,不过代理ip一直无效,不知道是什么原因,下次一定完善

# -*- coding: UTF-8 -*-
import requests
import re
import random #随机抽取列表中的元素
from multiprocessing import Pool #没有用到进程池
from  requests.exceptions import RequestException

def get_one_page(url):
    try:
        headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3278.0 Safari/537.36'}
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
    except RequestException:
        return None

length 是确定爬取前几页的ip,可以自由确定length的大小

def get_url_list(url,length=3):
    print('proxy ip download the first %d pages:' %length)
    urllist = []
    for i in range(1, length):
        urllist.append(url[:-1] + str(i))
    return urllist

用正则表达式提取ip地址和端口号、地区
过滤掉https的ip

def extractip(html):
    pattern = re.compile('<td>((?<![\.\d])(?:\d{1,3}\.){3}\d{1,3}(?![\.\d]))'
        +'</td>.*?<td>(.*?)</td>.*?href.*?>(.*?)</a>.*?<td>HTTP</td>',re.S)
    items = re.findall(pattern, html)
    return items

从爬取中的ip中随机选出一个

def rangeip(listip):
    return random.sample(listip,1)
def main():
    biglistip = []
    getipurl = 'http://www.xicidaili.com/nn/1'
    urllist = get_url_list(getipurl)
    for url in urllist:
        html = get_one_page(url)
        items = extractip(html)
        for item in items:
            biglistip.append(item)
    print(rangeip(biglistip))

if __name__ == '__main__':
    main()
上一篇下一篇

猜你喜欢

热点阅读