爬取代理西刺ip
2018-03-10 本文已影响179人
两分与桥
写了一个可以从西刺中爬取代理ip的小爬虫,本来加上ip代理测试和代理爬取网站的,不过代理ip一直无效,不知道是什么原因,下次一定完善
# -*- coding: UTF-8 -*-
import requests
import re
import random #随机抽取列表中的元素
from multiprocessing import Pool #没有用到进程池
from requests.exceptions import RequestException
def get_one_page(url):
try:
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3278.0 Safari/537.36'}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
except RequestException:
return None
length 是确定爬取前几页的ip,可以自由确定length的大小
def get_url_list(url,length=3):
print('proxy ip download the first %d pages:' %length)
urllist = []
for i in range(1, length):
urllist.append(url[:-1] + str(i))
return urllist
用正则表达式提取ip地址和端口号、地区
过滤掉https的ip
def extractip(html):
pattern = re.compile('<td>((?<![\.\d])(?:\d{1,3}\.){3}\d{1,3}(?![\.\d]))'
+'</td>.*?<td>(.*?)</td>.*?href.*?>(.*?)</a>.*?<td>HTTP</td>',re.S)
items = re.findall(pattern, html)
return items
从爬取中的ip中随机选出一个
def rangeip(listip):
return random.sample(listip,1)
def main():
biglistip = []
getipurl = 'http://www.xicidaili.com/nn/1'
urllist = get_url_list(getipurl)
for url in urllist:
html = get_one_page(url)
items = extractip(html)
for item in items:
biglistip.append(item)
print(rangeip(biglistip))
if __name__ == '__main__':
main()