关于代理ip的一些笔记

2019-01-22  本文已影响20人  silencefun

还是爬虫需要ip池支撑。
搜一下是一大堆免费的但是需要过滤筛选 能用的。

1.免费代理ip的获取

https://www.xicidaili.com/nn/

image.png

http://www.66ip.cn/nmtq.php?getnum=1

image.png

第一个 需要解析
第二个 可以自定义数量

2.验证

看能否使用正常访问

 /**
 * 测试 代理ip是否有效
 * 
 * @param ip
 * @param port
 */
public static void createIPAddress(String ip, int port) {
    URL url = null;
    try {
        url = new URL("http://www.baidu.com");
    } catch (MalformedURLException e) {
        System.out.println("url invalidate");
    }
    InetSocketAddress addr = null;
    addr = new InetSocketAddress(ip, port);
    Proxy proxy = new Proxy(Proxy.Type.HTTP, addr); // http proxy
    InputStream in = null;
    try {
        URLConnection conn = url.openConnection(proxy);
        conn.setConnectTimeout(1000);
        in = conn.getInputStream();
    } catch (Exception e) {
         e.printStackTrace();
         System.err.println("ip " + ip + " is not aviable");// 异常IP
    }
        String s = convertStreamToString(in);

    if (s.indexOf("baidu") > 0) {// 有效IP
        System.err.println(ip + ":" + port + " is ok");
        
        
        CrawlerUtis.appendLog("C:\\Users\\21555\\Desktop\\ip_enable.txt",
                ip + " " + port + "\r\n");
    }
}


public static String convertStreamToString(InputStream is) {
    if (is == null)
        return "";
    BufferedReader reader = new BufferedReader(new InputStreamReader(is));
    StringBuilder sb = new StringBuilder();
    String line = null;
    try {
        while ((line = reader.readLine()) != null) {
            sb.append(line+"\r\n");
        }
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    return sb.toString();

}

3.关于解析

3.1xicidail网站的解析

 public static List<String> AnalyIppool() {

    try {
        URL url = new URL("https://www.xicidaili.com/nn/");
        URLConnection connection = url.openConnection();
        connection.setRequestProperty("User-Agent","Mozilla/4.0 (compatible, MSIE 7.0, Windows NT 5.1, TencentTraveler 4.0)");
                    //要加上User-Agent
               connection.setRequestProperty("Charsert", "UTF-8"); //设置请求编码
         
               connection.setRequestProperty("Content-Type",  "application/json"); 
                connection.connect();
        InputStream in = connection.getInputStream();
 
        
        
        Document document = Jsoup.parse(convertStreamToString(in));

        Elements ss = document.getElementsByClass("odd");
        for (Element element : ss) {
            
            AnalyIpAndcheck(element.text());
            //System.out.println(element.text());
        }

        
    } catch (Exception e) {
        e.printStackTrace();
    }

    return null;

}

private static String AnalyIpAndcheck(String iporign) {
    String[] ipp=iporign.split(" ");
    createIPAddress(ipp[0],Integer.parseInt(ipp[1]));
    
    
    return null;
}

3.2 第二个直接是接口数据
http://www.66ip.cn/nmtq.php?getnum=2000
请求多个ip,每次读一行,然后可以使用线程池来执行。

关键代码:

    private static ThreadPoolExecutor executor== new ThreadPoolExecutor(5, 30, 300, TimeUnit.MILLISECONDS, new ArrayBlockingQueue<Runnable>(3),
                new ThreadPoolExecutor.CallerRunsPolicy());

传入解析后的list

public static List<String> AnalyIppool2(String path) {

    List<String> list = CrawlerUtis.filter(path);//转化为list 方法 通过本地文件读 可以直接写请求
    
    for (String string : list) {
        
        
        executor.execute(new Runnable() {
            
            @Override
            public void run() {
                String[] ip=string.split(":");
                createIPAddress(ip[0],Integer.parseInt(ip[1]));
                
            }
        });
    
        
    }
上一篇下一篇

猜你喜欢

热点阅读