没壁纸用了？用Jsoup写一个图片爬虫吧！

2016-10-23 本文已影响713人阿菜的博客

Jsoup

1. Jsoup

Jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。—— 百度百科

2. 设计/代码

2.1 爬取站点

爬取站点为http://www.16sucai.com/tupian/gqfj/3.html
是一个风景壁纸网站。

爬取站点

每个页面有18个类似相册一样的链接，每个页面的url不同的只有页号。

进入每个相册之后，再下载页面中的图片即可。

2.2 代码

主程序：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import util.Util;

import java.io.*;

public class Main {
    public static void main(String[] args) throws IOException {
        // 首先建立主目录
        Util.makeDir(Util.picDir);
        // 连接站点
        // 测试爬去第3页和第4页的壁纸
        for (int i = 3; i < 5; i++) {
            // 用Jsoup连接站点
            Document doc = Jsoup.connect("http://www.16sucai.com/tupian/gqfj/" + i + ".html").get();
            // 选择class为vector_listbox容器
            Elements elementClass = doc.select(".vector_listbox");
            // 在容器中选择a链接，用于进入相册
            Elements elements = elementClass.select("a[href~=/[0-9]{4}/[0-9]{2}/.*html]");
            System.out.println(elements.size());
            // 因为同样的链接存在与图片和文字上，做特殊处理
            for (int j = 0; j < elements.size() / 2; j++) {
                Element e = elements.get(2 * j);
                //取出该元素的title元素来新建文件夹
                String filePath = Util.picDir + "//" + e.attr("title");
                Util.makeDir(filePath);
                // 然后在请求该链接
                System.out.println(e.attr("href"));
                Document docInner = Jsoup.connect("http://www.16sucai.com" + e.attr("href")).get();
                // 取出对应图片的URL
                Elements elementsClass = docInner.select(".endtext");
                Elements elementsInner = elementsClass.select("img[src^=http://file]");
                System.out.println(elementsInner.size());
                // 下载图片
                for (Element eInner : elementsInner) {
                    String picUrl = eInner.attr("src");
                    Util.downloadPic(picUrl, picUrl.substring(picUrl.lastIndexOf("/")), filePath);
                }
            }
        }
    }
}

工具类：

import java.io.*;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;

/**
 * Created by JJS on 2016/10/23.
 */
public class Util {
    public static final String picDir = "F://imgs";

    // 新建文件目录
    public static void makeDir(String dir) {
        File f = new File(dir);
        if (!f.exists()) {
            f.mkdirs();
        }
    }

    // 下载图片
    public static void downloadPic(String src, String fileName, String dir) {
        // 新建URL类
        URL url = null;
        try {
            url = new URL(src);
        } catch (MalformedURLException e) {
            e.printStackTrace();
        }
        // 新建URL链接类
        URLConnection uri = null;
        try {
            uri = url.openConnection();
        } catch (IOException e) {
            e.printStackTrace();
        }
        //获取数据流
        InputStream is = null;
        try {
            is = uri.getInputStream();
        } catch (IOException e) {
            e.printStackTrace();
        }
        // 需要判断is是否为空，如果图片URL为404时候，不判空为导致程序中止
        if (is != null) {
            //写入数据流
            OutputStream os = null;
            try {
                os = new FileOutputStream(new File(dir, fileName));
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            }
            // 下载图片
            byte[] buf = new byte[1024];
            int l = 0;
            try {
                while ((l = is.read(buf)) != -1) {
                    os.write(buf, 0, l);
                }
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                // 下载完就关闭文件流
                if (os != null) {
                    try {
                        os.close();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
            }
        }
    }
}

3. 注意事项

慎用爬虫，防止被封IP。
要合理应用Jsoup选择器，不同站点考虑不同情况。
要保证下载的文件/文件夹不重名。
在下载完图片之后需要及时关闭输出流，在finally代码块中关闭。
在执行is = uri.getInputStream()获取输入流之后需要判空，可能存在图片链接失效的情况，否则会导致遇到异常终止程序。

4. 爬取结果

爬取结果

没壁纸用了？用Jsoup写一个图片爬虫吧！

1. Jsoup

2. 设计/代码

2.1 爬取站点

2.2 代码

3. 注意事项

4. 爬取结果

猜你喜欢

热点阅读