Hadoop学习之网络爬虫+分词+倒排索引实现搜索引擎案例
2017-11-24 本文已影响0人
一个三要不起
转载出处:http://blog.csdn.net/tanggao1314/article/details/51382382
本项目实现的是:自己写一个网络爬虫,对搜狐、百度、网易等新闻网站爬取新闻标题,然后把这些新闻标题和它的链接地址上传到hdfs多个文件上,一个文件对应一个标题和链接地址,然后通过分词技术对每个文件中的标题进行分词,分词后建立倒排索引以此来实现搜索引擎的功能。
由于Java已经荒废了1年多,原文一些地方愣是没能理解,我便对原文的代码进行了一些修改。
首先 要自己写一个网络爬虫
下载工具类DownLoadTool.java
其作用是把网页源码下载下来
package spider;
import java.io.BufferedInputStream;
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Scanner;
public class DownLoadTool {
public String downLoadUrl(final String addr) {
StringBuffer sb = new StringBuffer();
try {
URL url;
if(addr.startsWith("http://")==false){
String urladdr="http://"+addr;
url = new URL(urladdr);
}else{
url = new URL(addr);
}
//建立http连接类
HttpURLConnection con = (HttpURLConnection) url.openConnection();
//建立连接的超时时间(ms)
con.setConnectTimeout(5000);
con.connect();
if (con.getResponseCode() == 200) {//状态码 200 表示成功
BufferedInputStream bis = new BufferedInputStream(con.getInputStream());
@SuppressWarnings("resource")
Scanner sc = new Scanner(bis,"utf-8");
while (sc.hasNextLine()) {
sb.append(sc.nextLine());
}
}
} catch (IOException e) {
e.printStackTrace();
}
return sb.toString();
}
}
文章链接的匹配类ArticleDownLoad.java
截取<a></a>标签内的连接和标题内容
通常<a></a>标签的内容如下
<a href="http://news.ifeng.com/a/20171120/53406355_0.shtml?_zbs_baidu_news" mon="ct=1&a=2&c=top&pn=1"target="_blank"> 国资划转社保:让社保可持续的重大举措 </a>
这里对原文的正则表达式做了一些小修改
可以通过正则表达式<a+[^<>]*\\s+href=\"?(http[^<>\"]*)\"[^>]*>([^<]*)</a>
截取连接和标题内容。下面是代码:
package spider;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ArticleDownLoad {
static String ARTICLE_URL = "<a+[^<>]*\\s+href=\"?(http[^<>\"]*)\"[^>]*>([^<]*)</a>";
public Set<String> getLink(String html) {
Set<String> result = new HashSet<String>();
// 创建一个Pattern模式类,编译这个正则表达式
//CASE_INSENSITIVE: 大小写不明感的匹配只适用于US-ASCII字符集。
Pattern p = Pattern.compile(ARTICLE_URL, Pattern.CASE_INSENSITIVE);
// 定义一个匹配器的类
Matcher matcher = p.matcher(html);
while (matcher.find()) {
//过滤掉一下标题如:“时政要闻”,“台湾”,“港澳”
if(matcher.group(2).length() < 7){
continue;
}
result.add(matcher.group(1)+"\t"+matcher.group(2));
}
return result;
}
}
这里面加了一段代码,能过滤掉如下的一些小标题(“时政要闻”,“台湾”,“港澳”),而且一般文章的标题都是大于7个字的。
爬虫类Spider.java
package spider;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.Set;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
public class Spider {
private DownLoadTool dlt = new DownLoadTool();
private ArticleDownLoad adl = new ArticleDownLoad();
public void crawling(String url) {
try {
Configuration conf = new Configuration();
URI uri = new URI("hdfs://192.168.118.129:9000");
FileSystem hdfs = FileSystem.get(uri, conf);
String html = dlt.downLoadUrl(url);
Set<String> allneed = adl.getLink(html);
for (String addr : allneed) {
//生成文件路径
Path p1 = new Path("/mySpider/spider/" + System.currentTimeMillis());
FSDataOutputStream dos = hdfs.create(p1);
String value = addr + "\n";
//把内容写入文件
dos.write(value.getBytes());
}
} catch (IllegalArgumentException | URISyntaxException | IOException e) {
e.printStackTrace();
}
}
}
测试一下代码:
public static void main(String[] args){
String[] url = {
"http://news.baidu.com/",
"http://news.sohu.com/",
"http://news.sina.com.cn/",
"http://news.163.com/",
"http://www.people.com.cn/",
"http://news.cctv.com/"};
for(String news : url){
Spider spider = new Spider();
spider.crawling(news);
}
}
运行结果:
爬到的内容上传到hdfs上了
可以看到结果是“连接+Tab+标题”类型
分词和建倒排索引
看分词所用到的包,下载地址
我把原文的AriticleInvertedIndex类分为Map、Red、MR类,Combiner类我觉得有些多余(或者是我不能理解其中的妙处)就不用它了
Map类
package index.inverted;
import java.io.IOException;
import java.io.StringReader;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.wltea.analyzer.lucene.IKAnalyzer;
public class Map extends Mapper<Text, Text, Text, Text>{
private Text OutKey = new Text();
private Text OutValue = new Text();
@Override
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
//key: http://news.ifeng.com/a/20171120/53406355_0.shtml?_zbs_baidu_news
//value: 国资划转社保:让社保可持续的重大举措
@SuppressWarnings("resource")
Analyzer analyzer = new IKAnalyzer(true);
TokenStream ts = analyzer.tokenStream("field", new StringReader( value.toString().trim()));
CharTermAttribute term = ts.getAttribute(CharTermAttribute.class);
try {
ts.reset();
while (ts.incrementToken()) {
OutKey.set(term.toString());
OutValue.set(key);
context.write(OutKey, OutValue);
}
ts.end();
} finally {
ts.close();
}
}
}
Red类
package index.inverted;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class Red extends Reducer<Text, Text, Text, Text>{
private Text OutKey = new Text();
private Text OutValue = new Text();
@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//key: 社保
//value: http://news.ifeng.com/a/20171120/53406355_0.shtml?_zbs_baidu_news
int sum = 0;
String listString = new String();
//统计词频
for(Text value : values){
if(listString.indexOf(value.toString())==-1){
listString = value.toString()+ ";\t" + listString;
sum++;
}
}
OutKey.set(key+":"+String.valueOf(sum));
OutValue.set(listString);
context.write(OutKey, OutValue);
}
}
MR类
package index.inverted;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import index.inverted.Map;
import index.inverted.Red;
public class MR {
private static String hdfsPath = "hdfs://192.168.118.129:9000";
private static String inPath = "/mySpider/spider/";
private static String outPath = "/mySpider/outIndex/";
public int run(){
try {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf,"index.inverted");
job.setJarByClass(MR.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setMapperClass(Map.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setReducerClass(Red.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(hdfsPath+inPath));
FileOutputFormat.setOutputPath(job, new Path(hdfsPath+outPath+System.currentTimeMillis()+"/"));
return job.waitForCompletion(true)? 1:-1;
} catch (IllegalStateException | IllegalArgumentException | ClassNotFoundException | IOException | InterruptedException e) {
e.printStackTrace();
}
return 0;
}
}
测试类:
package test;
import index.inverted.MR;
import spider.Spider;
public class test {
public static void main(String[] args){
String[] url = {
"http://news.baidu.com/",
"http://news.sohu.com/",
"http://news.sina.com.cn/",
"http://news.163.com/",
"http://www.people.com.cn/",
"http://news.cctv.com/"};
for(String news : url){
Spider spider = new Spider();
spider.crawling(news);
}
MR mr = new MR();
if(mr.run()==1){
System.out.println("success");
}else{
System.out.println("fail");
}
}
}
我们来看看运行结果:
可以看到内容都是:“词语:+出现次数+Tab+连接……”形式。