MapReduce 通过key排序的例子一

2018-06-21 本文已影响0人博弈史密斯

在Hadoop中，排序是MapReduce的灵魂，MapTask和ReduceTask均会对数据按Key排序，这个操作是MR框架的默认行为，不管你的业务逻辑上是否需要这一操作。

下面这个例子是统计每个账户的总净利润（总收入 - 总支出）。

需求：

trade_info 中的数据如下(你可以认为有多个文件，每个文件都有如下类似的数据，
否则一个文件不需要 reducer，直接 combiner 就可以了)：

zhangsan@163.com    6000    0   2014-02-20
lisi@163.com        2000    0   2014-02-20
lisi@163.com           0  100   2014-02-20
zhangsan@163.com    3000    0   2014-02-20
wangwu@126.com      9000    0   2014-02-20
wangwu@126.com         0  200   2014-02-20

账户、收入、支出、时间

需要统计每个账户的总收入、总支出、总净利润（总收入 - 总支出），并对总净利润进行排序

分析：

我们想实现这样的效果:

zhangsan@163.com    9000      0   9000
wangwu@126.com      9000    200   8800
lisi@163.com        2000    100   1900

我们需要先统计每个账户的总收入、总支出、总净利润，然后对总收入进行排序
我们把账户、收入、支出、净利润封装到一个 bean 中。

map 阶段
读取每一行数据，把信息封装到 bean 中，作为 value
reduce 阶段
计算每个账户的总收入、总支出、总的剩余

到了这一步，我们发现，一个 mapreduce 搞不定，这种情况下，可以再引入一个 mapreduce。
所以这里引入多 mapreduce 的概念：一个 mapreduce 搞不定，可以通过多个 mapreduce 进行多次迭代，达到最终目的。

所以第二个 mapreduce：

map 阶段：
读取第一个 mapreduce reduce 的输出文件，把信息封装到 bean中，并把 bean 作为 map输出的 k。

为什么要这样设计呢？
我们知道，map完成之后，shuffle 会自动对 map输出的 k 进行排序，
所以我们利用shuffle的排序功能，前提是先在 bean 中实现 compareTo 方法。

在Hadoop中，排序是MapReduce的灵魂，MapTask和ReduceTask均会对数据按Key排序，这个操作是MR框架的默认行为，不管你的业务逻辑上是否需要这一操作。

而这里我们已经不需要 map 输出 value 了，所以这里可以直接传 null

reduce 阶段：
把 map 输出的 k取出账户这个字段，并作为 reduce 的输出k
把 map的输出k 作为 reduce的输出 value。

这样，我们通过两个 mapreduce 实现了最终效果。

上代码：

InfoBean ，要实现 WritableComparable 接口，并实现 compareTo 方法，在此方法中指定排序规则：

public class InfoBean implements WritableComparable<InfoBean> {

    private String account;
    private double income;
    private double expenses;
    private double surplus;
    
    public void set(String account,double income,double expenses){
        this.account = account;
        this.income = income;
        this.expenses = expenses;
        this.surplus = income - expenses;
    }
    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(account);
        out.writeDouble(income);
        out.writeDouble(expenses);
        out.writeDouble(surplus);
        
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.account = in.readUTF();
        this.income = in.readDouble();
        this.expenses = in.readDouble();
        this.surplus = in.readDouble();
    }

    @Override
    public int compareTo(InfoBean o) {
        if(this.income == o.getIncome()){
            return this.expenses > o.getExpenses() ? 1 : -1;
        }
        return this.income > o.getIncome() ? 1 : -1;
    }

    @Override
    public String toString() {
        return  income + "\t" + expenses + "\t" + surplus;
    }
    public String getAccount() {
        return account;
    }

    public void setAccount(String account) {
        this.account = account;
    }

    public double getIncome() {
        return income;
    }

    public void setIncome(double income) {
        this.income = income;
    }

    public double getExpenses() {
        return expenses;
    }

    public void setExpenses(double expenses) {
        this.expenses = expenses;
    }

    public double getSurplus() {
        return surplus;
    }

    public void setSurplus(double surplus) {
        this.surplus = surplus;
    }
}

SumStep 对 bean 中的数据进行汇总：

public class SumStep {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        
        job.setJarByClass(SumStep.class);
        
        job.setMapperClass(SumMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(InfoBean.class);
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        
        job.setReducerClass(SumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(InfoBean.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        job.waitForCompletion(true);
    }

    public static class SumMapper extends Mapper<LongWritable, Text, Text, InfoBean>{

        private InfoBean bean = new InfoBean();
        private Text k = new Text();
        
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
                
            // split 
            String line = value.toString();
            String[] fields = line.split("\t");
            // get useful field
            String account = fields[0];
            double income = Double.parseDouble(fields[1]);
            double expenses = Double.parseDouble(fields[2]);
            k.set(account);
            bean.set(account, income, expenses);
            context.write(k, bean);
        }
    }
    
    public static class SumReducer extends Reducer<Text, InfoBean, Text, InfoBean>{

        private InfoBean bean = new InfoBean();
        
        @Override
        protected void reduce(Text key, Iterable<InfoBean> v2s, Context context)
                throws IOException, InterruptedException {
            
            double in_sum = 0;
            double out_sum = 0;
            for(InfoBean bean : v2s){
                in_sum += bean.getIncome();
                out_sum += bean.getExpenses();
            }
            bean.set("", in_sum, out_sum);
            context.write(key, bean);
        }
    }
}

SortStep 把 bean 作为 mapper 的 key，利用 MapReduce 中自带的排序功能进行排序：

public class SortStep {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        
        Job job = Job.getInstance(conf);
        
        job.setJarByClass(SortStep.class);
        
        job.setMapperClass(SortMapper.class);
        job.setMapOutputKeyClass(InfoBean.class);
        job.setMapOutputValueClass(NullWritable.class);
        
        job.setReducerClass(SortReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(InfoBean.class);
        
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        job.waitForCompletion(true);
    }
    
    public static class SortMapper extends Mapper<LongWritable, Text, InfoBean, NullWritable>{

        private InfoBean k = new InfoBean();
        @Override
        protected void map(
                LongWritable key,
                Text value,
                Mapper<LongWritable, Text, InfoBean, NullWritable>.Context context)
                throws IOException, InterruptedException {
                
            String line = value.toString();
            String[] fields = line.split("\t");
            k.set(fields[0], Double.parseDouble(fields[1]), Double.parseDouble(fields[2]));
            
            context.write(k, NullWritable.get());
        }
    }
    
    public static class SortReducer extends Reducer<InfoBean, NullWritable, Text, InfoBean>{

        private Text k = new Text();
        
        @Override
        protected void reduce(InfoBean key, Iterable<NullWritable> values,
                Reducer<InfoBean, NullWritable, Text, InfoBean>.Context context)
                throws IOException, InterruptedException {
                
            k.set(key.getAccount());
            
            context.write(k, key);
        }   
    }
}

到这里，我们就完成了。

MapReduce 通过key排序的例子一

需求：

分析：

上代码：

猜你喜欢

热点阅读