Find median using mapreduce
今天在刷面经的时候看到了这个小问题:
Is there a fast algorithm to run on MapReduce framework to find the Median from a huge Integer set?
查阅一些资料后确定:
对于sort或find median这类问题,所有数据最后都需要pass到1个reducer上。
思路一:
用两个map reduce job完成:
- Calculate frequencies of values in your dataset (Word Count job, basically)
- Identity mapper + a reducer which calculates median based on < value - frequency> pairs
step1类似于wordcount,mapper将每个value变为<value,1>的pair后,经过shuffle,value相似的pair分配到同一个recuder上,reducer再统计该reducer上每个value出现的次数。
step1结束后可以,每个value都有一个<value,freq>pair,其中freq表示该value出现的次数。mapper将这些pair都放在一个reducer上后排序,排序完成即可找出median。
思路二:
若已知数据的rough distribution,还有一个可以scale当前算法的改进方法。
- Use a custom partitioner that distributes the keys by range buckets (0-99 go to reducer 0, 100-199 to reducer 2, and so on).
- This will however require some secondary job to examine the reducer outputs and perform the final median calculation (knowing for example the number of keys in each reducer, you can calculate which reducer output will contain the median, and at which offset)
根据已知的rough distribution,可以自定义一个custom partition,在shuffle阶段将value分配到不同reducer,之后根据每个reducer output的size(即有多少key在其中),确定median在哪个reducer的output里以及median在reducer中的offset。
参考:
http://stackoverflow.com/questions/6968215/find-median-of-a-large-integer-set-using-mapreduce
http://stackoverflow.com/questions/10109514/computing-median-in-map-reduce