groupby 与 distinct 去重时的区别
2018-08-23 本文已影响337人
scottzcw
sql1,select count(distinct sellno) from xxx;
sql2,select count( sellno) from
(select sellno from xxx
group by sellno) t;
sql1执行过程:
Stage-Stage-1: Map: 396 Reduce: 1 Cumulative CPU: 7915.67 sec HDFS Read: 119072894175 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 0 days 2 hours 11 minutes 55 seconds 670 msec
sql2执行过程:
Stage-Stage-1: Map: 396 Reduce: 457 Cumulative CPU: 10056.7 sec HDFS Read: 119074266583 HDFS Write: 53469 SUCCESS
Stage-Stage-2: Map: 177 Reduce: 1 Cumulative CPU: 280.22 sec HDFS Read: 472596 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 0 days 2 hours 52 minutes 16 seconds 920 msec
总结,distinct会将所有的数据都shuffle到一个reducer里面,而groupby 将数据分布到多台机器上执行,效率更高