Spark的wordcount及排序

2019-11-06  本文已影响0人  喵星人ZC

1、读取文件

scala> sc.textFile("file:///home/hadoop/soul/data/wordcount.txt").collect
res1: Array[String] = Array(spark       hadoop  hadoop, hive    hbase   hbase, hive     hadoop  hadoop, hive    hadoop  hadoop)

2、对数据进行压扁并以tab键分割

sc.textFile("file:///home/hadoop/soul/data/wordcount.txt").flatMap(x => (x.split("\t"))).collect
res2: Array[String] = Array(spark, hadoop, hadoop, hive, hbase, hbase, hive, hadoop, hadoop, hive, hadoop, hadoop)

3、赋1操作

scala> sc.textFile("file:///home/hadoop/soul/data/wordcount.txt").flatMap(x => (x.split("\t"))).map((_,1)).collect
res3: Array[(String, Int)] = Array((spark,1), (hadoop,1), (hadoop,1), (hive,1), (hbase,1), (hbase,1), (hive,1), (hadoop,1), (hadoop,1), (hive,1), (hadoop,1), (hadoop,1))

4、聚合相同的K

scala> val result = sc.textFile("file:///home/hadoop/soul/data/wordcount.txt").flatMap(x => (x.split("\t"))).map((_,1)).reduceByKey(_+_)
result: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[20] at reduceByKey at <console>:24

scala> result.collect
res8: Array[(String, Int)] = Array((hive,3), (spark,1), (hadoop,6), (hbase,2))

5、以单词次数降序排序
第一步:首先单词与次数调换位置

scala> result.map(x => (x._2,x._1))
res5: Array[(Int, String)] = Array((3,hive), (1,spark), (6,hadoop), (2,hbase))

第二步:按K降序排序

scala>  result.map(x => (x._2,x._1)).sortByKey(false).collect
res11: Array[(Int, String)] = Array((6,hadoop), (3,hive), (2,hbase), (1,spark))

第三步:将KV的位置进行调换,换成我们想要的格式

scala>  result.map(x => (x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).collect
res12: Array[(String, Int)] = Array((hadoop,6), (hive,3), (hbase,2), (spark,1))
上一篇下一篇

猜你喜欢

热点阅读