learn spark

2016-09-07 本文已影响0人 Codlife

内容来源：spark source code
1: spark 输入数据的默认task 个数：
解答：分如下情况：
Rdd:
Hadoopfile 计算分片，传递了一个参数 parallelism
Sc.parallelize() 默认值是：spark.default.parallelism
Local mode: number of cores on the local machine

Paste_Image.png

Mesos fine grained mode: 8

Paste_Image.png

Others: total number of cores on all executor nodes or 2, whichever is larger
Because:YarnSchedulerBackend 继承自CoarseGrainedSchedulerBackend

Paste_Image.png

Spark 2.0 中大量使用的Dataset
ExecutedCommandExec

2：慎用 groupBykey ，可能导致oom

Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any
key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
It’s recommended to use PairRDDFunctions.aggregateByKey

learn spark

猜你喜欢

热点阅读