coalease 和 repartition的区别

2018-06-08 本文已影响0人 pcqlegend

coalesce 英文翻译是联合合并

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
      : RDD[T] = withScope

官方解释：

返回一个被reduce 成 numPartitions(入参是numPartitions) 数量的partition的新的RDD
这将生成一个窄依赖。举个例子，如果你要将一个RDD从1000个分区转化到100个分区，并不会产生shuffle操作，100个新partition的任何一个需要当前的10个分区。
但是如果你要进行的是一个比较极端的coalesce,比如设置numPartitions = 1,与你希望的不同，这将会导致计算发生在少数的几个节点上（比如 numPartitions = 1的话就只有一个节点）。为了避免这个情况，你可以设置参数传入 shuffle = true，这将会增加一个shuffle的操作，这就意味着，当前的分区将会并行的计算。
注意：设置shuffle = true, 可以合并成一个数量很多的partition，比如你有大量的分区，比如100，但是有一些分区的数据量非常大。调用 coalesce(1000, shuffle = true) 则会使用默认的hash分区器，将所有的分区分发成1000个分区。

/**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

repartition

创建指定数量分区的RDD，可以增加或者减少这个RDD的并行度，内部就是对数据进行shuffle操作。如果你想要减少这个分区的分区数，你可以考虑使用coalesce 方法，它可以避免一个shuffle操作。

所以 repartition 就是shuffle为true的coalesce

coalease 和 repartition的区别

coalesce 英文翻译是联合合并

官方解释：

repartition

猜你喜欢

热点阅读

coalease 和 repartition的区别

coalesce 英文翻译是联合 合并

官方解释：

repartition

猜你喜欢

热点阅读

coalesce 英文翻译是联合合并