
Spark bypass sort shuffle write流

2019-03-13  本文已影响30人  LittleMagic

#1 - o.a.s.shuffle.sort.BypassMergeSortShuffleWriter.write()方法


  public void write(Iterator<Product2<K, V>> records) throws IOException {
    assert (partitionWriters == null);

    if (!records.hasNext()) {
      partitionLengths = new long[numPartitions];
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, null);
      mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);

    final SerializerInstance serInstance = serializer.newInstance();
    final long openStartTime = System.nanoTime();

    partitionWriters = new DiskBlockObjectWriter[numPartitions];
    partitionWriterSegments = new FileSegment[numPartitions];

    for (int i = 0; i < numPartitions; i++) {
      final Tuple2<TempShuffleBlockId, File> tempShuffleBlockIdPlusFile =
      final File file = tempShuffleBlockIdPlusFile._2();
      final BlockId blockId = tempShuffleBlockIdPlusFile._1();
      partitionWriters[i] =
        blockManager.getDiskWriter(blockId, file, serInstance, fileBufferSize, writeMetrics);
    // Creating the file to write to and creating a disk writer both involve interacting with
    // the disk, and can take a long time in aggregate when we open many files, so should be
    // included in the shuffle write time.
    writeMetrics.incWriteTime(System.nanoTime() - openStartTime);

    while (records.hasNext()) {
      final Product2<K, V> record =;
      final K key = record._1();
      partitionWriters[partitioner.getPartition(key)].write(key, record._2());

    for (int i = 0; i < numPartitions; i++) {
      final DiskBlockObjectWriter writer = partitionWriters[i];
      partitionWriterSegments[i] = writer.commitAndGet();

    File output = shuffleBlockResolver.getDataFile(shuffleId, mapId);
    File tmp = Utils.tempFileWith(output);
    try {
      //【#2 - 将上面的许多个分区文件合并到临时文件】
      partitionLengths = writePartitionedFile(tmp);
      shuffleBlockResolver.writeIndexFileAndCommit(shuffleId, mapId, partitionLengths, tmp);
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logger.error("Error while deleting temp file {}", tmp.getAbsolutePath());
    mapStatus = MapStatus$.MODULE$.apply(blockManager.shuffleServerId(), partitionLengths);

之前已经讲过触发bypass机制的条件。从代码可见,在bypass机制下,shuffle write的流程大大简化了。中间没有类似PartitionedAppendOnlyMap那样的缓存(因为没有map端预聚合),也没有数据方面的排序,直接按分区写一批中间数据文件(因为分区数会小于阈值spark.shuffle.sort.bypassMergeThreshold,不会产生过多),然后将它们合并。这种方式实际上颇有一些借鉴hash shuffle的意味。

bypass的含义是“旁路”“支线”,这也符合其绕过了缓存和排序的特征。虽然它中途产生的文件可能会比普通sort shuffle还多,但胜在数据量少,逻辑简单,因此在阈值合适的情况下速度也很快。


#2 - o.a.s.shuffle.sort.BypassMergeSortShuffleWriter.writePartitionedFile()方法

   * Concatenate all of the per-partition files into a single combined file.
   * @return array of lengths, in bytes, of each partition of the file (used by map output tracker).
  private long[] writePartitionedFile(File outputFile) throws IOException {
    // Track location of the partition starts in the output file
    final long[] lengths = new long[numPartitions];
    if (partitionWriters == null) {
      // We were passed an empty iterator
      return lengths;

    final FileOutputStream out = new FileOutputStream(outputFile, true);
    final long writeStartTime = System.nanoTime();
    boolean threwException = true;
    try {
      for (int i = 0; i < numPartitions; i++) {
        final File file = partitionWriterSegments[i].file();
        if (file.exists()) {
          final FileInputStream in = new FileInputStream(file);
          boolean copyThrewException = true;
          try {
            //【transferToEnabled即spark.file.transferTo参数,默认值true,采用NIO zero-copy方式复制;false就采用传统BIO方式】
            lengths[i] = Utils.copyStream(in, out, false, transferToEnabled);
            copyThrewException = false;
          } finally {
            Closeables.close(in, copyThrewException);
          if (!file.delete()) {
            logger.error("Unable to delete file for partition {}", i);
      threwException = false;
    } finally {
      Closeables.close(out, threwException);
      writeMetrics.incWriteTime(System.nanoTime() - writeStartTime);
    partitionWriters = null;
    return lengths;


bypass shuffle write流程简图
