正面刚算法-Java中Arrays.sort()(一)

2020-02-07 本文已影响0人 VincentPeng

最近一直在看关于排序相关的算法，从O(n²)的冒泡、插入、选择到O(nlog(n))的归并、快排、再到桶排序、计数排序、基数排序。各个算法都有自己的优点和缺点，那么jdk中关于这种底层的算法是怎么实现的呢？参考了一些博客，今天把学习到的东西总结一下。首先可以明确的是在java中的Arrays.sort()不仅仅使用了一种算法，他会根据数据不同的状态选择认为合适的算法。

本文涉及

1.java中Arrays.sort()方法对于数据排序的流程
2.sort方法中涉及到的排序算法（插入、快排、归并等）

一 java中Arrays.sort()排序流程

该方法签名为：

// 仅以int[] 排序为例
   public static void sort(int[] a) {
    // 调用的包内排序算法为DualPivotQuicksort 称为双轴快排
        DualPivotQuicksort.sort(a, 0, a.length - 1, null, 0, 0);
    }

通过调用我们发现是调用 DualPivotQuicksort.sort()方法。
那么Arrays.sort是不是就是直接使用DualPivotQuicksort 这种指定算法呢？
答案是否定的，我先把梳理出来的流程图放到前面，之后在分析一下各情况的判断

Arrays.sort处理流程

处理。

首先当元素个数低于QUICKSORT_THRESHOLD（286）时，直接进入private static void sort(int[] a, int left, int right, boolean leftmost)方法，该方法后面我们在拆解

      // Use Quicksort on small arrays
        if (right - left < QUICKSORT_THRESHOLD) {
            sort(a, left, right, true);
            return;
        }

当元素个数大于286个时，会进行数据的乱序的程度排查,结果有四种：
①出现连续的相同值（33个），也直接执行 private static void sort(int[] a, int left, int right, boolean leftmost)方法；
②数据中升降序的切换次数（计算出的count值。count值为run[]中的使用位数，一个run[]中的一位代表一段数据中，数据是有序的，升、降）大于67（时），直接执行 private static void sort(int[] a, int left, int right, boolean leftmost)方法；
③当数据中升降序的切换次数（计算出的count值。count值为run[]中的使用位数，一个run[]中的一位代表一段数据中，数据是有序的，升、降）小于67（时），完成循环；
④已经完全有序，完成循环（在这之后会进行结束排序）；

       int[] run = new int[MAX_RUN_COUNT + 1];
       int count = 0; run[0] = left;
// Check if the array is nearly sorted
       for (int k = left; k < right; run[count] = k) {
           if (a[k] < a[k + 1]) { // ascending
               while (++k <= right && a[k - 1] <= a[k]);
           } else if (a[k] > a[k + 1]) { // descending
               while (++k <= right && a[k - 1] >= a[k]);
               for (int lo = run[count] - 1, hi = k; ++lo < --hi; ) {
                   int t = a[lo]; a[lo] = a[hi]; a[hi] = t;
               }
           } else { // equal
               for (int m = MAX_RUN_LENGTH; ++k <= right && a[k - 1] == a[k]; ) {
                   if (--m == 0) {
                       sort(a, left, right, true);
                       return;
                   }
               }
           }

           /*
            * The array is not highly structured,
            * use Quicksort instead of merge sort.
            */
           if (++count == MAX_RUN_COUNT) {
               sort(a, left, right, true);
               return;
           }
       }

我们来看一下这段check代码

int[] run：排序区间数组，每一个run代表一个有序子数组元素数量
k：游标，标记元素
count:run区间使用数
int m：连续相同元素个数指标
从这中段代码，我们可以看出，数组的排序情况已经元素的相似度都会影响算法的使用。

上面代码执行完毕，没有触发执行其中的return，则会执行下面代码，下面代码分为执行归并处理和该数据完全有序直接返回的处理

        // Check special cases
        // Implementation note: variable "right" is increased by 1.
        if (run[count] == right++) { // The last run contains one element
            run[++count] = right;
        } else if (count == 1) { // The array is already sorted
            return;
        }

上面代码会进行完全有序直接返回的操作，或者补充最后一个有序区间为一个元素的统计的情况，进行接下来的操作。

接下来的代码就属于Arrays.sort中的归并排序（优化过的）的核心代码，这里不展开描述，会在另外的文章中，专门讲解涉及到的基础排序算法，这里只梳理排序的处理逻辑。

之前调用private static void sort(int[] a, int left, int right, boolean leftmost)的排序方法，其中会根据数据情况演变为双轴快排、单轴快排，传统插入排序、成对插入排序这几种情况。

// leftmost 标明数据范围是否为最左侧的数据，是否是起始位
private static void sort(int[] a, int left, int right, boolean leftmost)

首先映入眼帘的就是，一组插入排序，为什么叫一组，因为，当数据为起始位数据采用传统插入排序，非起始位（非最左侧部分）采用成对插入排序（pair insertion sort）

 // Use insertion sort on tiny arrays
        if (length < INSERTION_SORT_THRESHOLD) {
            if (leftmost) {
                /*
                 * Traditional (without sentinel) insertion sort,
                 * optimized for server VM, is used in case of
                 * the leftmost part.
                 */
                for (int i = left, j = i; i < right; j = ++i) {
                    int ai = a[i + 1];
                    while (ai < a[j]) {
                        a[j + 1] = a[j];
                        if (j-- == left) {
                            break;
                        }
                    }
                    a[j + 1] = ai;
                }
            } else {
                /*
                 * Skip the longest ascending sequence.
                 */
                do {
                    if (left >= right) {
                        return;
                    }
                } while (a[++left] >= a[left - 1]);

                /*
                 * Every element from adjoining part plays the role
                 * of sentinel, therefore this allows us to avoid the
                 * left range check on each iteration. Moreover, we use
                 * the more optimized algorithm, so called pair insertion
                 * sort, which is faster (in the context of Quicksort)
                 * than traditional implementation of insertion sort.
                 */
                for (int k = left; ++left <= right; k = ++left) {
                    int a1 = a[k], a2 = a[left];

                    if (a1 < a2) {
                        a2 = a1; a1 = a[left];
                    }
                    while (a1 < a[--k]) {
                        a[k + 2] = a[k];
                    }
                    a[++k + 1] = a1;

                    while (a2 < a[--k]) {
                        a[k + 1] = a[k];
                    }
                    a[k + 1] = a2;
                }
                int last = a[right];

                while (last < a[--right]) {
                    a[right + 1] = a[right];
                }
                a[right + 1] = last;
            }
            return;
        }

当数据规模和格式需要使用快排时，则会在数据中平均选取5个数据点，如果他们不存在相同元素则采用双轴快排，如果存在则采用单轴快排
简略代码如下：

 // Inexpensive approximation of length / 7
        int seventh = (length >> 3) + (length >> 6) + 1;

        /*
         * Sort five evenly spaced elements around (and including) the
         * center element in the range. These elements will be used for
         * pivot selection as described below. The choice for spacing
         * these elements was empirically determined to work well on
         * a wide variety of inputs.
         */
        int e3 = (left + right) >>> 1; // The midpoint
        int e2 = e3 - seventh;
        int e1 = e2 - seventh;
        int e4 = e3 + seventh;
        int e5 = e4 + seventh;
        // e1~e5 排序

//============================

 if (a[e1] != a[e2] && a[e2] != a[e3] && a[e3] != a[e4] && a[e4] != a[e5]) {
            /*
             * Use the second and fourth of the five sorted elements as pivots.
             * These values are inexpensive approximations of the first and
             * second terciles of the array. Note that pivot1 <= pivot2.
             */
            int pivot1 = a[e2];
            int pivot2 = a[e4];
            // 双轴排序过程
else {
            int pivot = a[e3];
            // 单轴排序过程
}

其中有一个计算步骤很为巧妙，就是关于该数组最接近1/7的值的计算，一般我们可能会采用int minSeventh = length/7 这样直接计算出长度的1/7的近似值。但作者采用的位运算进行计算，我思前想后也没有想出来为什么这样写，直到看到知乎上一个回答，才有眉目。

// Inexpensive approximation of length / 7
int seventh = (length >> 3) + (length >> 6) + 1;

这里面有逼近求解的思想和使用位运算替代除法的思想。

首先我们大致可以将"m >> n" 右位移运算当做是m除以2的n次方；

length >> 3 等于length/(2的3次方)  = length/8 ;
length >> 6 等于 length/62

那么在看这个表达式就变成了：

int seventh = length/8 + length/62 + 1

我们需要的是length/7当我们知道length/8 的值之后，可以补上
length/7-length/8的差值，就可以求出来length/7。

length/7 = length/8 + (length/7 -length/8)
length/7-length/8 = length/56。

和length/56最近的一个位运算结果就是length/62。
综上求length/7的近似值可以写成

int seventh = length/8 + length/62 + 1

至于最后的+1，是为了弥补两次位运算的期望值补偿。最后的补偿选择别人的结论为，+1补偿优于不补偿的精度。

如果我们不考虑其中具体排序算法，Arrays.sort()流程分析基本就结束了，后面针对每个算法的实现会单独进行分析记录。

二 Arrays.sort方法中涉及的排序算法

传统插入排序
成对插入排序
单轴快速排序
双轴快速排序
归并排序优化版，其中隐约看到计数排序的影子（个人愚见）

附：参考内容：

JDK源码DualPivotQuicksort类中利用移位求除7近似值？
Java SDK中的sort算法小议 - 03 双轴快排

正面刚算法-Java中Arrays.sort()(一)

本文涉及

一 java中Arrays.sort()排序流程

二 Arrays.sort方法中涉及的排序算法

附：参考内容：

猜你喜欢

热点阅读