5亿整数的大文件，怎么排？

2020-01-13 本文已影响0人 adminmane

本人免费整理了Java高级资料，涵盖了Java、Redis、MongoDB、MySQL、Zookeeper、Spring Cloud、Dubbo高并发分布式等教程，一共30G，需要自己领取。
传送门：https://mp.weixin.qq.com/s/osB-BOl6W-ZLTSttTkqMPQ

问题

给你1个文件bigdata，大小4663M，5亿个数，文件中的数据随机,如下一行一个整数：

<pre class="" style="margin: 0px; padding: 8px 0px 6px; max-width: 100%; box-sizing: border-box !important; word-wrap: break-word !important; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: 0.544px; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background: rgb(241, 239, 238); border-radius: 0px; overflow-y: auto; color: rgb(80, 97, 109); text-align: start; font-size: 10px; line-height: 12px; font-family: consolas, menlo, courier, monospace, &quot;Microsoft Yahei&quot; !important; border-width: 1px !important; border-style: solid !important; border-color: rgb(226, 226, 226) !important;">

6196302
3557681
6121580
2039345
2095006
1746773
7934312
2016371
7123302
8790171
2966901
...
7005375

</pre>

现在要对这个文件进行排序，怎么搞？

内部排序

先尝试内排，选2种排序方式：

<pre class="" style="margin: 0px; padding: 8px 0px 6px; max-width: 100%; box-sizing: border-box !important; word-wrap: break-word !important; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: 0.544px; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background: rgb(241, 239, 238); border-radius: 0px; overflow-y: auto; color: rgb(80, 97, 109); text-align: start; font-size: 10px; line-height: 12px; font-family: consolas, menlo, courier, monospace, &quot;Microsoft Yahei&quot; !important; border-width: 1px !important; border-style: solid !important; border-color: rgb(226, 226, 226) !important;">

1.  `private  final  int cutoff =  8;`

3.  `public  <T>  void perform(Comparable<T>[] a)  {`

4.  `perform(a,0,a.length -  1);`

5.  `}`

7.  `private  <T>  int median3(Comparable<T>[] a,int x,int y,int z)  {`

8.  `if(lessThan(a[x],a[y]))  {`

9.  `if(lessThan(a[y],a[z]))  {`

10.  `return y;`

11.  `}`

12.  `else  if(lessThan(a[x],a[z]))  {`

13.  `return z;`

14.  `}else  {`

15.  `return x;`

16.  `}`

17.  `}else  {`

18.  `if(lessThan(a[z],a[y])){`

19.  `return y;`

20.  `}else  if(lessThan(a[z],a[x]))  {`

21.  `return z;`

22.  `}else  {`

23.  `return x;`

24.  `}`

25.  `}`

26.  `}`

28.  `private  <T>  void perform(Comparable<T>[] a,int low,int high)  {`

29.  `int n = high - low +  1;`

30.  `//当序列非常小，用插入排序`

31.  `if(n <= cutoff)  {`

32.  `InsertionSort insertionSort =  SortFactory.createInsertionSort();`

33.  `insertionSort.perform(a,low,high);`

34.  `//当序列中小时，使用median3`

35.  `}else  if(n <=  100)  {`

36.  `int m = median3(a,low,low +  (n >>>  1),high);`

37.  `exchange(a,m,low);`

38.  `//当序列比较大时，使用ninther`

39.  `}else  {`

40.  `int gap = n >>>  3;`

41.  `int m = low +  (n >>>  1);`

42.  `int m1 = median3(a,low,low + gap,low +  (gap <<  1));`

43.  `int m2 = median3(a,m - gap,m,m + gap);`

44.  `int m3 = median3(a,high -  (gap <<  1),high - gap,high);`

45.  `int ninther = median3(a,m1,m2,m3);`

46.  `exchange(a,ninther,low);`

47.  `}`

49.  `if(high <= low)`

50.  `return;`

51.  `//lessThan`

52.  `int lt = low;`

53.  `//greaterThan`

54.  `int gt = high;`

55.  `//中心点`

56.  `Comparable<T> pivot = a[low];`

57.  `int i = low +  1;`

59.  `/*`

60.  `* 不变式：`

61.  `*   a[low..lt-1] 小于pivot -> 前部(first)`

62.  `*   a[lt..i-1] 等于 pivot -> 中部(middle)`

63.  `*   a[gt+1..n-1] 大于 pivot -> 后部(final)`

64.  `*`

65.  `*   a[i..gt] 待考察区域`

66.  `*/`

68.  `while  (i <= gt)  {`

69.  `if(lessThan(a[i],pivot))  {`

70.  `//i-> ,lt ->`

71.  `exchange(a,lt++,i++);`

72.  `}else  if(lessThan(pivot,a[i]))  {`

73.  `exchange(a,i,gt--);`

74.  `}else{`

75.  `i++;`

76.  `}`

77.  `}`

79.  `// a[low..lt-1] < v = a[lt..gt] < a[gt+1..high].`

80.  `perform(a,low,lt -  1);`

81.  `perform(a,gt +  1,high);`

82.  `}`

</pre>

归并排序：

<pre class="" style="margin: 0px; padding: 8px 0px 6px; max-width: 100%; box-sizing: border-box !important; word-wrap: break-word !important; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: 0.544px; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background: rgb(241, 239, 238); border-radius: 0px; overflow-y: auto; color: rgb(80, 97, 109); text-align: start; font-size: 10px; line-height: 12px; font-family: consolas, menlo, courier, monospace, &quot;Microsoft Yahei&quot; !important; border-width: 1px !important; border-style: solid !important; border-color: rgb(226, 226, 226) !important;">

1.  `/**`

2.  `* 小于等于这个值的时候，交给插入排序`

3.  `*/`

4.  `private  final  int cutoff =  8;`

6.  `/**`

7.  `* 对给定的元素序列进行排序`

8.  `*`

9.  `* @param a 给定元素序列`

10.  `*/`

11.  `@Override`

12.  `public  <T>  void perform(Comparable<T>[] a)  {`

13.  `Comparable<T>[] b = a.clone();`

14.  `perform(b, a,  0, a.length -  1);`

15.  `}`

17.  `private  <T>  void perform(Comparable<T>[] src,Comparable<T>[] dest,int low,int high)  {`

18.  `if(low >= high)`

19.  `return;`

21.  `//小于等于cutoff的时候,交给插入排序`

22.  `if(high - low <= cutoff)  {`

23.  `SortFactory.createInsertionSort().perform(dest,low,high);`

24.  `return;`

25.  `}`

27.  `int mid = low +  ((high - low)  >>>  1);`

28.  `perform(dest,src,low,mid);`

29.  `perform(dest,src,mid +  1,high);`

31.  `//考虑局部有序 src[mid] <= src[mid+1]`

32.  `if(lessThanOrEqual(src[mid],src[mid+1]))  {`

33.  `System.arraycopy(src,low,dest,low,high - low +  1);`

34.  `}`

36.  `//src[low .. mid] + src[mid+1 .. high] -> dest[low .. high]`

37.  `merge(src,dest,low,mid,high);`

38.  `}`

40.  `private  <T>  void merge(Comparable<T>[] src,Comparable<T>[] dest,int low,int mid,int high)  {`

42.  `for(int i = low,v = low,w = mid +  1; i <= high; i++)  {`

43.  `if(w > high || v <= mid && lessThanOrEqual(src[v],src[w]))  {`

44.  `dest[i]  = src[v++];`

45.  `}else  {`

46.  `dest[i]  = src[w++];`

47.  `}`

48.  `}`

49.  `}`

</pre>

数据太多，递归太深 ->栈溢出？加大Xss？数据太多，数组太长 -> OOM？加大Xmx?

耐心不足，没跑出来.而且要将这么大的文件读入内存，在堆中维护这么大个数据量，还有内排中不断的拷贝，对栈和堆都是很大的压力，不具备通用性。

sort命令来跑

跑了多久呢？24分钟.

为什么这么慢？

粗略的看下我们的资源：

内存 jvm-heap/stack，native-heap/stack,page-cache，block-buffer 外存 swap + 磁盘数据量很大，函数调用很多，系统调用很多，内核/用户缓冲区拷贝很多，脏页回写很多，io-wait很高，io很繁忙，堆栈数据不断交换至swap，线程切换很多，每个环节的锁也很多. 总之，内存吃紧，问磁盘要空间，脏数据持久化过多导致cache频繁失效，引发大量回写，回写线程高，导致cpu大量时间用于上下文切换，一切，都很糟糕，所以24分钟不细看了，无法忍受.

位图法

<pre class="" style="margin: 0px; padding: 8px 0px 6px; max-width: 100%; box-sizing: border-box !important; word-wrap: break-word !important; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: 0.544px; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background: rgb(241, 239, 238); border-radius: 0px; overflow-y: auto; color: rgb(80, 97, 109); text-align: start; font-size: 10px; line-height: 12px; font-family: consolas, menlo, courier, monospace, &quot;Microsoft Yahei&quot; !important; border-width: 1px !important; border-style: solid !important; border-color: rgb(226, 226, 226) !important;">

1.  `private  BitSet bits;`

3.  `public  void perform(`

4.  `String largeFileName,`

5.  `int total,`

6.  `String destLargeFileName,`

7.  `Castor<Integer> castor,`

8.  `int readerBufferSize,`

9.  `int writerBufferSize,`

10.  `boolean asc)  throws  IOException  {`

12.  `System.out.println("BitmapSort Started.");`

13.  `long start =  System.currentTimeMillis();`

14.  `bits =  new  BitSet(total);`

15.  `InputPart<Integer> largeIn =  PartFactory.createCharBufferedInputPart(largeFileName, readerBufferSize);`

16.  `OutputPart<Integer> largeOut =  PartFactory.createCharBufferedOutputPart(destLargeFileName, writerBufferSize);`

17.  `largeOut.delete();`

19.  `Integer data;`

20.  `int off =  0;`

21.  `try  {`

22.  `while  (true)  {`

23.  `data = largeIn.read();`

24.  `if  (data ==  null)`

25.  `break;`

26.  `int v = data;`

27.  `set(v);`

28.  `off++;`

29.  `}`

30.  `largeIn.close();`

31.  `int size = bits.size();`

32.  `System.out.println(String.format("lines : %d ,bits : %d", off, size));`

34.  `if(asc)  {`

35.  `for  (int i =  0; i < size; i++)  {`

36.  `if  (get(i))  {`

37.  `largeOut.write(i);`

38.  `}`

39.  `}`

40.  `}else  {`

41.  `for  (int i = size -  1; i >=  0; i--)  {`

42.  `if  (get(i))  {`

43.  `largeOut.write(i);`

44.  `}`

45.  `}`

46.  `}`

48.  `largeOut.close();`

49.  `long stop =  System.currentTimeMillis();`

50.  `long elapsed = stop - start;`

51.  `System.out.println(String.format("BitmapSort Completed.elapsed : %dms",elapsed));`

52.  `}finally  {`

53.  `largeIn.close();`

54.  `largeOut.close();`

55.  `}`

56.  `}`

58.  `private  void set(int i)  {`

59.  `bits.set(i);`

60.  `}`

62.  `private  boolean get(int v)  {`

63.  `return bits.get(v);`

64.  `}`

</pre>

nice!跑了190秒，3分来钟. 以核心内存4663M/32大小的空间跑出这么个结果，而且大量时间在用于I/O，不错.

问题是，如果这个时候突然内存条坏了1、2根，或者只有极少的内存空间怎么搞？

外部排序

该外部排序上场了. 外部排序干嘛的？

内存极少的情况下，利用分治策略，利用外存保存中间结果，再用多路归并来排序;

map-reduce的嫡系.

![image](https://img.haomeiwen.com/i20407517/26f3397ebc9dac7a?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

![image](https://img.haomeiwen.com/i20407517/85e4552f639e93e5?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

1.分

内存中维护一个极小的核心缓冲区memBuffer，将大文件bigdata按行读入，搜集到memBuffer满或者大文件读完时，对memBuffer中的数据调用内排进行排序，排序后将有序结果写入磁盘文件bigdata.xxx.part.sorted. 循环利用memBuffer直到大文件处理完毕，得到n个有序的磁盘文件：

![image](https://img.haomeiwen.com/i20407517/60aad7a5ddbebd7f?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

2.合

现在有了n个有序的小文件，怎么合并成1个有序的大文件？把所有小文件读入内存，然后内排？(⊙o⊙)… no!

利用如下原理进行归并排序：

![image](https://img.haomeiwen.com/i20407517/e284034e0a324086?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

我们举个简单的例子：

文件1：3,6,9 文件2：2,4,8 文件3：1,5,7

第一回合：文件1的最小值：3 , 排在文件1的第1行文件2的最小值：2，排在文件2的第1行文件3的最小值：1，排在文件3的第1行那么，这3个文件中的最小值是：min(1,2,3) = 1 也就是说，最终大文件的当前最小值，是文件1、2、3的当前最小值的最小值，绕么？上面拿出了最小值1，写入大文件.

第二回合：文件1的最小值：3 , 排在文件1的第1行文件2的最小值：2，排在文件2的第1行文件3的最小值：5，排在文件3的第2行那么，这3个文件中的最小值是：min(5,2,3) = 2 将2写入大文件.

也就是说，最小值属于哪个文件，那么就从哪个文件当中取下一行数据.（因为小文件内部有序，下一行数据代表了它当前的最小值）

最终的时间，跑了771秒，13分钟左右.

5亿整数的大文件，怎么排？

问题

内部排序

归并排序：

sort命令来跑

位图法

外部排序

1.分

2.合

猜你喜欢

热点阅读