有一類面試題,既可以考察工程師演算法、也可以兼顧實踐應用、甚至創新思維,這些題目便是好的題目,有區分度表現為可以有一般解,也可以有最優解。最近就發現了一個這樣的好題目,拿出來曬一曬。
1 題目 原文:
There is an array of 10000000 different int numbers. Find out its largest 100 elements. The implementation should be optimized for executing speed.
翻譯:
有一個長度為1000萬的int陣列,各元素互不重複。如何以最快的速度找出其中最大的100個元素?
2 分析與解 (接下來的演算法均以Java語言實現。)
首先,第一個冒出來的想法是——排序。各種排序演算法對陣列進行一次sort,然後limit出max的100個即可,時間複雜度為O(nLogN)。
2.1 堆排序思路 我以堆排序來實現這個題目,這樣可以使用非常少的記憶體空間,始終維護一個100個元素大小的最小堆,堆頂int[0]即是100個元素中最小的,插入一個新的元素的時候,將這個元素和堆頂int[0]進行交換,也就是淘汰掉堆頂,然後再維護一個最小堆,使int[0]再次儲存最小的元素,迴圈往復,不斷迭代,最終剩下的100個元素就是結果,該演算法時間複雜度仍然是O(nLogN),優點在於節省記憶體空間,演算法時間複雜度比較理想,平均耗時400ms。
程式碼實現如下,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
import java.util.ArrayList; import java.util.Collections; import java.util.List; /** * Implementation of finding top 100 elements out of a huge int array. <br> * * There is an array of 10000000 different int numbers. Find out its largest 100 * elements. The implementation should be optimized for executing speed. <br> * * Note: This is the third version of implementation, this time I make the best out * of the heap sort algorithm by using a minimum heap. The heap maintains the top biggest * numbers that guarantees the minimum number is removed every time a new number is added * to the heap. It saves memory usage to the limit by just using an array which size is 101 * and a few temp elements. However, the performance is not as good as the bit map way but * better than the multiple thread way. * * @author zhangxu04 */ public class FindTopElements3 { private static final int ARRAY_LENGTH = 10000000; // big array length public static void main(String[] args) { FindTopElements3 fte = new FindTopElements3(); // Get a array which is not in order and elements are not duplicate int[] array = getShuffledArray(ARRAY_LENGTH); // Find top 100 elements and print them by desc order in the console long start = System.currentTimeMillis(); fte.findTop100(array); long end = System.currentTimeMillis(); System.out.println("Costs " + (end - start) + "ms"); } public void findTop100(int[] arr) { MinimumHeap heap = new MinimumHeap(100); for (Integer i : arr) { heap.add(i); if (heap.size() > 100) { heap.deleteTop(); } } for (int i = 0; i < 100; i++) { System.out.println(heap.deleteTop()); } } /** * Get shuffled int array * * @return array not in order and elements are not duplicate */ private static int[] getShuffledArray(int len) { System.out .println("Start to generate test array... this may take several seconds."); List<Integer> list = new ArrayList<Integer>(len); for (int i = 0; i < len; i++) { list.add(i); } Collections.shuffle(list); int[] ret = new int[len]; for (int i = 0; i < len; i++) { ret[i] = list.get(i); } return ret; } } class MinimumHeap { int[] items; int size; public MinimumHeap(int size) { items = new int[size + 1]; size = 0; } void shiftUp(int index) { int intent = items[index]; while (index > 0) { int pindex = (index - 1) / 2; int parent = items[pindex]; if (intent < parent) { items[index] = parent; index = pindex; } else { break; } } items[index] = intent; } void shiftDown(int index) { int intent = items[index]; int leftIndex = 2 * index + 1; while (leftIndex < size) { int minChild = items[leftIndex]; int minIndex = leftIndex; int rightIndex = leftIndex + 1; if (rightIndex < size) { int rightChild = items[rightIndex]; if (rightChild < minChild) { minChild = rightChild; minIndex = rightIndex; } } if (minChild < intent) { items[index] = minChild; index = minIndex; leftIndex = index * 2 + 1; } else { break; } } items[index] = intent; } public void add(int item) { items[size++] = item; shiftUp(size - 1); } public int deleteTop() { if (size < 1) { return 0; } int maxItem = items[0]; int lastItem = items[size - 1]; size--; if (size < 1) { return lastItem; } items[0] = lastItem; shiftDown(0); return maxItem; } public boolean isEmpty() { return size < 1; } public int size() { return size; } /** * MinimumHeap main test * @param args */ public static void main(String[] args) { MinimumHeap heap = new MinimumHeap(7); heap.add(2); heap.add(3); heap.add(5); heap.add(1); heap.add(4); heap.add(7); heap.add(6); heap.deleteTop(); heap.deleteTop(); while (!heap.isEmpty()) { System.out.println(heap.deleteTop()); } } } |
那麼挖掘下題目,兩個點是我們的優化線索:
1、元素互不重複
2、最快的速度,沒有提及對於系統資源以及空間的要求
2.2 多執行緒分而治之策略 順著#2條線索,可以給出一個多執行緒的優化版本,使用分而治之的策略,將1000萬大小的陣列分割為1000個元素組成的若干小陣列,利用JDK自帶的高效排序演算法void java.util.Arrays.sort(int[] a)來進行排序,多執行緒處理,主執行緒彙總結果後取出各個小陣列的top 100,歸併後再進行一次排序得出結果。
程式碼實現如下,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; import java.util.List; import java.util.concurrent.Callable; import java.util.concurrent.CompletionService; import java.util.concurrent.ExecutorCompletionService; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; /** * Implementation of finding top 100 elements out of a huge int array. <br> * * There is an array of 10000000 different int numbers. * Find out its largest 100 elements. * The implementation should be optimized for executing speed. * * @author zhangxu04 */ public class FindTopElements { private static final int ARRAY_LENGTH = 10000000; // big array length private static final int ELEMENT_NUM_PER_GROUP = 10000; // split big array into sub-array, this represents sub-array length private static final int TOP_ELEMENTS_NUM = 100; // top elements number private ExecutorService executorService; private CompletionService<int[]> completionService; public FindTopElements() { int MAX_THREAD_COUNT = 50; executorService = Executors.newFixedThreadPool(MAX_THREAD_COUNT); completionService = new ExecutorCompletionService<int[]>(executorService); } /** * Start from here :-) * @param args */ public static void main(String[] args) { FindTopElements findTopElements = new FindTopElements(); // Get a array which is not in order and elements are not duplicate int[] array = getShuffledArray(ARRAY_LENGTH); // Find top 100 elements and print them by desc order in the console long start = System.currentTimeMillis(); findTopElements.findTop100(array); long end = System.currentTimeMillis(); System.out.println("Costs " + (end - start) + "ms"); } /** * Leveraging concurrent components of JDK, we can deal small parts of the huge array concurrently. * The huge array are split into several sub arrays which are submitted to a thread pool one by one. * By using <code>CompletionService</code>, we can take out completed result from the pool as soon as possible, * which avoid the block issue when getting future result through a future task list by using * <code>ExcutorService</code> and <code>Future</code> class. Moreover, the can optimize the performance of * the piece of code by processing the completed results once we get them, so the overall sort invocation will * not be delayed to the final moment. * */ private void findTop100(int[] arr) { System.out.println("Start to compute."); int groupNum = (ARRAY_LENGTH / ELEMENT_NUM_PER_GROUP); System.out.println("Split " + ARRAY_LENGTH + " elements into " + groupNum + " groups"); for (int i = 0; i < groupNum; i++) { int[] toBeSortArray = new int[ELEMENT_NUM_PER_GROUP]; System.arraycopy(arr, i * ELEMENT_NUM_PER_GROUP, toBeSortArray, 0, ELEMENT_NUM_PER_GROUP); completionService.submit(new FindTop100(toBeSortArray)); } try { int[] overallArray = new int[TOP_ELEMENTS_NUM * groupNum]; for (int i = 0; i < groupNum; i++) { System.arraycopy(completionService.take().get(), 0, overallArray, i * TOP_ELEMENTS_NUM, TOP_ELEMENTS_NUM); } Arrays.sort(overallArray); for (int i = 1; i <= TOP_ELEMENTS_NUM; i++) { System.out.println(overallArray[TOP_ELEMENTS_NUM * groupNum - i]); } System.out.println("Finish to output result."); } catch (Exception e) { e.printStackTrace(); } executorService.shutdown(); } /** * Callable of finding top 100 elements <br> * The steps are as below: * 1) Quick sort a array * 2) Get reverse 100 elements and put them into a new array * 3) return the new array */ private class FindTop100 implements Callable<int[]> { private int[] array; public FindTop100(int[] array) { this.array = array; } @Override public int[] call() throws Exception { int len = array.length; Arrays.sort(array); int[] result = new int[TOP_ELEMENTS_NUM]; int index = 0; for (int i = 1; i <= TOP_ELEMENTS_NUM; i++) { result[index++] = array[len - i]; } return result; } } /** * Get shuffled int array * * @return array not in order and elements are not duplicate */ private static int[] getShuffledArray(int len) { System.out.println("Start to generate test array... this may take several seconds."); List<Integer> list = new ArrayList<Integer>(len); for (int i = 0; i < len; i++) { list.add(i); } Collections.shuffle(list); int[] ret = new int[len]; for (int i = 0; i < len; i++) { ret[i] = list.get(i); } return ret; } } |
分析看來,這個解的優勢在於充分利用了系統資源,使用了分而治之的思想,將時間複雜度平均分配到了每個子執行緒中,但是程式碼中大量用到了System.arraycopy進行陣列拷貝,佔用記憶體過於多,甚至需要指定JVM的記憶體-Xmx才可以正常執行起來,平均耗時250ms。
2.3 點陣圖陣列思路 這個思路屬於比較創新的方式,考慮到優化線索#1提到的無重複元素,那麼可以使用點陣圖陣列儲存元素,一個int佔用4個位元組,32個bit,也就是說1個int可以表示32個數字的位置。 維護一個陣列長度/32+1的點陣圖陣列x,遍歷給定的陣列,將數字安插進入這個點陣圖陣列x中,例如int[0]=62,那麼
1 2 3 |
index=62/32=1 bit_index=62 mod 32 = 30 |
那麼就置點陣圖陣列的x[1]=x[1] | 30
,採用“位或”是為了不丟掉以前處理過的數字。
程式碼實現如下,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
import java.util.ArrayList; import java.util.Collections; import java.util.List; /** * Implementation of finding top 100 elements out of a huge int array. <br> * * There is an array of 10000000 different int numbers. Find out its largest 100 * elements. The implementation should be optimized for executing speed. <br> * * Note: This is the second version of implementation, the previous one using * thread pool provided by JDK concurrent toolkit is not efficient enough, the * second version is an enhanced one based on bit map algorithm, which is estimated to * have at least a 3 times faster and consume less memory usage. * * @author zhangxu04 */ public class FindTopElements2 { private static final int ARRAY_LENGTH = 10000000; // big array length public static void main(String[] args) { FindTopElements2 fte = new FindTopElements2(ARRAY_LENGTH + 1); // Get a array which is not in order and elements are not duplicate int[] array = getShuffledArray(ARRAY_LENGTH); // Find top 100 elements and print them by desc order in the console long start = System.currentTimeMillis(); fte.findTop100(array); long end = System.currentTimeMillis(); System.out.println("Costs " + (end - start) + "ms"); } private final int[] bitmap; private final int size; public FindTopElements2(final int size) { this.size = size; int len = ((size % 32) == 0) ? size / 32 : size / 32 + 1; this.bitmap = new int[len]; } private static int index(final int number) { return number / 32; } private static int position(final int number) { return number % 32; } private void adjustBitMap(final int index, final int position) { int bit = bitmap[index] | (1 << position); bitmap[index] = bit; } public void add(int[] numArr) { for (int i = 0; i < numArr.length; i++) { add(numArr[i]); } } public void add(int number) { adjustBitMap(index(number), position(number)); } public boolean getIndex(final int index) { if (index > size) { return false; } int bit = (bitmap[index(index)] >> position(index)) & 0x0001; return (bit == 1); } private void findTop100(int[] arr) { System.out.println("Start to compute."); add(arr); int[] result = new int[100]; int index = 0; for (int i = bitmap.length - 1; i >= 0; i--) { for (int j = 31; j >= 0; j--) { if (((bitmap[i] >> j) & 0x0001) == 1) { if (index == result.length) { break; } result[index++] = ((i) * 32) + j ; } } if (index == result.length) { break; } } for (int j = 0; j < result.length; j++) { System.out.println(result[j]); } System.out.println("Finish to output result."); } /** * Get shuffled int array * * @return array not in order and elements are not duplicate */ private static int[] getShuffledArray(int len) { System.out.println("Start to generate test array... this may take several seconds."); List<Integer> list = new ArrayList<Integer>(len); for (int i = 0; i < len; i++) { list.add(i); } Collections.shuffle(list); int[] ret = new int[len]; for (int i = 0; i < len; i++) { ret[i] = list.get(i); } return ret; } } |
這個演算法的時間複雜度是O(N),非常理想,平均耗時可以減少到50ms作用,效能比排序演算法提升了10倍以上,不足在於點陣圖陣列的長度取決於給定陣列的最大值,如果分佈比較平均,並且最大值比較小,那麼佔用記憶體空間就可以得到有效的控制。
3 總結
綜上給出的題目,可以看出解決一個實際問題,既可以用純演算法的思路來解決,我們甚至可以自己動手實現,例如自己寫的堆排序,非常節省空間,如果用JDK自帶的快速排序,那麼無疑這一點不會好於我們的實現。 現今,處理大資料問題一個傾向的思路就是充分利用系統資源,充分發揮多核、大記憶體計算型伺服器的能力,為我們提高效率,多執行緒是在JAVA中以及有了非常好用的API以及concurrent包下的工具類,能否有效利用這些工具提速我們的程式也很關鍵。同時,問題總有一些點可以讓我們找到最適合的場景來解決,例如點陣圖陣列的思路,在效能上達到了最佳,同時多消耗的記憶體對於現代的伺服器來說完全在可控範圍內,因此不失為一種創新的好思路。