It's a well-known fact that hardware companies are abandoning the race for single-CPU speed and instead are focusing on multicore processors. Despite the fact that many algorithms can be easily parallelized, most client-side Java code is still written for single-CPU systems. In this article Kirill Grouchnikov shows you how to fine-tune a core JDK array-sorting algorithm for improved processing speed of as much as 35%.

[@more@]

The consistent increase in single-core CPU speed that programmers could rely on for years is no longer available. This has been true for the server side for at least a decade, and now it's also the reality for client-side programming. Quite a few tasks easily lend themselves to being parallelized using new features in JDK 6.0, letting client-side applications take advantage of newer multicore hardware. Using a core JDK array sorting algorithm as an example, this article walks through the implementation details, highlighting points that are specific to parallel algorithms. Even a simple implementation results in a significant speed-up -- around 35 % -- on a dual-core machine.

The demise of Moore's Law

Up until a year or two ago, API providers could simply rely on an empirical observation made in 1965 by Gordon Moore, the co-founder of Intel. Moore's Law states that the number of transistors on an integrated circuit at minimum cost doubles every 18 to 24 months. For software developers, this meant that you could write a program, run it on the simplest CPU available at that time, and be pretty sure that the same program would run twice as fast on the simplest CPU available two years later. As long as the operating system was backward compatible, you didn't even need to recompile the program.

However, in the last couple of years, hardware manufacturers have started hitting production barriers that make cramming more computing power into a single chip very costly. The solution most of them have adopted -- first for the server market and now for the consumer (client) market -- is to put multiple cores on the same chip, without increasing each core's processing speed. Translated into the software world, this means that you don't get "free rides" any more. If your program has a simple sequential flow, it no longer can enjoy advances in the underlying hardware. This is true for both the programs that you're writing and the core language libraries.

Collection sorting in the single-core world

Existing Java collection-sorting routines are no exception. They perform no faster on newer multicore machines than on single-core machines. This might seem acceptable on smaller inputs, but the input domains for most real-world problems don't stand still. Moreover, developers and users rightly expect their programs to run faster on newer hardware.

A look at the generic algorithms that operate on collections, and their existing Java implementations, will clarify the problem -- and light the way to a solution.

Algorithm complexity

Most of the generic core algorithms that operate on collections are available in the Arrays and Collections classes. Using the APIs exposed by these two classes, you can sort, search, and fill lists and arrays. Because most of the APIs operate on the contents of the entire collection, the running time is proportional to the collection size.

For some methods (such as sorting), the running time is even longer. You can't sort an arbitrary collection in a number of steps that is proportional to the collection's size. (This is known as linear complexity.) If the collection size is N, the best algorithm can sort this collection in a number of steps that is proportional to N*log(N); this lower bound has been theoretically proven. (As a side note, the quadratic complexity of the straightforward bubble sort means that the number of steps is proportional to N*N, which is highly undesirable for large collections.)

What does this mean to users of these core APIs? When you call Arrays.sort or Collections.sort, the running time doesn't grow linearly with the collection size. On my machine, calling Arrays.sort on an array of 200,000 strings takes 490ms on average. An array of 400,000 strings is sorted in 1290ms -- an increase in running time by factor of about 2.6.

Java implementations of sorting algorithms

A look at the implementation of the various sorting algorithms in the core Arrays and Collections classes is especially easy now that they are licensed as open source. All the sort routines in the Arrays class that operate on primitive types (such as byte, char, and double) use quicksort, which is generally recognized as the fastest sorting algorithm for most inputs. (In some corner cases the performance might degrade to quadratic.) The sort API that operates on an array of Objects is a mergesort, which is simpler to implement and provides N*log(N) performance as well. Finally, Collections.sort(List) delegates the actual sorting to the Arrays.sort(Object[]).

Taking a closer look at the mergesort in the Arrays class, you can see that it is indeed the classic mergesort with a few optimization tweaks for corner cases. If the array size is small (seven or fewer with the current implementation), it reverts to the bubble sort. Otherwise, the array is split in half, and the same method is called (recursively) on both halves. After both halves are sorted, the code "merges" them (hence the name).

Going back to the hardware advances, you can easily see that increasing the CPU speed (number of operations per second) by a factor of two results in a matching improvement in the algorithm running time. This is the case because the mergesort is a sequential recursive algorithm that does exactly the same sequence of steps, provided the same input.

Core sorting in the multicore world

What's the problem with the current mergesort implementation when it runs on a multicore machine? The answer is simple: it doesn't explore the inherent concurrency of the recursive implementation. When the array is split in half, the sorting of the second half begins only after the sorting of the first half is done. But new concurrency utilities in JDK 6.0 come to the rescue, letting you perform these subtasks in parallel without any communication between them. At the end, when both tasks are done, the sorted halves still need to be merged together. This still results in N*log(N) performance, but this time with a lower constant factor. To illustrate how much lower, I'll show you a simple implementation of a mergesort that takes advantage of a multicore environment. This takes you into the world of parallel or concurrent programming.

Parallel programming to the rescue

Here is a guided walkthrough for the concurrent implementation of this alternative implementation of Arrays.mergeSort. The code examples I'll show are based on the core JDK implementation (and as such are licensed under GPL). The first step, shown in Listing 1, is to check the number of available processors (cores). If you have only one processor, parallelizing the implementation gains you nothing; you will only pay for the overhead of creating threads and the thread-context switch while the array halves are being sorted.

Listing 1. Special case for small input and one processor

if ((a.length < 7) || (Runtime.getRuntime().availableProcessors() == 1)) {
   mergeSort(aux, a, 0, a.length, 0);
   return;
}

Now you need to decide how to split up the work among all the available processors. This decision is specific to the task at hand. In order to simplify the implementation, I'll split the work between only two processors. It's easy to see that the final merge stage can begin only when both halves are sorted. Listing 2 uses the new concurrency utilities available in JDK 6.0:

The CountDownLatch class lets you wait for a specific number of tasks to be completed (two in this case).
Executors.newFixedThreadPool launches two threads that sort the array halves.

Listing 2. Splitting the sort into two independently and concurrently processed halves

final CountDownLatch doneSignal = new CountDownLatch(2);
ExecutorService e = Executors.newFixedThreadPool(2);

class WorkerRunnable implements Runnable {
   int start;
   int end;

   WorkerRunnable(int start, int end) {
      this.start = start;
      this.end = end;
   }

   public void run() {
      mergeSort(aux, a, start, end, 0);
      doneSignal.countDown();
   }
}

int mid = a.length >> 1;
e.execute(new WorkerRunnable(0, mid));
e.execute(new WorkerRunnable(mid, a.length));
try {
   doneSignal.await(); // wait for all to finish
} catch (InterruptedException ie) {
}
e.shutdown();

After CountDownLatch.await returns, you know that both subtasks have been completed. At this point, you can merge the sorted halves, as shown in Listing 3. (This code is taken from Arrays.mergeSort with relevant variable renaming.)

Listing 3. Merging the two sorted halves

System.arraycopy(a, 0, aux, 0, a.length);
// merge two halves
for (int i = 0, p = 0, q = mid; i < a.length; i++) {
   if (q >= a.length || p < mid
         && ((Comparable) aux[p]).compareTo(aux[q]) <= 0)
      a[i] = aux[p++];
   else
      a[i] = aux[q++];
}
Performance results on multicore machine
The source archive accompanying this article contains the following Java classes:

CoreSort tests the core implementation of Arrays.sort().
ConcurrentSort is a simple parallelized implementation of the same method that takes advantage of dual-core CPUs.
TestConcurrentSort tests the parallelized sort implementation on different inputs.

The following results were taken on dual-core Intel processor, each core running at 1.66Ghz. The same input array of a million random strings was sorted by the core Arrays.sort method and by the parallelized version. The average running time of the core implementation was 3350ms, while the average running time of the parallelized version was 2180ms (a 35% improvement).

In an ideal case, the performance improvement should be exactly 50%. However, for a few reasons, you can rarely achieve this. First, creating and running multiple threads incurs some "penalty." Second, most problems involve some kind of a "merging" stage where the results from the subproblems are analyzed to create the result for the original problem. Finally, some problems require communication among the subtasks.

Is it that easy?

So, is it just that easy? Yes and no. It is very simple to break up problems that are inherently concurrent, which you can do in just a few lines of code using the new JDK 6.0 concurrency APIs. And if your implementation is written to scale to an arbitrary number of available processors, your code can once again enjoy Moore's Law (at least as long as the hardware vendors keep putting more and more identical cores on their chips).

On the other hand, you are now faced with a multitude of aspects that are specific to concurrent programming. First, you need to decide if you split your problem among all processors, some of them, or only a fixed number (as in this article's example). Splitting the work among all processors might result in the best average completion time, but it is more complicated to do, and in some cases (when one core is used heavily by some external process), a specific subtask can take much longer to complete than the others. Splitting the task among a fixed number of cores is simpler but gives you only a constant performance improvement, no matter how many cores are available.

Second, your testing becomes much more complicated. You need to test your code on different configurations, including single-cores, dual-cores, quad-cores, and so on. Some bugs become very hard to reproduce, because the code no longer runs sequentially.

Third, the code itself is more complicated, and not only for the task splitting. You also need to think about system resources; your method might be called from multiple threads, and if you spawn yet more threads for every such request, you can easily bring the whole system to a crawl.

The last consideration, but certainly not the least, is that some problems, even if they appear sequential, don't easily lend themselves to a parallel implementation. Consider, for example, ArrayList.indexOf (which is also used in ArrayList.contains). The implementation is very simple, scanning the entire list from the beginning and returning the first element that matches the passed parameter. This can indeed be easily split among any number of available processors, each one searching in the matching subrange. But the main difference between this method and Arrays.sort is the merging stage. In the indexOf method, you need to return the index of the first matching element. If the subtask that operates on the first list slice finds a matching element, as it is in the case of ArrayList.indexOf, all the other subtasks are now using the processors for no reason.

ArrayList.indexOf, unlike Arrays.sort, doesn't need the full results of all the subtasks. If more than one subtask returns a nonnegative result, the implementation should take the smallest one. Suppose you split the search into four subtasks, and third subtask returns a matching index. You can't just stop the remaining subtasks and return it, but must wait until the first two subtasks complete in order to decide what the result is. In this specific example, when a subtask N returns a matching result, you can stop all the tasks with an index higher than N but need to wait for all the tasks with an index lower than N.

In conclusion

Multicore machines are the reality for client-side development. Failure to adapt will result in software that does not scale well on modern hardware. By identifying routines that lend themselves to parallelizing, your code can continue enjoying the advances of multicore machines. However, this means switching your mindset from the sequential model to the concurrent model. The concurrent model doesn't only yield performance improvements; it also comes with its own set of best practices. If you want to achieve competitive performance and write software that scales with the hardware, you need to dip your toes in the concurrency pool.

Multicore processing for client-side Java applications