Multicore processing for client-side Java applications
It's a well-known fact that hardware companies are abandoning the race for single-CPU speed and instead are focusing on multicore processors. Despite the fact that many algorithms can be easily parallelized, most client-side Java code is still written for single-CPU systems. In this article Kirill Grouchnikov shows you how to fine-tune a core JDK array-sorting algorithm for improved processing speed of as much as 35%.
[@more@]The consistent increase in single-core CPU speed that programmers could rely on for years is no longer available. This has been true for the server side for at least a decade, and now it's also the reality for client-side programming. Quite a few tasks easily lend themselves to being parallelized using new features in JDK 6.0, letting client-side applications take advantage of newer multicore hardware. Using a core JDK array sorting algorithm as an example, this article walks through the implementation details, highlighting points that are specific to parallel algorithms. Even a simple implementation results in a significant speed-up -- around 35 % -- on a dual-core machine.
The demise of Moore's Law
Up until a year or two ago, API providers could simply rely on an empirical observation made in 1965 by Gordon Moore, the co-founder of Intel. Moore's Law states that the number of transistors on an integrated circuit at minimum cost doubles every 18 to 24 months. For software developers, this meant that you could write a program, run it on the simplest CPU available at that time, and be pretty sure that the same program would run twice as fast on the simplest CPU available two years later. As long as the operating system was backward compatible, you didn't even need to recompile the program.
However, in the last couple of years, hardware manufacturers have started hitting production barriers that make cramming more computing power into a single chip very costly. The solution most of them have adopted -- first for the server market and now for the consumer (client) market -- is to put multiple cores on the same chip, without increasing each core's processing speed. Translated into the software world, this means that you don't get "free rides" any more. If your program has a simple sequential flow, it no longer can enjoy advances in the underlying hardware. This is true for both the programs that you're writing and the core language libraries.
Collection sorting in the single-core world
Existing Java collection-sorting routines are no exception. They perform no faster on newer multicore machines than on single-core machines. This might seem acceptable on smaller inputs, but the input domains for most real-world problems don't stand still. Moreover, developers and users rightly expect their programs to run faster on newer hardware.
A look at the generic algorithms that operate on collections, and their existing Java implementations, will clarify the problem -- and light the way to a solution.
Algorithm complexity
Most of the generic core algorithms that operate on collections are available in the Arrays
and Collections
classes. Using the APIs exposed by these two classes, you can sort, search, and fill lists and arrays. Because most of the APIs operate on the contents of the entire collection, the running time is proportional to the collection size.
For some methods (such as sorting), the running time is even longer. You can't sort an arbitrary collection in a number of steps that is proportional to the collection's size. (This is known as linear complexity.) If the collection size is N, the best algorithm can sort this collection in a number of steps that is proportional to N*log(N); this lower bound has been theoretically proven. (As a side note, the quadratic complexity of the straightforward bubble sort means that the number of steps is proportional to N*N, which is highly undesirable for large collections.)
What does this mean to users of these core APIs? When you call Arrays.sort
or Collections.sort
, the running time doesn't grow linearly with the collection size. On my machine, calling Arrays.sort
on an array of 200,000 strings takes 490ms on average. An array of 400,000 strings is sorted in 1290ms -- an increase in running time by factor of about 2.6.
Java implementations of sorting algorithms
A look at the implementation of the various sorting algorithms in the core Arrays
and Collections
classes is especially easy now that they are licensed as open source. All the sort routines in the Arrays
class that operate on primitive types (such as byte
, char
, and double
) use quicksort, which is generally recognized as the fastest sorting algorithm for most inputs. (In some corner cases the performance might degrade to quadratic.) The sort API that operates on an array of Object
s is a mergesort, which is simpler to implement and provides N*log(N) performance as well. Finally, Collections.sort(List)
delegates the actual sorting to the Arrays.sort(Object[])
.
Taking a closer look at the mergesort in the Arrays
class, you can see that it is indeed the classic mergesort with a few optimization tweaks for corner cases. If the array size is small (seven or fewer with the current implementation), it reverts to the bubble sort. Otherwise, the array is split in half, and the same method is called (recursively) on both halves. After both halves are sorted, the code "merges" them (hence the name).
Going back to the hardware advances, you can easily see that increasing the CPU speed (number of operations per second) by a factor of two results in a matching improvement in the algorithm running time. This is the case because the mergesort is a sequential recursive algorithm that does exactly the same sequence of steps, provided the same input.
Core sorting in the multicore world
What's the problem with the current mergesort implementation when it runs on a multicore machine? The answer is simple: it doesn't explore the inherent concurrency of the recursive implementation. When the array is split in half, the sorting of the second half begins only after the sorting of the first half is done. But new concurrency utilities in JDK 6.0 come to the rescue, letting you perform these subtasks in parallel without any communication between them. At the end, when both tasks are done, the sorted halves still need to be merged together. This still results in N*log(N) performance, but this time with a lower constant factor. To illustrate how much lower, I'll show you a simple implementation of a mergesort that takes advantage of a multicore environment. This takes you into the world of parallel or concurrent programming.
Parallel programming to the rescue
Here is a guided walkthrough for the concurrent implementation of this alternative implementation of Arrays.mergeSort
. The code examples I'll show are based on the core JDK implementation (and as such are licensed under GPL). The first step, shown in Listing 1, is to check the number of available processors (cores). If you have only one processor, parallelizing the implementation gains you nothing; you will only pay for the overhead of creating threads and the thread-context switch while the array halves are being sorted.
Listing 1. Special case for small input and one processor
if ((a.length < 7) || (Runtime.getRuntime().availableProcessors() == 1)) {
mergeSort(aux, a, 0, a.length, 0);
return;
}
Now you need to decide how to split up the work among all the available processors. This decision is specific to the task at hand. In order to simplify the implementation, I'll split the work between only two processors. It's easy to see that the final merge stage can begin only when both halves are sorted. Listing 2 uses the new concurrency utilities available in JDK 6.0:
- The
CountDownLatch
class lets you wait for a specific number of tasks to be completed (two in this case). Executors.newFixedThreadPool
launches two threads that sort the array halves.
Listing 2. Splitting the sort into two independently and concurrently processed halves
final CountDownLatch doneSignal = new CountDownLatch(2);
ExecutorService e = Executors.newFixedThreadPool(2);
class WorkerRunnable implements Runnable {
int start;
int end;
WorkerRunnable(int start, int end) {
this.start = start;
this.end = end;
}
public void run() {
mergeSort(aux, a, start, end, 0);
doneSignal.countDown();
}
}
int mid = a.length >> 1;
e.execute(new WorkerRunnable(0, mid));
e.execute(new WorkerRunnable(mid, a.length));
try {
doneSignal.await(); // wait for all to finish
} catch (InterruptedException ie) {
}
e.shutdown();
After CountDownLatch.await
returns, you know that both subtasks have been completed. At this point, you can merge the sorted halves, as shown in Listing 3. (This code is taken from Arrays.mergeSort
with relevant variable renaming.)
Listing 3. Merging the two sorted halves
System.arraycopy(a, 0, aux, 0, a.length); // merge two halves for (int i = 0, p = 0, q = mid; i < a.length; i++) { if (q >= a.length || p < mid && ((Comparable) aux[p]).compareTo(aux[q]) <= 0) a[i] = aux[p++]; else a[i] = aux[q++]; }
Performance results on multicore machine
The source archive accompanying this article contains the following Java classes:
CoreSort
tests the core implementation ofArrays.sort()
.ConcurrentSort
is a simple parallelized implementation of the same method that takes advantage of dual-core CPUs.TestConcurrentSort
tests the parallelized sort implementation on different inputs.
The following results were taken on dual-core Intel processor, each core running at 1.66Ghz. The same input array of a million random strings was sorted by the core Arrays.sort
method and by the parallelized version. The average running time of the core implementation was 3350ms, while the average running time of the parallelized version was 2180ms (a 35% improvement).
In an ideal case, the performance improvement should be exactly 50%. However, for a few reasons, you can rarely achieve this. First, creating and running multiple threads incurs some "penalty." Second, most problems involve some kind of a "merging" stage where the results from the subproblems are analyzed to create the result for the original problem. Finally, some problems require communication among the subtasks.
Is it that easy?
So, is it just that easy? Yes and no. It is very simple to break up problems that are inherently concurrent, which you can do in just a few lines of code using the new JDK 6.0 concurrency APIs. And if your implementation is written to scale to an arbitrary number of available processors, your code can once again enjoy Moore's Law (at least as long as the hardware vendors keep putting more and more identical cores on their chips).
On the other hand, you are now faced with a multitude of aspects that are specific to concurrent programming. First, you need to decide if you split your problem among all processors, some of them, or only a fixed number (as in this article's example). Splitting the work among all processors might result in the best average completion time, but it is more complicated to do, and in some cases (when one core is used heavily by some external process), a specific subtask can take much longer to complete than the others. Splitting the task among a fixed number of cores is simpler but gives you only a constant performance improvement, no matter how many cores are available.
Second, your testing becomes much more complicated. You need to test your code on different configurations, including single-cores, dual-cores, quad-cores, and so on. Some bugs become very hard to reproduce, because the code no longer runs sequentially.
Third, the code itself is more complicated, and not only for the task splitting. You also need to think about system resources; your method might be called from multiple threads, and if you spawn yet more threads for every such request, you can easily bring the whole system to a crawl.
The last consideration, but certainly not the least, is that some problems, even if they appear sequential, don't easily lend themselves to a parallel implementation. Consider, for example, ArrayList.indexOf
(which is also used in ArrayList.contains
). The implementation is very simple, scanning the entire list from the beginning and returning the first element that matches the passed parameter. This can indeed be easily split among any number of available processors, each one searching in the matching subrange. But the main difference between this method and Arrays.sort
is the merging stage. In the indexOf
method, you need to return the index of the first matching element. If the subtask that operates on the first list slice finds a matching element, as it is in the case of ArrayList.indexOf
, all the other subtasks are now using the processors for no reason.
ArrayList.indexOf
, unlike Arrays.sort
, doesn't need the full results of all the subtasks. If more than one subtask returns a nonnegative result, the implementation should take the smallest one. Suppose you split the search into four subtasks, and third subtask returns a matching index. You can't just stop the remaining subtasks and return it, but must wait until the first two subtasks complete in order to decide what the result is. In this specific example, when a subtask N returns a matching result, you can stop all the tasks with an index higher than N but need to wait for all the tasks with an index lower than N.
In conclusion
Multicore machines are the reality for client-side development. Failure to adapt will result in software that does not scale well on modern hardware. By identifying routines that lend themselves to parallelizing, your code can continue enjoying the advances of multicore machines. However, this means switching your mindset from the sequential model to the concurrent model. The concurrent model doesn't only yield performance improvements; it also comes with its own set of best practices. If you want to achieve competitive performance and write software that scales with the hardware, you need to dip your toes in the concurrency pool.
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/71047/viewspace-996790/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- CORS filter for Java applicationsCORSFilterJavaAPP
- rac 的Client-side TAF配置clientIDE
- Failover 之 Client-Side Connect time Failover、Client-Side TAF、Service-Side TAFAIclientIDE
- HOW TO INTEGRATE APPLICATIONS RELEASE 11 WITH CUSTOM APPLICATIONSAPP
- Applications 5APP
- java-apt程式設計實踐(Annotatino Processing Tool+maven)JavaAPT程式設計Maven
- 【BO-SDK】SSO InfoView_1 (BOE SDK Java Applications @_2)ViewJavaAPP
- Applications1APP
- Dependencies of Applications (256)APP
- Graph Theory with ApplicationsGraph TheoryAPP
- Oracle Query processing 的程式Oracle
- ORA-600(kollasg:client-side tmp lob)錯誤clientIDE
- 2.3.6.2 Synchronization of Multiple ApplicationsAPP
- 2.3.3.3.2 Applications at Different VersionsAPP
- Oracle Applications Interface ProgramsOracleAPP
- LLM multiple modal applicationsAPP
- 並行處理 Parallel Processing並行Parallel
- XStream: Stream Processing Platform at FacebookPlatform
- SCM600--Complaints ProcessingAI
- ASP.NET AJAX 4的Client-Side Template和DataViewASP.NETclientIDEView
- How To Delete An Oracle Applications UserdeleteOracleAPP
- Performance and Scalability of EJB ApplicationsORMAPP
- 試用Applications Manager 7.3APP
- HMAC: Introduction, History, and ApplicationsMacAPP
- 【numpy學習筆記】 Array processing筆記
- 【Processing】互動藝術-入門
- An error occurred during bosboot verification processingErrorboot
- Asp.Net Sync & ASync Processing requestASP.NET
- The Nature of Code: Simulating Natural Systems with Processing
- Oracle Clusterware Software Component Processing DetailsOracleAI
- $ORACLE_HOME/owb/external/oc4j_applications/applications/WFMLRSVApp.eraOracleAPP
- Debugging QML ApplicationsAPP
- Making AJAX Applications CrawlableAPP
- 80 of the Best Linux Security ApplicationsLinuxAPP
- Objectify: A Better Way to Build Rails ApplicationsObjectUIAIAPP
- Using Partitioned Indexes in OLTP ApplicationsIndexAPP
- ORA-00600: internal error code, arguments: [kollasg:client-side tmp lob]ErrorclientIDE
- Oracle RAC Failove 之一:Client-Side Connect Time FailoverOracleAIclientIDE