Hadoop的JVM重用 (R2)

Hadoop中有個引數是mapred.job.reuse.jvm.num.tasks，預設是1，表示一個JVM上最多可以順序執行的task數目（屬於同一個Job）是1。也就是說一個task啟一個JVM。

比如我配的是每個slave節點最多同時執行8個map和8個reduce。那麼在map階段，slave節點會啟動最多8個JVM用於map。如下：

root@slave1:~# jps

28291 Child

28290 Child

28281 Child

28293 Child

28277 Child

1487 DataNode

28298 Child

28273 Child

28272 Child

1636 TaskTracker

28799 Jps

root@slave1:~# ps -e | grep java

1487 ? 00:53:26 java	TaskTracker
1636 ? 00:12:42 java	DataNode
28272 ? 00:00:35 java	Child
28273 ? 00:00:35 java	Child
28277 ? 00:00:36 java	Child
28281 ? 00:00:36 java	Child
28290 ? 00:00:36 java	Child
28291 ? 00:00:37 java	Child
28293 ? 00:00:36 java	Child
28298 ? 00:00:36 java	Child

其中前兩個是固定的程式。

為每個task啟動一個新的JVM將耗時1秒左右，對於執行時間較長（比如1分鐘以上）的job影響不大，但如果都是時間很短的task，那麼頻繁啟停JVM會有開銷。

如果我們想使用JVM重用技術來提高效能，那麼可以將mapred.job.reuse.jvm.num.tasks設定成大於1的數。這表示屬於同一job的順序執行的task可以共享一個JVM，也就是說第二輪的map可以重用前一輪的JVM，而不是第一輪結束後關閉JVM，第二輪再啟動新的JVM。

那麼最多一個JVM能順序執行多少個task才關閉呢？這個值就是mapred.job.reuse.jvm.num.tasks。如果設定成-1，那麼只要是同一個job的task（無所謂多少個），都可以按順序在一個JVM上連續執行。

如果task屬於不同的job，那麼JVM重用機制無效，不同job的task需要不同的JVM來執行。

注意：

JVM重用技術不是指同一Job的兩個或兩個以上的task可以同時執行於同一JVM上，而是排隊按順序執行。

一個tasktracker最多可以同時執行的task數目由mapred.tasktracker.map.tasks.maximum和mapred.tasktracker.reduce.tasks.maximum

決定，並且這兩個引數在mapred-site.xml中設定。其他方法，如在JobClient端通過命令列

-Dmapred.tasktracker.map.tasks.maximum=number或者conf.set("mapred.tasktracker.map.tasks.maximum","number")設定都是無效的。

附《Hadoop權威指南》上的介紹：

Task JVM Reuse

Hadoop runs tasks in their own Java Virtual Machine to isolate them from other running tasks. The overhead of starting a new JVM for each task can take around a second, which for jobs that run for a minute or so is insignificant. However, jobs that have a large number of very short-lived tasks (these are usually map tasks), or that have lengthy initialization, can see performance gains when the JVM is reused for subsequent tasks.

With task JVM reuse enabled, tasks do not run concurrently in a single JVM. The JVM runs tasks sequentially. Tasktrackers can, however, run more than one task at a time, but this is always done in separate JVMs. The properties for controlling the tasktrackers number of map task slots and reduce task slots are discussed in “Memory” on page 269.

The property for controlling task JVM reuse is mapred.job.reuse.jvm.num.tasks: it specifies the maximum number of tasks to run for a given job for each JVM launched; the default is 1 (see Table 6-4). Tasks from different jobs are always run in separate JVMs. If the property is set to –1, there is no limit to the number of tasks from the same job that may share a JVM. The method setNumTasksToExecutePerJvm() on JobConf can also be used to configure this property.

Tasks that are CPU-bound may also benefit from task JVM reuse by taking advantage of runtime optimizations applied by the HotSpot JVM. After running for a while, the HotSpot JVM builds up enough information to detect performance-critical sections in the code and dynamically translates the Java byte codes of these hot spots into native machine code. This works well for long-running processes, but JVMs that run for seconds or a few minutes may not gain the full benefit of HotSpot. In these cases, it is worth enabling task JVM reuse.

Another place where a shared JVM is useful is for sharing state between the tasks of a job. By storing reference data in a static field, tasks get rapid access to the shared data.

轉載地址：http://jerrylead.iteye.com/blog/1195335

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/26613085/viewspace-1098382/，如需轉載，請註明出處，否則將追究法律責任。

Hadoop的JVM重用 (R2)

相關文章