MapReduce: number of mappers/reducers

Ready!發表於2015-05-01

It's the other way round. Number of mappers is decided based on the number of splits. In reality it is the job of InputFormat, which you are using, to create the splits. You do not have any idea about the number of mappers until number of splits has been decided. And, it's not always that splits will be created based on the HDFS block size. It totally depends on the logic inside the getSplits() method of your InputFormat.

To better understand this, assume you are processing data stored in your MySQL using MR. Since there is no concept of blocks in this case, the theory that splits are always created based on the HDFS block fails. Right? What about splits creation then? One possibility is to create splits based on ranges of rows in your MySQL table (and this is what DBInputFormat does, an input format for reading data from a relational database). Suppose you have 100 rows. Then you might have 5 splits of 20 rows each.

It is only for the InputFormats based on FileInputFormat (an InputFormat for handling data stored in files) that the splits are created based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. If you have a file smaller than the HDFS block size, you'll get only 1 mapper for that file. If you want to have some different behavior, you can use mapred.min.split.size. But it again depends solely on the getSplits() of your InputFormat.

There is a fundamental difference between MR split and HDFS block and folks often get confused by this. A block is a physical piece of data while a split is just a logical piece which is going to be fed to a mapper. A split does not contain the input data, it is just a reference to the data. Then what is a split? A split basically has 2 things : a length in bytes and a set of storage locations, which are just hostname strings.

Coming back to your question. Hadoop allows much more than 200 mappers. Having said that, it doesn't make much sense to have 200 mappers for just 500MB of data. Always remember that when you talk about Hadoop, you are dealing with very huge data. Sending just 2.5 MB data to each mapper would be an overkill. And yes, if there are no free CPU slots then some mappers may run after the completion of current mappers. But the MR framework is very intelligent and tries its best to avoid these kind of situation. If the machine where data to processed is present, doesn't have any free CPU slots, the data will be moved to a nearby node, where free slots are available, and get processed.

HTH

How many mappers/reducers should be set when configuring Hadoop cluster?

There is no formula. It depends on how many cores and how much memory do you have. The number of mapper + number of reducer should not exceed the number of cores in general. Keep in mind that the machine is also running Task Tracker and Data Node daemons. One of the general suggestion is more mappers than reducers. If I were you, I would run one of my typical jobs with reasonable amount of data to try it out.

For a normal 7200rpm disk, 2-3 mappers is a good number. For you system, with 48G mem and 16 cpu thread, I/O will likely to be the problem. I suggest you to get multiple disk for each node and set them up as JBOD.

Quoting from "Hadoop The Definite Guide, 3rd edition", page 306

Because MapReduce jobs are normally I/O-bound, it makes sense to have more tasks than processors to get better utilization.

The amount of oversubscription depends on the CPU utilization of jobs you run, but a good rule of thumb is to have a factor of between one and two more tasks (counting both map and reduce tasks) than processors.

A processor in the quote above is equivalent to one logical core.

But this is just in theory, and most likely each use case is different than another, some tests need to be performed. But this number can be a good start to test with.

No. of mappers is decided in accordance with the data locality principle as described earlier. Data Locality principle : Hadoop tries its best to run map tasks on nodes where the data is present locally to optimize on the network and inter-node communication latency. As the input data is split into pieces and fed to different map tasks, it is desirable to have all the data fed to that map task available on a single node.Since HDFS only guarantees data having size equal to its block size (64M) to be present on one node, it is advised/advocated to have the split size equal to the HDFS block size so that the map task can take advantage of this data localization. Therefore, 64M of data per mapper. If we see some mappers running for a very small period of time, try to bring down the number of mappers and make them run longer for a minute or so.

No. of reducers should be slightly less than the number of reduce slots in the cluster (the concept of slots comes in with a pre-configuration in the job/task tracker properties while configuring the cluster) so that all the reducers finish in one wave and make full utilisation of the cluster resources.

mapred.tasktracker.reduce.tasks.maximum

mapred.tasktracker.map.tasks.maximum

in mapred-site.xml

This is applicable for all jobs. If you want to set for a specific one, you can use

mapred.reduce.tasks

mapred.map.tasks

Liyin Tang added a comment - 13/Nov/10 01:16

I just finished converting common join into map join based on the file size. There are 2 flags to control this optimization.

1) set hive.auto.convert.join = true; It means this optimization is enabled. By default right now, this flag is disabled in order not to break any existing test cases. Also I put 25 additional test cases, auto_join0.q - auto_join25.q, which covers this optimization code.

2) Set hive.hashtable.max.memory.usage = 0.9; It means if the memory usage of local task is more than 90% of its heap size, then the local task will abort by itself. The Driver will know the local work fails and it won't submit the MapJoinTask (a Map Only MapRedTask) to Hadoop, but instead, it will submit the originally CommonJoinTask to Hadoop to run.

3) Set hive.smalltable.filesize = 25000000L; It means if the summary of the small table file size is less than 25M, then it will run the map join task. If not, just run the originally common join task.

[譯] Reducers VS Transducers
2018-03-26
Redux------combineReducers(reducers:Object)
2017-10-21
ReduxObject
為什麼Redux 需要 reducers是純函式？
2018-08-23
Redux函式
MapReduce初探
2014-03-22
MapReduce理解
2024-11-02
MapReduce: 提高MapReduce效能的七點建議[譯]
2014-05-03
MapReduce 簡介
2016-04-13
MapReduce之topN
2015-01-29
MapReduce InputFormat——DBInputFormat
2015-11-30
ORM
Mongodb MapReduce使用
2016-05-26
MongoDB
Lab 1: MapReduce
2024-08-25
dva應用中reducers和effects的單元測試實戰
2019-03-24
JavaScript Number()
2018-06-01
JavaScript
Oracle System Change Number (SCN) Number 完全筆記
2007-08-09
Oracle筆記
mybatis原始碼配置檔案解析之五：解析mappers標籤流程圖
2020-06-29
MyBatis原始碼APP流程圖
MapReduce 示例：減少 Hadoop MapReduce 中的側連線
2021-09-17
Hadoop
MapReduce執行流程
2021-11-09
MapReduce工作流程
2016-10-19
mapreduce框架詳解
2016-09-13
框架
MapReduce模型講解
2017-03-06
模型
MapReduce&&Hadoop
2017-12-16
Hadoop
MapReduce InputFormat之FileInputFormat
2015-11-30
ORM
MapReduce(四)：shuffer原理
2015-09-18
【NUMBER】Oracle的NUMBER資料型別特點
2010-11-13
Oracle資料型別
JavaScript Number 物件
2020-04-02
JavaScript物件
Number.NaN
2018-06-01
NaN
Number of Boomerangs
2017-03-07
OOM
一個簡單的MapReduce示例（多個MapReduce任務處理）
2017-05-27
MapReduce之自定義OutputFormat
2020-08-05
ORM
MapReduce之自定義InputFormat
2020-07-19
ORM
MapReduce之WritableComparable排序
2020-07-29
排序
Hadoop學習——MapReduce
2019-04-06
Hadoop
MapReduce過程詳解
2019-04-29
MapReduce框架-Join的使用
2021-08-13
框架
hadoop_MapReduce yarn
2020-11-11
HadoopYarn
HDFS/MapReduce及應用
2017-04-11
MapReduce程式執行流程
2018-06-12
MapReduce實戰：倒排索引
2014-12-24
索引

MapReduce: number of mappers/reducers

相關文章