Hadoop的GroupComparator是如何起做用的（原始碼分析）

目標：弄明白，我們配置的GroupComparator是如何對進入reduce函式中的key Iterable<value> 進行影響。
如下是一個配置了GroupComparator 的reduce 函式。具體影響是我們可以在自定義的GroupComparator 中確定哪兒些value組成一組，進入一個reduce函式

點選(此處)摺疊或開啟

public static class DividendGrowthReducer extends Reducer<Stock, DoubleWritable, NullWritable, DividendChange> {
private NullWritable outputKey = NullWritable.get();
private DividendChange outputValue = new DividendChange();
@Override
protected void reduce(Stock key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
double previousDividend = 0.0;
for(DoubleWritable dividend : values) {
double currentDividend = dividend.get();
double growth = currentDividend - previousDividend;
if(Math.abs(growth) > 0.000001) {
outputValue.setSymbol(key.getSymbol());
outputValue.setDate(key.getDate());
outputValue.setChange(growth);
context.write(outputKey, outputValue);
previousDividend = currentDividend;
}
}
}
}

著先我們找到向上找，是誰呼叫了我們寫的這個reduce函式。 Reducer類的run 方法。通過如下程式碼，可以看到是在run方法中，對於每個key，呼叫一次reduce函式。
此處傳入reduce函式的都是物件引用。

點選(此處)摺疊或開啟

/**
* Advanced application writers can use the
* {@link #run(org.apache.hadoop.mapreduce.Reducer.Context)} method to
* control how the reduce task works.
*/
public void run(Context context) throws IOException, InterruptedException {
.....
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
.....
}
.....
}
}

結合我們寫的reduce函式，key是在遍歷value的時候會對應變化。
那我們繼續跟蹤context.getValues 得到的迭代器的next方法。context 此處是ReduceContext.java （介面）. 對應的實現類為ReduceContextImpl.java

點選(此處)摺疊或開啟

protected class ValueIterable implements Iterable<VALUEIN> {
private ValueIterator iterator = new ValueIterator();
@Override
public Iterator<VALUEIN> iterator() {
return iterator;
}
}
/**
* Iterate through the values for the current key, reusing the same value
* object, which is stored in the context.
* @return the series of values associated with the current key. All of the
* objects returned directly and indirectly from this method are reused.
*/
public
Iterable<VALUEIN> getValues() throws IOException, InterruptedException {
return iterable;
}

直接返回了一個iterable。繼續跟蹤ValueIterable 型別的iterable。那明白了，在reduce 函式中進行Iterable的遍歷，其實呼叫的是ValueIterable的next方法。下面看一下next的實現。

點選(此處)摺疊或開啟

@Override
public VALUEIN next() {
………………
nextKeyValue();
return value;
………………
}

再繼續跟蹤nextKeyValue()方法。終於找了一個comparator。這個就是我們配置的GroupingComparator.

點選(此處)摺疊或開啟

@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
……………………………………
if (hasMore) {
nextKey = input.getKey();
nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,
currentRawKey.getLength(),
nextKey.getData(),
nextKey.getPosition(),
nextKey.getLength() - nextKey.getPosition()
) == 0;
} else {
nextKeyIsSame = false;
}
inputValueCounter.increment(1);
return true;
}

為了證明這個就是我們配置的GroupingComparator。跟蹤ReduceContextImpl的構造呼叫者。 ReduceTask的run方法。

點選(此處)摺疊或開啟

@Override
@SuppressWarnings("unchecked")
public void run(JobConf job, final TaskUmbilicalProtocol umbilical){
………………………………
RawComparator comparator = job.getOutputValueGroupingComparator();
runNewReducer(job, umbilical, reporter, rIter, comparator,
keyClass, valueClass);
}

下面把runNewReducer 的程式碼也貼出來。

點選(此處)摺疊或開啟

void runNewReducer(JobConf job,
final TaskUmbilicalProtocol umbilical,
final TaskReporter reporter,
RawKeyValueIterator rIter,
RawComparator<INKEY> comparator,
Class<INKEY> keyClass,
Class<INVALUE> valueClass
) {
org.apache.hadoop.mapreduce.Reducer.Context
reducerContext = createReduceContext(reducer, job, getTaskID(),
rIter, reduceInputKeyCounter,
reduceInputValueCounter,
trackedRW,
committer,
reporter, comparator, keyClass,
valueClass);

好吧，關於自定義GroupingComparator如何起做用的程式碼分析，就到此吧。

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/30066956/viewspace-2095520/，如需轉載，請註明出處，否則將追究法律責任。

Hadoop的GroupComparator是如何起做用的（原始碼分析）

相關文章