Hadoop的GroupComparator是如何起做用的(原始碼分析)

self_control發表於2016-05-07
目標:弄明白,我們配置的GroupComparator是如何對進入reduce函式中的key   Iterable<value> 進行影響。
如下是一個配置了GroupComparator  的reduce 函式。具體影響是我們可以在自定義的GroupComparator 中確定哪兒些value組成一組,進入一個reduce函式

點選(此處)摺疊或開啟

  1. public static class DividendGrowthReducer extends Reducer<Stock, DoubleWritable, NullWritable, DividendChange> {
  2.         private NullWritable outputKey = NullWritable.get();
  3.         private DividendChange outputValue = new DividendChange();
  4.         
  5.         @Override
  6.         protected void reduce(Stock key, Iterable<DoubleWritable> values, Context context)
  7.                 throws IOException, InterruptedException {
  8.             double previousDividend = 0.0;
  9.             for(DoubleWritable dividend : values) {
  10.                 double currentDividend = dividend.get();
  11.                 double growth = currentDividend - previousDividend;
  12.                 if(Math.abs(growth) > 0.000001) {
  13.                     outputValue.setSymbol(key.getSymbol());
  14.                     outputValue.setDate(key.getDate());
  15.                     outputValue.setChange(growth);
  16.                     context.write(outputKey, outputValue);
  17.                     previousDividend = currentDividend;
  18.                 }
  19.             }
  20.         }
  21.     }
著先我們找到向上找,是誰呼叫了我們寫的這個reduce函式。 Reducer類的run 方法。通過如下程式碼,可以看到是在run方法中,對於每個key,呼叫一次reduce函式。
此處傳入reduce函式的都是物件引用。

點選(此處)摺疊或開啟

  1. /**
  2.    * Advanced application writers can use the
  3.    * {@link #run(org.apache.hadoop.mapreduce.Reducer.Context)} method to
  4.    * control how the reduce task works.
  5.    */
  6.   public void run(Context context) throws IOException, InterruptedException {
  7.    .....
  8.     while (context.nextKey()) {
  9.        reduce(context.getCurrentKey(), context.getValues(), context);
  10.        .....
  11.     }   
  12.    .....
  13.   }
  14. }
結合我們寫的reduce函式,key是在遍歷value的時候會對應變化。
那我們繼續跟蹤context.getValues 得到的迭代器的next方法。context 此處是ReduceContext.java (介面). 對應的實現類為ReduceContextImpl.java 

點選(此處)摺疊或開啟

  1. protected class ValueIterable implements Iterable<VALUEIN> {
  2.     private ValueIterator iterator = new ValueIterator();
  3.     @Override
  4.     public Iterator<VALUEIN> iterator() {
  5.       return iterator;
  6.     }
  7.   }
  8.   
  9.   /**
  10.    * Iterate through the values for the current key, reusing the same value
  11.    * object, which is stored in the context.
  12.    * @return the series of values associated with the current key. All of the
  13.    * objects returned directly and indirectly from this method are reused.
  14.    */
  15.   public
  16.   Iterable<VALUEIN> getValues() throws IOException, InterruptedException {
  17.     return iterable;
  18.   }
直接返回了一個iterable。繼續跟蹤ValueIterable 型別的iterable。那明白了,在reduce 函式中進行Iterable的遍歷,其實呼叫的是ValueIterable的next方法。下面看一下next的實現。

點選(此處)摺疊或開啟

  1. @Override
  2.     public VALUEIN next() {
  3.      ………………
  4.      nextKeyValue();
  5.      return value;
  6.      ………………
  7.     }
再繼續跟蹤nextKeyValue()方法。終於找了一個comparator。  這個就是我們配置的GroupingComparator.

點選(此處)摺疊或開啟

  1. @Override
  2.   public boolean nextKeyValue() throws IOException, InterruptedException {
  3.     ……………………………………
  4.     if (hasMore) {
  5.       nextKey = input.getKey();
  6.       nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,
  7.                                      currentRawKey.getLength(),
  8.                                      nextKey.getData(),
  9.                                      nextKey.getPosition(),
  10.                                      nextKey.getLength() - nextKey.getPosition()
  11.                                          ) == 0;
  12.     } else {
  13.       nextKeyIsSame = false;
  14.     }
  15.     inputValueCounter.increment(1);
  16.     return true;
  17.   }
為了證明這個就是我們配置的GroupingComparator。 跟蹤ReduceContextImpl的構造呼叫者。 ReduceTask的run方法。

點選(此處)摺疊或開啟

  1. @Override
  2.   @SuppressWarnings("unchecked")
  3.   public void run(JobConf job, final TaskUmbilicalProtocol umbilical){
  4.     ………………………………
  5.     RawComparator comparator = job.getOutputValueGroupingComparator();
  6.     runNewReducer(job, umbilical, reporter, rIter, comparator,
  7.                     keyClass, valueClass);   
  8.   }
下面把runNewReducer 的程式碼也貼出來。

點選(此處)摺疊或開啟

  1. void runNewReducer(JobConf job,
  2.                      final TaskUmbilicalProtocol umbilical,
  3.                      final TaskReporter reporter,
  4.                      RawKeyValueIterator rIter,
  5.                      RawComparator<INKEY> comparator,
  6.                      Class<INKEY> keyClass,
  7.                      Class<INVALUE> valueClass
  8.                      ) {
  9.   
  10.     org.apache.hadoop.mapreduce.Reducer.Context
  11.          reducerContext = createReduceContext(reducer, job, getTaskID(),
  12.                                                rIter, reduceInputKeyCounter,
  13.                                                reduceInputValueCounter,
  14.                                                trackedRW,
  15.                                                committer,
  16.                                                reporter, comparator, keyClass,
  17.                                                valueClass);
好吧,關於自定義GroupingComparator如何起做用的程式碼分析,就到此吧。






來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30066956/viewspace-2095520/,如需轉載,請註明出處,否則將追究法律責任。

相關文章