Hadoop TeraSort演算法之2-trie樹構造時間解惑

HamaWhite發表於2014-03-06

Hadoop演算法

前言：

近日，需要用Metis或ParMetis對大圖資料進行分割槽，而分割槽的要求是輸入的無向圖要按照頂點ID排序，於是想到用Hadoop中的TeraSort演算法對無向圖進行排序。

說明：

本文要解決的問題是：誰呼叫了TeraSort子類TotalOrderPartitioner的configure(JobConf job)方法及如何呼叫的？

其屬於細節問題，說好聽叫“刨根問底”，負面講則叫“鑽牛角尖”。但我認為，我們應該在能力、時間允許內，弄清楚每個細節，踏踏實實做學問。

本人QQ：530422429，歡迎大家指正、討論。

正文：

研讀TeraSort原始碼後，對其思想和演算法基本掌握。TotalOrderPartitioner類實現了Partitioner和JobConfigurable介面，並覆寫了getPartition()和configure()方法。其中configure()方法如下：

public void configure(JobConf job) {
      try {
        FileSystem fs = FileSystem.getLocal(job);
        Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
        splitPoints = readPartitions(fs, partFile, job);
        trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
      } catch (IOException ie)
        throw new IllegalArgumentException("can't read paritions file", ie);
      }
    }

可以發現，每個MapTask從分散式快取中讀取分割點，呼叫buildTrie()方法構造2-trie樹。然後MapTask從split中依次讀入資料，通過trie樹查詢每條資料所對應的reduce task編號。因此，構造2-trie樹應在呼叫map()方法之前完成。可問題是：誰呼叫configure(JobConf job)方法及如何呼叫的？

1. 開啟configure(JobConf job)方法的 Call Hierarchy檢視呼叫關係，結果如下圖。竟然無呼叫關係，那麼MapTask究竟是怎麼構建2-Trie樹的呢？疑惑中繼續探索。

2. map完成後，寫入資料時會進行partition，顯然會呼叫TotalOrderPartitioner物件的getPartition()方法。因此檢視何時構造TotalOrderPartitioner物件的。猜想的情況是，構造完TotalOrderPartitioner物件後，再直接呼叫其configure(JobConf job)方法。由於TeraSort作業中沒有設定mapper，因此使用了Hadoop預設的IdentityMapper，其對輸入不作任何處理，直接將key-value對輸出。IdentityMapper類內容如下：

/** Implements the identity function, mapping inputs directly to outputs. 
 */
public class IdentityMapper<K, V>
    extends MapReduceBase implements Mapper<K, V, K, V> {

  /** The identify function.  Input key/value pair is written directly to output.*/
  public void map(K key, V val,
                  OutputCollector<K, V> output, Reporter reporter)
    throws IOException {
    output.collect(key, val);
  }
}

因此檢視上述map()方法的呼叫關係，如下圖：

檢視MapTask的runOldMapper(…)方法，核心片段如下：

 try {
      runner.run(in,new OldOutputCollector(collector, conf), reporter);
      collector.flush();
    }

runner.run(…)引數中會建立OldOutputCollector()物件，進入其構造方法。如下，正如所猜想那樣，會在此構造Partitioner（TotalOrderPartitioner）物件。

    @SuppressWarnings("unchecked")
    OldOutputCollector(MapOutputCollector<K,V> collector, JobConf conf) {
      numPartitions = conf.getNumReduceTasks();
      if (numPartitions > 0) {
          partitioner = (Partitioner<K,V>)
          ReflectionUtils.newInstance(conf.getPartitionerClass(), conf);
      } else {
        partitioner = new Partitioner<K,V>() {
          @Override
          public void configure(JobConf job) { }
          @Override
          public int getPartition(K key, V value, int numPartitions) {
            return -1;
          }
        };
      }
      this.collector = collector;
    }

可見，採用了Hadoop的反射工具包ReflectionUtils來建立TotalOrderPartitioner物件（注：hadoop建立物件都是如此）， 但此處未發現呼叫configure(JobConf job)方法。

3. 無奈+疑惑之下，進入ReflectionUtils類的newInstance(…)方法。如下：

  @SuppressWarnings("unchecked")
  public static <T> T newInstance(Class<T> theClass, Configuration conf) {
    T result;
    try {
      Constructor<T> meth = (Constructor<T>) CONSTRUCTOR_CACHE.get(theClass);
      if (meth == null) {
        meth = theClass.getDeclaredConstructor(EMPTY_ARRAY);
        meth.setAccessible(true);
        CONSTRUCTOR_CACHE.put(theClass, meth);
      }
      result = meth.newInstance();
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
    setConf(result, conf);
    return result;
  }

分析程式碼後，利用Java的反射機制建立好物件result後，也未呼叫 configure(JobConf job)方法。但其後呼叫了setConf(result, conf)方法，迷茫之際只能再進入此方法檢視。程式碼如下：

  public static void setConf(Object theObject, Configuration conf) {
    if (conf != null) {
      if (theObject instanceof Configurable) {
        ((Configurable) theObject).setConf(conf);
      }
      setJobConf(theObject, conf);
    }
  }

由於引數theObject是Partitioner和JobConfigurable介面的例項物件，而非Configurable介面的例項。故上面的程式碼會進入到setJobConf(theObject, conf)。

4. 再檢視setJobConf(theObject, conf)方法的程式碼。原來是根據JobConfigurable介面的Class物件獲取Method物件，然後再根據例項物件theObject和引數conf動態呼叫configure(…)方法。到此，才走出霧霾，得以解惑。

private static void setJobConf(Object theObject, Configuration conf) {
    //If JobConf and JobConfigurable are in classpath, AND
    //theObject is of type JobConfigurable AND
    //conf is of type JobConf then
    //invoke configure on theObject
    try {
      Class<?> jobConfClass = 
        conf.getClassByName("org.apache.hadoop.mapred.JobConf");
      Class<?> jobConfigurableClass = 
        conf.getClassByName("org.apache.hadoop.mapred.JobConfigurable");
       if (jobConfClass.isAssignableFrom(conf.getClass()) &&
            jobConfigurableClass.isAssignableFrom(theObject.getClass())) {
       Method configureMethod = 
          jobConfigurableClass.getMethod("configure", jobConfClass);
       configureMethod.invoke(theObject, conf);
      }
    } catch (ClassNotFoundException e) {
      //JobConf/JobConfigurable not in classpath. no need to configure
    } catch (Exception e) {
      throw new RuntimeException("Error in configuring object", e);
    }
  }

5. 以上根據程式碼呼叫關係和逐步推理得到了真相，總結一句：來之不易。

經驗總結：日後若查不到方法的呼叫關係時，應該想到可能使用了Java反射機制來呼叫此方法。

下面再說一種簡單獲得configure(…)方法呼叫關係的方法。通過在其方法內列印呼叫堆疊來檢視。如下：

 public void configure(JobConf job) {
      try {
        FileSystem fs = FileSystem.getLocal(job);
        Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
        splitPoints = readPartitions(fs, partFile, job);
        trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
        Exception e=new Exception("this is a log");
        e.printStackTrace();
      } catch (IOException ie) {
        throw new IllegalArgumentException("can't read paritions file", ie);
      }
    }

分析其log得到呼叫關係，和上述分析相同。如下圖所示。

正文結束。

附文：Java反射機制-Method的問題。

以前使用Java反射機制根據方法名執行方法時，都是根據類A來建立其Method物件，然後再根據類A的例項物件和Method呼叫方法。但仔細分析setJobConf(theObject, conf)方法，其獲取的是介面JobConfigurable的Class物件的Method物件，下面的測試程式碼證明此方法的正確性。

JobConfigurable介面定義：

package com.test;

/**
 * 
 * @author baisong
 *
 */
public interface JobConfigurable {
	//for test,the parameter type is String
	void configure(String name);
}

JobConfigurableImpl定義：

package com.test;

/**
 * 
 * @author baisong
 *
 */
public class JobConfigurableImpl implements JobConfigurable{
	@Override
	public void configure(String name) {
		System.out.println("Your name is "+name);
	}
}

測試方法，模仿Hadoop裡面的原始碼書寫。

package com.test;

import java.lang.reflect.Constructor;
import java.lang.reflect.Method;

/**
 * we only know the name of JobConfigurable.class and JobConfigurableImpl.class
 *
 * Note:we can't make the JobConfigurable and JobConfigurableImpl object directly.
 * 
 * @author baisong
 *
 */
public class MapTaskTest {
	
	public static void main(String[] args) throws Exception {
		// make the JobConfigurableImpl object
		Constructor<?> meth=JobConfigurableImpl.class.getDeclaredConstructor();
		Object obj=meth.newInstance();
		
		Class<?> JobConfigurableClass=Class.forName("com.test.JobConfigurable");
		Method configureMethod=JobConfigurableClass.getMethod("configure", String.class);
		
		// invoke configure on the obj
		configureMethod.invoke(obj, "BaiSong");
	}
}

結果輸入為：

Your name is BaiSong

總結：可通過介面建立Method物件。

Hadoop Terasort
2014-06-14
Hadoop
解惑3：時間頻度，演算法時間複雜度
2020-06-27
演算法時間複雜度
Hadoop測試TeraSort
2015-10-09
Hadoop
測試眼裡的Hadoop系列之Terasort
2011-07-26
Hadoop
Hadoop的TeraSort問題
2011-09-29
Hadoop
hadoop的terasort排序總結
2013-11-24
Hadoop排序
修改 Hadoop TeraSort演算法 —— 按照LongWritable型別的Key排序
2014-03-07
Hadoop演算法型別排序
線性時間構造最大堆
2024-04-17
機器學習之使用sklearn構造決策樹模型
2019-07-30
機器學習模型
Hadoop TeraSort 基準測試實驗
2011-04-01
Hadoop
Hadoop學習筆記 - Sort / TeraSort / TestDFSIO
2012-08-10
Hadoop筆記
資料結構之演算法時間複雜度
2018-07-10
資料結構演算法時間複雜度
hadoop基準測試_Hadoop TeraSort基準測試
2020-07-29
Hadoop
Hadoop學習筆記之TeraSort修改後輸出翻倍異常
2016-04-02
Hadoop筆記
構造二叉樹
2020-12-14
二叉樹
Hadoop-MapReduce-TeraSort-大資料排序例子
2012-07-24
Hadoop大資料排序
hadoop之心跳時間與冗餘快清除
2018-01-31
Hadoop
[演算法] 資料結構之AVL樹
2016-04-04
演算法資料結構
小解惑：查詢構造器與集合中的 get 方法
2018-12-19
JavaScript 資料結構與演算法之美 - 時間和空間複雜度
2019-06-02
JavaScript資料結構演算法複雜度
演算法：區間樹
2017-09-06
演算法
資料結構與演算法之美-王爭-極客時間
2019-03-20
資料結構演算法
嬰兒的出生為你解惑建構函式和構造程式碼塊
2019-04-05
函式
樹形資料構造的方法
2022-07-19
05 Javascript資料結構與演算法之樹
2018-08-27
JavaScript資料結構演算法
Hadoop技巧（02）：時間同步
2016-04-21
Hadoop
陣列（Array）- 極客時間（資料結構與演算法之美）
2019-11-12
陣列資料結構演算法
通過有序線性結構構造AVL樹
2022-05-12
先序、中序、後序序列的二叉樹構造演算法
2018-04-19
二叉樹演算法
2. TeraSort在Hadoop分散式叢集中的執行
2020-06-11
Hadoop分散式
淺談演算法和資料結構（10）：平衡查詢樹之B樹
2014-11-03
演算法資料結構
二叉樹的構造與遍歷
2016-03-28
二叉樹
FP-Growth演算法之FP-tree的構造（python）
2015-07-04
演算法Python
淺談演算法和資料結構（9）：平衡查詢樹之紅黑樹
2014-11-03
演算法資料結構
資料結構之樹（線段樹，字典樹）
2020-10-28
資料結構
Hadoop叢集是配置時間同步！
2020-10-27
Hadoop
資料結構與演算法——時間複雜度
2020-09-30
資料結構演算法時間複雜度
演算法執行時間、logN、NlogN 之間的比較
2016-02-27
演算法

Hadoop TeraSort演算法之2-trie樹構造時間解惑

相關文章