在測試hive0.14.0 on tez時遇到的問題比較多:
1.在使用cdh5.2.0+hive0.14.0+tez-0.5.0測試時,首先遇到下面的問題

java.lang.NoSuchMethodError: org.apache.tez.dag.api.client.Progress.getFailedTaskAttemptCount()I
        at org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor.printStatusInPlace(TezJobMonitor.java:613)
        at org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor.monitorExecution(TezJobMonitor.java:311)
        at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:167)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
        at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
        at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1604)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1364)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1177)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1004)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:994)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:247)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:199)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:410)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:783)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

通過堆疊可以看出是在tez job提交之後報的錯,在org.apache.hadoop.hive.ql.exec.tez.TezTask中
job通過submit方法提交後,例項化一個TezJobMonitor 物件,用來記錄tez job的執行情況:

// submit will send the job to the cluster and start executing
client = submit(jobConf, dag, scratchDir, appJarLr, session,
additionalLr, inputOutputJars, inputOutputLocalResources);
// finally monitor will print progress until the job is done
TezJobMonitor monitor = new TezJobMonitor();
rc = monitor.monitorExecution(client, ctx.getHiveTxnManager(), conf, dag);

TezJobMonitor.monitorExecution方法中:

boolean isProfileEnabled = conf.getBoolVar(conf, HiveConf.ConfVars.TEZ_EXEC_SUMMARY); //hive.tez.exec.print.summary,預設為false
boolean inPlaceUpdates = conf.getBoolVar(conf, HiveConf.ConfVars.TEZ_EXEC_INPLACE_PROGRESS); //hive.tez.exec.inplace.progress,預設為true
boolean wideTerminal = false;
boolean isTerminal = inPlaceUpdates == true ? isUnixTerminal() : false;
// we need at least 80 chars wide terminal to display in-place updates properly
if (isTerminal) {
  if (getTerminalWidth() >= MIN_TERMINAL_WIDTH) {
    wideTerminal = true;
  }
}
boolean inPlaceEligible = false;
if (inPlaceUpdates && isTerminal && wideTerminal && !console.getIsSilent()) {
  inPlaceEligible = true;
}
//進入一個while迴圈,判斷 job的狀態,並執行printStatusInPlace或者printStatus方法(其中printStatus最終呼叫getReport方法)
......
case RUNNING:
  if (!running) {
    perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.TEZ_SUBMIT_TO_RUNNING);
    console.printInfo("Status: Running (" + dagClient.getExecutionContext() + ")
");
    startTime = System.currentTimeMillis();
    running = true;
  }
  if (inPlaceEligible) {
    printStatusInPlace(progressMap, startTime, false, dagClient);
    // log the progress report to log file as well
    lastReport = logStatus(progressMap, lastReport, console);
  } else {
    lastReport = printStatus(progressMap, lastReport, console);
  }
  break;

比如在printStatusInPlace方法中:

SortedSet<String> keys = new TreeSet<String>(progressMap.keySet());
int idx = 0;
int maxKeys = keys.size();
for (String s : keys) {
   idx++;
   Progress progress = progressMap.get(s);
   final int complete = progress.getSucceededTaskCount();
   final int total = progress.getTotalTaskCount();
   final int running = progress.getRunningTaskCount();
   final int failed = progress.getFailedTaskAttemptCount(); // 會呼叫Progress類getFailedTaskAttemptCount方法獲取失敗的task數
   final int pending = progress.getTotalTaskCount() - progress.getSucceededTaskCount() -
   progress.getRunningTaskCount();
   final int killed = progress.getKilledTaskCount();

在0.5.0的tez中org.apache.tez.dag.api.client.Progress類沒有getFailedTaskAttemptCount方法
在0.5.2的tez中才開始增加這個方法,因此要想使用hive0.14.0的話,需要使用tez-0.5.2以上的版本

2.升級至hive0.14.0+tez-0.5.2之後,發現如下錯誤:

15/01/13 14:09:21 INFO client.TezClient: The url to track the Tez Session: http://xxxx:8042/proxy/application_1416818587155_0049/
Exception in thread "main" java.lang.RuntimeException: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown
        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:457)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:672)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown
        at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:599)
        at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:212)
        at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:122)
        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:454)
        ... 7 more

可以看到是由於在session初始化異常導致,異常是由TezSessionState.open方法丟擲:

....
  try {
    session.waitTillReady();
  } catch(InterruptedException ie) {
    //ignore
  }

其中session為TezClient的例項,在TezClient.waitTillReady方法中

public synchronized void waitTillReady() throws IOException, TezException, InterruptedException {
  if (!isSession) {
    // nothing to wait for in non-session mode
    return;
  }
  verifySessionStateForSubmission();
  while (true) {
    TezAppMasterStatus status = getAppMasterStatus(); //這裡getAppMasterStatus方法返回了TezAppMasterStatus.SHUTDOWN
    if (status.equals(TezAppMasterStatus.SHUTDOWN)) {
      throw new SessionNotRunning("TezSession has already shutdown");
    }
    if (status.equals(TezAppMasterStatus.READY)) {
      return;
    }
    Thread.sleep(SLEEP_FOR_READY);
  }
}

這裡建立TezClient時設定了為sessionmode,並且getAppMasterStatus返回了TezAppMasterStatus.SHUTDOWN,導致在waitTillReady方法中丟擲異常,即TezAppMaster沒有啟動正常導致,檢視nm的日誌,發現由如下報錯:

2015-01-13 16:27:58,162 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1416818587155_0060_01_000001 and exit code: 1
ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:196)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

是由於啟動am的container異常報錯,檢視對應的container日誌:

2015-01-13 17:34:59,731 FATAL [main] app.DAGAppMaster: Error starting DAGAppMaster
java.lang.VerifyError: class org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
        at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        at java.lang.Class.getDeclaredConstructors0(Native Method)
        at java.lang.Class.privateGetDeclaredConstructors(Class.java:2389)
        at java.lang.Class.getConstructor0(Class.java:2699)
        at java.lang.Class.getConstructor(Class.java:1657)
        at org.apache.hadoop.yarn.factories.impl.pb.RecordFactoryPBImpl.newRecordInstance(RecordFactoryPBImpl.java:62)
        at org.apache.hadoop.yarn.util.Records.newRecord(Records.java:36)
        at org.apache.hadoop.yarn.api.records.ApplicationId.newInstance(ApplicationId.java:49)
        at org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)
        at org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)
        at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1794)

看樣子是protoc-buf相容的問題
cdh5.2.0預設使用protobuf-java-2.5.0.jar,hive0.14.0預設使用protobuf-java-2.5.0.jar,tez 0.5.2也使用pb2.5.0編譯,理論上應該不會有pb相容性問題,懷疑是在tezam啟動時載入了2.4.0a 的pb,需要檢視啟動命令,找到對應的classpath:
通過更改org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor類,增加Thread.sleep來檢視啟動am的shell,重新編譯cdh5.2.0包(主要需要java7支援 range [1.7.0,1.7.1000}],編譯時跳過native: mvn package -DskipTests -Pdist -Dtar -e -X),
並替換./share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.5.0-cdh5.2.0.jar 測試:
shell的呼叫如下:

default_container_executor.sh-->default_container_executor_session.sh-->launch_container.sh

而在launch_container.sh指令碼:

export HADOOP_COMMON_HOME="/home/vipshop/platform/hadoop-2.5.0-cdh5.2.0"  #先設定相關的變數
export CLASSPATH="$PWD:$PWD/*:$HADOOP_CONF_DIR:" #這裡重設了CLASSPATH
export HADOOP_TOKEN_FILE_LOCATION="/home/vipshop/hard_disk/7/yarn/local/usercache/hdfs/appcache/application_1416818587155_0075/container_1416818587155_0075_01_000001/container_tokens"
....
ln -sf "/home/vipshop/hard_disk/10/yarn/local/filecache/42/hadoop-yarn-api-2.5.0.jar" "hadoop-yarn-api-2.5.0.jar"  #建立相關jar的軟連線到本地目錄
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi
.....
exec /bin/bash -c "$JAVA_HOME/bin/java  -Xmx819m -server
 -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN 
 -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA
  -XX:+UseParallelGC -Dlog4j.configuration=tez-container-log4j.properties
   -Dyarn.app.container.log.dir=/home/vipshop/hard_disk/9/yarn/logs/
   application_1416818587155_0075/container_1416818587155_0075_01_000001 
   -Dtez.root.logger=INFO,CLA -Dsun.nio.ch.bugLevel=`` 
   org.apache.tez.dag.app.DAGAppMaster --session 1>/home/vipshop/hard_disk/9/
   yarn/logs/application_1416818587155_0075/container_1416818587155_0075_01_000001/
   stdout 2>/home/vipshop/hard_disk/9/yarn/logs/application_1416818587155_0075/
   container_1416818587155_0075_01_000001/stderr "
#最後執行 java  org.apache.tez.dag.app.DAGAppMaster,即
org.apache.tez.dag.app.DAGAppMaster的main方法,啟動DAGAppMaster

CLASSPATH為shell所在的目錄,比如這裡

CLASSPATH=`/home/vipshop/hard_disk/11/yarn/local/usercache/hdfs/appcache/
application_1416818587155_0079/container_1416818587155_0079_01_000001:
/home/vipshop/hard_disk/11/yarn/local/usercache/hdfs/appcache/
application_1416818587155_0079/container_1416818587155_0079_01_000001/*:
/home/vipshop/conf:`

在shell的當前目錄下查詢包含pb的包,發現有一個hive-solr中整合了pb,並且檢視到其pb版本為2.4.0a:

for i in `find . -name "*.jar"`; do echo $i `jar -tvf $i|grep GeneratedMessage|wc -l`; done|awk `{if($2>0) print}`                        
./protobuf-java-2.5.0.jar 31  //2.5.0
./hive-exec-0.14.0-dfffe4217f40bd764977b741ad970a562e07fb99992f0180620bd13f68a2577b.jar 31 //2.5.0
./hive-solr-0.0.1-SNAPSHOT-jar-with-dependencies.jar  //2.4.0a

這就導致在container啟動時,classloader載入到了2.4.0a的pb,最終導致container啟動失敗。使用2.5.0的pb重新編譯這個jar包後,hive on tez就執行正常了