hadoop啟動遇到的各種問題

Joyven發表於2016-09-07

1. HDFS initialized but not `healthy` yet, waiting…

這個日誌會在啟動hadoop的時候在JobTracker的log日誌檔案中出現,在這裡就是hdfs出現問題,導致DataNode無法啟動,這裡唯一的解決方式就是把所有的NameNode管理的路徑下的檔案刪除然後重新執行namenode -format,而刪除的地方主要有存放臨時資料的tmp路徑,存放資料的data路徑還有name路徑,全部刪除之後重新format次問題就解決了

2. 在執行hadoop程式的時候出現Name node is in safe mode

這個異常一般就直接會在IDE的控制檯輸出,這個錯誤的主要導致原因是,datanode不停在丟失資料,所以此時namenode就強制本身進入safe mode模式,在該模式下對資料只可以進行讀操作而不能進行寫操作。解決此異常很簡單,直接執行命令讓namenode離開次模式就可以了。

./hadoop dfsadmin-safemode leave

3. java.io.FileNotFoundException: /data/dfs/namesecondary/in_use.lock (Permission denied):

2016-09-07 10:18:42,902 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: SecondaryNameNode metrics system started
2016-09-07 10:18:43,053 FATAL org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Failed to start secondary namenode
java.io.FileNotFoundException: /data/dfs/namesecondary/in_use.lock (Permission denied)
    at java.io.RandomAccessFile.open0(Native Method)
    at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
    at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:706)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:678)
    at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:499)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.recoverCreate(SecondaryNameNode.java:962)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:243)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.<init>(SecondaryNameNode.java:192)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:671)
2016-09-07 10:18:43,056 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2016-09-07 10:18:43,057 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down SecondaryNameNode at joyven/192.168.2.35
************************************************************

這有兩種場景出現,

1):在原來正常的時候,有一次突然使用了原來不同的使用者啟動了一次hadoop。這種場景會產生一個in_use.lock 資料夾在你設定的目錄中,這時候可以刪除這個資料夾直接,然後重新啟動

2):在格式化hadoop的時候和當期啟動的使用者不是同一個,也會導致該問題。這個時候可以使用格式化hadoop的那個使用者重新啟動hadoop。也可以解決此錯誤。

4. hadoop /tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1

啟動了叢集之後發現namenode起來了,但是各個slave節點的datanode卻都沒起起來。去看namenode日誌發現錯誤日誌:

INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9000, call addBlock(/opt/hadoop/tmp/mapred/system/jobtracker.info, DFSClient_502181644) from 127.0.0.1:2278: error: java.io.IOException: File /opt/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1  
java.io.IOException: File /opt/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1  
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)   
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)   
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)   
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)   
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)   
    at java.lang.reflect.Method.invoke(Method.java:597)   
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)   
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)   
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)   
at java.security.AccessController.doPrivileged(Native Method)   
at javax.security.auth.Subject.doAs(Subject.java:396)   
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

具體原因還不是很清楚,當防火牆不關閉的時候可能出現,但是當異常宕掉整個系統再重啟的時候也會出現。解決辦法是master和slave同時重新格式化

5. ERROR mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker.

java.lang.OutOfMemoryError: unable to create new native thread

在執行任務的過程中,計算突然停止,去計算節點檢視TaskTracker日誌,發現在計算的過程中丟擲以上錯誤,經查證是因為你的作業開啟的檔案個數超過系統設定一個程式可以開啟的檔案的個數的上限。更改/etc/security/limits.conf的配置加入如下配置

hadoop soft nproc 10000
hadoop hard nproc 64000

6. namenode 異常

2013-08-20 14:10:08,946 INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot access storage directory /var/lib/hadoop/cache/hadoop/dfs/name  
2013-08-20 14:10:08,947 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.  
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /var/lib/hadoop/cache/hadoop/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.  
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:316)  
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:104)  
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:427)  
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:388)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:277)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:497)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1298)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1307)  
2013-08-20 14:10:08,948 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /var/lib/hadoop/cache/hadoop/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.  
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:316)  
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:104)  
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:427)  
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:388)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:277)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:497)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1298)  
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1307)  

7. namenode無法啟動(或者SecondaryNameNode無法啟動)

檢視namenode日誌,發現埠被佔用:

2016-09-07 10:18:08,547 INFO org.apache.hadoop.http.HttpServer2: HttpServer.start() threw a non Bind IOException
java.net.BindException: Port in use: 0.0.0.0:50070
    at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:919)
    at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:856)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:142)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:752)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:638)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:811)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:795)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
Caused by: java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:433)
    at sun.nio.ch.Net.bind(Net.java:425)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
    at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:914)
    ... 8 more
2016-09-07 10:18:08,550 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system...
2016-09-07 10:18:08,550 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.
2016-09-07 10:18:08,550 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
2016-09-07 10:18:08,551 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.net.BindException: Port in use: 0.0.0.0:50070
    at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:919)
    at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:856)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:142)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:752)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:638)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:811)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:795)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
Caused by: java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:433)
    at sun.nio.ch.Net.bind(Net.java:425)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
    at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:914)
    ... 8 more
2016-09-07 10:18:08,552 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2016-09-07 10:18:08,553 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at joyven/192.168.2.35
************************************************************/

解決方法:
既然知道是哪個埠被佔用了,要麼換埠要麼殺掉。
mac下殺掉埠的方法:

sudo lsof -i:埠 -P
sudo kill -9 PID

以50070埠為例:

sudo lsof -i:50070 -P

控制檯輸出內容:

COMMAND  PID USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
java    6501 root  189u  IPv4 0x782e003217773193      0t0  TCP *:50070 (LISTEN)

然後kill掉:

sudo kill -9 6501

8. NameNode、SecondaryNameNode以及DataNode無法啟動的原因歸納

NameNodeSecondaryNameNode的錯誤大多數由於配置引起,比如core-site.xml的配置以及hdfs-site.xml的配置,涉及到ip和埠。
檢視日誌位於logs下的字尾部分為nameode.log以及ssecondarynamenode.log檔案,一般錯誤寫的很明顯。
DataNode啟動錯誤一般是由於data的資料存放路徑配置錯誤,格式化錯誤造成。主要配置檔案為core-site.xml,hadoop.tmp.dir屬性為資料存放路徑。
namenode格式化命令:

hadoop namenode -format

注:部分參考內容來源:hadoop啟動的時候,會出現各種各樣的問題,請尊重原作者的勞動成果!

相關文章