以太坊公鏈節點連線節點超時問題排查

看見月亮的人發表於2020-12-08

2020年4月1日晚上8點,zabbix報警:以太坊公鏈三分鐘內沒有檢測到區塊資料同步

立即登入到伺服器,檢視以太坊公鏈節點資料同步情況

# docker logs -f public-eth --tail 10

INFO [04-01|20:17:37.735] Deep froze chain segment                 blocks=44   elapsed=345.845ms number=9695993      hash=3833c6…28c205
WARN [04-01|20:18:10.932] Synchronisation failed, dropping peer    peer=63cbc31e7027052b err=timeout
WARN [04-01|20:25:26.171] Synchronisation failed, dropping peer    peer=4d27a5ef8b885210 err=timeout
WARN [04-01|20:26:21.815] Synchronisation failed, dropping peer    peer=6a224bc2c8c3b02c err=timeout

根據日誌發現,區塊資料已經落後最新區塊高度50塊左右,原因為連線的節點同步超時。

於是進入到以太坊的geth環境,檢視連線的節點資訊

> admin.peers  //檢視連線的節點資訊

{
    caps: ["eth/63", "eth/64", "eth/65", "les/2", "les/3", "shh/6"],
    enode: "enode://0e1806acd33408d618070c3e0a33e692af2a641493701a65f2f7c9d7f9de076a7d6087e4228f711428aa654155a41f2e4850491df2ab213703eb6ae13f10fa32@18.218.89.155:30303",
    id: "2bd5a38260608099ba57f4236645718e4fdbfb5df9566ae4059e2b38d4321dfa",
    name: "Geth/bluebear/v1.9.12-stable-b6f1c8dc/linux-amd64/go1.13.8",
    network: {
      inbound: false,
      localAddress: "172.17.0.2:36444",
      remoteAddress: "18.218.89.155:30303",
      static: false,
      trusted: false
    },
    protocols: {
      eth: {
        difficulty: 1.4790370433997981e+22,
        head: "0x4afbea3912db5ab6e0c5042498c2020dc0558f349b6505e4d5d23de087a0c0ae",
        version: 64
      }
    }
}
......

> net.peerCount   // 檢視連線的有效節點數量
7

通過節點返回的IP及埠進行telnet網路連線測試,以及根據連線的有效節點數量來看,並未發現問題

懷疑應該是伺服器網路問題,於是登入到以太坊公鏈節點備用環境

# docker logs -f public-eth --tail 1000
INFO [03-29|09:19:19.278] Regenerated local transaction journal    transactions=4  accounts=1
WARN [03-29|09:25:23.724] Synchronisation failed, dropping peer    peer=4b9ad7c0ab94dc10 err="retrieved hash chain is invalid"
WARN [03-29|09:38:02.704] Checkpoint challenge timed out, dropping id=e8191234978143d6 conn=dyndial addr=3.1.27.148:30303      type=Geth/source/linux/go1.10.4
WARN [03-29|09:40:02.373] Synchronisation failed, dropping peer    peer=0be2160d6093d7a3 err=timeout
WARN [03-29|09:42:43.915] Checkpoint challenge timed out, dropping id=b7fe283b8834fc21 conn=dyndial addr=188.35.22.31:30307    type=Geth/source/linux/go1.11.2
WARN [03-29|09:48:31.330] Synchronisation failed, dropping peer    peer=209fadad4df29ee7 err=timeout
WARN [03-29|09:50:49.572] Synchronisation failed, dropping peer    peer=a46a49badbd2205b err=timeout
WARN [03-29|09:51:25.420] Synchronisation failed, dropping peer    peer=a41421cb9772370b err=timeout

發現備用節點早就未同步資料了(因為是備用節點,所以未設定zabbix報警)

在備用節點中同樣使用了以上方法,檢視了連線的節點資訊,有效節點數量,並未發現異常,然後使用了以下命令查詢了區塊高度資訊,發現資料差的真多,害怕。。。

> eth.syncing
{
  currentBlock: 9739351,
  highestBlock: 9786347,
  knownStates: 399147124,
  pulledStates: 399147124,
  startingBlock: 9725124
}
> eth.blockNumber
9739410

備用環境的節點沒有發現什麼異常,於是對服務進行了重啟(因為是備用節點,所以才進行了重啟操作,生產環境必須慎重使用服務重啟、關閉等操作)

# docker restart public-eth ;docker logs -f public-eth --tail 10
public-eth
。。。。。。
INFO [04-01|20:34:16.546] Setting new local account                address=0x52df2CE99891c31314c9f2f97dE1eBf401806571
INFO [04-01|20:34:16.546] Loaded local transaction journal         transactions=11 dropped=0
INFO [04-01|20:34:16.546] Regenerated local transaction journal    transactions=11 accounts=1
WARN [04-01|20:34:16.546] Switch sync mode from fast sync to full sync 
INFO [04-01|20:34:16.743] New local node record                    seq=31 id=4b66feada6a11d9d ip=127.0.0.1 udp=30303 tcp=30303
INFO [04-01|20:34:16.744] Started P2P networking                   self=enode://f82c0b9f10906785fed6e1ee2f86c165da6dc43155f0799bb63e713d677786b2d4b56124389b4e32f7c268f79877a87eed9ffd47f9e25d52976108aee39810a9@127.0.0.1:30303
INFO [04-01|20:34:16.747] IPC endpoint opened                      url=/root/.ethereum/geth.ipc
INFO [04-01|20:34:16.748] HTTP endpoint opened                     url=http://0.0.0.0:8545      cors= vhosts=localhost
INFO [04-01|20:34:26.744] Block synchronisation started 
INFO [04-01|20:34:31.518] New local node record                    seq=32 id=4b66feada6a11d9d ip=47.244.12.0 udp=58724 tcp=30303
INFO [04-01|20:34:36.316] Importing heavy sidechain segment        blocks=2048 start=9722054 end=9724101
INFO [04-01|20:34:48.324] Imported new chain segment               blocks=1    txs=86 mgas=9.983 elapsed=12.006s mgasps=0.832 number=9722054 hash=695f2b…b44224 age=1w2d21h dirty=1.08MiB
INFO [04-01|20:34:56.746] Imported new chain segment               blocks=9    txs=1378 mgas=87.184 elapsed=8.421s  mgasps=10.353 number=9722063 hash=75ea98…258dcd age=1w2d21h dirty=12.45MiB
INFO [04-01|20:35:05.076] Imported new chain segment               blocks=17   txs=2583 mgas=157.750 elapsed=8.329s  mgasps=18.938 number=9722080 hash=c7ff1b…6e622b age=1w2d21h dirty=31.53MiB
INFO [04-01|20:35:13.222] Imported new chain segment               blocks=20   txs=1967 mgas=151.433 elapsed=8.146s  mgasps=18.590 number=9722100 hash=19fbd6…b3d86b age=1w2d21h dirty=49.69MiB

節點重啟後,發現以太坊備用節點開始同步資料,進行恢復

遇到問題後,第一時間也進行百度查詢過此問題:Synchronisation failed, dropping peer

網上給出的回覆是:

日誌一致卡在此處,說明geth沒有連結到其他有效的節點,通過cosole後臺執行以下命令可看到連結的節點數為0:

> net.peerCount
0

針對此警告等待即可,如果長時間無響應,建議重新啟動節點,讓節點重新尋找新的peers。同時也可以手動新增peer。星火計劃提供的節點如下列表,可嘗試新增:

懷疑可能是因為節點重啟後,重新整理了連線的節點資訊,懷疑可能是連線的節點有問題,導致了連線節點超時,資料未同步

於是想到使用如下方法進行解決

當出現節點timeout的問題報錯時,可以根據以太坊原始碼中給出的引導節點資訊,進行新增,幫助尋找可用的有效節點

引導節點資訊為:

var MainnetBootnodes = []string{
	// Ethereum Foundation Go Bootnodes
	"enode://d860a01f9722d78051619d1e2351aba3f43f943f6f00718d1b9baa4101932a1f5011f16bb2b1bb35db20d6fe28fa0bf09636d26a87d31de9ec6203eeedb1f666@18.138.108.67:30303",   // bootnode-aws-ap-southeast-1-001
	"enode://22a8232c3abc76a16ae9d6c3b164f98775fe226f0917b0ca871128a74a8e9630b458460865bab457221f1d448dd9791d24c4e5d88786180ac185df813a68d4de@3.209.45.79:30303",     // bootnode-aws-us-east-1-001
	"enode://ca6de62fce278f96aea6ec5a2daadb877e51651247cb96ee310a318def462913b653963c155a0ef6c7d50048bba6e6cea881130857413d9f50a621546b590758@34.255.23.113:30303",   // bootnode-aws-eu-west-1-001
	"enode://279944d8dcd428dffaa7436f25ca0ca43ae19e7bcf94a8fb7d1641651f92d121e972ac2e8f381414b80cc8e5555811c2ec6e1a99bb009b3f53c4c69923e11bd8@35.158.244.151:30303",  // bootnode-aws-eu-central-1-001
	"enode://8499da03c47d637b20eee24eec3c356c9a2e6148d6fe25ca195c7949ab8ec2c03e3556126b0d7ed644675e78c4318b08691b7b57de10e5f0d40d05b09238fa0a@52.187.207.27:30303",   // bootnode-azure-australiaeast-001
	"enode://103858bdb88756c71f15e9b5e09b56dc1be52f0a5021d46301dbbfb7e130029cc9d0d6f73f693bc29b665770fff7da4d34f3c6379fe12721b5d7a0bcb5ca1fc1@191.234.162.198:30303", // bootnode-azure-brazilsouth-001
	"enode://715171f50508aba88aecd1250af392a45a330af91d7b90701c436b618c86aaa1589c9184561907bebbb56439b8f8787bc01f49a7c77276c58c1b09822d75e8e8@52.231.165.108:30303",  // bootnode-azure-koreasouth-001
	"enode://5d6d7cd20d6da4bb83a1d28cadb5d409b64edf314c0335df658c1a54e32c7c4a7ab7823d57c39b6a757556e68ff1df17c748b698544a55cb488b52479a92b60f@104.42.217.25:30303",   // bootnode-azure-westus-001
}

連線方式為(舉例):

admin.addPeer("enode://d860a01f9722d78051619d1e2351aba3f43f943f6f00718d1b9baa4101932a1f5011f16bb2b1bb35db20d6fe28fa0bf09636d26a87d31de9ec6203eeedb1f666@18.138.108.67:30303")

admin.addPeer("enode://22a8232c3abc76a16ae9d6c3b164f98775fe226f0917b0ca871128a74a8e9630b458460865bab457221f1d448dd9791d24c4e5d88786180ac185df813a68d4de@3.209.45.79:30303")

admin.addPeer("enode://ca6de62fce278f96aea6ec5a2daadb877e51651247cb96ee310a318def462913b653963c155a0ef6c7d50048bba6e6cea881130857413d9f50a621546b590758@34.255.23.113:30303")

admin.addPeer("enode://5d6d7cd20d6da4bb83a1d28cadb5d409b64edf314c0335df658c1a54e32c7c4a7ab7823d57c39b6a757556e68ff1df17c748b698544a55cb488b52479a92b60f@104.42.217.25:30303")

admin.addPeer("enode://715171f50508aba88aecd1250af392a45a330af91d7b90701c436b618c86aaa1589c9184561907bebbb56439b8f8787bc01f49a7c77276c58c1b09822d75e8e8@52.231.165.108:30303")

admin.addPeer("enode://a0c97b58f2d3ea039cf09d8d9255c6d635f605f8702fafaafeda173ad0ae81c717c9f0ec155be615868c5eb027f8e28b2e9cab31ddddbb1b3e664c13eccb649a@159.69.56.17:30303?discport=1062

後來發現正式環境自己恢復並且同步到最新資料了,懷疑可能和網路也有一定的關係

結果不到三分鐘,正式環境又開始報錯節點timeout,出現資料不同步的情況,這個時候再去看備用節點,並未發現有異常資訊。

因為以太坊公鏈節點的正式環境與備用環境伺服器都在香港地區,所以可排除是網路的問題

再次檢視正式環境的連線的節點資訊時,發現有一個1.9.3的版本,1.9.9為以太坊繆爾冰川硬分叉的版本,如果有節點版本比它低,可說明此節點的區塊一定未超過920塊,是一個壞節點

> admin.peers

{
    caps: ["eth/63"],
    enode: "enode://945b152c088c887d06f02eafba2d88559fbfc460510ca98c4d919312cf2cf4fa6c4307fe02bff4cde36a236948448f24620c0f67fce3444e444c0dee302a81e6@209.250.230.142:30303",
    id: "3b95c0d3e14e5021e214c81a83c346eb997ec194fb1bd2e7a79b23d0d92a6198",
    name: "Geth/v1.9.3-stable-cfbb969d/linux-amd64/go1.11.5",
    network: {
      inbound: false,
      localAddress: "172.17.0.2:37444",
      remoteAddress: "209.250.230.142:30303",
      static: false,
      trusted: false
    },
    protocols: {
      eth: {
        difficulty: 1.3520836652877566e+22,
        head: "0xab865ff4c05af71a415426ce40bf7656194bcbea3ed0b0b2f0c64767b30f2982",
        version: 63
      }
    }
}

於是使用命令對它進行了刪除

> admin.removePeer("enode://945b152c088c887d06f02eafba2d88559fbfc460510ca98c4d919312cf2cf4fa6c4307fe02bff4cde36a236948448f24620c0f67fce3444e444c0dee302a81e6@209.250.230.142:30303")
true

刪除後,正式環境的節點開始正常同步

結論,可發現以太坊正式環境的節點資料不同步,並且出現如下報錯:

Synchronisation failed, dropping peer    peer=6a224bc2c8c3b02c err=timeout

其中的原因會是因為以太坊公鏈節點連線了版本較低的節點【以太坊硬分叉之前的版本】導致,此時可使用如下命令對其連線的節點進行刪除

admin.removePeer("enode://945b152c088c887d06f02eafba2d88559fbfc460510ca98c4d919312cf2cf4fa6c4307fe02bff4cde36a236948448f24620c0f67fce3444e444c0dee302a81e6@209.250.230.142:30303")
  • 新增節點
admin.addPeer("enode://3af83ae28fc90838c334369ed2bf8071065062b851e5845e5eb07bd2efc5ba68f9d77865bea3ea09d3cc866bded716c258b0bca002696a69463fba7fdefb51df@128.230.208.74:30303")
true
  • 新增信任節點
addTrustedPeer("enode://3af83ae28fc90838c334369ed2bf8071065062b851e5845e5eb07bd2efc5ba68f9d77865bea3ea09d3cc866bded716c258b0bca002696a69463fba7fdefb51df@128.230.208.74:30303")
true

相關文章