Zookeeper的一次遷移故障

五柳-先生發表於2015-11-25

前階段同事遷移Zookeeper(是給Kafka使用的以及flume使用)後發現所有Flume-producer/consumer端集體報錯:

1
2
3
4
07 Jan 2014 01:19:32,571 INFO  [conf-file-poller-0-SendThread(xxx:2181)] (org.apache.zookeeper.ClientCnxn$SendThread.startConnect:1058)  - Opening socket connection to server xxx:2181
07 Jan 2014 01:19:32,572 INFO  [conf-file-poller-0-SendThread(xxx:2181)] (org.apache.zookeeper.ClientCnxn$SendThread.primeConnection:947)  - Socket connection established to xxx:2181, initiating session
07 Jan 2014 01:19:32,573 INFO  [conf-file-poller-0-SendThread(xxx:2181)] (org.apache.zookeeper.ClientCnxn$SendThread.run:1183)  - Unable to read additional data from server sessionid 0x142f42b91871911, likely server has closed socket, closing socket connection and attempting reconnect
07 Jan 2014 01:19:32,845 INFO  [conf-file-poller-0-SendThread(xxx:2181)] (org.apache.zookeeper.ClientCnxn$SendThread.startConnect:1058)  - Opening socket connection to server xxx:2181

一直在不斷的重試連線失敗再重試,問同事說:網路連通性早就驗證過,然後檢視server端日誌發現:

1
2
3
4
5
6
7
8
2014-01-06 23:59:59,987 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /xxx:45282
2014-01-06 23:59:59,987 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@793] - Connection request from old client xxx:45282; will
be dropped if server is in r-o mode
2014-01-06 23:59:59,987 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@812] - Refusing session request for client xxx:45282 as it
has seen zxid 0x60fd15564 our last zxid is 0x10000000f client must try another server
2014-01-06 23:59:59,987 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client xxx:45282 (no se
ssion established for client)
2014-01-06 23:59:59,989 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from xxx:45285

發現Flume還是保留原來的zxid,但是現在的zxid竟然是0,所以丟擲異常!

1
2
3
4
5
6
7
8
9
10
11
if (connReq.getLastZxidSeen() > zkDb.dataTree.lastProcessedZxid) {
            String msg = "Refusing session request for client "
                + cnxn.getRemoteSocketAddress()
                " as it has seen zxid 0x"
                + Long.toHexString(connReq.getLastZxidSeen())
                " our last zxid is 0x"
                + Long.toHexString(getZKDatabase().getDataTreeLastProcessedZxid())
                " client must try another server";
            LOG.info(msg);
            throw new CloseRequestException(msg);
        }

   後來問同事是怎麼做的遷移:先啟動一套新的叢集,然後關閉老的叢集,同時在老叢集的一個IP:2181起了一個haproxy代理新叢集以為這樣,可以做到透明遷移=。=,其實是觸發了ZK的bug-832導致不停的重試連線,只有重啟flume才可以解決

   正確的遷移方式是,把新叢集加入老叢集,然後修改Flume配置等一段時間(flume自動reconfig)後再關閉老叢集就不會觸發這個問題了.

本文出自 “MIKE老畢的海賊船” 部落格,請務必保留此出處http://boylook.blog.51cto.com/7934327/1365364

相關文章