踩坑日誌--CEPH叢集常見問題解決辦法

一往無前,未來可期發表於2020-11-06

踩坑日誌–CEPH叢集常見問題解決辦法

前言

一:使用ceph命令報錯:.handle_connect_reply connect got BADAUTHORIZER

1.1:報錯詳情

  • 我檢視osd狀態(ceph osd status)發現出現以下錯誤:

  • [root@ct ~(keystone_admin)]# ceph osd status
    2020-03-12 18:09:43.363 7f2e96572700  0 -- 192.168.11.100:0/3068442569 >> 192.168.11.100:6804/1625 conn(0x7f2e80005580 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    2020-03-12 18:09:43.564 7f2e96572700  0 -- 192.168.11.100:0/3068442569 >> 192.168.11.100:6804/1625 conn(0x7f2e80005580 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    2020-03-12 18:09:43.965 7f2e96572700  0 -- 192.168.11.100:0/3068442569 >> 192.168.11.100:6804/1625 conn(0x7f2e80005580 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    2020-03-12 18:09:44.767 7f2e96572700  0 -- 192.168.11.100:0/3068442569 >> 192.168.11.100:6804/1625 conn(0x7f2e80005580 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    2020-03-12 18:09:46.370 7f2e96572700  0 -- 192.168.11.100:0/3068442569 >> 192.168.11.100:6804/1625 conn(0x7f2e80005580 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    2020-03-12 18:09:49.574 7f2e96572700  0 -- 192.168.11.100:0/3068442569 >> 192.168.11.100:6804/1625 conn(0x7f2e80005580 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    。。。會一直出現這個
    

1.2:解決

  • 最開始重啟ceph-osd服務發現沒用,需要重啟ceph服務才可以systemctl restart ceph.target

1.3:問題解決!

二:CEPH某個節點的osd總是起不來

2.1:報錯詳情

  • CEPH叢集檢視健康狀態的時候發現有一個節點的osd服務down了,使用 ceph osd status命令發現是c1節點的服務沒有起來

  • [root@ct ~(keystone_admin)]# ceph osd status
    +----+------+-------+-------+--------+---------+--------+---------+----------------+
    | id | host |  used | avail | wr ops | wr data | rd ops | rd data |     state      |
    +----+------+-------+-------+--------+---------+--------+---------+----------------+
    | 0  |  ct  | 14.4G | 1009G |    0   |     0   |    0   |     6   |   exists,up    |
    | 1  |      |    0  |    0  |    0   |     0   |    0   |     0   | autoout,exists |
    | 2  |  c2  | 14.4G | 1009G |    0   |     0   |    1   |    48   |   exists,up    |
    +----+------+-------+-------+--------+---------+--------+---------+----------------+
    

2.2:解決

  • 再次檢查健康狀態,終於發現問題:因為c1節點的時間同步出現了問題
[root@ct ~(keystone_admin)]# ceph -s
  cluster:
    id:     8c9d2d27-492b-48a4-beb6-7de453cf45d6
    health: HEALTH_WARN
            Degraded data redundancy: 2127/6381 objects degraded (33.333%), 133 pgs degraded, 192 pgs undersized
            clock skew detected on mon.c1	'//顯示c1節點時間有問題'
 
  services:
    mon: 3 daemons, quorum ct,c1,c2
    mgr: ct(active), standbys: c2, c1
    osd: 3 osds: 2 up, 2 in
 
  data:
    pools:   3 pools, 192 pgs
    objects: 2.13 k objects, 13 GiB
    usage:   29 GiB used, 2.0 TiB / 2.0 TiB avail
    pgs:     2127/6381 objects degraded (33.333%)
             133 active+undersized+degraded
             59  active+undersized

  • c1節點重新進行時間同步,並重啟相關服務即可
[root@c1 ~]# ntpdate ct	'//同步ct的時間'
12 Mar 18:23:27 ntpdate[37287]: step time server 192.168.11.100 offset -28799.645303 sec
[root@c1 ~]# date	'//再次檢查時間是否相同'
2020年 03月 12日 星期四 18:23:33 CST
[root@c1 ~]# systemctl restart ceph-osd.target	'//重啟osd服務'
  • 再次檢查健康狀態,問題已經解決
[root@ct ~(keystone_admin)]# ceph -s
  cluster:
    id:     8c9d2d27-492b-48a4-beb6-7de453cf45d6
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum ct,c1,c2
    mgr: ct(active), standbys: c2
    osd: 3 osds: 3 up, 3 in
 
  data:
    pools:   3 pools, 192 pgs
    objects: 2.13 k objects, 13 GiB
    usage:   43 GiB used, 3.0 TiB / 3.0 TiB avail
    pgs:     192 active+clean

2.3:問題成功解決!

相關文章