踩坑日志--CEPH集群常见问题解决办法

前言

一:使用ceph命令报错:.handle_connect_reply connect got BADAUTHORIZER

1.1:报错详情

  • 我查看osd状态(ceph osd status)发现出现以下错误:

  • [root@ct ~(keystone_admin)]# ceph osd status
    2020-03-12 18:09:43.363 7f2e96572700  0 -- 192.168.11.100:0/3068442569 >> 192.168.11.100:6804/1625 conn(0x7f2e80005580 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    2020-03-12 18:09:43.564 7f2e96572700  0 -- 192.168.11.100:0/3068442569 >> 192.168.11.100:6804/1625 conn(0x7f2e80005580 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    2020-03-12 18:09:43.965 7f2e96572700  0 -- 192.168.11.100:0/3068442569 >> 192.168.11.100:6804/1625 conn(0x7f2e80005580 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    2020-03-12 18:09:44.767 7f2e96572700  0 -- 192.168.11.100:0/3068442569 >> 192.168.11.100:6804/1625 conn(0x7f2e80005580 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    2020-03-12 18:09:46.370 7f2e96572700  0 -- 192.168.11.100:0/3068442569 >> 192.168.11.100:6804/1625 conn(0x7f2e80005580 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    2020-03-12 18:09:49.574 7f2e96572700  0 -- 192.168.11.100:0/3068442569 >> 192.168.11.100:6804/1625 conn(0x7f2e80005580 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=1).handle_connect_reply connect got BADAUTHORIZER
    。。。会一直出现这个
    

1.2:解决

  • 最开始重启ceph-osd服务发现没用,需要重启ceph服务才可以systemctl restart ceph.target

1.3:问题解决!

二:CEPH某个节点的osd总是起不来

2.1:报错详情

  • CEPH集群查看健康状态的时候发现有一个节点的osd服务down了,使用 ceph osd status命令发现是c1节点的服务没有起来

  • [root@ct ~(keystone_admin)]# ceph osd status
    +----+------+-------+-------+--------+---------+--------+---------+----------------+
    | id | host |  used | avail | wr ops | wr data | rd ops | rd data |     state      |
    +----+------+-------+-------+--------+---------+--------+---------+----------------+
    | 0  |  ct  | 14.4G | 1009G |    0   |     0   |    0   |     6   |   exists,up    |
    | 1  |      |    0  |    0  |    0   |     0   |    0   |     0   | autoout,exists |
    | 2  |  c2  | 14.4G | 1009G |    0   |     0   |    1   |    48   |   exists,up    |
    +----+------+-------+-------+--------+---------+--------+---------+----------------+
    
    

2.2:解决

  • 再次检查健康状态,终于发现问题:因为c1节点的时间同步出现了问题

  • [root@ct ~(keystone_admin)]# ceph -s
      cluster:
        id:     8c9d2d27-492b-48a4-beb6-7de453cf45d6
        health: HEALTH_WARN
                Degraded data redundancy: 2127/6381 objects degraded (33.333%), 133 pgs degraded, 192 pgs undersized
                clock skew detected on mon.c1	'//显示c1节点时间有问题'
     
      services:
        mon: 3 daemons, quorum ct,c1,c2
        mgr: ct(active), standbys: c2, c1
        osd: 3 osds: 2 up, 2 in
     
      data:
        pools:   3 pools, 192 pgs
        objects: 2.13 k objects, 13 GiB
        usage:   29 GiB used, 2.0 TiB / 2.0 TiB avail
        pgs:     2127/6381 objects degraded (33.333%)
                 133 active+undersized+degraded
                 59  active+undersized
     
    
    
  • c1节点重新进行时间同步,并重启相关服务即可

  • [root@c1 ~]# ntpdate ct	'//同步ct的时间'
    12 Mar 18:23:27 ntpdate[37287]: step time server 192.168.11.100 offset -28799.645303 sec
    [root@c1 ~]# date	'//再次检查时间是否相同'
    2020年 03月 12日 星期四 18:23:33 CST
    [root@c1 ~]# systemctl restart ceph-osd.target	'//重启osd服务'
    
    
  • 再次检查健康状态,问题已经解决

    [root@ct ~(keystone_admin)]# ceph -s
      cluster:
        id:     8c9d2d27-492b-48a4-beb6-7de453cf45d6
        health: HEALTH_OK
     
      services:
        mon: 3 daemons, quorum ct,c1,c2
        mgr: ct(active), standbys: c2
        osd: 3 osds: 3 up, 3 in
     
      data:
        pools:   3 pools, 192 pgs
        objects: 2.13 k objects, 13 GiB
        usage:   43 GiB used, 3.0 TiB / 3.0 TiB avail
        pgs:     192 active+clean
    
    

2.3:问题成功解决!

原创文章 172 获赞 97 访问量 5万+

猜你喜欢

转载自blog.csdn.net/CN_TangZheng/article/details/104825088