health: HEALTH_WARN

西门良才

2023-12-01

health: HEALTH_WARN

too few PGs per OSD 错误

ceph -s
  cluster:
    id:     da54ea6a-111a-434c-b78e-adad1ac66abb
    health: HEALTH_WARN
            too few PGs per OSD (10 < min 30)
 
  services:
    mon: 3 daemons, quorum master1,master2,master3
    mgr: master1(active), standbys: master2
    osd: 3 osds: 3 up, 3 in
 
  data:
    pools:   1 pools, 10 pgs
    objects: 17  objects, 24 MiB
    usage:   3.0 GiB used, 24 GiB / 27 GiB avail
    pgs:     10 active+clean

从上面可以看到，提示说每个osd上的pg数量小于最小的数目30个。pgs为10，因为是2副本的配置，所以当有3个osd的时候，每个osd上均分了10/3 *2=6个pgs,也就是出现了如上的错误小于最小配置30个。
集群这种状态如果进行数据的存储和操作，会发现集群卡死，无法响应io，同时会导致大面积的osd down。

修改默认pool rbd的pgs
ceph osd pool set rbd pg_num 50
此时，ceph -s 查看会提示pg_num 大于 pgp_num，所以还需要修改pgp_num
ceph osd pool set rbd pgp_num 50
再次查看：

ceph -s
  cluster:
    id:     da54ea6a-111a-434c-b78e-adad1ac66abb
    health: HEALTH_WARN
            application not enabled on 1 pool(s)

提示 application not enabled

[root@master1 ~]# ceph health detail
HEALTH_WARN application not enabled on 1 pool(s)
POOL_APP_NOT_ENABLED application not enabled on 1 pool(s)
    application not enabled on pool 'rbd'
    use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications.
[root@master1 ~]# ceph osd pool application enable rbd rbd
enabled application 'rbd' on pool 'rbd'

ceph health detail
HEALTH_WARN application not enabled on 1 pool(s)
POOL_APP_NOT_ENABLED application not enabled on 1 pool(s)
application not enabled on pool 'rbd'
use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications.
[root@master1 ~]# ceph osd pool application enable rbd rbd
enabled application 'rbd' on pool 'rbd'

根据上面的提示，选择存储方式：cephfs ，rbd ，rgw 。此处我是rbd块存储。
查看：

root@master1 ~]# ceph -s
  cluster:
    id:     da54ea6a-111a-434c-b78e-adad1ac66abb
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum master1,master2,master3
    mgr: master1(active), standbys: master2
    osd: 3 osds: 3 up, 3 in
 
  data:
    pools:   1 pools, 50 pgs
    objects: 17  objects, 24 MiB
    usage:   3.0 GiB used, 24 GiB / 27 GiB avail
    pgs:     50 active+clean

HEALTH_WARN application not enabled on pool '.rgw.root'
在创建rgw对象存储时，报错：

root@master1 ceph-cluster]# ceph health detail
HEALTH_WARN application not enabled on 1 pool(s)
POOL_APP_NOT_ENABLED application not enabled on 1 pool(s)
    application not enabled on pool '.rgw.root'
    use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications.
[root@master1 ceph-cluster]# ceph osd pool application enable .rgw.root rgw
enabled application 'rgw' on pool '.rgw.root'
ceph -s  查看已恢复正常

too many PGs per OSD告警的处理
ceph自L版本后，mon_max_pg_per_osd,默认值也从300变更为200
修改ceph.conf文件

[root@master1 ceph-cluster]# cat ceph.conf 
[global]
......
mon_max_pg_per_osd = 1000
在global下面追加一行。
推送到各节点：
ceph-deploy --overwrite-conf config push master1 master2 master3
重启mgr和各节点mon
systemctl restart ceph-mgr@master1
systemctl restart ceph-mon@master1
systemctl restart ceph-mon@master2
systemctl restart ceph-mon@master3
ceph -s  查看已恢复正常

一般ceph.conf在不修改配置文件的情况下，好的参数都使用的默认配置，此时创建rgw用户时也会报错：

rgw_init_ioctx ERROR: librados::Rados::pool_create returned (34) Numerical result out of range (this can be due to a pool or placement group misconfiguration, e.g. pg_num < pgp_num or mon_max_pg_per_osd exceeded)
[root@master1 ceph-cluster]# radosgw-admin user create --uid=radosgw --display-name='radosgw'
{
    "user_id": "radosgw",
    "display_name": "radosgw",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,      #最大值1000已生效
    "auid": 0,
    "subusers": [],
......

解决办法同上，修改默认配置文件后重启即可。

ceph ERROR问题处理

安装ceph时报ERROR

ceph在安装时如果报ERROR错误，一般两种原因：
1，缺少安装所需的依赖包；
2，repo源问题，无法正常下载。
解决以上两个问题，请参考ceph安装文档进行解决。
No data was received after 300 seconds, disconnecting...
网络超时，在各节点使用yum方式安装ceph解决
yum -y install ceph
出现Error：over-write
出现这种情况一般是修改了ceph.conf没生效。解决办法：
ceph-deploy --overwrite-conf config push node1-4
或者
ceph-deploy --overwrite-conf mon create node1-4
出现Error：[Errno 2] No such file or directory
如果在安装时就出现这种情况，可能是卸载过ceph，但没删除干净。
ceph在卸载后需要删除以下目录：

rm -rf /etc/ceph/*
rm -rf /var/lib/ceph/*
rm -rf /var/log/ceph/*
rm -rf /var/run/ceph/*

ceph在使用了一段时间后，如果卸载重新安装，会继续保留原来的数据，所以必须要删除这些数据目录才行。

health: HEALTH_WARN