不知不觉已经实习了一个月了,实习期间做的主要工作就是搭建Nagios+Centreon监控平台了,自己动手还是比较快的,搭这个东西虽然bug一堆,但还算顺利,后来就开始自行编写监控磁盘的脚本了。
先说一下为什么要自己编写监控磁盘的脚本,其实,我自己也不是太清楚,因为Nagios-plugins里面是有check_disk的脚本的,可能我的导师是想锻炼一下我,同时也为了有一个更符合自己实际情况的脚本。
面对的硬件有:三台服务器搭建测试云平台,两台服务器上有RAID卡,两台服务器上有SSD,还有HDD若干。对的,只有这么点,但对于我这个小菜鸟,也够我折腾了。
对于有RAID卡的主机,MegaCli就是个不错的选择了,自行下载安装MegaCli,然后就动手了:
/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL ---查raid
/opt/MegaRAID/MegaCli/MegaCli64 -AdpAllInfo -aALL ---查raid卡信息
/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL ---查看硬盘信息
自己弄着弄着玩一下,观察一下显示的东西,显示出来的东西有很大一片的,随便看看。如果该主机本身没有RAID卡,那你在它上面使用MegaCli的话,显示的就只有 Exit Code: 0x00
主要用的是第三条命令/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
然后抓取我要的信息/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | grep -E 'Device Id|Error|Media Type'
Device Id — 监控SSD寿命的时候用到,就是一个Id而已
Error — Error Count 就是我们要观察的错误信息了,为0就是木有错误,不为0就要担心了
Media Type — 硬盘类型,主要是我要找主机面的SSD对应的是哪个Device Id,因为除了这样,我也不知道Device Id跟硬盘或者跟分区有什么对应关系,贴一下我显示的结果:
[root@cloud-13 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | grep -E 'Device Id|Error|Media Type'
Device Id: 0
Media Error Count: 0
Other Error Count: 0
Media Type: Hard Disk Device
Device Id: 1
Media Error Count: 0
Other Error Count: 0
Media Type: Hard Disk Device
Device Id: 2
Media Error Count: 0
Other Error Count: 0
Media Type: Hard Disk Device
Device Id: 3
Media Error Count: 0
Other Error Count: 0
Media Type: Hard Disk Device
Device Id: 4
Media Error Count: 0
Other Error Count: 0
Media Type: Solid State Device
这样,自行写代码观察Error Count后面的数值就行了,就达到监控的效果了。
刚刚有提到SSD寿命的问题,在这一并说了吧,使用smartctl可以检测SSD的寿命,当然还有很多其它结果,SSD寿命只是其中一部分,但是对于有RAID卡的主机,需要刚刚获取到的Device Id。
[root@cloud-13 ~]# smartctl -a -d megaraid,4 /dev/sdc1
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
/dev/sdc1 [megaraid_disk_04] [SAT]: Device open changed type from 'megaraid' to 'sat'
Smartctl open device: /dev/sdc1 [megaraid_disk_04] [SAT] failed: SATA device detected,
MegaRAID SAT layer is reportedly buggy, use '-d sat+megaraid,N' to try anyhow
我的主机上需要我加上sat,就听他话咯
[root@cloud-13 ~]# smartctl -a -d megaraid,4 /dev/sdc1
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
/dev/sdc1 [megaraid_disk_04] [SAT]: Device open changed type from 'megaraid' to 'sat'
Smartctl open device: /dev/sdc1 [megaraid_disk_04] [SAT] failed: SATA device detected,
MegaRAID SAT layer is reportedly buggy, use '-d sat+megaraid,N' to try anyhow
[root@cloud-13 ~]# smartctl -a -d sat+megaraid,4 /dev/sdc1
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: OCZ INTREPID 3600
Serial Number: A21N8061423000004
LU WWN Device Id: 5 e83a97 100006dc5
Firmware Version: 1.4.6.0
User Capacity: 800,166,076,416 bytes [800 GB]
Sector Size: 512 bytes logical/physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ACS-2 (revision not indicated)
Local Time is: Tue Aug 25 15:20:02 2015 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x1d) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x00) Error logging NOT supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 0) minutes.
Extended self-test routine
recommended polling time: ( 0) minutes.
SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0000 100 100 000 Old_age Offline - 0
9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 3964
12 Power_Cycle_Count 0x0000 100 100 000 Old_age Offline - 28
100 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 2547072
171 Unknown_Attribute 0x0000 090 000 000 Old_age Offline - 12030
174 Unknown_Attribute 0x0000 071 100 000 Old_age Offline - 20
184 End-to-End_Error 0x0000 009 100 000 Old_age Offline - 1282
187 Reported_Uncorrect 0x0000 100 100 000 Old_age Offline - 0
190 Airflow_Temperature_Cel 0x0000 048 054 000 Old_age Offline - 48
195 Hardware_ECC_Recovered 0x0000 000 100 000 Old_age Offline - 0
196 Reallocated_Event_Count 0x0000 000 100 000 Old_age Offline - 0
197 Current_Pending_Sector 0x0000 000 100 000 Old_age Offline - 0
198 Offline_Uncorrectable 0x0000 100 100 000 Old_age Offline - 3562
199 UDMA_CRC_Error_Count 0x0000 100 100 000 Old_age Offline - 3443
202 Data_Address_Mark_Errs 0x0000 100 100 000 Old_age Offline - 2061332509
205 Thermal_Asperity_Rate 0x0000 100 100 000 Old_age Offline - 3000
206 Flying_Height 0x0000 000 100 000 Old_age Offline - 0
207 Spin_High_Current 0x0000 002 100 000 Old_age Offline - 64
208 Spin_Buzz 0x0000 000 100 000 Old_age Offline - 9
210 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
211 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
212 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
213 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
214 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
221 G-Sense_Error_Rate 0x0000 100 100 000 Old_age Offline - 0
222 Loaded_Hours 0x0000 100 100 000 Old_age Offline - 0
230 Head_Amplitude 0x0000 001 100 000 Old_age Offline - 1
233 Media_Wearout_Indicator 0x0000 100 000 000 Old_age Offline - 100
249 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 5792
251 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 22849
SMART Error Log not supported
Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Device does not support Selective Self Tests/Logging
然后抓取这个就行了,那个100就是表示寿命还剩100%,就是一点都没损耗,毕竟是新的呢
233 Media_Wearout_Indicator 0x0000 100 000 000 Old_age Offline - 100
我也都是参照下面这两个博客做的,他们说得很详细
http://blog.yufeng.info/archives/1096
http://www.woxihuan.com/117417/1336095005082619.shtml
对于没有RAID卡的主机,smartctl可以很好的用来检测磁盘是否有错误
# smartctl -a /dev/sdx
显示所有信息sdx为自己电脑分区
因为我只要观察Error Count log,可以使用这个:
# smartctl -l error /dev/sdc
则只列出Error Counter
[root@cloud-11 ~]# smartctl -l error /dev/sdc
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 20680 755.998 0
write: 0 0 0 0 8177 1356.647 0
verify: 0 0 0 0 760 61.354 0
Non-medium error count: 0
观察带error的列,为0则是木有问题,实现代码抓取就行了
对于这台没有RAID卡的主机,使用smartctl检测ssd的时候,是没有Error Counter log的
[root@cloud-11 ~]# smartctl -a /dev/sdb
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.123.2.openstack.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: OCZ INTREPID 3600
Serial Number: A21N8061423000020
LU WWN Device Id: 5 e83a97 100006dd5
Firmware Version: 1.4.6.0
User Capacity: 800,166,076,416 bytes [800 GB]
Sector Size: 512 bytes logical/physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ACS-2 (revision not indicated)
Local Time is: Tue Aug 25 15:34:29 2015 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 25) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x1d) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x00) Error logging NOT supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 0) minutes.
Extended self-test routine
recommended polling time: ( 0) minutes.
SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0000 100 100 000 Old_age Offline - 0
9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 5116
12 Power_Cycle_Count 0x0000 100 100 000 Old_age Offline - 12
100 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 4009824
171 Unknown_Attribute 0x0000 090 000 000 Old_age Offline - 12041
174 Unknown_Attribute 0x0000 066 100 000 Old_age Offline - 8
184 End-to-End_Error 0x0000 009 100 000 Old_age Offline - 1271
187 Reported_Uncorrect 0x0000 100 100 000 Old_age Offline - 0
190 Airflow_Temperature_Cel 0x0000 045 063 000 Old_age Offline - 45
195 Hardware_ECC_Recovered 0x0000 000 100 000 Old_age Offline - 0
196 Reallocated_Event_Count 0x0000 000 100 000 Old_age Offline - 0
197 Current_Pending_Sector 0x0000 000 100 000 Old_age Offline - 0
198 Offline_Uncorrectable 0x0000 100 100 000 Old_age Offline - 2732
199 UDMA_CRC_Error_Count 0x0000 100 100 000 Old_age Offline - 2458
202 Data_Address_Mark_Errs 0x0000 100 100 000 Old_age Offline - 2371926836
205 Thermal_Asperity_Rate 0x0000 100 100 000 Old_age Offline - 3000
206 Flying_Height 0x0000 000 100 000 Old_age Offline - 0
207 Spin_High_Current 0x0000 003 100 000 Old_age Offline - 90
208 Spin_Buzz 0x0000 000 100 000 Old_age Offline - 14
210 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 9175
211 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
212 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
213 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
214 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
221 G-Sense_Error_Rate 0x0000 100 100 000 Old_age Offline - 0
222 Loaded_Hours 0x0000 100 100 000 Old_age Offline - 0
230 Head_Amplitude 0x0000 001 100 000 Old_age Offline - 1
233 Media_Wearout_Indicator 0x0000 100 000 000 Old_age Offline - 100
249 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 7079
251 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 20961
SMART Error Log not supported
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Aborted by host 90% 0 -
# 2 Short offline Aborted by host 90% 0 -
Device does not support Selective Self Tests/Logging
但却是有SSD的寿命的:
233 Media_Wearout_Indicator 0x0000 100 000 000 Old_age Offline - 100
找了很久,对这块没有RAID的SSD的错误检测依旧没有办法,只能监控其寿命,要是哪位高人有办法,请指教。
至此就实现得差不多了,总体思路就是如此:
通过检测工具
对于没有使用raid卡的硬盘,可以用smartctl -a /dev/sdX 观察Error counter log的列的值有没有增加;
使用raid卡的硬盘,则用MegaCli来观察Error Count
最后就是对ioerr_cnt的研究了,操作系统为redhat5.x,具体版本不记得了,可以用df -h
来查看磁盘分区情况
对于每一块磁盘,其目录下都会有这个文件,里面存放了一个值
# cat /sys/block/sdb/device/ioerr_cnt
0x1494
从ioerr_cnt这个名字就觉得这个应该是对IO错误的计数,那么它的值就表示发生的IO错误数,0x1494,这可不是一个很低的值,它是否象征着磁盘错误?
而后导师在redhat社区找了一篇关于这个问题的讨论文章给我看,有兴趣的可自行去红帽社区找,我这里不方便提供
[Troubleshooting] How do I determine which io are causing ioerr_cnt to increase?
而这篇文章的存在就是为了确定是哪个IO发生了错误提供寻找办法,就是提出一个解决办法去找到是哪个IO导致错误,但是就算找到了,跟磁盘的健康状态有关系吗?或者说,只是某个进程发生了IO错误,如果这是那个进程本身的关系,那就跟磁盘毫不相干了。
我观察了我三台主机,9块磁盘的ioerr_cnt,发现只有一块硬盘的ioerr_cnt值为0,但是smartctl和MegaCli显示的error都为0。
最后决定放弃对ioerr_cnt的检测,毕竟它并不能全部和磁盘的健康状态挂钩,所以把MegaCli和smartctl作为标准。
这样写下来,总觉得好少,可是自己也将近做了一星期的研究,还要加上好几天的写代码,全部用Python实现的,因为对Python也生疏了好久,查了好久的函数怎么怎么用。但自己收获还是很大的,之前对nagios的脚本还一直抱有敬畏的心态(有一些打开全是乱码),现在发现其实还蛮简单的,主要还是要挑对工具,接着大多数都是字符串处理了,Python是个好东西。
最后的代码如下了,挺简单的,没什么含金量:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Description:
# This application is used to discovery the pyhsical disk by using the MegaCLI tool.
#
# Author: Jiang Chuan <806692341@qq.com>
#
import commands
import os
import sys
import string
import argparse
SMARTCTL = 'smartctl'
ListError = '-l error'
DISK = '/dev/sdc'
LSPCI = 'lspci | grep -i raid'
MEGACLI = '/opt/MegaRAID/MegaCli/MegaCli64'
PDLIST = '-PDList -aALL'
DEVICE = '|grep \'Device Id\''
ERROR = '|grep Error'
# nagios exit code
STATUS_OK = 0
STATUS_WARNING = 1
STATUS_ERROR = 2
STATUS_UNKNOWN = 3
def check_smartctl():
(status, output) = commands.getstatusoutput('%s %s %s' % (SMARTCTL, ListError, DISK))
line = output.split('\n')
if status != 0:
print 'UNKNOWN|Something not unexpected happened:' + line[3]
return STATUS_UNKNOWN
else:
num = [0,1,2,3,4]
str_read = ''
str_write = ''
str_verify = ''
for item in line:
if item.find("read") in num:
str_read = item
if item.find("write") in num:
str_write = item
if item.find("verify") in num:
str_verify = item
if str_read != '' and str_write != '' and str_verify != '':
error_list = [max_error(str_read), max_error(str_write), max_error(str_verify)]
if max(error_list) >= 5:
print 'ERROR|There is too much error:' + str(error_list) + ' >= 5'
return STATUS_ERROR
elif max(error_list) == 0:
print 'OK'
return STATUS_OK
else:
print 'WARNING|There is some error need handle:' + str(error_list) + '< 5'
return STATUS_WARNING
else:
print 'UNKNOWN|We can not get the error count,please check'
return STATUS_UNKNOWN
def max_error(str):
words = str.split(' ')
words = filter(lambda x:x != '', words)
lis = [int(words[1]), int(words[2]), int(words[3]), int(words[4]), int(words[7])]
return max(lis)
def check_lsi():
(status, output) = commands.getstatusoutput('%s' % (LSPCI))
if status != 0:
print 'UNKNOWN|LSPCI encounter a problem'
return STATUS_UNKNOWN
sys.exit(1)
else:
if(output.find('LSI') >=0 ):
return STATUS_OK
else:
print 'ERROR|There is no lspci raid'
return STATUS_ERROR
def check_MegaCli():
check_lsi()
device_id = get_device_id()
error_count = get_error_count()
# Some judgement, maybe useless
if len(device_id)<1 or len(error_count)<1:
print 'ERROR|There is some error because one of the device_id and error_count is 0'
return STATUS_ERROR
elif len(device_id)*2 != len(error_count):
print 'ERROR|There is some error because the num of error_count does not equal to double device_id'
return STATUS_ERROR
else:
warn_num = [1,2,3,4]
# 0 represent NORMAL.1---WARNING.2---CRITICAL
status_num = 0;
if max(error_count) == 0:
print 'OK'
return STATUS_OK
elif max(error_count) >=5:
print 'ERROR|There is ' + str(max(error_count)) + ' error in device ' + error_count.index(max(error_count))
return STATUS_ERROR
else:
print 'ERROR|There is ' + str(max(error_count)) + ' error in device ' + error_count.index(max(error_count))
return STATUS_WARNING
# Just for testing, print the error and the device_id
# if status_num == 0:
# i = 0
# while i < len(device_id):
# print 'Device_Id ' + str(device_id[i]) + ':'
# print 'Media Error Count :' + str(error_count[2*i])
# print 'Other Error Count :' + str(error_count[2*i+1])
# i = i + 1
# return status_num
def get_device_id():
(status, output) = commands.getstatusoutput('%s %s %s' % (MEGACLI, PDLIST, DEVICE))
if status != 0:
print 'ERROR|Error for get device id'
return STATUS_ERROR
sys.exit(1)
else:
device_id = []
line = output.split('\n')
for item in line:
device_id.append(int(item.split(' ')[-1]))
return device_id
def get_error_count():
(status, output) = commands.getstatusoutput('%s %s %s' % (MEGACLI, PDLIST, ERROR))
if status != 0:
print 'Error|Error for get MegaCli error count'
return STATUS_ERROR
sys.exit(1)
else:
error_count = []
line = output.split('\n')
for item in line:
error_count.append(int(item.split(' ')[-1]))
return error_count
def check_ssd(device_id,disk):
(status, output) = commands.getstatusoutput('%s %s%s %s %s' % (SMARTCTL, '-a -d sat+megaraid,', device_id,disk, '|grep Media_Wearout_Indicator'))
if status != 0:
print 'UNKNOWN|Something unexpected happened,now is doing check_ssd().'
return STATUS_UNKNOWN
sys.exit(1)
else:
life = int(str(output).split(' ')[5])
if life >= 50:
print 'OK|The life of the SSD is ' + str(life) +'% left'
return STATUS_OK
elif life < 50 and life >= 20:
print 'WARNING|The life of the SSD is ' + str(life) + '% < 20%'
return STATUS_WARNING
else:
print 'CRITICAL|The life of the SSD is ' + str(life) + '% < 10%'
return STATUS_ERROR
def check_ssd_no_id(disk):
(status, output) = commands.getstatusoutput('%s %s %s %s' % (SMARTCTL, '-a ', disk, '|grep Media_Wearout_Indicator'))
if status != 0:
print 'UNKNOWN|Something unexpected happened,now is doing check_ssd().'
return STATUS_UNKNOWN
sys.exit(1)
else:
life = int(str(output).split(' ')[5])
if life >= 50:
print 'OK|The life of the SSD is ' + str(life) +'% left'
return STATUS_OK
elif life < 50 and life >= 20:
print 'WARNING|The life of the SSD is ' + str(life) + '% < 20%'
return STATUS_WARNING
else:
print 'CRITICAL|The life of the SSD is ' + str(life) + '% < 10%'
return STATUS_ERROR
def init_option():
parser = argparse.ArgumentParser(description="DISK nagios plugin.")
parser.add_argument('-r', '--raid', help='raid or not(y/n)')
parser.add_argument('-s', '--ssd', help='ssd or not(y/n), need device_id(0,1,2) and disk(/dev/sdc)')
parser.add_argument('-i', '--device', help='Device Id(0,1,2), which is needed in check_ssd')
parser.add_argument('-d', '--disk', help='DISK(/dev/sdx),which is needed in check_ssd')
return parser
def main():
parser = init_option()
args = parser.parse_args()
if args.raid == 'y':
if not args.ssd:
return check_MegaCli()
else:
if not args.device or not args.disk:
print 'Error|Check ssd needs device id and disk'
return STATUS_ERROR
sys.exit(1)
else:
# If it doesn't in the list of device id
device_id = get_device_id()
if int(args.device) in device_id:
return check_ssd(args.device,args.disk)
else:
print 'Error|You must specify a Device_Id ' + str(args.device)
return STATUS_ERROR
sys.exit(1)
else:
if not args.ssd:
return check_smartctl()
elif args.ssd == 'y':
# For the ssd doesn't need device id(no MegaCli)
if not args.disk:
print 'Error|Check the life of SSD with no ID must assign the DISK(/dev/sdx)'
return STATUS_ERROR
sys.exit(1)
else:
return check_ssd_no_id(args.disk)
if __name__ == '__main__':
sys.exit(main())
# usage: check_disk_health_v2.py [-h] [-r RAID] [-s SSD] [-i DEVICE] [-d DISK]
# 要监控一台电脑的磁盘,因为不带自动识别,所以对于每一台电脑,都需要指定其:
# 是否有RAID:
# 是:是否检测SSD
# 是:check_ssd()
# 否:check_megacli()
# 否:是否检测SSD
# 是:check_ssd_no_id()
# 否:check_smartctl()
#
# 都需要自行指定参数,有点小麻烦