两个自动化运维的监控软件,留底备用。先记录一下:
god
ruby编写,据说小米在用。
http://godrb.com/
monit
Python编写,正在IDOL系统中试用。
https://mmonit.com/monit/documentation/monit.html
附上monit的配置文件以及exec脚本:
在/monit/etc/monit.d目录下,名称为:服务进程名.monit
check process content-hist-b with pidfile /app/cfg/content-hist-b/content.pid
start program = "/app/etc/init.d/content-hist-b start"
as uid idol and gid users
stop program = "/app/etc/init.d/content-hist-b stop"
as uid idol and gid users
if failed host 127.0.0.1 port 9500 then restart
if 3 restarts within 3 cycles then exec "/usr/local/monit/bin/restart.sh"
as uid idol and gid users
ver1:
#!/bin/bash
Condir=/app/cfg/content-hist-b
Startfile=/app/etc/init.d/content-hist-b
Pid=content-hist-b
echo "D&Gby1900d129" |sudo -S /usr/sbin/sysctl -w vm.drop_caches=3;
count=`pgrep -f $Pid` && echo $count;
if [ -n "$count" ]; then
sleep 1;
echo "Prcoess is busy";
kill -9 $count;
sleep 60;
cd $Condir && sh clean.sh && $Startfile restart;
else
echo "Prcoess is stop";
cd $Condir && sh clean.sh && $Startfile restart;
fi
exit;
ver2:
#!/bin/bash
Condir=/app/cfg/content-hist-b
Startfile=/app/etc/init.d/content-hist-b
Pid=content-hist-b
sysctl -w vm.drop_caches=3;
count=`pgrep -f $Pid` && echo $count;
if [ -n "$count" ]; then
sleep 1;
echo "Prcoess is busy";
kill -9 $count;
sleep 30;
su - idol -c "cd $Condir && sh clean.sh && $Startfile restart";
else
echo "Prcoess is stop";
su - idol -c "cd $Condir && sh clean.sh && $Startfile restart";
fi
exit;
不过使用之后感觉不是很靠谱,所以又写了一份shell判断监控端口连通性,并自动重启的脚本。
如下:
portdd.sh
#!/bin/bash
state="succeeded!"
Condir=/app/cfg/content-shortterm-9000
Startfile=/app/etc/init.d/content-shortterm-9000
Pid=content-shortterm-9000
DATAFILE=/app/data/content-shortterm-9000
DATE=`date +%Y-%m-%d-%H:%M`
while :;
do
count=`pgrep -f $Pid`;
port=$(nc -vz -w 10 192.168.5.137 9000 |awk '{print $7}');
if [ "$port"x = "$state"x ]; then
sleep 1;
echo "process is ok" ;
else
sleep 1;
echo "Prcoess is bad busy";
kill -9 $count;
sleep 60;
mv $DATAFILE $DATAFILE$DATE;
su idol -c "cd $Condir && sh clean.sh && $Startfile restart";
echo "process is restart complete";
fi
sysctl -w vm.drop_caches=3;
sleep 3600;
done
上面是用netcat命令对端口状态进行监测的,也可以用nmap对端口状态进行监测 “nmap 192.168.155.249 -p 22 | grep 22”,对于没有端口通信的应用进程,还可以用stat命令判断应用进程日志文件的更新时间间隔来进行筛查,并对其进行操作。