在AWS EMR(Elastic MapReduce)中,可以通过设置Bootstrap Action来为Hadoop集群安装Ganglia来监视集群的运行状况。EMR在lauch所有instance之后会自动帮你下载Ganglia,为你安装,配置,启动,不用用户自己做任何设置工作,十分方便。然而“方便”的代价就是牺牲灵活行。在大多数情况下,EMR为咱们自动配置的ganglia环境已经足够好用了,足以应付我们的绝大多数需求。但是一旦我们遇到了奇怪的需求,需要自己配置ganglia,就麻烦了。最近,我就遇到了这么一个“奇怪”的需求:由于我的MapReduce任务运行时间比较短,也就1,2分钟,所以我需要ganglia监视系统性能时取数据的间隔能够密集一些。ganglia在默认条件下监视的间隔是15秒,所以如果我的任务一共只有1分钟,那么ganglia才会给我采集到4个点,这就没有意义了。所以我需要将监视间隔改成1秒。而EMR启动ganglia的Bootstrap脚本是没有任何参数的,所以我们不能通过EMR直接实现。经过艰苦的探索,我终于找到了解决这个问题的方法。本文就以“监视间隔改成1秒”为例,介绍一下在EMR中改变Ganglia配置的方法。
首先介绍一下EMR是如何在Hadoop集群中安装,配置和启动Ganglia的。在EMR中启动Ganglia的方法很简单,就是在Bootstrap Action中添加一个Action, 脚本地址是:s3://elasticmapreduce/bootstrap-actions/install-ganglia。更详细的信息请看这里。这个install-ganglia是一个Ruby程序,EMR在一个instance启动后,会从S3上下载这个install-ganglia,然后在instance内执行。所以我们先看看这个install-ganglia里面到底写了些什么。我们可以通过这个地址把install-ganglia文件下载下来:https://s3.amazonaws.com/elasticmapreduce/bootstrap-actions/install-ganglia。这个文件里面的内容非常简单,就是根据你选择的hadoop的版本,选择一个对应的另外一个脚本的版本,然后执行这一行:
executor.run("hadoop fs -copyToLocal s3://#{BUCKET_NAME}/bootstrap-actions/ganglia/#{version_num}/ganglia-installer .")
我们来看一看这个脚本中都干了什么。首先下载并解压ganglia:
def download_and_unzip_ganglia
run("mkdir -p ~/source")
run("cd ~/source && wget http://#{BUCKET_NAME}.s3.amazonaws.com/bootstrap-actions/ganglia/2.0/#{GANGLIA}.tar.gz")
run("cd ~/source && tar xvfz #{GANGLIA}.tar.gz")
end
def configure_gmond
run("sudo ldconfig")
run("sudo gmond --default_config > ~/gmond.conf")
run("sudo mv ~/gmond.conf /etc/gmond.conf")
run("sudo perl -pi -e 's/name = \"unspecified\"/name = \"AMZN-EMR\"/g' /etc/gmond.conf")
run("sudo perl -pi -e 's/owner = \"unspecified\"/name = \"AMZN-EMR\"/g' /etc/gmond.conf")
run("sudo perl -pi -e 's/send_metadata_interval = 0/send_metadata_interval = 10/g' /etc/gmond.conf")
if $instance_info['isMaster'].to_s == 'false' then
command = <<-COMMAND
sudo sed -i -e "s|\\( *mcast_join *=.*\\)|#\\1|" \
-e "s|\\( *bind *=.*\\)|#\\1|" \
-e "s|\\( *location *=.*\\)| location = \"master-node\"|" \
-e "s|\\(udp_send_channel {\\)|\\1\\n host=#{$master_dns}|" \
/etc/gmond.conf
COMMAND
$e.run(command)
else
command = <<-COMMAND
sudo sed -i -e "s|\\( *mcast_join *=.*\\)|#\\1|" \
-e "s|\\( *bind *=.*\\)|#\\1|" \
-e "s|\\(udp_send_channel {\\)|\\1\\n host=#{$ip}|" \
/etc/gmond.conf
COMMAND
$e.run(command)
end
$e.run("sudo gmond")
end
def configure_gmetad
ganglia_log_dir = "/mnt/var/log/ganglia/rrds/"
ganglia_templates_dir = "/mnt/var/log/ganglia/dwoo/"
run("sudo cp #{GANGLIA_HOME}/gmetad/gmetad.conf /etc/")
run("sudo mkdir -p #{ganglia_log_dir}")
run("sudo chown -R nobody #{ganglia_log_dir}")
run("sudo sed -i -e 's$# rrd_rootdir .*$rrd_rootdir #{ganglia_log_dir}$g' /etc/gmetad.conf")
run("sudo mkdir -p #{ganglia_templates_dir}")
run("sudo chown -R nobody #{ganglia_templates_dir}")
run("sudo chmod -R 777 #{ganglia_templates_dir}")
#Setup pushing rrds to S3
parsed = JSON.parse(File.read("/etc/instance-controller/logs.json"))
newEntry = Hash["fileGlob", "/mnt/var/log/ganglia/rrds/AMZN-EMR/(.*)/(.*)", "s3Path", "node/$instance-id/ganglia/$0/$1", "delayPush", true]
parsed["logFileTypes"][1]["logFilePatterns"].push(newEntry)
run("sudo mv /etc/instance-controller/logs.json /etc/instance-controller/logs.json.bak")
File.open("/tmp/logs.json" , "w") do |fil|
fil.puts(JSON.generate(parsed))
end
$e.run("sudo mv /tmp/logs.json /etc/instance-controller/")
end
我们的例子是要把ganglia配置成每秒采集一次系统运行信息,所以我们来看看应该怎么配置ganglia才能达到这个目的。其实这可不是个简单的问题,为了解决这个问题花了我不少时间。因为我之前也没用过ganglia,也是为了解决这个“每秒监视一次”的需求现学现用。本文重点是介绍如何在EMR上配置ganglia,而不是如何配置ganglia本身,所以这里只简单介绍一下。首先学习一下基本内容:http://sourceforge.net/apps/trac/ganglia/wiki/Ganglia%203.1.x%20Installation%20and%20Configuration#gmond_configuration和http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_readme。ganglia有gmond和gmeta这么两个东西,简单说来,gmond是运行在集群每一台机器上的,它定期获得一次所在机器的各项性能数据,然后群发给集群的其他机器。而gmeta是接收并整理gmond所发的信息,每隔一段时间把从各个gmond得到的数据存成rrd格式的数据,也就是我们最终能从ganglia得到的监视数据。
gmond的配置文件gmond.conf里面有很多collection_group,比如:
collection_group {
collect_every = 20
time_threshold = 90
/* CPU status */
metric {
name = "cpu_user"
value_threshold = "1.0"
title = "CPU User"
}
metric {
name = "cpu_system"
value_threshold = "1.0"
title = "CPU System"
}
metric {
name = "cpu_idle"
value_threshold = "5.0"
title = "CPU Idle"
}
metric {
name = "cpu_nice"
value_threshold = "1.0"
title = "CPU Nice"
}
metric {
name = "cpu_aidle"
value_threshold = "5.0"
title = "CPU aidle"
}
metric {
name = "cpu_wio"
value_threshold = "1.0"
title = "CPU wio"
}
/* The next two metrics are optional if you want more detail...
... since they are accounted for in cpu_system.
metric {
name = "cpu_intr"
value_threshold = "1.0"
title = "CPU intr"
}
metric {
name = "cpu_sintr"
value_threshold = "1.0"
title = "CPU sintr"
}
*/
}
在gmeta的配置文件gmeta.conf中,我们需要关心的是这一段:
# What to monitor. The most important section of this file.
#
# The data_source tag specifies either a cluster or a grid to
# monitor. If we detect the source is a cluster, we will maintain a complete
# set of RRD databases for it, which can be used to create historical
# graphs of the metrics. If the source is a grid (it comes from another gmetad),
# we will only maintain summary RRDs for it.
#
# Format:
# data_source "my cluster" [polling interval] address1:port addreses2:port ...
#
# The keyword 'data_source' must immediately be followed by a unique
# string which identifies the source, then an optional polling interval in
# seconds. The source will be polled at this interval on average.
# If the polling interval is omitted, 15sec is asssumed.
#
# A list of machines which service the data source follows, in the
# format ip:port, or name:port. If a port is not specified then 8649
# (the default gmond port) is assumed.
# default: There is no default value
#
# data_source "my cluster" 10 localhost my.machine.edu:8649 1.2.3.5:8655
# data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651
# data_source "another source" 1.3.4.7:8655 1.3.4.8
data_source "my cluster" localhost
data_source "my cluster" 1 localhost
#
# Round-Robin Archives
# You can specify custom Round-Robin archives here (defaults are listed below)
#
# RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244" "RRA:AVERAGE:0.5:168:244" "RRA:AVERAGE:0.5:672:244" \
# "RRA:AVERAGE:0.5:5760:374"
#
综上所述,我们要在ganglia-installer这个Ruby脚本中做如下修改:
1.在configure_gmond那一段中,在那些run语句的最后,添加这样一行:
run("sudo perl -pi -e 's/collect_every *=.*/collect_every = 1/g;s/time_threshold *=.*/time_threshold = 1/g;s/value_threshold *=.*/value_threshold = 0/g' /etc/gmond.conf")
2.在configure_gmeta那一段中,run语句的最后添加:
run("sudo sed -i -e 's/data_source \"my cluster\" localhost/data_source \"my cluster\" 1 localhost/g' /etc/gmetad.conf")
3.在configure_gmeta那一段中,run语句的最后添加:
run("sudo sed -i -e's/# RRAs \"RRA:.*/RRAs \"RRA:AVERAGE:0.5:1:3600\" \"RRA:AVERAGE:0.5:24:3600\" \"RRA:AVERAGE:0.5:168:3600\" \"RRA:AVERAGE:0.5:672:3600\"/g' /etc/gmetad.conf")
我们是把注释的那一行替换掉了,也是比较简单粗暴。也可以用更复杂的sed语句做的更好看一些。
executor.run("hadoop fs -copyToLocal s3://#{BUCKET_NAME}/bootstrap-actions/ganglia/#{version_num}/ganglia-installer .")
把它改成:
executor.run("hadoop fs -copyToLocal s3://mybucket/ganglia-installer .")
然后我们把这个install-ganglia也上传的S3的mybucket下。在启动EMR任务时,我们指定一个bootstrap action的地址为s3://mybucket/install-ganglia,这个EMR任务就会运行我们修改过的ruby脚本,按照我们的要求安装配置ganglia了。