改变AWS EMR中的Ganglia配置：每1秒监视一次

赵华彩

2023-12-01

前言

在AWS EMR（Elastic MapReduce）中，可以通过设置Bootstrap Action来为Hadoop集群安装Ganglia来监视集群的运行状况。EMR在lauch所有instance之后会自动帮你下载Ganglia，为你安装，配置，启动，不用用户自己做任何设置工作，十分方便。然而“方便”的代价就是牺牲灵活行。在大多数情况下，EMR为咱们自动配置的ganglia环境已经足够好用了，足以应付我们的绝大多数需求。但是一旦我们遇到了奇怪的需求，需要自己配置ganglia，就麻烦了。最近，我就遇到了这么一个“奇怪”的需求：由于我的MapReduce任务运行时间比较短，也就1，2分钟，所以我需要ganglia监视系统性能时取数据的间隔能够密集一些。ganglia在默认条件下监视的间隔是15秒，所以如果我的任务一共只有1分钟，那么ganglia才会给我采集到4个点，这就没有意义了。所以我需要将监视间隔改成1秒。而EMR启动ganglia的Bootstrap脚本是没有任何参数的，所以我们不能通过EMR直接实现。经过艰苦的探索，我终于找到了解决这个问题的方法。本文就以“监视间隔改成1秒”为例，介绍一下在EMR中改变Ganglia配置的方法。

EMR配置和启动Ganglia的原理

首先介绍一下EMR是如何在Hadoop集群中安装，配置和启动Ganglia的。在EMR中启动Ganglia的方法很简单，就是在Bootstrap Action中添加一个Action, 脚本地址是：s3://elasticmapreduce/bootstrap-actions/install-ganglia。更详细的信息请看这里。这个install-ganglia是一个Ruby程序，EMR在一个instance启动后，会从S3上下载这个install-ganglia，然后在instance内执行。所以我们先看看这个install-ganglia里面到底写了些什么。我们可以通过这个地址把install-ganglia文件下载下来：https://s3.amazonaws.com/elasticmapreduce/bootstrap-actions/install-ganglia。这个文件里面的内容非常简单，就是根据你选择的hadoop的版本，选择一个对应的另外一个脚本的版本，然后执行这一行：

executor.run("hadoop fs -copyToLocal s3://#{BUCKET_NAME}/bootstrap-actions/ganglia/#{version_num}/ganglia-installer .")

也就是要再下载另外一个Ruby脚本来真正安装ganglia。我们同样把这个Ruby脚本文件也下载下来。假如我们选择的Hadoop版本是0.20.205，通过阅读install-ganglia脚本，我们知道对应的版本是2.0，所以我们下载脚本文件的地址就是： https://s3.amazonaws.com/elasticmapreduce/bootstrap-actions/ganglia/2.0/ganglia-installer。通过阅读，我们知道这个ganglia-installer脚本正是我们所需要的。

我们来看一看这个脚本中都干了什么。首先下载并解压ganglia：

def download_and_unzip_ganglia
    run("mkdir -p ~/source")
    run("cd ~/source && wget http://#{BUCKET_NAME}.s3.amazonaws.com/bootstrap-actions/ganglia/2.0/#{GANGLIA}.tar.gz")
    run("cd ~/source && tar xvfz #{GANGLIA}.tar.gz")
  end

然后就是一系列的配置。其中有两个部分是我们所关心的。配置gmond：

  def configure_gmond
    run("sudo ldconfig")
    run("sudo gmond --default_config > ~/gmond.conf")
    run("sudo mv ~/gmond.conf /etc/gmond.conf")
    run("sudo perl -pi -e 's/name = \"unspecified\"/name = \"AMZN-EMR\"/g' /etc/gmond.conf")
    run("sudo perl -pi -e 's/owner = \"unspecified\"/name = \"AMZN-EMR\"/g' /etc/gmond.conf")
    run("sudo perl -pi -e 's/send_metadata_interval = 0/send_metadata_interval = 10/g' /etc/gmond.conf")

    if $instance_info['isMaster'].to_s == 'false' then
      command = <<-COMMAND
      sudo sed -i -e "s|\\( *mcast_join *=.*\\)|#\\1|" \
             -e "s|\\( *bind *=.*\\)|#\\1|" \
             -e "s|\\( *location *=.*\\)|  location = \"master-node\"|" \
             -e "s|\\(udp_send_channel {\\)|\\1\\n  host=#{$master_dns}|" \
             /etc/gmond.conf
      COMMAND
      $e.run(command)
    else
      command = <<-COMMAND
      sudo sed -i -e "s|\\( *mcast_join *=.*\\)|#\\1|"  \
             -e "s|\\( *bind *=.*\\)|#\\1|" \
             -e "s|\\(udp_send_channel {\\)|\\1\\n  host=#{$ip}|" \
             /etc/gmond.conf
      COMMAND
      $e.run(command)
    end
    $e.run("sudo gmond")
  end

以及配置gmeta：

  def configure_gmetad
    ganglia_log_dir = "/mnt/var/log/ganglia/rrds/"
    ganglia_templates_dir = "/mnt/var/log/ganglia/dwoo/"
    run("sudo cp #{GANGLIA_HOME}/gmetad/gmetad.conf /etc/")
    run("sudo mkdir -p #{ganglia_log_dir}")
    run("sudo chown -R nobody #{ganglia_log_dir}")
    run("sudo sed -i -e 's$# rrd_rootdir .*$rrd_rootdir #{ganglia_log_dir}$g' /etc/gmetad.conf")
    run("sudo mkdir -p #{ganglia_templates_dir}")
    run("sudo chown -R nobody #{ganglia_templates_dir}")
    run("sudo chmod -R 777 #{ganglia_templates_dir}")
  
    #Setup pushing rrds to S3
    parsed = JSON.parse(File.read("/etc/instance-controller/logs.json"))
    newEntry = Hash["fileGlob", "/mnt/var/log/ganglia/rrds/AMZN-EMR/(.*)/(.*)", "s3Path", "node/$instance-id/ganglia/$0/$1", "delayPush", true]
    parsed["logFileTypes"][1]["logFilePatterns"].push(newEntry)
    run("sudo mv /etc/instance-controller/logs.json /etc/instance-controller/logs.json.bak")
    File.open("/tmp/logs.json" , "w") do |fil|
    fil.puts(JSON.generate(parsed))
    end
    $e.run("sudo mv /tmp/logs.json /etc/instance-controller/")
     
  end

其实就是通过用linux的sed命令来改变配置文件gmond.conf和gmeta.conf的内容来达到配置ganglia的目的。所以我们也就有了解决我们的问题的思路：自己改写ganglia-installer文件，在其中插入几个sed命令来按照我们的需求配置ganglia，然后把文件上传到S3上，在启动EMR时调用我们改写过的脚本来安装ganglia。

配置ganglia的监视时间间隔

我们的例子是要把ganglia配置成每秒采集一次系统运行信息，所以我们来看看应该怎么配置ganglia才能达到这个目的。其实这可不是个简单的问题，为了解决这个问题花了我不少时间。因为我之前也没用过ganglia，也是为了解决这个“每秒监视一次”的需求现学现用。本文重点是介绍如何在EMR上配置ganglia，而不是如何配置ganglia本身，所以这里只简单介绍一下。首先学习一下基本内容：http://sourceforge.net/apps/trac/ganglia/wiki/Ganglia%203.1.x%20Installation%20and%20Configuration#gmond_configuration和http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_readme。ganglia有gmond和gmeta这么两个东西，简单说来，gmond是运行在集群每一台机器上的，它定期获得一次所在机器的各项性能数据，然后群发给集群的其他机器。而gmeta是接收并整理gmond所发的信息，每隔一段时间把从各个gmond得到的数据存成rrd格式的数据，也就是我们最终能从ganglia得到的监视数据。

gmond的配置文件gmond.conf里面有很多collection_group，比如：

collection_group {
  collect_every = 20
  time_threshold = 90
  /* CPU status */
  metric {
    name = "cpu_user"
    value_threshold = "1.0"
    title = "CPU User"
  }
  metric {
    name = "cpu_system"
    value_threshold = "1.0"
    title = "CPU System"
  }
  metric {
    name = "cpu_idle"
    value_threshold = "5.0"
    title = "CPU Idle"
  }
  metric {
    name = "cpu_nice"
    value_threshold = "1.0"
    title = "CPU Nice"
  }
  metric {
    name = "cpu_aidle"
    value_threshold = "5.0"
    title = "CPU aidle"
  }
  metric {
    name = "cpu_wio"
    value_threshold = "1.0"
    title = "CPU wio"
  }
  /* The next two metrics are optional if you want more detail...
     ... since they are accounted for in cpu_system.
  metric {
    name = "cpu_intr"
    value_threshold = "1.0"
    title = "CPU intr"
  }
  metric {
    name = "cpu_sintr"
    value_threshold = "1.0"
    title = "CPU sintr"
  }
  */
}

大概意思是说，这个group中的这些指标（都是关于cpu的）每20秒（collect_every）采集一次，如果其中某一个的值超出了它的value_threshold，就把整组信息向外广播一次。而每90秒（time_thredhold）则无论值超不超过value_threshold都广播一次。所以我们需要把collect_every和time_thredhold都设置成1。

在gmeta的配置文件gmeta.conf中，我们需要关心的是这一段：

# What to monitor. The most important section of this file.
#
# The data_source tag specifies either a cluster or a grid to
# monitor. If we detect the source is a cluster, we will maintain a complete
# set of RRD databases for it, which can be used to create historical
# graphs of the metrics. If the source is a grid (it comes from another gmetad),
# we will only maintain summary RRDs for it.
#
# Format:
# data_source "my cluster" [polling interval] address1:port addreses2:port ...
#
# The keyword 'data_source' must immediately be followed by a unique
# string which identifies the source, then an optional polling interval in
# seconds. The source will be polled at this interval on average.
# If the polling interval is omitted, 15sec is asssumed.
#
# A list of machines which service the data source follows, in the
# format ip:port, or name:port. If a port is not specified then 8649
# (the default gmond port) is assumed.
# default: There is no default value
#
# data_source "my cluster" 10 localhost my.machine.edu:8649 1.2.3.5:8655
# data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651
# data_source "another source" 1.3.4.7:8655 1.3.4.8

data_source "my cluster" localhost

由于默认的data_source没有写[polling interval]，所以gmeta记录数据的间隔是15秒，我们需要把data_source这一行改成：

data_source "my cluster" 1 localhost

另外在gmeta.conf中我们还需要注意一段：

#
# Round-Robin Archives
# You can specify custom Round-Robin archives here (defaults are listed below)
#
# RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244" "RRA:AVERAGE:0.5:168:244" "RRA:AVERAGE:0.5:672:244" \
#      "RRA:AVERAGE:0.5:5760:374"
#

这里配置了RRA，大概意思是：ganglia为了节省空间，会自动把一些时间比较久远的数据删掉一些，使其间隔变大。如上面所写，在默认设置下，ganglia会先把每次得到的数据都保存，一共244个。244个满了之后，把数据删减成间隔24的。间隔24的满了244个之后，再删减成间隔168，之后是间隔672（最多244个），再之后间隔5760（最多374个）（再之后就应该是彻底删了吧？这个不确定，以后再查一查）。所以我们最好把第一个RRA改一改，把244这个最大保存量改大一些，否则我们的每秒保存的数据只能保存244秒，一超过就会变成间隔24秒的。

综上所述，我们要在ganglia-installer这个Ruby脚本中做如下修改：

1.在configure_gmond那一段中，在那些run语句的最后，添加这样一行：

run("sudo perl -pi -e 's/collect_every *=.*/collect_every = 1/g;s/time_threshold *=.*/time_threshold = 1/g;s/value_threshold *=.*/value_threshold = 0/g' /etc/gmond.conf")

这里用的是perl的文件内容替换命令，因为这段里别的run语句也用的是这个，这样看上去整洁一些...其实用sed也可以。这样是把所有的监视内容都改成间隔1秒了，比较简单粗暴。你也可以用更负责一些的替换语句有针对性地修改，这里就不详细介绍了。

2.在configure_gmeta那一段中，run语句的最后添加：

run("sudo sed -i -e 's/data_source \"my cluster\" localhost/data_source \"my cluster\" 1 localhost/g' /etc/gmetad.conf")

3.在configure_gmeta那一段中，run语句的最后添加：

run("sudo sed -i -e's/# RRAs \"RRA:.*/RRAs \"RRA:AVERAGE:0.5:1:3600\" \"RRA:AVERAGE:0.5:24:3600\" \"RRA:AVERAGE:0.5:168:3600\" \"RRA:AVERAGE:0.5:672:3600\"/g' /etc/gmetad.conf")

我们是把注释的那一行替换掉了，也是比较简单粗暴。也可以用更复杂的sed语句做的更好看一些。

使用自己的脚本启动ganglia

经过上面的修改，ganglia-installer已经修改完毕了。我们把它上传到S3中。假设你有一个bucket名字是mybucket，我们把ganglia-installer上传到mybucket的根目录下。

之后我们还要修改一下那个简单的install-ganglia脚本，让他调用我们修改过的ganglia-installer。我们找到install-ganglia的这一行：

executor.run("hadoop fs -copyToLocal s3://#{BUCKET_NAME}/bootstrap-actions/ganglia/#{version_num}/ganglia-installer .")

把它改成：

executor.run("hadoop fs -copyToLocal s3://mybucket/ganglia-installer .")

然后我们把这个install-ganglia也上传的S3的mybucket下。在启动EMR任务时，我们指定一个bootstrap action的地址为s3://mybucket/install-ganglia，这个EMR任务就会运行我们修改过的ruby脚本，按照我们的要求安装配置ganglia了。

其他几种可行的方法

前述的方法是用sed命令来修改ganglia的配置文件，其实还有其他可行的方法，比如先把gmond和gmeta的配置文件修改好上传到S3上，然后在ganglia-installer中从S3下载修改过的配置文件，或者干脆把ganglia整个配置好后打包放在S3上，在ganglia-installer中下载我们自己配置好的ganglia而不是Amazon提供的ganglia，等等。

改变AWS EMR中的Ganglia配置：每1秒监视一次

前言

EMR配置和启动Ganglia的原理

配置ganglia的监视时间间隔

使用自己的脚本启动ganglia

其他几种可行的方法

相关阅读

相关文章

相关问答

相关文档