mtail 编程指北

欧阳鸿哲

2023-12-01

介绍

本文介绍编写mtail程序的一些常见模式。

变量重命名

mtail只允许您在程序文本中使用“C”样式的标识符名称，但如果您不喜欢，可以在将导出的变量呈现给收集系统时对其进行重命名。

counter connection_time_total as "connection-time_total"

正则复用

如果反复使用相同的模式，那么定义一个常量，避免每次出现都要检查拼写。

# Define some pattern constants for reuse in the patterns below.
const IP /\d+(\.\d+){3}/
const MATCH_IP /(?P<ip>/ + IP + /)/

...

    # Duplicate lease
    /uid lease / + MATCH_IP + / for client .* is duplicate on / {
        duplicate_lease++
    }

分析日志行时间戳

mtail 为每一个事件添加一个时间戳

如果日志中不存在时间戳，并且mtail程序没有明确解析时间戳，那么mtail将使用当前系统时间作为事件的时间。

许多日志文件包括日志程序报告的事件时间戳。要解析时间戳，请使用strptime函数和Go-time.parse布局字符串。

/^(?P<date>\w+\s+\d+\s+\d+:\d+:\d+)\s+[\w\.-]+\s+sftp-server/ {
    strptime($date, "Jan _2 15:04:05")

不要试图将时间戳分别分解为组成部分（例如年、月、日）。保持它们与日志文件显示的格式相同，并更改strptime格式字符串以匹配它。

/^/ +
/(?P<date>\d{4}\/\d{2}\/\d{2} \d{2}:\d{2}:\d{2}) / +
/.*/ +
/$/ {
    strptime($date, "2006/01/02 15:04:05")

如果没有进行时间戳分析，则报告的事件时间戳可能会为事件真正发生的时间测量增加一些延迟。在程序记录事件和mtail读取事件之间，有许多活动部分：日志编写器、一些系统调用（可能）、一些磁盘IO、一些更多系统调用、一些更多磁盘IO，然后是mtail的虚拟机执行。虽然通常可以忽略不计，但如果用户注意到mtail报告的内容与实际发生的事件之间的时间偏移，则值得说明。因此，建议始终使用日志文件的时间戳（如果有）。

重复公共时间戳解析

decorator语法的设计考虑到了常见的时间戳解析。它允许重用从日志行中获取时间戳的代码，并使程序文本的其余部分更可读，从而更易于维护。

# The `syslog' decorator defines a procedure.  When a block of mtail code is
# "decorated", it is called before entering the block.  The block is entered
# when the keyword `next' is reached.
def syslog {
    /(?P<date>(?P<legacy_date>\w+\s+\d+\s+\d+:\d+:\d+)|(?P<rfc3339_date>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d+[+-]\d{2}:\d{2}))/ +
        /\s+(?:\w+@)?(?P<hostname>[\w\.-]+)\s+(?P<application>[\w\.-]+)(?:\[(?P<pid>\d+)\])?:\s+(?P<message>.*)/ {
        # If the legacy_date regexp matched, try this format.
        len($legacy_date) > 0 {
            strptime($legacy_date, "Jan _2 15:04:05")
        }
        # If the RFC3339 style matched, parse it this way.
        len($rfc3339_date) > 0 {
            strptime($rfc3339_date, "2006-01-02T15:04:05-07:00")
        }
        # Call into the decorated block
        next
    }
}

这可以在程序后面的任何块中使用。

@syslog {
/foo/ {
  ...
}

/bar/ {
}
} # end @syslog decorator

foo和bar模式操作在被调用之前都将解析syslog时间戳。

带有奇怪字符的时间戳

Go’s time.Parse 解析不喜欢格式字符串中的下划线，这可能发生在试图解析格式中有下划线的时间戳时。Go将下划线视为包含可选数字的位置。

要解决此问题，可以在解析之前使用subst（）重写时间戳：

/(\d{4}-\d{2}-\d{2}_\d{2}:\d{2}:\d{2}) / {
  strptime(subst("_", " ", $1), "2006-01-02 15:04:05")
}

条件结构

pattern/{action}习惯用法是mtail程序中常见的条件控制流结构。
如果模式匹配，则执行块中的操作。如果模式不匹配，则跳过块。
else关键字允许程序在模式不匹配时执行操作。

/pattern/ {
  action
} else {
  alternative
}

如果模式与当前行不匹配，上面的示例将执行“替代”块。
otherwise关键字可用于创建类似于C switch语句的控制流结构。在包含块中，otherwise关键字表示只有在同一范围内没有其他模式匹配时才应执行此块。

{
/pattern1/ { _action1_ }
/pattern2/ { _action2_ }
otherwise { _action3_ }
}

在本例中，如果pattern1和pattern2都与当前行不匹配，则执行“action3”。

显式匹配

上面的/patter/{action}表单隐式匹配当前输入日志行。
如果要与另一个字符串变量匹配，可以使用=_{运算符，或使用！}，像这样：

  $1 =~ /GET/ {
    ...
  }

存储中间状态

隐藏度量是可以用于内部状态的度量，并且从不导出到mtail之外。例如，如果需要计算日志行对之间的时间，则可以使用隐藏度量来记录日志行对开始的时间戳。

请注意，内置的时间戳要求程序在调用之前使用strptime或settime设置日志行时间戳。

hidden gauge connection_time by pid
...

  # Connection starts
  /connect from \S+ \(\d+\.\d+\.\d+\.\d+\)/ {
    connections_total++

    # Record the start time of the connection, using the log timestamp.
    connection_time[$pid] = timestamp()
  }

...

  # Connection summary when session closed
  /sent (?P<sent>\d+) bytes  received (?P<received>\d+) bytes  total size \d+/ {
    # Sum total bytes across all sessions for this process
    bytes_total["sent"] += $sent
    bytes_total["received"] += $received
    
    # Count total time spent with connections open, according to the log timestamp.
    connection_time_total += timestamp() - connection_time[$pid]

    # Delete the datum referenced in this dimensional metric.  We assume that
    # this will never happen again, and hint to the VM that we can garbage
    # collect the memory used.
    del connection_time[$pid]
  }

在本例中，连接时间戳记录在由连接的“pid”键控的隐藏变量connection_time中。稍后记录连接结束时，将计算当前日志时间戳和开始时间戳之间的差值，并将其添加到总连接时间中。

在本例中，可以通过取连接数（connections_total）与花费时间（connection_time_total）的比率来计算收集系统中的平均连接时间。例如，在prometheus中，人们可能会写道：

connection_time_10s_moving_avg = 
  rate(connections_total[10s])
    / on job
  rate(connection_time_total[10s])

还要注意，del关键字用于向mtail发出不再需要connection_time值的信号。这将导致mtail从此度量中删除该标签引用的数据，从而控制mtail的内存使用，并加快标签集搜索时间（通过减少搜索空间！）

或者，72小时后的语句del connection_time[ $pid]也会执行同样的操作，但前提是连接_time[$ spid]在72小时内未更改。当连接关闭事件有损或难以确定时，这种形式更方便。

计算动态平均值

mtail故意不实现复杂的数学函数。它希望尽可能快地处理日志行。市场上的许多其他产品已经对时间序列数据执行了复杂的数学函数，如普罗米修斯和黎曼，因此mtail将这一责任推给了他们。（Do One Thing, and Do It Pretty Good 做一件事，做得很好。）

但如果你仍然想在mtail中做移动平均线。首先注意，mtail没有可用的历史记录，只有时间点数据。您可以使用权重更新平均值，使其成为指数移动平均值（EMA）

gauge average

/some (\d+) match/ {
  # Use a smoothing constant 2/(N + 1) to make the average over the last N observations
  average = 0.9 * $1 + 0.1 * average
}

然而，这没有考虑到匹配不规则到达的可能情况（它们之间的时间间隔不是恒定的）。不幸的是，公式需要exp（）函数（e^N），如下所述：http://stackoverflow.com/questions/1023860/exponential-moving-average-sampled-at-varying-times .我建议您将此计算推迟到收集系统

直方图

为了让操作员更好地了解系统的行为，在许多监控howto、blog、talk和rants中，直方图比平均值更可取。
mtail支持直方图作为第一类度量类型，并应使用桶边界列表创建：

histogram foo buckets 1, 2, 4, 8

使用范围[0-1）、[1-2）、[2-4）、[4-8）以及从8到正无穷大的桶创建新的直方图foo。

注：0-n和m-+Inf桶是自动创建的。

您也可以在直方图上放置标签：

histogram apache_http_request_time_seconds buckets 0.005, 0.01, 0.025, 0.05 by server_port, handler, request_method, request_status, request_protocol

目前，所有桶边界（0和正无穷大除外）都需要显式命名（没有创建几何级数的简写形式）。

分配到直方图记录观察结果：

  ###
  # HTTP Requests with histogram buckets.
  #
  apache_http_request_time_seconds[$server_port][$handler][$request_method][$request_status][$request_protocol] = $time_us / 1000000

在像普罗米修斯这样的工具中，可以对它们进行聚合操作，以计算响应延迟的百分位数。

apache_http_request_time:rate10s = rate(apache_http_request_time_seconds_bucket[10s])
apache_http_request_time_count:rate10s = rate(apache_http_request_time_seconds_count[10s])


apache_http_request_time:percentiles = 
  apache_http_request_time:rate10s
    / on (job, port, handler, request_method, request_status, request_protocol)
  apache_http_request_time_seconds_count:rate10s

避免不必要的日志解析

可以理解为在程序中忽略某个日志文件的解析，例如/var/log/zcm/ 目录下有两个日志文件，而/etc/mtail 下的两个程序只需要各自解析对应的日志文件，则需要在程序中加一下忽略

  getfilename() !~ /nms-metrics/ {
  stop
  }

这将检查输入文件名是否类似于/var/log/xxx/nms-metrics.log，如果不匹配，则不会在日志行上尝试任何进一步的模式匹配。意思就是说只匹配日志路径中包含nms-metrics关键词的文件

指标参数打标签

例如想要在参数中匹配的是哪个日志文件则可以添加getfilename(),如果需要添加自定义的标签则需要定义正则表达式

# Define some pattern constants for reuse in the patterns below key word
const PREFIX /^kernel hrtimer interrupt took$/
/^/ +
/(?P<alarmword>([0-9a-zA-Z:*]+\s*)+)/ +
/$/ {
   PREFIX {
  		mtail_node_keyword_total[getfilename()][\$alarmword]++
	}
}

如果日志中匹配了内核异常的关键词kernel hrtimer interrupt took,暴露的指标就会添加key 标签，如

# TYPE mtail_node_keyword_total counter
mtail_node_keyword_total{alarmword="kernel hrtimer interrupt took",filename="/var/log/zcm/nms-monitor.log"} 1

项目及参考

项目工程参考
 参考