当packetbeat采用抓取网口码流并解析的方式时,很容易发现,当网口流量过大(消息并发量较大)通过kibana前端统计消息数量,存在丢失的问题。这个问题与网口流量速率有关,并发越大,丢包率越高,这应该是packetbeat本身的pipeline机制有关,如果pipeline缓冲区满则默认丢弃新来的event,经过实验还没能找到不丢包的方法,最多就是消息并发量小一点,丢包率会小一点。这篇文章主要是想描述packetbeat通过解析pcap文件的方式丢包问题。
当packetbeat直接抓取网口包并解析时,运行packetbeat是阻塞的,而当packetbeat指定pcap文件进行解析时,是非阻塞的,这就导致,当配置了上报event缓冲区数目且不为0的情况下,如果packetbeat解析完了整个pcap文件,但缓冲区还有未上报的event,此时packetbeat解析完就停止了,没有上报的event就会被丢弃不再上报,这就导致kibana前端统计消息数量缺失的问题。
解决办法是,调整packetbeat的配置文件,配置event上报数量最小值为0,这样配置的效果就是,当缓冲区存在未上报的event,立即上报,当packetbeat停止时,缓冲区也不存在未上报的event了。
贴一下相关配置:
# ================================== General =================================== # The name of the shipper that publishes the network data. It can be used to group # all the transactions sent by a single shipper in the web interface. # If this options is not defined, the hostname is used. #name: # The tags of the shipper are included in their own field with each # transaction published. Tags make it easy to group servers by different # logical properties. #tags: ["service-X", "web-tier"] # Optional fields that you can specify to add additional information to the # output. Fields can be scalar values, arrays, dictionaries, or any nested # combination of these. #fields: # env: staging # If this option is set to true, the custom fields are stored as top-level # fields in the output document instead of being grouped under a fields # sub-dictionary. Default is false. #fields_under_root: false # Internal queue configuration for buffering events to be published. queue: # Queue type by name (default 'mem') # The memory queue will present all available events (up to the outputs # bulk_max_size) to the output, the moment the output is ready to server # another batch of events. mem: # Max number of events the queue can buffer. events: 4096 # Hints the minimum number of events stored in the queue, # before providing a batch of events to the outputs. # The default value is set to 2048. # A value of 0 ensures events are immediately available # to be sent to the outputs. flush.min_events: 0 # Maximum duration after which events are available to the outputs, # if the number of events stored in the queue is < `flush.min_events`. flush.timeout: 1s # The disk queue stores incoming events on disk until the output is # ready for them. This allows a higher event limit than the memory-only # queue and lets pending events persist through a restart. #disk: # The directory path to store the queue's data. #path: "${path.data}/diskqueue" # The maximum space the queue should occupy on disk. Depending on # input settings, events that exceed this limit are delayed or discarded. #max_size: 10GB # The maximum size of a single queue data file. Data in the queue is # stored in smaller segments that are deleted after all their events # have been processed. #segment_size: 1GB # The number of events to read from disk to memory while waiting for # the output to request them. #read_ahead: 512 # The number of events to accept from inputs while waiting for them # to be written to disk. If event data arrives faster than it # can be written to disk, this setting prevents it from overflowing # main memory. #write_ahead: 2048 # The duration to wait before retrying when the queue encounters a disk # write error. #retry_interval: 1s # The maximum length of time to wait before retrying on a disk write # error. If the queue encounters repeated errors, it will double the # length of its retry interval each time, up to this maximum. #max_retry_interval: 30s # Sets the maximum number of CPUs that can be executing simultaneously. The # default is the number of logical CPUs available in the system. #max_procs:
But!!! 这样配置的话缺点也很明显:没有缓冲机制,有一个event就上报一次,比如数据存入elasticsearch,此时每个event都会导致packetbeat和es交互一次,对应cpu占用就会很高。