当前位置: 首页 > 知识库问答 >
问题:

把文件归档

逑衡
2023-03-14

我试图理解如何用flume-ng结尾的文件,这样我就可以把数据推到HDFS。在第一个实例中,我设置了一个简单的conf文件:

tail1.sources = source1
tail1.sinks = sink1
tail1.channels = channel1

tail1.sources.source1.type = exec
tail1.sources.source1.command = tail -F /var/log/apache2/access.log
tail1.sources.source1.channels = channel1

tail1.sinks.sink1.type = logger

tail1.channels.channel1.type = memory
tail1.channels.channel1.capacity = 1000
tail1.channels.channel1.transactionCapacity = 100

tail1.sources.source1.channels = channel1
tail1.sinks.sink1.channel = channel1

这是一个测试,我的期望是我将在控制台上看到输出。我使用以下命令运行它:

flume-ng agent --conf-file tail1.conf -n tail1 -Dflume.root.logger=DEBUG,INFO,console

我得到以下输出:

2005年12月12日11:01:07信息生命周期。生命周期管理程序:启动生命周期管理器1 12/12/05 11:01:07 INFO节点。FlumeNode:Flume节点启动-尾1 12/12/05 11:01:07信息节点管理器。DefaultLogicalNodeManager:节点管理器开始于2005年12月12日11:01:07信息生命周期。生命周期管理程序:启动生命周期管理器8 12/12/05 11:01:07 INFO属性。PropertiesFileConfigurationProvider:配置提供程序开始于2005年12月12日11:01:07 INFO属性。PropertiesFileConfigurationProvider:重新加载配置文件:tail1.conf 12/12/05 11:01:07信息配置。FlumeConfiguration:处理:sink1 12/12/0 11:01:007信息配置。FlumeConfiguration:处理:sink 1 12/12/2005 11:00:07信息配置代理:[tail1]2005年12月12日11:01:07信息属性。PropertiesFileConfigurationProvider:正在创建通道12/12/05 11:01:08 INFO检测。MonitoredCounterGroup:已成功注册类型为CHANNEL、名称为channel1的受监控计数器组。2005年12月12日11:01:08信息属性。PropertiesFileConfigurationProvider:已创建通道通道1 12/12/05 11:01:08 INFO接收器。DefaultSinkFactory:正在创建接收器实例:sink1,类型:logger 12/12/05 11:01:08 INFO nodemanager。DefaultLogicalNodeManager:正在启动新配置:{sourceRunners:{so源1=EventDrivenSourceRunner:{源:org.apache.flume.source.ExecSource@1839aa9}}sinkRunners:{sink1=SinkRunner:{策略:org.apache.flume.sink.DefaultSinkProcessor@11f0c98counterGroup:{name:空计数器:{}}}通道:{channel1=org.apache.flume.channel.MemoryChannel@1740f55}2005年12月12日11:01:08信息节点管理器。DefaultLogicalNodeManager:启动通道通道1 12/12/05 11:01:08 INFO检测。MonitoredCounterGroup:组件类型:CHANNEL,名称:channel1已于2005年12月12日11:01:08启动INFO节点管理器。DefaultLogicalNodeManager:启动接收器接收器1 12/12/05 11:01:08 INFO nodemanager。DefaultLogicalNodeManager:启动源源1 12/12/05 11:01:08 INFO源。ExecSource:Exec源,以命令:tail-F/var/log/apache2/access.log开始

然而,没有进一步的事情发生。

我有另一个会话,其中有以下命令:

tail -F /var/log/apache2/access.log

我可以看到文件写入的位置:

192.168.1.81 - - [05/Dec/2012:10:58:07 +0000] "GET / HTTP/1.1" 200 483 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"
192.168.1.81 - - [05/Dec/2012:10:58:07 +0000] "GET /favicon.ico HTTP/1.1" 404 502 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"
192.168.1.81 - - [05/Dec/2012:10:58:21 +0000] "GET / HTTP/1.1" 304 209 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"
192.168.1.81 - - [05/Dec/2012:10:58:22 +0000] "GET /favicon.ico HTTP/1.1" 404 502 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"

你能帮忙吗?我彻底糊涂了。

共有3个答案

阮选
2023-03-14

因为/var/log/apache2/access。日志不够大,无法让flume打印文件行。因此,只需按如下方式尝试,您可以在控制台中找到输出

for i in {1..100}; do echo "tail log test$i" >> var/log/apache2/access.log;done
柯天宇
2023-03-14

两个可能的原因:第一,一旦发出命令,source开始启动,sink必须注册并启动自己。我在你展示的日志中没有找到这两行。希望你没错过。通常情况下,它应该是这样的:

apache@hadoop:/hadoop/projects/apache-flume-1.4.0-SNAPSHOT-bin$ bin/flume-ng agent -n agent1 -c /conf -f conf/agent1.conf
Info: Including Hadoop libraries found via (/hadoop/projects/hadoop-1.0.4/bin/hadoop) for HDFS access
Warning: $HADOOP_HOME is deprecated.

Warning: $HADOOP_HOME is deprecated.

Info: Excluding /hadoop/projects/hadoop-1.0.4/libexec/../lib/slf4j-api-1.4.3.jar from classpath
Info: Excluding /hadoop/projects/hadoop-1.0.4/libexec/../lib/slf4j-log4j12-1.4.3.jar from classpath
+ exec /usr/lib/jvm/java-7-oracle/bin/java -Xmx20m -cp '/conf:/hadoop/projects/apache-flume-1.4.0-SNAPSHOT-bin/lib/*:/hadoop/projects/hadoop-1.0.4/libexec/../conf:/usr/lib/jvm/java-7-oracle/lib/tools.jar:/hadoop/projects/hadoop-1.0.4/libexec/..:/hadoop/projects/hadoop-1.0.4/libexec/../hadoop-core-1.0.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/asm-3.2.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/aspectjrt-1.6.5.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/aspectjtools-1.6.5.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-beanutils-1.7.0.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-beanutils-core-1.8.0.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-cli-1.2.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-codec-1.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-collections-3.2.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-configuration-1.6.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-daemon-1.0.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-digester-1.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-el-1.0.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-httpclient-3.0.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-io-2.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-lang-2.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-logging-1.1.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-logging-api-1.0.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-math-2.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/commons-net-1.4.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/core-3.1.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/guava-13.0.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/hadoop-capacity-scheduler-1.0.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/hadoop-fairscheduler-1.0.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/hadoop-thriftfs-1.0.4.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/hsqldb-1.8.0.10.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jackson-core-asl-1.8.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jasper-compiler-5.5.12.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jasper-runtime-5.5.12.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jdeb-0.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jersey-core-1.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jersey-json-1.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jersey-server-1.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jets3t-0.6.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jetty-6.1.26.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jetty-util-6.1.26.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jsch-0.1.42.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/junit-4.5.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/kfs-0.2.2.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/log4j-1.2.15.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/mockito-all-1.8.5.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/oro-2.0.8.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/protobuf-java-2.3.0.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/servlet-api-2.5-20081211.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/xmlenc-0.52.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/zookeeper-3.4.3.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jsp-2.1/jsp-2.1.jar:/hadoop/projects/hadoop-1.0.4/libexec/../lib/jsp-2.1/jsp-api-2.1.jar' -Djava.library.path=:/hadoop/projects/hadoop-1.0.4/libexec/../lib/native/Linux-amd64-64 org.apache.flume.node.Application -n agent1 -f conf/agent1.conf
12/12/15 02:55:29 INFO node.PollingPropertiesFileConfigurationProvider: Configuration provider starting
12/12/15 02:55:29 INFO node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:conf/agent1.conf
12/12/15 02:55:29 INFO conf.FlumeConfiguration: Processing:HDFS
12/12/15 02:55:29 INFO conf.FlumeConfiguration: Processing:HDFS
12/12/15 02:55:29 INFO conf.FlumeConfiguration: Processing:HDFS
12/12/15 02:55:29 INFO conf.FlumeConfiguration: Processing:HDFS
12/12/15 02:55:29 INFO conf.FlumeConfiguration: Added sinks: HDFS Agent: agent1
12/12/15 02:55:29 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration  for agents: [agent1]
12/12/15 02:55:29 INFO node.AbstractConfigurationProvider: Creating channels
12/12/15 02:55:29 INFO channel.DefaultChannelFactory: Creating instance of channel MemoryChannel-2 type memory
12/12/15 02:55:29 INFO node.AbstractConfigurationProvider: Created channel MemoryChannel-2
12/12/15 02:55:29 INFO source.DefaultSourceFactory: Creating instance of source tail, type exec
12/12/15 02:55:29 INFO sink.DefaultSinkFactory: Creating instance of sink: HDFS, type: hdfs
12/12/15 02:55:30 INFO hdfs.HDFSEventSink: Hadoop Security enabled: false
12/12/15 02:55:30 INFO node.Application: Starting new configuration:{ sourceRunners:{tail=EventDrivenSourceRunner: { source:org.apache.flume.source.ExecSource{name:tail,state:IDLE} }} sinkRunners:{HDFS=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@137efe53 counterGroup:{ name:null counters:{} } }} channels:{MemoryChannel-2=org.apache.flume.channel.MemoryChannel{name: MemoryChannel-2}} }
12/12/15 02:55:30 INFO node.Application: Starting Channel MemoryChannel-2
12/12/15 02:55:30 INFO instrumentation.MonitoredCounterGroup: Monitoried counter group for type: CHANNEL, name: MemoryChannel-2, registered successfully.
12/12/15 02:55:30 INFO instrumentation.MonitoredCounterGroup: Component type: CHANNEL, name: MemoryChannel-2 started
12/12/15 02:55:30 INFO node.Application: Starting Sink HDFS
12/12/15 02:55:30 INFO node.Application: Starting Source tail
12/12/15 02:55:30 INFO source.ExecSource: Exec source starting with command:tail -F /var/log/apache2/access.log.1
12/12/15 02:55:30 INFO instrumentation.MonitoredCounterGroup: Monitoried counter group for type: SINK, name: HDFS, registered successfully.
12/12/15 02:55:30 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: HDFS started

请参阅最后 2 行。

其次,代理不会推送任何数据,直到文件出现新内容,此处为“/var/log/apache2/access.log”。手动将某些内容复制到文件中并重新启动 Apache 并执行某些操作,然后检查 /HDFS/Flume 目录的内容。

雷飞虎
2023-03-14

您的配置文件看起来很好。我在CDH4中使用了它,并按照您的预期工作,我所做的只是更改了尾部的日志文件位置。我在控制台上看到了输出。在我的例子中,新的日志数据正在不断写入我跟踪的文件。数据中的时间戳使其看起来不像您的示例中的情况。

这是一个更完整的 conf 示例,更符合我认为您要做的事情。它将尾随文件并每 10 分钟或 10K 条记录写入一个新的 HDFS 文件。将 agent1.sources.source1.command 更改为 tail 命令,并根据 HDFS 配置更改 agent1.sinks.sink1.hdfs.path 和 agent1.sinks.sink1.hdfs.filePrefix 。

# A single-node Flume configuration
# uses exec and tail and will write a file every 10K records or every 10 min
# Name the components on this agent
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# Describe/configure source1
agent1.sources.source1.type = exec
agent1.sources.source1.command = tail -f /home/cloudera/LogCreator/fortune_log.log

# Describe sink1
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://localhost/flume/logtest/
agent1.sinks.sink1.hdfs.filePrefix = LogCreateTest
# Number of seconds to wait before rolling current file (0 = never roll based on time interval)
agent1.sinks.sink1.hdfs.rollInterval = 600
# File size to trigger roll, in bytes (0: never roll based on file size) 
agent1.sinks.sink1.hdfs.rollSize = 0
#Number of events written to file before it rolled (0 = never roll based on number of events) 
agent1.sinks.sink1.hdfs.rollCount = 10000
# number of events written to file before it flushed to HDFS 
agent1.sinks.sink1.hdfs.batchSize = 10000 
agent1.sinks.sink1.hdfs.txnEventMax = 40000
# -- Compression codec. one of following : gzip, bzip2, lzo, snappy
# hdfs.codeC = gzip
#format: currently SequenceFile, DataStream or CompressedStream
#(1)DataStream will not compress output file and please don't set codeC
#(2)CompressedStream requires set hdfs.codeC with an available codeC
agent1.sinks.sink1.hdfs.fileType = DataStream 
agent1.sinks.sink1.hdfs.maxOpenFiles=50
# -- "Text" or "Writable"
#hdfs.writeFormat
agent1.sinks.sink1.hdfs.appendTimeout = 10000
agent1.sinks.sink1.hdfs.callTimeout = 10000
# Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
agent1.sinks.sink1.hdfs.threadsPoolSize=100 
# Number of threads per HDFS sink for scheduling timed file rolling
agent1.sinks.sink1.hdfs.rollTimerPoolSize = 1 
# hdfs.kerberosPrin--cipal Kerberos user principal for accessing secure HDFS
# hdfs.kerberosKey--tab Kerberos keytab for accessing secure HDFS
# hdfs.round false Should the timestamp be rounded down (if true, affects all time based escape sequences except %t)
# hdfs.roundValue1 Rounded down to the highest multiple of this (in the unit configured using
# hdfs.roundUnit), less than current time.
# hdfs.roundUnit second The unit of the round down value - second, minute or hour.
# serializer TEXT Other possible options include AVRO_EVENT or the fully-qualified class name of an implementation of the EventSerializer.Builder interface.
# serializer.*


# Use a channel which buffers events to a file
# -- The component type name, needs to be FILE.
agent1.channels.channel1.type = FILE 
# checkpointDir ~/.flume/file-channel/checkpoint The directory where checkpoint file will be stored
# dataDirs ~/.flume/file-channel/data The directory where log files will be stored
# The maximum size of transaction supported by the channel
agent1.channels.channel1.transactionCapacity = 1000000 
# Amount of time (in millis) between checkpoints
agent1.channels.channel1.checkpointInterval 30000
# Max size (in bytes) of a single log file 
agent1.channels.channel1.maxFileSize = 2146435071
# Maximum capacity of the channel 
agent1.channels.channel1.capacity 10000000 
#keep-alive 3 Amount of time (in sec) to wait for a put operation
#write-timeout 3 Amount of time (in sec) to wait for a write operation

# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
 类似资料:
  • 问题内容: 是否有一个好的Java API可以处理有人可能会推荐的RAR存档文件?谷歌搜索没有发现任何令人信服的东西。 问题答案: 您可以尝试JUnRar,“纯Java中实现的RAR处理API”(引用该站点)。

  • 一个项目可以有很多 JAR 文件,你可以向项目中添加 WAR , ZIP 和 TAR 文档,使用归档任务可以创建这些文档: Zip , Tar , Jar , War 和Ear. 它门都以同样的机制工作. 例 15.19 创建一个 ZIP 文档 build.gradle apply plugin: 'java' task zip(type: Zip) { from 'src/dist'

  • 可以使用标准的 Java 归档工具把 Web 应用程序打包并签名到一个 Web 存档格式(WAR)文件中。例如,一个关于“issue tracking”的应用程序可以分布在一个称为 issuetrack.war 的归档文件中。 当打包成这种形式时,将生成一个 META-INF 目录,其中包含了对 java归档工具有用的信息。尽管这个目录的内容可以通过 servlet 代码调用ServletCont

  • 归档命令 tar 标准的 UNIX 归档工具. rpm Red Hat 包管理器, 或者说rpm工具提供了一种对源文件或2进制文件进行打包的方法. 除此之外, 它还包括安装命令, 并且还检查包的完整性. 一个简单的rpm -i package_name.rpm命令对于安装一个包来说就足够了, 虽然这个命令还有好多其它的选项. rpm -qf列出一个文件属于那个包. bash$ rpm -

  • 本文向大家介绍使用python把json文件转换为csv文件,包括了使用python把json文件转换为csv文件的使用技巧和注意事项,需要的朋友参考一下 了解json整体格式 这里有一段json格式的文件,存着全球陆地和海洋的每年异常气温(这里只选了一部分):global_temperature.json 通过python读取后可以看到其实json就是dict类型的数据,description和

  • 问题内容: 我正在尝试使用递归搜索返回指定目录中的文件。我成功实现了这一点,但是我想添加几行代码,这些代码使我可以指定要返回的某些扩展名。 例如,仅返回目录中的.jpg文件。 这是我的代码, 请让我知道我可以在上述代码中添加些什么来实现此目标,谢谢 问题答案: