当前位置: 首页 > 知识库问答 >
问题:

Apache Tika 1.16 TXTParser未能在sbt构建中检测到字符编码

益光亮
2023-03-14

我正在使用sbt组装构建Eclipse中的一个项目。我有一个非常大和复杂的建设。sbt文件,因为我有很多冲突。

使用tika 1.16中的pdf、OOXML和OpenDocument解析器,pdf、pptx、odt和docx文件的一切都正常工作。然而,当我尝试使用TXTParser解析txt文件(UTF-8编码)时,我得到以下错误:

org.apache.tika.exception.TikaException: Failed to detect the character encoding of a document
    at org.apache.tika.detect.AutoDetectReader.detect(AutoDetectReader.java:77)
    at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:108)
    at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:114)
    at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:79)`

从Scala代码的这一行:

val content = theParser.parse(stream.open(), chandler, meta, pContext)

其中,stream是PortableDataStream,chandler是新的BodyContentHandler,meta是新的元数据,pContext是新的ParseContext。

如果我使用AutoDetectParser代替,我会收到以下错误:

org.apache.jena.shared.SyntaxError: unknown
    at org.apache.jena.rdf.model.impl.NTripleReader.read(NTripleReader.java:73)
    at org.apache.jena.rdf.model.impl.NTripleReader.read(NTripleReader.java:58)
    at org.apache.jena.rdf.model.impl.ModelCom.read(ModelCom.java:305)

从Scala代码的这一行:

val response = model.read(stream, null, "N-TRIPLES")

其中,stream是InputStream。

我认为这是由于Tika的回答是空的(所以也是同样的问题)。

我很确定这可能是我过于复杂的构建中的一个依赖性问题。sbt文件,但经过几个小时的尝试,我肯定需要帮助。

一个积极的方面是,如果没有输入txt文件,一切都可以完美运行,所以这可能是我的最后一个问题!

最后,这是我使用sbt清理程序集构建的build.sbt文件:

scalaVersion := "2.11.8"
version      := "1.0.0"
name := "crawldocs"
conflictManager := ConflictManager.strict
mainClass in assembly := Some("com.addlesee.crawling.CrawlHiccup")
libraryDependencies ++= Seq(
  "org.apache.tika" % "tika-core" % "1.16",
  "org.apache.tika" % "tika-parsers" % "1.16" excludeAll(
    ExclusionRule(organization = "*", name = "guava")
  ),
    "com.blazegraph" % "bigdata-core" % "2.0.0" excludeAll(
    ExclusionRule(organization = "*", name = "collection-0.7"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-csv"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "commons-lang3"),
    ExclusionRule(organization = "*", name = "commons-logging"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "httpclient-cache"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "httpmime"),
    ExclusionRule(organization = "*", name = "jackson-annotations"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "jackson-databind"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "jena-cmds"),
    ExclusionRule(organization = "*", name = "jena-rdfconnection"),
    ExclusionRule(organization = "*", name = "jena-tdb"),
    ExclusionRule(organization = "*", name = "jsonld-java"),
    ExclusionRule(organization = "*", name = "libthrift"),
    ExclusionRule(organization = "*", name = "log4j"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "xercesImpl"),
    ExclusionRule(organization = "*", name = "xml-apis")
  ),
    "org.scalaj" %% "scalaj-http" % "2.3.0",
  "org.apache.jena" % "apache-jena" % "3.4.0" excludeAll(
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-csv"),
    ExclusionRule(organization = "*", name = "commons-lang3"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "httpclient-cache"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "jackson-databind"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "jena-rdfconnection"),
    ExclusionRule(organization = "*", name = "slf4j-api")
  ),
    "org.apache.jena" % "apache-jena-libs" % "3.4.0" excludeAll(
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-csv"),
    ExclusionRule(organization = "*", name = "commons-lang3"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "httpclient-cache"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "jackson-databind"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "jena-rdfconnection"),
    ExclusionRule(organization = "*", name = "slf4j-api")
  ),
    "org.noggit" % "noggit" % "0.6",
    "com.typesafe.scala-logging" %% "scala-logging" % "3.7.2" excludeAll(
    ExclusionRule(organization = "*", name = "slf4j-api")
  ),
  "org.apache.spark" % "spark-core_2.11" % "2.2.0" excludeAll(
    ExclusionRule(organization = "*", name = "breeze_2.11"),
    ExclusionRule(organization = "*", name = "hadoop-hdfs"),
    ExclusionRule(organization = "*", name = "hadoop-annotations"),
    ExclusionRule(organization = "*", name = "hadoop-common"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-app"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-common"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-core"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-jobclient"),
    ExclusionRule(organization = "*", name = "hadoop-mapreduce-client-shuffle"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-api"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-client"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-common"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-server-common"),
    ExclusionRule(organization = "*", name = "hadoop-yarn-server-web-proxy"),
    ExclusionRule(organization = "*", name = "activation"),
    ExclusionRule(organization = "*", name = "hive-exec"),
    ExclusionRule(organization = "*", name = "scala-compiler"),
    ExclusionRule(organization = "*", name = "spire_2.11"),
    ExclusionRule(organization = "*", name = "commons-compress"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "guava"),
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "gson"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "zookeeper"),
    ExclusionRule(organization = "*", name = "jettison"),
    ExclusionRule(organization = "*", name = "jackson-core"),
    ExclusionRule(organization = "*", name = "httpcore"),
    ExclusionRule(organization = "*", name = "bcprov-jdk15on"),
    ExclusionRule(organization = "*", name = "jul-to-slf4j"),
    ExclusionRule(organization = "*", name = "jcl-over-slf4j"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "curator-framework")
  ),
  "org.scala-lang" % "scala-xml" % "2.11.0-M4",
  "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.7.3" excludeAll(
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "jettison"),
    ExclusionRule(organization = "*", name = "avro"),
    ExclusionRule(organization = "*", name = "commons-compress"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "netty")
  ),
  "org.apache.hadoop" % "hadoop-common" % "2.7.3" excludeAll(
    ExclusionRule(organization = "*", name = "commons-codec"),
    ExclusionRule(organization = "*", name = "commons-cli"),
    ExclusionRule(organization = "*", name = "slf4j-api"),
    ExclusionRule(organization = "*", name = "commons-math3"),
    ExclusionRule(organization = "*", name = "commons-io"),
    ExclusionRule(organization = "*", name = "jets3t"),
    ExclusionRule(organization = "*", name = "gson"),
    ExclusionRule(organization = "*", name = "avro"),
    ExclusionRule(organization = "*", name = "httpclient"),
    ExclusionRule(organization = "*", name = "zookeeper"),
    ExclusionRule(organization = "*", name = "commons-compress"),
    ExclusionRule(organization = "*", name = "slf4j-log4j12"),
    ExclusionRule(organization = "*", name = "commons-net"),
    ExclusionRule(organization = "*", name = "curator-recipes"),
    ExclusionRule(organization = "*", name = "jsr305")
  )
)
assemblyMergeStrategy in assembly := {
 case PathList("META-INF", xs @ _*) => MergeStrategy.discard
 case x => MergeStrategy.first
}

共有2个答案

公良子轩
2023-03-14

最终修复。。。

如果x.contains(“EncodingDetector”),我添加了案例x=

assemblyMergeStrategy in assembly := {
 case x if x.contains("EncodingDetector") => MergeStrategy.deduplicate
 case PathList("META-INF", xs @ _*) => MergeStrategy.discard
 case x => MergeStrategy.first
}
芮瑾瑜
2023-03-14

上面的代码调用旧的N-triples解析,该解析仅因遗留原因而存在。旧读取器仅为ASCII码。UTF-8会破坏它。

要么没有处理apache jena libs(type=pom),要么重新打包JAR,还没有处理Java的ServiceLoader放置文件的META-INF/service。Jena将其用于初始化。您必须通过连接同名的META\U INF/service/*文件来组合这些文件。

详情:https://jena.apache.org/documentation/notes/jena-repack.html

 类似资料:
  • 我正在尝试从一个节点传递一个API,其中字符包含'&'(例如:AT&T),在url查询参数中,我没有得到与AT&T相关联的值,而是它返回的空字符串。有人知道这事吗? http://localhost:3000/account?accountName=AT&T []

  • 使用https://altbeacon.github.io/android-beacon-library/samples.html中的代码试图检测IBeacon发射器,该发射器是iOS8,如下http://blog.passkit.com/configure-iphone-ibeacon-transmiter/,在哪一步出错了,我需要在区域中使用BeaconIdentifier作为mymonito

  • 问题内容: 我正在寻找一种检测文档中字符集的方法。我一直在这里阅读Mozilla字符集检测实现: 通用字符集检测 我还找到了一个名为jCharDet的Java实现: JCharDet 这两个都是基于使用一组静态数据进行的研究。我想知道的是,是否有人成功使用了其他实现?您是否采用了自己的方法,如果是的话,您用来检测字符集的算法是什么? 任何帮助,将不胜感激。我既不是通过Google寻找现有方法的清单

  • 问题内容: 似乎是一个相当热门的问题,但是我还没有找到解决方案。也许是因为它有 很多 风味。虽然在这里。我正在尝试读取一些用逗号分隔的文件(有时,分隔符可能比逗号更具独特性,但现在就可以使用逗号了)。 这些文件本应在整个行业中标准化,但是最近我们看到了许多不同类型的字符集文件。我希望能够设置BufferedReader来对此进行补偿。 执行此操作并检测是否成功的标准方法是什么? 我对这种方法的第一

  • 问题内容: 我读取了大约1000个文件名,其中一些文件以UTF8编码,而某些文件为CP1252。 我想将它们全部解码为Unicode,以便在脚本中进行进一步处理。有没有一种方法可以使源编码正确解码为Unicode? 例: 问题答案: 如果您的文件位于和中,则有一种简单的方法。 否则,有一个字符集检测库。 Python-检测字符集并转换为utf-8 https://pypi.python.org/p

  • 我的身材有问题。未使用IntelliJ(旗舰版2020.2.3)正确提取sbt文件。我使用IntelliJ向导(文件)创建了一个新的Scala sbt项目 sbt工具窗口选择外部依赖项,但我的项目“外部依赖项”根本不包含库,即使在我刷新项目之后。还有我的身材。sbt文件中有很多错误,比如IntelliJ无法正确识别它,但我已经安装了Scala插件,所以我不知道还能做什么? 在这里,您可以看到我已经