我已经成功地爬取了几个网站,并使用Nutch创建了两个片段。我也安装并启动了Solr服务。
但当我试图将这些爬取的数据索引到Solr中时,它就不起作用了。
我尝试了这个命令:
bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/*
输出:
The input path at crawldb is not a segment... skipping
Segment dir is complete: crawl/segments/20161214143435.
Segment dir is complete: crawl/segments/20161214144230.
Indexer: starting at 2016-12-15 10:55:35
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
Indexer: java.io.IOException: No FileSystem for scheme: http
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
还有这个命令:
bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/*
输出:
Segment dir is complete: crawl/segments/20161214143435.
Segment dir is complete: crawl/segments/20161214144230.
Indexer: starting at 2016-12-15 10:54:07
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
Indexing 250/250 documents
Deleting 0 documents
Indexing 250/250 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
在此之前,我将nutch/conf/schema/xml
文件复制到/nutch/solr-5.4.1/server/solr/configsets/data_driven_schema_config/conf
中,并按照建议将其重命名为managed-schema
。
我可能会犯什么错误?提前道谢!
编辑
这是我的圆木
...........................
...........................
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214143435
2016-12-15 10:15:48,378 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214144230
2016-12-15 10:15:49,120 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-12-15 10:15:49,122 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-12-15 10:15:49,180 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-12-15 10:15:49,181 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-12-15 10:15:49,406 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-12-15 10:15:50,930 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: content dest: content
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: title dest: title
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: host dest: host
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: segment dest: segment
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: boost dest: boost
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: digest dest: digest
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2016-12-15 10:15:51,243 INFO solr.SolrIndexWriter - Indexing 250/250 documents
2016-12-15 10:15:51,243 INFO solr.SolrIndexWriter - Deleting 0 documents
2016-12-15 10:15:51,384 INFO solr.SolrIndexWriter - Indexing 250/250 documents
2016-12-15 10:15:51,384 INFO solr.SolrIndexWriter - Deleting 0 documents
2016-12-15 10:15:51,414 WARN mapred.LocalJobRunner - job_local1333791357_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre> Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>
</body>
</html>
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
............................
.............................
问题是solr、nutch和HBase之间的版本不兼容。这篇文章对我很有用。
我无法在Google Drive环境中保存新笔记本。 Google Colaboratory与预定义的笔记本(如Hello,Colaboratory)一起工作,但我无法将任何笔记本保存到我的驱动器文件夹中。我有Colaboratory应用程序允许在谷歌驱动器设置,真的不知道如何解决它。Colaboratory与Drive通信--它甚至在Google Drives文件夹中创建笔记本文件,但当加载任何
问题内容: 现在,开发iOS应用程序的新语言已变得 迅捷 。 我们如何与 AFNetworking 集成或使用 NSURLSession 将是一个更好的选择? 请帮帮我.. 问题答案: 您必须将AFNetworking添加到您的swift项目 在构建设置->定义模块中设置为是 在构建设置-> Swift编译器-> Objective-C桥接文件中,例如,设置桥接文件’ProjectName-Bri
问题内容: 我想捕获我在jsp中构建的responseXML。我该怎么办。之后,我将其转换为html。我知道这很烦人,我们可以使用框架或类似jquery的库来实现,但我可以用ajax来实现。由于我必须使用JSON SERVICE,因此我在使用jquery和jsp \ servlet时也遇到了问题。为什么在我看来是如此复杂。 问题答案: 不需要那么复杂。您可能只需要一个人调整所有技术即可。JSP,S
主要内容:步骤1 - 打开Eclipse Marketplace,步骤2 - 安装Buildship插件,步骤3 - 验证Gradle插件安装情况,步骤4 - 验证目录结构本章将介绍了集成。以下是将插件添加到的步骤。 步骤1 - 打开Eclipse Marketplace 打开在系统中安装好的。 转到 → ,如下面的屏幕截图所示。 步骤2 - 安装Buildship插件 单击 Eclipse 中的 ,在打开界面中找到以下屏幕截图。在左侧搜索栏上输入。是一个Gradle集成插件。当在屏幕上找到时,
主要内容:数据库设置:,Hibernate的配置:,环境设置:,Hibernate 类:,动作类:,创建视图文件:,Struts 配置:Hibernate是一个高性能的对象/关系持久性和查询服务许可下的开源GNU通用公共许可证(LGPL),并免费下载。在这一章中,我们要学习如何实现Struts2与Hibernate集成。如果你不熟悉与Hibernate,那么可以查看我们的Hibernate教程。 数据库设置: 在本教程中,我会使用“struts2_tutorial”MySQL数据库。我连接到我的
在本章中,让我们通过Struts2的集成Tiles框架所涉及的步骤。 Apache的Tiles是一个内置的模板框架来简化Web应用程序用户界面的开发。 首先,我们需要从Apache Tiles 网站下载的files jar文件。需要添加下面的jar文件添加到项目的类路径。 tiles-api-x.y.z.jar tiles-compat-x.y.z.jar tiles-core-x.y.z.jar