教你一步步搭建和运行完整的开源搜索引擎
一、需要的软件及其版本
1. Centos linux 7
2. hadoop 1.2.1
3. hbase 0.94.27
4. nutch 2.3
5. solr 4.9.1
以上参考下载地址如下:
http://isoredirect.centos.org/centos/7/isos/x86_64/CentOS-7-x86_64-DVD-1503-01.iso
https://www.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz
http://mirror.bit.edu.cn/apache/hbase/hbase-0.94.27/hbase-0.94.27.tar.gz
http://www.apache.org/dyn/closer.cgi/nutch/2.3/apache-nutch-2.3-src.tar.gz
http://archive.apache.org/dist/lucene/solr/4.9.1/solr-4.9.1.tgz
二、系统环境准备
1. 安装linux操作系统(步骤略)
2. 单独新建一个hadoop用户:useradd hadoop
3. 设置密码:passwd hadoop
4. 开启管理员权限:vi /etc/sudoers,增加一行hadoop ALL=(ALL) ALL
5. 确保配置localhost地址映射到127.0.0.1,vi /etc/hosts
6. 设置无密码登陆(百度),至ssh localhost可以实现无密码登陆
7. 确保已经安装java-1.7,并配置好JAVA_HOME环境变量,如:
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.75-2.5.4.2.el7_0.x86_64/
三、安装hadoop 1.2.1
1. 下载hadoop-1.2.1.tar.gz并解压到/usr/local/hadoop,修改目录权限:
sudo chown -R hadoop:hadoop hadoop
2. 创建/data/hadoop-data目录
3. 修改./conf/core-site.xml配置如下:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop-data/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
4. 修改conf/hadoop-env.sh中的JAVA_HOME配置如下:
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.75-2.5.4.2.el7_0.x86_64/
5. 修改./conf/hdfs-site.xml配置如下:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
6. 修改./conf/mapred-site.xml配置如下:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
7. 配置环境变量如下:
export HADOOP_PREFIX=/usr/local/hadoop/
export PATH=${HADOOP_PREFIX}/bin/:${PATH}
8. 格式化namenode:
hadoop namenode -format
9. 启动hadoop:
start-all.sh
10. 检查hadoop是否正确启动
hadoop fs -ls /
返回:
Found 1 items
drwxr-xr-x – hadoop supergroup 0 2015-07-31 16:53 /data
hadoop job -list
返回:
0 jobs currently running
JobId State StartTime UserName Priority SchedulingInfo
11. tasktracker状态页面:http://localhost:50060/tasktracker.jsp
12. jobtracker状态页面:http://localhost:50030/jobtracker.jsp
13. datanode状态页面:http://localhost:50075
四、安装hbase 0.94.27
1. 下载hbase-0.94.27.tar.gz并解压到/usr/local/hbase,修改目录权限:sudo chown -R hadoop:hadoop hbase
2. 创建目录/data/hbase/zookeeper/
3. 修改./conf/hbase-env.sh中的JAVA_HOME配置如下:
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.75-2.5.4.2.el7_0.x86_64/
并开启:
export HBASE_MANAGES_ZK=true
4. 修改./conf/hbase-site.xml如下:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/data/hbase/zookeeper</value>
</property>
</configuration>
5. 删掉./lib/hadoop-core-1.0.4.jar并从hadoop拷贝:
cp /usr/local/hadoop/hadoop-core-1.2.1.jar ./lib/
6. 启动hbase:
./bin/start-hbase.sh
7. 检验hbase正确启动:执行./bin/hbase shell启动终端并执行list结果如下:
hbase(main):002:0> list
TABLE
0 row(s) in 0.0170 seconds
hbase(main):003:0>
8. hbase的Master状态页面:http://localhost:60010/master-status
五、安装nutch 2.3
1. 下载apache-nutch-2.3-src.tar.gz并解压到/usr/local/nutch,修改目录权限:sudo chown -R hadoop:hadoop nutch
2. 修改./conf/gora.properties增加如下一行:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
3. 修改./conf/nutch-site.xml如下:
<configuration>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
</configuration>
4. 修改./ivy/ivy.xml
将依赖的hadoop-core和hadoop-test的版本由1.2.0改为1.2.1
将gora-hbase依赖解除注释如下:
<dependency org=”org.apache.gora” name=”gora-hbase” rev=”0.5″ conf=”*->default” />
5. 编译:ant runtime(可能耗费很长时间,因为要下包)
六、安装solr 4.9.1
1. 下载solr-4.9.1.tar并解压到/usr/local/solr,修改目录权限:
sudo chown -R hadoop:hadoop solr
2. 进入/usr/local/solr/example,覆盖nutch的schema.xml执行:
cp /usr/local/nutch/runtime/local/conf/schema.xml solr/collection1/conf/schema.xml
3. 启动solr:
java -jar start.jar
4. 访问http://localhost:8983/solr/#/collection1/query查看solr页面
七、启动抓取并测试搜索效果
1. 进到/usr/local/nutch/runtime/local目录,创建myUrls目录并创建seed.txt文件内容为种子url,如:
http://nutch.apache.org/
2. 执行:
./bin/crawl ./myUrls/ TestCrawl http://localhost:8983/solr 2
3. 查询:在http://localhost:8983/solr/#/collection1/query检索可以看到结果