crawler4j
继续执行正在实现搜索引擎的Programming Collection Intelligence (PCI)的第4章。
我可能比做一次运动所咬的东西要多。 我认为,与其使用本书中所使用的常规关系数据库结构,不如说我一直想看看Neo4J,所以现在是时候了。 只是说,这不一定是图数据库的理想用例,但是用1块石头杀死3只鸟可能有多难。
在尝试重置SQL Server的教程中,Oracle的想法花了比预期更长的时间,但是幸运的是Neo4j周围有一些很棒的资源。
只是几个:
由于我只是想作为一个小练习来运行它,所以我决定采用内存中的实现方式,而不是将其作为服务在我的机器上运行。 事后看来,这可能是一个错误,而工具和Web界面将帮助我从一开始就更快地可视化数据图。
因为您只能在内存中实现1个可写实例,所以我做了一个双锁单例工厂来创建和清除数据库。
package net.briandupreez.pci.chapter4;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.factory.GraphDatabaseFactory;
import org.neo4j.kernel.impl.util.FileUtils;
import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
public class CreateDBFactory {
private static GraphDatabaseService graphDb = null;
public static final String RESOURCES_CRAWL_DB = "resources/crawl/db";
public static GraphDatabaseService createInMemoryDB() {
if (null == graphDb) {
synchronized (GraphDatabaseService.class) {
if (null == graphDb) {
final Map<String, String> config = new HashMap<>();
config.put("neostore.nodestore.db.mapped_memory", "50M");
config.put("string_block_size", "60");
config.put("array_block_size", "300");
graphDb = new GraphDatabaseFactory()
.newEmbeddedDatabaseBuilder(RESOURCES_CRAWL_DB)
.setConfig(config)
.newGraphDatabase();
registerShutdownHook(graphDb);
}
}
}
return graphDb;
}
private static void registerShutdownHook(final GraphDatabaseService graphDb) {
Runtime.getRuntime().addShutdownHook(new Thread() {
@Override
public void run() {
graphDb.shutdown();
}
});
}
public static void clearDb() {
try {
if(graphDb != null){
graphDb.shutdown();
graphDb = null;
}
FileUtils.deleteRecursively(new File(RESOURCES_CRAWL_DB));
} catch (final IOException e) {
throw new RuntimeException(e);
}
}
}
然后使用Crawler4j创建了一个图形,其中包含以我的博客开头的所有URL,它们与其他URL的关系以及这些URL包含的所有单词和单词的索引。
package net.briandupreez.pci.chapter4;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;
import org.neo4j.graphdb.Transaction;
import org.neo4j.graphdb.index.Index;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class Neo4JWebCrawler extends WebCrawler {
private final GraphDatabaseService graphDb;
/**
* Constructor.
*/
public Neo4JWebCrawler() {
this.graphDb = CreateDBFactory.createInMemoryDB();
}
@Override
public boolean shouldVisit(final WebURL url) {
final String href = url.getURL().toLowerCase();
return !NodeConstants.FILTERS.matcher(href).matches();
}
/**
* This function is called when a page is fetched and ready
* to be processed by your program.
*/
@Override
public void visit(final Page page) {
final String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
final Index<Node> nodeIndex = graphDb.index().forNodes(NodeConstants.PAGE_INDEX);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
//String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
Transaction tx = graphDb.beginTx();
try {
final Node pageNode = graphDb.createNode();
pageNode.setProperty(NodeConstants.URL, url);
nodeIndex.add(pageNode, NodeConstants.URL, url);
//get all the words
final List<String> words = cleanAndSplitString(text);
int index = 0;
for (final String word : words) {
final Node wordNode = graphDb.createNode();
wordNode.setProperty(NodeConstants.WORD, word);
wordNode.setProperty(NodeConstants.INDEX, index++);
final Relationship relationship = pageNode.createRelationshipTo(wordNode, RelationshipTypes.CONTAINS);
relationship.setProperty(NodeConstants.SOURCE, url);
}
for (final WebURL webURL : links) {
System.out.println("Linking to " + webURL);
final Node linkNode = graphDb.createNode();
linkNode.setProperty(NodeConstants.URL, webURL.getURL());
final Relationship relationship = pageNode.createRelationshipTo(linkNode, RelationshipTypes.LINK_TO);
relationship.setProperty(NodeConstants.SOURCE, url);
relationship.setProperty(NodeConstants.DESTINATION, webURL.getURL());
}
tx.success();
} finally {
tx.finish();
}
}
}
private static List<String> cleanAndSplitString(final String input) {
if (input != null) {
final String[] dic = input.toLowerCase().replaceAll("\\p{Punct}", "").replaceAll("\\p{Digit}", "").split("\\s+");
return Arrays.asList(dic);
}
return new ArrayList<>();
}
}
收集完数据后,我可以查询它并执行搜索引擎的功能。 为此,我决定使用Java Futures,因为这是我仅读过但尚未实现的另一件事。 在我的日常工作环境中,我们使用应用程序服务器中的Weblogic / CommonJ工作管理器来执行相同的任务。
final ExecutorService executorService = Executors.newFixedThreadPool(4);
final String[] searchTerms = {"java", "spring"};
List<Callable<TaskResponse>> tasks = new ArrayList<>();
tasks.add(new WordFrequencyTask(searchTerms));
tasks.add(new DocumentLocationTask(searchTerms));
tasks.add(new PageRankTask(searchTerms));
tasks.add(new NeuralNetworkTask(searchTerms));
final List<Future<TaskResponse>> results = executorService.invokeAll(tasks);
然后,我开始为以下每个任务创建一个任务,对单词频率,文档位置, 页面排名和神经网络(带有虚假输入/训练数据)进行计数,以根据搜索条件对返回的页面进行排名。 所有代码都在我的公共github博客仓库中。
免责声明:神经网络任务要么没有足够的数据来有效,要么我没有正确实现数据标准化,所以它目前不是很有用,我将在完成while PCI的旅程后再返回书。
值得共享的一项任务是Page Rank,我很快就读懂了一些理论,认为我不那么聪明,然后去寻找实现它的图书馆。 我发现Graphstream是一个很棒的开源项目,它不仅可以完成PageRank的全部工作,还可以查看他们的视频。
因此,很容易实现本练习的PageRank任务。
package net.briandupreez.pci.chapter4.tasks;
import net.briandupreez.pci.chapter4.NodeConstants;
import net.briandupreez.pci.chapter4.NormalizationFunctions;
import org.graphstream.algorithm.PageRank;
import org.graphstream.graph.Graph;
import org.graphstream.graph.implementations.SingleGraph;
import org.neo4j.cypher.javacompat.ExecutionEngine;
import org.neo4j.cypher.javacompat.ExecutionResult;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.concurrent.Callable;
public class PageRankTask extends SearchTask implements Callable<TaskResponse> {
public PageRankTask(final String... terms) {
super(terms);
}
@Override
protected ExecutionResult executeQuery(final String... words) {
final ExecutionEngine engine = new ExecutionEngine(graphDb);
final StringBuilder bob = new StringBuilder("START page=node(*) MATCH (page)-[:CONTAINS]->words ");
bob.append(", (page)-[:LINK_TO]->related ");
bob.append("WHERE words.word in [");
bob.append(formatArray(words));
bob.append("] ");
bob.append("RETURN DISTINCT page, related");
return engine.execute(bob.toString());
}
public TaskResponse call() {
final ExecutionResult result = executeQuery(searchTerms);
final Map<String, Double> returnMap = convertToUrlTotalWords(result);
final TaskResponse response = new TaskResponse();
response.taskClazz = this.getClass();
response.resultMap = NormalizationFunctions.normalizeMap(returnMap, true);
return response;
}
private Map<String, Double> convertToUrlTotalWords(final ExecutionResult result) {
final Map<String, Double> uniqueUrls = new HashMap<>();
final Graph g = new SingleGraph("rank", false, true);
final Iterator<Node> pageIterator = result.columnAs("related");
while (pageIterator.hasNext()) {
final Node node = pageIterator.next();
final Iterator<Relationship> relationshipIterator = node.getRelationships().iterator();
while (relationshipIterator.hasNext()) {
final Relationship relationship = relationshipIterator.next();
final String source = relationship.getProperty(NodeConstants.SOURCE).toString();
uniqueUrls.put(source, 0.0);
final String destination = relationship.getProperty(NodeConstants.DESTINATION).toString();
g.addEdge(String.valueOf(node.getId()), source, destination, true);
}
}
computeAndSetPageRankScores(uniqueUrls, g);
return uniqueUrls;
}
/**
* Compute score
*
* @param uniqueUrls urls
* @param graph the graph of all links
*/
private void computeAndSetPageRankScores(final Map<String, Double> uniqueUrls, final Graph graph) {
final PageRank pr = new PageRank();
pr.init(graph);
pr.compute();
for (final Map.Entry<String, Double> entry : uniqueUrls.entrySet()) {
final double score = 100 * pr.getRank(graph.getNode(entry.getKey()));
entry.setValue(score);
}
}
}
在这两者之间,我发现了一种通过Stackoverflow上的值对映射进行排序的出色实现。
package net.briandupreez.pci.chapter4;
import java.util.*;
public class MapUtil {
/**
* Sort a map based on values.
* The values must be Comparable.
*
* @param map the map to be sorted
* @param ascending in ascending order, or descending if false
* @param <K> key generic
* @param <V> value generic
* @return sorted list
*/
public static <K, V extends Comparable<? super V>> List<Map.Entry<K, V>> entriesSortedByValues(final Map<K, V> map, final boolean ascending) {
final List<Map.Entry<K, V>> sortedEntries = new ArrayList<>(map.entrySet());
Collections.sort(sortedEntries,
new Comparator<Map.Entry<K, V>>() {
@Override
public int compare(final Map.Entry<K, V> e1, final Map.Entry<K, V> e2) {
if (ascending) {
return e1.getValue().compareTo(e2.getValue());
} else {
return e2.getValue().compareTo(e1.getValue());
}
}
}
);
return sortedEntries;
}
}
用于实现所有这些功能的Maven依赖项
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>14.0.1</version>
</dependency>
<dependency>
<groupId>org.encog</groupId>
<artifactId>encog-core</artifactId>
<version>3.2.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>edu.uci.ics</groupId>
<artifactId>crawler4j</artifactId>
<version>3.5</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.neo4j</groupId>
<artifactId>neo4j</artifactId>
<version>1.9</version>
</dependency>
<dependency>
<groupId>org.graphstream</groupId>
<artifactId>gs-algo</artifactId>
<version>1.1.2</version>
</dependency>
现在进入关于PCI…优化的第5章。
crawler4j