本人也是菜鸟一枚,现在刚开始接触爬虫,想通过读别人的爬虫框架源码来了解下爬虫,如有错误,请见谅并指出。
继之前解析了crawler4j的robotstxt包之后,今天来让我们看看crawler包和exception包。
crawler包中主要有以下几个类:
1.Configurable:抽象配置类,这是一个抽象类,里面有一个CrawlConfig的引用。其他什么也没有了。
2.CrawlConfig:这是一个爬虫的具体配置类,里面有许多参数,这里我只介绍几个主要的可配置的参数。
resumableCrawling:这个变量用来控制那些已经停止的爬虫是否可以恢复。(开启了会让爬的效率降低)
maxDepthOfCrawling:所爬取的最大深度。第一个页面是0的话,则从该页面里获取到的下一个页面深度就是1,以此类推,达到最大深度后的页面下的url都不会加入url队列。
maxPagesToFetch:最多的爬取数量。
politenessDelay:发送2个请求的间隔时间
includeBinaryContentInCrawling和processBinaryContentInCrawling:是否处理二进制内容,如图像。
userAgentString:爬虫名
proxyHost和proxyPort:代理服务器地址和端口。(关于代理可以自行百度,简单说下,就是你的爬虫先向代理服务器发送http请求,若代理服务器上有最新结果直接返回,否则由代理服务器向web服务器发送并得到结果返回)
还有一些参数就不一一讲了。(有些http连接和超时的参数,还有些没搞懂,如onlineTldListUpdate)
3.WebCrawler:爬虫类,实现了Runnable。既然实现了Runnable这里首先来看下run方法
public void run() {
onStart();
while (true) {
List assignedURLs = new ArrayList<>(50);
isWaitingForNewURLs = true;
frontier.getNextURLs(50, assignedURLs);
isWaitingForNewURLs = false;
if (assignedURLs.isEmpty()) {
if (frontier.isFinished()) {
return;
}
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
logger.error("Error occurred", e);
}
} else {
for (WebURL curURL : assignedURLs) {
if (myController.isShuttingDown()) {
logger.info("Exiting because of controller shutdown.");
return;
}
if (curURL != null) {
curURL = handleUrlBeforeProcess(curURL);
processPage(curURL);
frontier.setProcessed(curURL);
}
}
}
}
}
onStart()方法默认是空方法,但我们可以重写这个方法定制在爬虫开始前的一些配置。然后自己的队列里没有url了就从全局url队列里取,若也没有则结束爬虫。若有url且控制器未关闭则处理,最后告诉全局url控制器Frontier这个url处理过了。
接着我们来看下处理页的方法,这里贴出主要逻辑
fetchResult = pageFetcher.fetchPage(curURL);//获取结果集
Page page = new Page(curURL);//new 一个page
page.setFetchResponseHeaders(fetchResult.getResponseHeaders());
page.setStatusCode(statusCode);
//status code is 200
if (!curURL.getURL().equals(fetchResult.getFetchedUrl())) {
if (docIdServer.isSeenBefore(fetchResult.getFetchedUrl())) {
throw new RedirectException(Level.DEBUG, "Redirect page: " + curURL + " has already been seen");
}
curURL.setURL(fetchResult.getFetchedUrl());
curURL.setDocid(docIdServer.getNewDocID(fetchResult.getFetchedUrl()));
}
parser.parse(page, curURL.getURL());//解析到page中
ParseData parseData = page.getParseData();
for (WebURL webURL : parseData.getOutgoingUrls()) {
int newdocid = docIdServer.getDocId(webURL.getURL());
if (newdocid > 0) {//对于已经访问过的,深度设置为-1
// This is not the first time that this Url is visited. So, we set the depth to a negative number.
webURL.setDepth((short) -1);
webURL.setDocid(newdocid);
}else {//加入url队列
webURL.setDocid(-1);
webURL.setDepth((short) (curURL.getDepth() + 1));
if (shouldVisit(page, webURL)) {//满足访问要求,我们可以重写此方法定制自己要访问的页面
webURL.setDocid(docIdServer.getNewDocID(webURL.getURL()));
toSchedule.add(webURL);
}
}
//加入全局url队列
frontier.scheduleAll(toSchedule);
//重写此方法处理获取到的html代码
visit(page);
里面有许多细节忽略了,不过一次状态码为200的http请求处理过程大致是这样了。
4.Page:代表一个页面,存储了页面的相关信息
5.CrawlController:爬虫控制器。这是一个总控制器,用来开启爬虫并监视各个爬虫状态。构造函数需要CrawlConfig,PageFetcher和RobotstxtServer。通过addSeed(String)方法来添加种子(最一开始爬虫所爬的页面),可添加多个。然后通过start方法开始爬取。start方法需要传递一个继承WebCrawler的类的Class对象和开启爬虫的数量。让我们来看下这个start方法
for (int i = 1; i <= numberOfCrawlers; i++) {//创建爬虫
T crawler = crawlerFactory.newInstance();
Thread thread = new Thread(crawler, "Crawler " + i);
crawler.setThread(thread);
crawler.init(i, this);
thread.start();
crawlers.add(crawler);
threads.add(thread);
logger.info("Crawler {} started", i);
}
//接下来开启一个监视线程,
Thread monitorThread = new Thread(new Runnable() {
@Override
public void run() {
try {
synchronized (waitingLock) {
while (true) {
sleep(10);
boolean someoneIsWorking = false;
for (int i = 0; i
Thread thread = threads.get(i);
if (!thread.isAlive()) {
if (!shuttingDown) {//重启爬虫
logger.info("Thread {} was dead, I'll recreate it", i);
T crawler = crawlerFactory.newInstance();
thread = new Thread(crawler, "Crawler " + (i + 1));
threads.remove(i);
threads.add(i, thread);
crawler.setThread(thread);
crawler.init(i + 1, controller);
thread.start();
crawlers.remove(i);
crawlers.add(i, crawler);
}
} else if (crawlers.get(i).isNotWaitingForNewURLs()) {
someoneIsWorking = true;
}
}
boolean shut_on_empty = config.isShutdownOnEmptyQueue();
//没有爬虫在工作且,队列为空时关闭
if (!someoneIsWorking && shut_on_empty) {
// Make sure again that none of the threads
// are
// alive.
logger.info("It looks like no thread is working, waiting for 10 seconds to make sure...");
sleep(10);
someoneIsWorking = false;
//再次检查各个线程爬虫
for (int i = 0; i
Thread thread = threads.get(i);
if (thread.isAlive() && crawlers.get(i).isNotWaitingForNewURLs()) {
someoneIsWorking = true;
}
}
if (!someoneIsWorking) {
if (!shuttingDown) {
//队列里还有要爬取的页面
long queueLength = frontier.getQueueLength();
if (queueLength > 0) {
continue;
}
logger.info(
"No thread is working and no more URLs are in queue waiting for another 10 seconds to make " +
"sure...");
sleep(10);
//又判断了一次,这里进行了2次判断,防止出现伪结束
queueLength = frontier.getQueueLength();
if (queueLength > 0) {
continue;
}
}
//所有爬虫都结束了,关闭各个服务
logger.info("All of the crawlers are stopped. Finishing the process...");
frontier.finish();
for (T crawler : crawlers) {
crawler.onBeforeExit();
crawlersLocalData.add(crawler.getMyLocalData());
}
logger.info("Waiting for 10 seconds before final clean up...");
sleep(10);
frontier.close();
docIdServer.close();
pageFetcher.shutDown();
finished = true;
waitingLock.notifyAll();
env.close();
return;
}
}
}
}
} catch (Exception e) {
logger.error("Unexpected Error", e);
}
}
});
monitorThread.start();
其实还有许多细节没有解析,不得不佩服大神啊,光是看看都觉得太厉害了。不过还是希望能从这里学到些东西的。