转载请注明出处:http://blog.csdn.net/xiaojimanman/article/details/46785223
http://www.llwjy.com/blogdetail/efda32f346445dd8423a942aa4c8c2cd.html
个人博客站已经上线了,网址 www.llwjy.com ~欢迎各位吐槽~-------------------------------------------------------------------------------------------------
首先和大家说一生抱歉,由于最近经常在外面出差,博客断更了很长时间,后面不出意外的话,博客会恢复更新。
在上次的博客中已经介绍了纵横小说的数据库表结构,这里需要说明的是,我在设计数据表的时候,取消了数据表之间的外键,至于为什么这样做这里就不再多说,感兴趣的可以自行百度下。下面我们就开始今天的介绍:
模版类
在介绍数据库的操作之前,我们首先看一下定义的模版(javabean),这里定义了四个模版分别为抓取入口信息模版、小说简介页模版、小说章节列表模版、小说阅读页模版,类中只有一些简单的set和put方法,下面就看下具体的代码实现:
1.CrawlListInfo
/**
*@Description:
*/
package com.lulei.crawl.novel.zongheng.model;
public class CrawlListInfo {
private String url;
private String info;
private int frequency;
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
public String getInfo() {
return info;
}
public void setInfo(String info) {
this.info = info;
}
public int getFrequency() {
return frequency;
}
public void setFrequency(int frequency) {
this.frequency = frequency;
}
}
2.NovelIntroModel
/**
*@Description:
*/
package com.lulei.crawl.novel.zongheng.model;
public class NovelIntroModel {
private String md5Id;
private String name;
private String author;
private String description;
private String type;
private String lastChapter;
private String chapterlisturl;
private int wordCount;
private String keyWords;
private int chapterCount;
public String getMd5Id() {
return md5Id;
}
public void setMd5Id(String md5Id) {
this.md5Id = md5Id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getAuthor() {
return author;
}
public void setAuthor(String author) {
this.author = author;
}
public String getDescription() {
return description;
}
public void setDescription(String description) {
this.description = description;
}
public String getType() {
return type;
}
public void setType(String type) {
this.type = type;
}
public String getLastChapter() {
return lastChapter;
}
public void setLastChapter(String lastChapter) {
this.lastChapter = lastChapter;
}
public String getChapterlisturl() {
return chapterlisturl;
}
public void setChapterlisturl(String chapterlisturl) {
this.chapterlisturl = chapterlisturl;
}
public int getWordCount() {
return wordCount;
}
public void setWordCount(int wordCount) {
this.wordCount = wordCount;
}
public String getKeyWords() {
return keyWords;
}
public void setKeyWords(String keyWords) {
this.keyWords = keyWords;
}
public int getChapterCount() {
return chapterCount;
}
public void setChapterCount(int chapterCount) {
this.chapterCount = chapterCount;
}
}
3.NovelChapterModel
/**
*@Description:
*/
package com.lulei.crawl.novel.zongheng.model;
public class NovelChapterModel {
private String url;
private int chapterId;
private long time;
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
public int getChapterId() {
return chapterId;
}
public void setChapterId(int chapterId) {
this.chapterId = chapterId;
}
public long getTime() {
return time;
}
public void setTime(long time) {
this.time = time;
}
}
4.NovelReadModel
/**
*@Description:
*/
package com.lulei.crawl.novel.zongheng.model;
public class NovelReadModel extends NovelChapterModel {
private String title;
private int wordCount;
private String content;
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public int getWordCount() {
return wordCount;
}
public void setWordCount(int wordCount) {
this.wordCount = wordCount;
}
public String getContent() {
return content;
}
public void setContent(String content) {
this.content = content;
}
}
这里的数据库操作我们使用博客《基于lucene的案例开发:数据库连接池》介绍的数据库连接池,在采集这个业务过程中,主要是插入和查询操作,当然还有记录的状态值的更新操作,下面我们就每一个操作介绍一个方法,方面大家理解如何使用我们自己的数据库连接池操作来完成数据库的增删改查操作。
1.数据表查询:随机获取一条记录
我们之后的爬虫希望可以做成分布式的采集,因此这里我们在获取简介页的URL时候,我们可以每次获取一个随机值,这样在线程之间出现同时采集一个URL的情况就会大大降低,至于Mysql中的随机我们可以使用 order by rand() limit n 来获取前n条记录,其他的数据库实现方式稍微有点差异。
/**
* @param state
* @return
* @Author:lulei
* @Description: 随机获取一个简介url
*/
public String getRandIntroPageUrl(int state) {
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "select * from novelinfo where state = '" + state + "' order by rand() limit 1";
ResultSet rs = dbServer.select(sql);
while (rs.next()) {
return rs.getString("url");
}
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
return null;
}
在这个方法中,我们直接使用DBServer中的select(String sql)方法即可执行对应的sql语句,他的返回值就是查询的结果集。
2.数据表更新:修改简介页的抓取状态
在简介页一次采集完成之后或者更新列表页检测到该简介页有更新的时候,我们需要对小说的简介页的抓取状态进行修改,标识这个简介页已经完成采集或需要采集,我们直接使用DBServer中的update(String sql)方法即可执行对应的sql语句。
/**
* @param md5Id
* @param state
* @Author:lulei
* @Description: 修改简介页的抓取状态
*/
public void updateInfoState(String md5Id, int state) {
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "update novelinfo set state = '" + state + "' where id = '" + md5Id + "'";
dbServer.update(sql);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
3.数据表插入:保存小说阅读页信息
在完成小说阅读页数据解析之后,我们需要将解析后的数据持久化到数据库中,这里我们可以使用DBServer中的insert(String table, String columns, HashMap<Integer, Object> params)方法即可执行相关的插入操作。
/**
* @param novel
* @Author:lulei
* @Description: 保存小说阅读页信息
*/
public void saveNovelRead(NovelReadModel novel) {
if (novel == null) {
return;
}
DBServer dbServer = new DBServer(POOLNAME);
try {
HashMap<Integer, Object> params = new HashMap<Integer, Object>();
int i = 1;
String md5Id = ParseMD5.parseStrToMd5L32(novel.getUrl());
//如果已经存在,则直接返回
if (haveReadUrl(md5Id)) {
return;
}
long now = System.currentTimeMillis();
params.put(i++, md5Id);
params.put(i++, novel.getUrl());
params.put(i++, novel.getTitle());
params.put(i++, novel.getWordCount());
params.put(i++, novel.getChapterId());
params.put(i++, novel.getContent());
params.put(i++, novel.getTime());
params.put(i++, now);
params.put(i++, now);
dbServer.insert("novelchapterdetail", "id,url,title,wordcount,chapterid,content,chaptertime,createtime,updatetime", params);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
完整的纵横小说数据库操作类代码如下:
/**
*@Description: 纵横中文小说数据库操作
*/
package com.lulei.db.novel.zongheng;
import java.sql.ResultSet;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import com.lulei.crawl.novel.zongheng.model.CrawlListInfo;
import com.lulei.crawl.novel.zongheng.model.NovelChapterModel;
import com.lulei.crawl.novel.zongheng.model.NovelIntroModel;
import com.lulei.crawl.novel.zongheng.model.NovelReadModel;
import com.lulei.db.manager.DBServer;
import com.lulei.util.ParseMD5;
public class ZonghengDb {
private static final String POOLNAME = "proxool.test";
/**
* @param urls
* @Author:lulei
* @Description: 保存更新列表采集到的URL
*/
public void saveInfoUrls(List<String> urls) {
if (urls == null || urls.size() < 1) {
return;
}
for (String url : urls) {
String md5Id = ParseMD5.parseStrToMd5L32(url);
if (haveInfoUrl(md5Id)) {
updateInfoState(md5Id, 1);
} else {
insertInfoUrl(md5Id, url);
}
}
}
/**
* @param state
* @return
* @Author:lulei
* @Description: 随机获取一个简介url
*/
public String getRandIntroPageUrl(int state) {
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "select * from novelinfo where state = '" + state + "' order by rand() limit 1";
ResultSet rs = dbServer.select(sql);
while (rs.next()) {
return rs.getString("url");
}
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
return null;
}
/**
* @param state
* @return
* @Author:lulei
* @Description: 随机获取一个章节信息
*/
public NovelChapterModel getRandReadPageUrl(int state) {
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "select * from novelchapter where state = '" + state + "' order by rand() limit 1";
ResultSet rs = dbServer.select(sql);
while (rs.next()) {
NovelChapterModel chapter = new NovelChapterModel();
chapter.setChapterId(rs.getInt("chapterid"));
chapter.setTime(rs.getLong("chaptertime"));
chapter.setUrl(rs.getString("url"));
return chapter;
}
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
return null;
}
/**
* @param novel
* @Author:lulei
* @Description: 保存小说阅读页信息
*/
public void saveNovelRead(NovelReadModel novel) {
if (novel == null) {
return;
}
DBServer dbServer = new DBServer(POOLNAME);
try {
HashMap<Integer, Object> params = new HashMap<Integer, Object>();
int i = 1;
String md5Id = ParseMD5.parseStrToMd5L32(novel.getUrl());
//如果已经存在,则直接返回
if (haveReadUrl(md5Id)) {
return;
}
long now = System.currentTimeMillis();
params.put(i++, md5Id);
params.put(i++, novel.getUrl());
params.put(i++, novel.getTitle());
params.put(i++, novel.getWordCount());
params.put(i++, novel.getChapterId());
params.put(i++, novel.getContent());
params.put(i++, novel.getTime());
params.put(i++, now);
params.put(i++, now);
dbServer.insert("novelchapterdetail", "id,url,title,wordcount,chapterid,content,chaptertime,createtime,updatetime", params);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @return
* @Author:lulei
* @Description: 获取监控的更新列表页
*/
public List<CrawlListInfo> getCrawlListInfos(){
List<CrawlListInfo> infos = new ArrayList<CrawlListInfo>();
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "select * from crawllist where state = '1'";
ResultSet rs = dbServer.select(sql);
while (rs.next()) {
CrawlListInfo info = new CrawlListInfo();
infos.add(info);
info.setFrequency(rs.getInt("frequency"));
info.setInfo(rs.getString("info"));
info.setUrl(rs.getString("url"));
}
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
return infos;
}
/**
* @param bean
* @Author:lulei
* @Description: 更新简介页记录
*/
public void updateInfo(NovelIntroModel bean) {
if (bean == null) {
return;
}
DBServer dbServer = new DBServer(POOLNAME);
try {
HashMap<Integer, Object> params = new HashMap<Integer, Object>();
int i = 1;
params.put(i++, bean.getName());
params.put(i++, bean.getAuthor());
params.put(i++, bean.getDescription());
params.put(i++, bean.getType());
params.put(i++, bean.getLastChapter());
params.put(i++, bean.getChapterCount());
params.put(i++, bean.getChapterlisturl());
params.put(i++, bean.getWordCount());
params.put(i++, bean.getKeyWords());
long now = System.currentTimeMillis();
params.put(i++, now);
params.put(i++, "0");
String columns = "name, author, description, type, lastchapter, chaptercount, chapterlisturl, wordcount, keywords, updatetime, state";
String condition = "where id = '" + bean.getMd5Id() + "'";
dbServer.update("novelinfo", columns, condition, params);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @param chapters
* @Author:lulei
* @Description: 保存章节列表信息
*/
public void saveChapters(List<String[]> chapters) {
if (chapters == null) {
return;
}
DBServer dbServer = new DBServer(POOLNAME);
try {
for (int i = 0; i < chapters.size(); i++) {
String[] chapter = chapters.get(i);
if (chapter.length != 4) {
continue;
}
//name、wordcount、time、url
String md5Id = ParseMD5.parseStrToMd5L32(chapter[3]);
if (!haveChapterUrl(md5Id)) {
insertChapterUrl(chapter, i);
}
}
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @param md5Id
* @param state
* @Author:lulei
* @Description: 修改简介页的抓取状态
*/
public void updateInfoState(String md5Id, int state) {
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "update novelinfo set state = '" + state + "' where id = '" + md5Id + "'";
dbServer.update(sql);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @param md5Id
* @param state
* @Author:lulei
* @Description: 更新章节列表采集状态
*/
public void updateChapterState(String md5Id, int state) {
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "update novelchapter set state = '" + state + "' where id = '" + md5Id + "'";
dbServer.update(sql);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @param md5Id
* @param url
* @Author:lulei
* @Description: 新增一个抓取简介页
*/
private void insertInfoUrl(String md5Id, String url) {
DBServer dbServer = new DBServer(POOLNAME);
try {
HashMap<Integer, Object> params = new HashMap<Integer, Object>();
int i = 1;
params.put(i++, md5Id);
params.put(i++, url);
long now = System.currentTimeMillis();
params.put(i++, now);
params.put(i++, now);
params.put(i++, "1");
dbServer.insert("novelinfo", "id, url, createtime, updatetime, state", params);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @param md5Id
* @return
* @Author:lulei
* @Description: 判断简介页是否存在
*/
private boolean haveInfoUrl(String md5Id) {
DBServer dbServer = new DBServer(POOLNAME);
try {
ResultSet rs = dbServer.select("select sum(1) as count from novelinfo where id = '" + md5Id + "'");
if (rs.next()) {
int count = rs.getInt("count");
return count > 0;
}
return false;
} catch (Exception e) {
e.printStackTrace();
return true;
} finally{
dbServer.close();
}
}
/**
* @param md5Id
* @return
* @Author:lulei
* @Description: 判断阅读页信息是否存在
*/
private boolean haveReadUrl(String md5Id) {
DBServer dbServer = new DBServer(POOLNAME);
try {
ResultSet rs = dbServer.select("select sum(1) as count from novelchapterdetail where id = '" + md5Id + "'");
if (rs.next()) {
int count = rs.getInt("count");
return count > 0;
}
return false;
} catch (Exception e) {
e.printStackTrace();
return true;
} finally{
dbServer.close();
}
}
/**
* @param chapter
* @param chapterId
* @Author:lulei
* @Description: 插入章节列表页信息
*/
private void insertChapterUrl(String[] chapter, int chapterId) {
//name、wordcount、time、url
DBServer dbServer = new DBServer(POOLNAME);
try {
HashMap<Integer, Object> params = new HashMap<Integer, Object>();
int i = 1;
params.put(i++, ParseMD5.parseStrToMd5L32(chapter[3]));
params.put(i++, chapter[3]);
params.put(i++, chapter[0]);
params.put(i++, chapter[1]);
params.put(i++, chapterId);
params.put(i++, chapter[2]);
long now = System.currentTimeMillis();
params.put(i++, now);
params.put(i++, "1");
dbServer.insert("novelchapter", "id, url, title, wordcount, chapterid, chaptertime, createtime, state", params);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @param md5Id
* @return
* @Author:lulei
* @Description: 是否存在章节信息
*/
private boolean haveChapterUrl(String md5Id) {
DBServer dbServer = new DBServer(POOLNAME);
try {
ResultSet rs = dbServer.select("select sum(1) as count from novelchapter where id = '" + md5Id + "'");
if (rs.next()) {
int count = rs.getInt("count");
return count > 0;
}
return false;
} catch (Exception e) {
e.printStackTrace();
return true;
} finally{
dbServer.close();
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
}
}
对于上面的代码还希望大家可以认真的阅读下,里面有一些简单的去重操作;在下一篇博客中我们将会介绍如何基于这写数据库操作来实现分布式采集。
----------------------------------------------------------------------------------------------------
ps:最近发现其他网站可能会对博客转载,上面并没有源链接,如想查看更多关于 基于lucene的案例开发 请点击这里。或访问网址http://blog.csdn.net/xiaojimanman/article/category/2841877 或 http://www.llwjy.com/blogtype/lucene.html
-------------------------------------------------------------------------------------------------
小福利
-------------------------------------------------------------------------------------------------
个人在极客学院上《Lucene案例开发》课程已经上线了(目前上线到第二课),欢迎大家吐槽~