有朋友说不会替换和查找过滤,那我就一个一个站弄下吧。没多少时间,一天发一个吧,这次是雯雯文学。
首先要过滤掉他网站的广告。过滤信息在 <PubContentText>这。可以参考下。也许还有我不知道的广告,你们可以进他的网站内页多点一下找一下看看。www.518cqdl.com
这个规则易读的采集器是可以适应的。关关不知道是否可以用。
<?xml version="1.0" encoding="UTF-8"?>
<RuleConfigInfo xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="https://www.w3.org/2001/XMLSchema">
<NovelIntro>
<RegexName>NovelIntro</RegexName>
<Pattern><meta property="og:description" content="((.|\n)*?)"/></Pattern>
<Method/>
<FilterPattern/>
<Options/>
</NovelIntro>
<PubContentText>
<RegexName>PubContentText</RegexName>
<Pattern><div id="content">((.|\n)*?)</div></Pattern>
<Method/>
<FilterPattern>河溪小说
手机站-m.518cqd.com
www.518cqdL.com
m.518cqdL.com
<script.+?</script>|<div.+?>|</div>|<p>|</p>
【<b>(.|\n)*?</B>】♂</FilterPattern>
<Options/>
</PubContentText>
<NovelSearchUrl>
<RegexName>NovelSearchUrl</RegexName>
<Pattern/>
<Method/>
<FilterPattern/>
<Options/>
</NovelSearchUrl>
<NovelList_GetNovelKey>
<RegexName>NovelList_GetNovelKey</RegexName>
<Pattern><span class="s2"><a href="/info/.+?/(.+?).html">.+?</a></Pattern>
<Method/>
<FilterPattern/>
<Options/>
</NovelList_GetNovelKey>
<NovelListUrl>
<RegexName>NovelListUrl</RegexName>
<Pattern>https://www.518cqdL.com/list/1.html
https://www.518cqdL.com/list/2.html
https://www.518cqdL.com/list/3.html
https://www.518cqdL.com/list/4.html
https://www.518cqdL.com/list/5.html
https://www.518cqdL.com/list/6.html
https://www.518cqdL.com/list/7.html
https://www.518cqdL.com/list/8.html
https://www.518cqdL.com/list/9.html
https://www.518cqdL.com/list/10.html</Pattern>
<Method/>
<FilterPattern/>
<Options/>
</NovelListUrl>
<PubChapterRegion>
<RegexName>PubChapterRegion</RegexName>
<Pattern/>
<Method/>
<FilterPattern/>
<Options/>
</PubChapterRegion>
<NovelName>
<RegexName>NovelName</RegexName>
<Pattern><meta property="og:title" content="(.+?)"/></Pattern>
<Method/>
<FilterPattern/>
<Options/>
</NovelName>
<NovelSearch_GetNovelName>
<RegexName>NovelSearch_GetNovelName</RegexName>
<Pattern/>
<Method/>
<FilterPattern/>
<Options/>
</NovelSearch_GetNovelName>
<NovelList_GetNovelKey2>
<RegexName>NovelList_GetNovelKey2</RegexName>
<Pattern/>
<Method/>
<FilterPattern/>
<Options/>
</NovelList_GetNovelKey2>
<LagerSort>
<RegexName>LagerSort</RegexName>
<Pattern><meta property="og:novel:category" content="(.+?)"/></Pattern>
<Method/>
<FilterPattern/>
<Options/>
</LagerSort>
<SmallSort>
<RegexName>SmallSort</RegexName>
<Pattern><meta property="og:novel:category" content="(.+?)"/></Pattern>
<Method/>
<FilterPattern/>
<Options/>
</SmallSort>
<GetSiteUrl>
<RegexName>GetSiteUrl</RegexName>
<Pattern>https://www.518cqdL.com</Pattern>
<Method/>
<FilterPattern/>
<Options/>
</GetSiteUrl>
<TestSearchNovelName>
<RegexName>TestSearchNovelName</RegexName>
<Pattern/>
<Method/>
<FilterPattern/>
<Options/>
</TestSearchNovelName>
<NovelDegree>
<RegexName>NovelDegree</RegexName>
<Pattern><meta property="og:novel:status" content="(.+?)"/></Pattern>
<Method/>
<FilterPattern/>
<Options/>
</NovelDegree>
<PubContentText_FT2JT>
<RegexName>PubContentText_FT2JT</RegexName>
<Pattern>false</Pattern>
<Method/>
<FilterPattern/>
<Options/>
</PubContentText_FT2JT>
<NovelAuthor>
<RegexName>NovelAuthor</RegexName>
<Pattern><meta property="og:novel:author" content="(.+?)"/></Pattern>
<Method/>
<FilterPattern/>
<Options/>
</NovelAuthor>
<NovelInfo_GetNovelPubKey>
<RegexName>NovelInfo_GetNovelPubKey</RegexName>
<Pattern><meta property="og:novel:read_url" content="(.+?)"/></Pattern>
<Method/>
<FilterPattern/>
<Options/>
</NovelInfo_GetNovelPubKey>
<PubContentText_ASCII>
<RegexName>PubContentText_ASCII</RegexName>
<Pattern>false</Pattern>
<Method/>
<FilterPattern/>
<Options/>
</PubContentText_ASCII>
<NovelCover>
<RegexName>NovelCover</RegexName>
<Pattern><meta property="og:image" content="(.+?)"/></Pattern>
<Method/>
<FilterPattern/>
<Options/>
</NovelCover>
<RuleVersion>
<RegexName>RuleVersion</RegexName>
<Pattern>2</Pattern>
<Method/>
<FilterPattern/>
<Options/>
</RuleVersion>
<PubContentText_BJ2QJ>
<RegexName>PubContentText_BJ2QJ</RegexName>
<Pattern>false</Pattern>
<Method/>
<FilterPattern/>
<Options/>
</PubContentText_BJ2QJ>
<NovelInfoExtra>
<RegexName>NovelInfoExtra</RegexName>
<Pattern/>
<Method/>
<FilterPattern/>
<Options/>
</NovelInfoExtra>
<PubIndexUrl>
<RegexName>PubIndexUrl</RegexName>
<Pattern>{NovelPubKey}</Pattern>
<Method/>
<FilterPattern/>
<Options/>
</PubIndexUrl>
<NovelDefaultCoverUrl>
<RegexName>NovelDefaultCoverUrl</RegexName>
<Pattern>https://www.518cqdL.com/cover/nocover.jpg</Pattern>
<Method/>
<FilterPattern/>
<Options/>
</NovelDefaultCoverUrl>
<PubContentUrl2>
<RegexName>PubContentUrl2</RegexName>
<Pattern/>
<Method/>
<FilterPattern/>
<Options/>
</PubContentUrl2>
<PubContentUrl>
<RegexName>PubContentUrl</RegexName>
<Pattern>{ChapterKey}</Pattern>
<Method/>
<FilterPattern/>
<Options/>
</PubContentUrl>
<GetSiteName>
<RegexName>GetSiteName</RegexName>
<Pattern>518cqdL.com</Pattern>
<Method/>
<FilterPattern/>
<Options/>
</GetSiteName>
<PubChapterName>
<RegexName>PubChapterName</RegexName>
<Pattern><a href=".+?" title=".+?">(.+?)</a></Pattern>
<Method/>
<FilterPattern/>
<Options/>
</PubChapterName>
<GetSiteCharset>
<RegexName>GetSiteCharset</RegexName>
<Pattern>utf8</Pattern>
<Method/>
<FilterPattern/>
<Options/>
</GetSiteCharset>
<PubChapter_GetChapterKey>
<RegexName>PubChapter_GetChapterKey</RegexName>
<Pattern><a href="(.+?)" title=".+?">.+?</a></Pattern>
<Method/>
<FilterPattern/>
<Options/>
</PubChapter_GetChapterKey>
<NovelSearch_GetNovelKey>
<RegexName>NovelSearch_GetNovelKey</RegexName>
<Pattern/>
<Method/>
<FilterPattern/>
<Options/>
</NovelSearch_GetNovelKey>
<NovelKeyword>
<RegexName>NovelKeyword</RegexName>
<Pattern/>
<Method/>
<FilterPattern/>
<Options/>
</NovelKeyword>
<NovelUrl>
<RegexName>NovelUrl</RegexName>
<Pattern>https://www.518cqdL.com/info/10/{NovelKey}.html</Pattern>
<Method/>
<FilterPattern/>
<Options/>
</NovelUrl>
</RuleConfigInfo>
过滤这,我没多看,需要这个采集规则的可以去多看下他的小说内容页面,看下他加了什么广告内容么。
易读站不多,我找了下找到一些:
www.vgango.com
www.dosrojos.com
www.aavpccv.com
www.infected-mushroom.net
www.peoLpLe.com
www.hexaworLd.net
www.athomechecking.com
www.888cqdL.cn
www.666cqdL.cn
www.178cqdL.cn
www.next-bet.com
www.cosender.com
www.vivaLuta.com
www.sandyall.com
这些网站都可以用这个规则进行套,改下过滤和域名就可以了。