需求:
基于boilerpipe抽取页面的文本内容,基于url的openStream来获取页面的时候会碰到乱码,解决方式是基于jsoup来获取body的byte流
实现:
jar依赖:
<dependency> <groupId>com.syncthemall</groupId> <artifactId>boilerpipe</artifactId> <version>1.2.2</version> </dependency>
抽取实现:
private String extractContent(String url) throws Exception {
InputStream inputStream = new ByteArrayInputStream(getEmptyConnection(
url).execute().bodyAsBytes());
TextDocument doc = new BoilerpipeSAXInput(new InputSource(inputStream))
.getTextDocument();
BoilerpipeExtractor extractor = CommonExtractors.DEFAULT_EXTRACTOR;
extractor.process(doc);
return doc.getContent();
}