当前位置：首页 > 工具软件 > Boilerpipe > 使用案例 >

基于boilerpipe抽取页面乱码问题解决方式

丁晋

2023-12-01

需求：

基于boilerpipe抽取页面的文本内容，基于url的openStream来获取页面的时候会碰到乱码，解决方式是基于jsoup来获取body的byte流

实现：

jar依赖：

<dependency>
	<groupId>com.syncthemall</groupId>
	<artifactId>boilerpipe</artifactId>
	<version>1.2.2</version>
</dependency>

抽取实现：

private String extractContent(String url) throws Exception {
	InputStream inputStream = new ByteArrayInputStream(getEmptyConnection(
			url).execute().bodyAsBytes());

	TextDocument doc = new BoilerpipeSAXInput(new InputSource(inputStream))
			.getTextDocument();

	BoilerpipeExtractor extractor = CommonExtractors.DEFAULT_EXTRACTOR;
	extractor.process(doc);
	return doc.getContent();
}

类似资料：

基于boilerpipe抽取页面乱码问题解决方式

相关阅读

相关文章

相关问答

相关文档