开源库openhtmltopdf html 转pdf特殊字符适配

长孙正卿

2023-12-01

概述

html 转pdf 是常规需求，也是一个比较让人烦躁的需求，可能出现各种不讲道理的问题。本篇主要分析html 中存在特殊字符时，如何正常转化为pdf 的问题。其他相关注意事项可参考我的其他文章,如下。

基本的`html` 转`pdf`。

这里介绍使用开源库openhtmltopdf 将html 转化为pdf。更加详细的文档可参考openhtmltopdf库操作手册。

import java.io.FileOutputStream;
import java.io.OutputStream;
import com.openhtmltopdf.pdfboxout.PdfRendererBuilder;

public class SimpleUsage 
{
    public static void main(String[] args) throws Exception { 
        try (OutputStream os = new FileOutputStream("/Users/me/Documents/pdf/out.pdf")) {
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.useFastMode();
            builder.withUri("file:///Users/me/Documents/pdf/in.htm");
            builder.toStream(os);
            builder.run();
        }
    }
}

存在问题

通过以上代码即可把html 转化为pdf，但如果html中有特殊字符如&，则会转化失败。因为默认的严格的xmldoc，对于格式有严格要求。遇到特殊字符必须使用转义字符，否则就会因无法识别而报错。

解决方案

第一步：将html 转化为标准的`w3c DOM`，代码如下。

	public org.w3c.dom.Document html5ParseDocument(String urlStr, int timeoutMs) throws IOException 
	{
		URL url = new URL(urlStr);
		org.jsoup.nodes.Document doc;
		
		if (url.getProtocol().equalsIgnoreCase("file")) {
			doc = Jsoup.parse(new File(url.getPath()), "UTF-8");
		}
		else {
			doc = Jsoup.parse(url, timeoutMs);	
		}
		// Should reuse W3CDom instance if converting multiple documents.
		return new W3CDom().fromJsoup(doc);
	}

**注意：**调用方法html5ParseDocument（）需要传入两个参数，第一个参数是html 的文件路径或地址，如果html文件是本地文件，需要在文件路径前添加file://。如file:///user/temp/test.html。第二个参数是解析远程html 的超时时间，没特殊要求。

第二步：将 `w3c DOM` 转化为`pdf`

    public static void getPdfByHtml(String htmlPath, String pdfPath) {
        OutputStream os =null;
        try  {
            os = new FileOutputStream(pdfPath);
            PdfRendererBuilder builder = new PdfRendererBuilder();
            builder.useFastMode();
            org.w3c.dom.Document document = html5ParseDocument("file://" + htmlPath, 2000);
            builder.withW3cDocument(document, "file:///");
            builder.toStream(os);
            builder.run();
        } catch (FileNotFoundException e) {
            logger.error("html文件未找到");
            e.printStackTrace();
        } catch (IOException e) {
            logger.error("html转PDF文件 出错");
            e.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        }finally {
            try {
                os.flush();
                os.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

注意：builder.withW3cDocument(document, "file:///");这个方法的第一个参数就是我们转化后的符合w3c DOM标准的doc文件的实例，没啥好说的。第二个参数是baseUri，用来描述html文件的根目录。假设html 中有个本地图片资源：/Documents/temple/Downloads/histogram_8a754ed63131432f8a720a4db6e7ab1a_ldzc.png，那么你最少需要将根目录设置为/，也就是将baseUri设置为"file:///"。当然这个根目录必须是html 中所有资源的根目录，否则可能出现无法找到资源文件的问题。

开源库openhtmltopdf html 转pdf特殊字符适配

概述

基本的`html` 转`pdf`。

存在问题

解决方案

第一步：将html 转化为标准的`w3c DOM`，代码如下。

第二步：将 `w3c DOM` 转化为`pdf`

相关阅读

相关文章

相关问答

相关文档

开源库openhtmltopdf html 转pdf特殊字符适配

概述

基本的html 转pdf。

存在问题

解决方案

第一步：将html 转化为 标准的w3c DOM，代码如下。

第二步：将 w3c DOM 转化为pdf

相关阅读

相关文章

相关问答

相关文档

基本的`html` 转`pdf`。

第一步：将html 转化为标准的`w3c DOM`，代码如下。

第二步：将 `w3c DOM` 转化为`pdf`