当前位置: 首页 > 知识库问答 >
问题:

使用Jsoup将子元素中的换行符替换为

严繁
2023-03-14

我在替换所有

String body = "<p>This is the output:</p>\n<pre class=\"lang-xml prettyprint prettyprinted\">\n<code><span class=\"dec\">&lt;!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\"&gt;</span><span class=\"pln\">\n</span><span class=\"tag\">&lt;HTML&gt;</span><span class=\"pln\">\n    </span><span class=\"tag\">&lt;HEAD&gt;</span><span class=\"pln\">\n        </span><span class=\"tag\">&lt;META</span><span class=\"pln\"> </span><span class=\"atn\">http-equiv</span><span class=\"pun\">=</span><span class=\"atv\">\"Content-Type\"</span><span class=\"pln\"> </span><span class=\"atn\">content</span><span class=\"pun\">=</span><span class=\"atv\">\"text/html; charset=iso-8859-1\"</span><span class=\"tag\">&gt;</span><span class=\"pln\">\n        </span><span class=\"tag\">&lt;TITLE&gt;</span><span class=\"pln\">GeteBayOfficialTime</span><span class=\"tag\">&lt;/TITLE&gt;</span><span class=\"pln\">\n    </span><span class=\"tag\">&lt;/HEAD&gt;</span><span class=\"pln\">\n    </span><span class=\"tag\">&lt;BODY&gt;</span><span class=\"pln\">\n\n* About to connect() to api.ebay.com port 443 (#0)\n*   Trying 66.135.211.100... * Timeout\n*   Trying 66.135.211.140... * Timeout\n*   Trying 66.211.179.150... * Timeout\n*   Trying 66.211.179.180... * Timeout\n*   Trying 66.135.211.101... * Timeout\n*   Trying 66.211.179.148... * Timeout\n* connect() timed out!\n* Closing connection #0\n</span><span class=\"tag\">&lt;P&gt;</span><span class=\"pln\">Error sending request</span></code></pre>";
            log.info("printing before creating a Jsoup Doc "+  body);
            Document bodyDom = Jsoup.parse(body);
            log.info("printing after creating a Jsoup Doc "+  bodyDom.html());

            Elements preTags = bodyDom.getElementsByTag("pre");

            for (Element pre : preTags) {
                pre.html(pre.html().replaceAll("(\r\n|\n)", "<br />"));
                log.info("Pre element with linebreaks replaced -" + pre);
            }

            body = bodyDom.html();

这里是日志,似乎HTML源丢失了换行符,一旦我解析了JSoup文档。:

**2013-12-10 10:14:59 INFO  FormattingTest:166** - printing before creating a Jsoup Doc <p>This is the output:</p>
<pre class="lang-xml prettyprint prettyprinted">
<code><span class="dec">&lt;!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"&gt;</span><span class="pln">
</span><span class="tag">&lt;HTML&gt;</span><span class="pln">
    </span><span class="tag">&lt;HEAD&gt;</span><span class="pln">
        </span><span class="tag">&lt;META</span><span class="pln"> </span><span class="atn">http-equiv</span><span class="pun">=</span><span class="atv">"Content-Type"</span><span class="pln"> </span><span class="atn">content</span><span class="pun">=</span><span class="atv">"text/html; charset=iso-8859-1"</span><span class="tag">&gt;</span><span class="pln">
        </span><span class="tag">&lt;TITLE&gt;</span><span class="pln">GeteBayOfficialTime</span><span class="tag">&lt;/TITLE&gt;</span><span class="pln">
    </span><span class="tag">&lt;/HEAD&gt;</span><span class="pln">
    </span><span class="tag">&lt;BODY&gt;</span><span class="pln">

* About to connect() to api.ebay.com port 443 (#0)
*   Trying 66.135.211.100... * Timeout
*   Trying 66.135.211.140... * Timeout
*   Trying 66.211.179.150... * Timeout
*   Trying 66.211.179.180... * Timeout
*   Trying 66.135.211.101... * Timeout
*   Trying 66.211.179.148... * Timeout
* connect() timed out!
* Closing connection #0
</span><span class="tag">&lt;P&gt;</span><span class="pln">Error sending request</span></code></pre>


**2013-12-10 10:14:59 INFO  FormattingTest:168** - printing after creating a Jsoup Doc <html>
 <head></head>
 <body>
  <p>This is the output:</p> 
  <pre class="lang-xml prettyprint prettyprinted">
<code><span class="dec">&lt;!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 4.01 Transitional//EN&quot; &quot;http://www.w3.org/TR/html4/loose.dtd&quot;&gt;</span><span class="pln"> </span><span class="tag">&lt;HTML&gt;</span><span class="pln"> </span><span class="tag">&lt;HEAD&gt;</span><span class="pln"> </span><span class="tag">&lt;META</span><span class="pln"> </span><span class="atn">http-equiv</span><span class="pun">=</span><span class="atv">&quot;Content-Type&quot;</span><span class="pln"> </span><span class="atn">content</span><span class="pun">=</span><span class="atv">&quot;text/html; charset=iso-8859-1&quot;</span><span class="tag">&gt;</span><span class="pln"> </span><span class="tag">&lt;TITLE&gt;</span><span class="pln">GeteBayOfficialTime</span><span class="tag">&lt;/TITLE&gt;</span><span class="pln"> </span><span class="tag">&lt;/HEAD&gt;</span><span class="pln"> </span><span class="tag">&lt;BODY&gt;</span><span class="pln"> * About to connect() to api.ebay.com port 443 (#0) * Trying 66.135.211.100... * Timeout * Trying 66.135.211.140... * Timeout * Trying 66.211.179.150... * Timeout * Trying 66.211.179.180... * Timeout * Trying 66.135.211.101... * Timeout * Trying 66.211.179.148... * Timeout * connect() timed out! * Closing connection #0 </span><span class="tag">&lt;P&gt;</span><span class="pln">Error sending request</span></code></pre>
 </body>
</html>
2013-12-10 10:14:59 INFO  FormattingTest:174 - Pre element with linebreaks replaced -  <pre class="lang-xml prettyprint prettyprinted"><code><span class="dec">&lt;!DOCTYPE HTML PUBLIC &quot;-//W3C//DTD HTML 4.01 Transitional//EN&quot; &quot;http://www.w3.org/TR/html4/loose.dtd&quot;&gt;</span><span class="pln"> </span><span class="tag">&lt;HTML&gt;</span><span class="pln"> </span><span class="tag">&lt;HEAD&gt;</span><span class="pln"> </span><span class="tag">&lt;META</span><span class="pln"> </span><span class="atn">http-equiv</span><span class="pun">=</span><span class="atv">&quot;Content-Type&quot;</span><span class="pln"> </span><span class="atn">content</span><span class="pun">=</span><span class="atv">&quot;text/html; charset=iso-8859-1&quot;</span><span class="tag">&gt;</span><span class="pln"> </span><span class="tag">&lt;TITLE&gt;</span><span class="pln">GeteBayOfficialTime</span><span class="tag">&lt;/TITLE&gt;</span><span class="pln"> </span><span class="tag">&lt;/HEAD&gt;</span><span class="pln"> </span><span class="tag">&lt;BODY&gt;</span><span class="pln"> * About to connect() to api.ebay.com port 443 (#0) * Trying 66.135.211.100... * Timeout * Trying 66.135.211.140... * Timeout * Trying 66.211.179.150... * Timeout * Trying 66.211.179.180... * Timeout * Trying 66.135.211.101... * Timeout * Trying 66.211.179.148... * Timeout * connect() timed out! * Closing connection #0 </span><span class="tag">&lt;P&gt;</span><span class="pln">Error sending request</span></code></pre>

不确定出了什么问题。这是与另一个html源-"\n响应:\n一些thext\n\ndsjkhskjdh sdjhasjkdas\n"

正确地转换为-


Response :
some text

dsjkhskjdh sdjhasjkdas

不知道为什么第一个样本没有!!


共有1个答案

诸葛亮
2023-03-14

问题是当您尝试执行此操作时:

    Jsoup.parse("\nText\nNex").html();

你会得到:

    text nex

从这些问题中,您可以执行以下操作:

    Document bodyDom = Jsoup.parse(body.replaceAll("(\\r\\n|\\n)", "<br />"));

这就是在解析文档之前替换换行符。

对于仅替换两个pre标记之间的换行符,请使用正则表达式提取它们并替换:

    Pattern preP = Pattern.compile("<pre.*?>.+?</pre>", Pattern.DOTALL
            | Pattern.CASE_INSENSITIVE);
    Matcher m = preP.matcher(body);
    while (m.find()) {
        String toReplace = m.group();
        String replaced = toReplace.replaceAll("(\r\n|\n)", "<br />");
        body = body.replace(toReplace, replaced);
    }

.*是一个贪婪的限定符,它与/pre的第一个外观相匹配,您可以尝试使用正则表达式,但这是不可能的,请参阅此答案以获得更好的解释。我建议您使用下一个选项。

您可以在这里看到正则表达式的示例。

从第二个ASNWER中,您可以使用:

    Document.OutputSettings outputSettings = new Document.OutputSettings()
            .prettyPrint(false);
    body = Jsoup.clean(body, "", Whitelist.relaxed(), outputSettings);

之后(如原始代码):

    pre.html(pre.html().replaceAll("(\r\n|\n)", "<br />"));

prettyPrint选项使clean方法退出换行符,然后解析器正确处理它

干杯

 类似资料:
  • 有没有人知道如何使用JSoup替换元素。我试图用按钮替换表格元素及其内容,但没有成功。代码尝试如下。这是一个Android项目

  • 问题内容: 如何使用JavaScript从值读取换行符并将所有换行符替换为元素? 例: 从PHP传递的变量如下: 我希望我的结果在JavaScript转换后看起来像这样: 问题答案: 这会将所有退货转换为HTML 如果您想知道什么?:的意思。它称为非捕获组。这意味着括号内的正则表达式组不会保存在内存中,以后再引用。

  • 我想改变HTML元素的文本内容,使其具有一定的背景色。HTML的格式如下 我有像下面这样需要匹配的关键字: 我有字符串形式的html 我想匹配元素文本内容,并在匹配HTML字符串时用关键字替换它们。我会改变他们的跨度有给定的背景颜色和匹配关键字的文本。 生成的HTML如下所示。 如何用java实现它。我正在使用jsoup库。 这个代码对我有用。这是最佳方法吗?。或者有没有更好的替代html字符串的

  • 您好,我已经尝试了以下答案:如何使用jsoup替换标记,以及如何使用jsoup替换HTML标记,但都没有成功。我正在用JSoup解析一个网站,我运行了一个accross-letter-look GIF图像。幸运的是,这些gif图像有一个特定的名称,例如字母“a”的a.gif。 HTML输入: 期望输出: 我的java代码(以下)未打印预期输出: 谢谢你的帮助。

  • 这是我的密码 我想替换字体标签,并把span标签。在这将取代第一个字体标签但不是第二个标签

  • 问题内容: 我正在加载一个包含换行符的文本文件,并将其传递给。 用替换为已加载的字符串中的with ,它们会被模板转义为html 并显示在浏览器中,而不是引起换行。 如何更改此行为而无需切换到(没有XSS保护)? 问题答案: 看来您可以先在文本上运行template.HTMLEscape()进行净化,然后执行\ n 替换所信任的内容,然后将其用作预先转义和信任的模板数据。 更新:在Kocka的示例