问题：

使用Jsoup解析块引号内的文本

麹学文

2023-03-14

我试图用Jsoup解析Javadocs，但提取blockquote标记中包装的文本时遇到问题。

下面是我试图解析的HTML示例：

<P>
The <code>String</code> class represents character strings. All
 string literals in Java programs, such as <code>"abc"</code>, are
 implemented as instances of this class.
 <p>
 Strings are constant; their values cannot be changed after they
 are created. String buffers support mutable strings.
 Because String objects are immutable they can be shared. For example:
 <p><blockquote><pre>
     String str = "abc";
 </pre></blockquote><p>
 is equivalent to:
 <p><blockquote><pre>
     char data[] = {'a', 'b', 'c'};
     String str = new String(data);
 </pre></blockquote><p>
 Here are some more examples of how strings can be used:
 <p><blockquote><pre>
     System.out.println("abc");
     String cde = "cde";
     System.out.println("abc" + cde);
     String c = "abc".substring(2,3);
     String d = cde.substring(1, 2);
 </pre></blockquote>
 <p>

我试图使用这段代码来解析p标签中包含的文本：

        Document doc = Jsoup.parse(new File("/home/facetoe/ebooks/Java/docs/api/java/lang/String.html"), "UTF-8");
        Elements para = doc.getElementsByTag("P");

        for ( Element element : para ) {
            System.out.println(element);
        }

然而，无论我尝试什么，包含在块引用标签中的文本都会消失。

以下是我得到的输出示例：

<p> The <code>String</code> class represents character strings. All string literals in Java programs, such as <code>&quot;abc&quot;</code>, are implemented as instances of this class. </p>
<p> Strings are constant; their values cannot be changed after they are created. String buffers support mutable strings. Because String objects are immutable they can be shared. For example: </p>
<p></p>
<p> is equivalent to: </p>
<p></p>
<p> Here are some more examples of how strings can be used: </p>
<p></p>
<p> The class <code>String</code> includes methods for examining individual characters of the sequence, for comparing strings, for searching strings, for extracting substrings, and for creating a copy of a string with all characters translated to uppercase or to lowercase. Case mapping is based on the Unicode Standard version specified by the <a href="../../java/lang/Character.html" title="class in java.lang"><code>Character</code></a> class. </p>
<p> The Java language provides special support for the string concatenation operator (&nbsp;+&nbsp;), and for conversion of other objects to strings. String concatenation is implemented through the <code>StringBuilder</code>(or <code>StringBuffer</code>) class and its <code>append</code> method. String conversions are implemented through the method <code>toString</code>, defined by <code>Object</code> and inherited by all classes in Java. For additional information on string concatenation and conversion, see Gosling, Joy, and Steele, <i>The Java Language Specification</i>. </p>
<p> Unless otherwise noted, passing a <tt>null</tt> argument to a constructor or method in this class will cause a <a href="../../java/lang/NullPointerException.html" title="class in java.lang"><code>NullPointerException</code></a> to be thrown. </p>
<p>A <code>String</code> represents a string in the UTF-16 format in which <em>supplementary characters</em> are represented by <em>surrogate pairs</em> (see the section <a href="Character.html#unicode">Unicode Character Representations</a> in the <code>Character</code> class for more information). Index values refer to <code>char</code> code units, so a supplementary character uses two positions in a <code>String</code>. </p>
<p>The <code>String</code> class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., <code>char</code> values). </p>
<p> </p>

这就像是JSoup只是把任何包裹在块引用标签中的东西丢弃。有人知道如何保留这些标签并从中提取文本吗？

共有3个答案

宰父涵忍

2023-03-14

你不关你的门

唐渊

2023-03-14

查看解析方法的JSoup留档，似乎他们使用白名单机制来决定什么是安全的，什么不是。也许您需要在解析之前设置一个时间列表？虽然这似乎只适用于清洁方法。所以可能是别的东西。

郭盛

2023-03-14

其原因是JSoup构建了DOM，使得块引用元素位于段落之外。您可以通过打印doc对象看到这一点。我认为一个块引用元素会自动终止前一个p元素（不需要结束p标记）。如果您在现代浏览器中加载html并检查元素，您可以观察到同样的事情。

另请参见HTML4.01规范-“P元素表示段落。它不能包含块级元素（包括P本身）。”我确信HTML5中也有类似的内容。

因此，通过只遍历段落，您会丢失其中不包含的块引号。

类似资料：

使用Jsoup解析HTML内容

问题内容：这是我的HTML来源这是我获取内容的Java程序，它过滤HTML标记是否有使用Jsoup而不是使用Java而不是Regex解析HTML内容的简便方法有没有办法只获取所需的内容。所以在这里我只需要内容“项目2-222” 问题答案：尝试使用jsoup轻松解析：要了解更多信息，请访问Jsoup Docs
使用 Jsoup 解析 div 内部的跨度

给定此网页：我试图选择div内部的第一个span，然后获取强值。到目前为止，我成功地收集了其他东西，但是我无法完成:
使用带引号字段内带有双引号的OpenCSV解析CSV

我正在尝试使用OpenCSV解析CSV文件。其中一列以YAML序列化格式存储数据，并被引用，因为其中可以包含逗号。它里面也有引号，所以它通过放两个引号来转义。我能够在Ruby中轻松解析这个文件，但使用OpenCSV我无法完全解析它。这是一个UTF-8编码的文件。这是我的Java片段，它试图读取文件这是此文件中的2行。第一行没有被正确解析，并且在处被拆分，因为我猜是转义双引号。
使用JSoup解析HTML

我想解析出这个Nasa页面上的描述，页面底部的文字我该怎么做？
使用jsoup解析html并删除标记块

问题内容：我想删除标签之间的所有内容。输入示例可能是输入：输出将是：基本上，我必须先删除整个区块谢谢，问题答案：您最好对找到的所有元素进行迭代。所以你可以保证 a。）所有元素都被删除并且 b。）如果没有元素，那么什么也做不了。例：编辑：（除了我的评论）当简单的 null /范围检查在这里足够时，请不要使用异常处理：代替：
在引号内使用引号

问题内容：当我想在Python中执行命令并且需要使用引号时，我不知道如何在不关闭字符串的情况下执行该命令。例如：但是，当我尝试执行上面的操作时，我最终关闭了字符串，并且无法将需要的单词放在引号之间。我怎样才能做到这一点？问题答案：您可以通过以下三种方式之一进行操作：一起使用单引号和双引号：转义字符串中的双引号：使用三引号引起来的字符串：

使用Jsoup解析块引号内的文本

共有3个答案

相关问答

相关文章

相关阅读

相关工具

相关文档