当前位置: 首页 > 知识库问答 >
问题:

使用Jsoup提取HTML数据

孙胜泫
2023-03-14

我有一个带有ID、TEXT等列的表,这里的TEXT是超文本标记语言FORMAT中包含数据的Clob列

样本数据:

<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Start: 8:30 am<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></SPAN></P>
<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">End: 4 pm<o:p></o:p></SPAN></P>
<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals.<SPAN style="mso-spacerun: yes">  </SPAN>A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below.<SPAN style="mso-spacerun: yes">  </SPAN>The following items represent the scope and visit focus areas:<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">1.<SPAN style="FONT: 7pt 'Times New Roman'">       </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">SOP Program<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">2.<SPAN style="FONT: 7pt 'Times New Roman'">       </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Training Program<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">3.<SPAN style="FONT: 7pt 'Times New Roman'">       </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Calibration/Preventive Maintenance Program<o:p></o:p></SPAN></P>
<html>
 <head></head>
 <body>
  <p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Start: 8:30 am
    <!--?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /-->
    <o:p></o:p></span></p> 
  <p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">End: 4 pm
    <o:p></o:p></span></p> 
  <p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals.<span style="mso-spacerun: yes"> </span>A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below.<span style="mso-spacerun: yes"> </span>The following items represent the scope and visit focus areas:
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">1.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">SOP Program
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> 
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">2.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Training Program
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> 
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">3.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Calibration/Preventive Maintenance Program
    <o:p></o:p></span></p> 
 </body>
</html>

当我使用Jsoup.parse(AUDIT_SCOPE_LOB.text()时;我得到的数据如下

Start: 8:30 am End: 4 pm The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below. The following items represent the scope and visit focus areas: 1. SOP Program 2. Training Program 3. Calibration/Preventive Maintenance Program

我对java知之甚少。我可以使用jsoup获取java代码来提取数据并重新运行下面的outpu吗

Start: 8:30 am
End: 4 pm
The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below. The following items represent the scope and visit focus areas:
1. SOP Program
2. Training Program
3. Calibration/Preventive Maintenance Program

实际上,这个数据是一个样本数据。我有一些带有html标记的数据,这里没有提到。

共有2个答案

单于智
2023-03-14

org.jsoup.nodes.Element.toString()返回org.jsoup.nodes.Element.outerHTML()

获取此节点的外部HTML。

org.jsoup.nodes.Element.text()

获取此元素及其所有子元素的组合文本。空白被规范化和修剪。

因此,对整个示例调用toString()将返回与输出相同的结果。同样,调用text()将以单个字符串的形式返回所有不带标记的文本。但是,您需要的是每个文本段落的单个字符串。

您的某些段落标记为空。为了获得示例中的输出,您应该首先验证每个段落是否有文本。

Document doc = Jsoup.parse(AUDIT_SCOPE_LOB, "UTF-8");

for (Element p : doc.select("p"))
    if (p.hasText())
        System.out.println(p.text());

输出

Start: 8:30 am
End: 4 pm
The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined bel ow. The following items represent the scope and visit focus areas:
1. SOP Program
2. Training Program
3. Calibration/Preventive Maintenance Program

查看CSS选择器,了解更多如何解析数据的示例。例如,如果要解析出有序列表,可以在类名上选择并检索列表中的第二个跨度。

for (Element span : doc.select("p.MsoNormal > span:nth-child(2)")) 
     System.out.println(span.ownText());

输出

SOP Program
Training Program
Calibration/Preventive Maintenance Program
柳项明
2023-03-14

由于信息分为

Document doc = Jsoup.parse(AUDIT_SCOPE_LOB);
    Elements el = doc.select("p");
    for (Element e : el) {
        System.out.println(e.text());
    }

 类似资料:
  • 我一直在研究用于数据提取的Jsoup示例,并提取了此链接的一个示例 J汤

  • 有多个包含美国专利No.9,000,000的转让数据的div元素出现在行下面 有办法用JSOUP提取这个隐藏的html吗?

  • 问题内容: 我想使用JSoup-framework提取此表,以将内容保存在“表”数组中。第一个tr-tag是表头。所有以下内容(不包括在内)均描述了内容。 我已经测试了这一个和其他一些,但是我没有让它们为我工作: 使用JSoup提取HTML表内容 问题答案: 这是一些示例代码,您如何仅选择标题: 你得到… 解析 文件 :(这里是和字符集,请参阅jsoup对铁道部的相关信息文件) 解析 网站 :(不

  • 当我试图从在线URL=forexalgerie.com中的表中获取数据时,我的目标是这些值: ...似乎我的代码一切正常: 但是结果包含表中的所有内容,除了我想要的值? 怎么了?

  • 我有这个html 并且,我试图得到每个标签的href。 例如,

  • 主要内容:Jsoup 获取HTML 语法,Jsoup 获取HTML 说明,Jsoup 获取HTML 示例以下示例将展示在将 HTML 字符串解析为 Document 对象后获取内部 html 和外部 html 的方法的使用。 Jsoup 获取HTML 语法 document :文档对象代表 HTML DOM。 Jsoup : 解析给定 HTML 字符串的主类。 html : HTML 字符串。 link : 元素对象表示表示锚标记的 html 节点元素。 link.outerHtml() : o