我有一个带有ID、TEXT等列的表,这里的TEXT是超文本标记语言FORMAT中包含数据的Clob列
样本数据:
<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Start: 8:30 am<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></SPAN></P>
<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">End: 4 pm<o:p></o:p></SPAN></P>
<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals.<SPAN style="mso-spacerun: yes"> </SPAN>A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below.<SPAN style="mso-spacerun: yes"> </SPAN>The following items represent the scope and visit focus areas:<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">1.<SPAN style="FONT: 7pt 'Times New Roman'"> </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">SOP Program<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">2.<SPAN style="FONT: 7pt 'Times New Roman'"> </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Training Program<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">3.<SPAN style="FONT: 7pt 'Times New Roman'"> </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Calibration/Preventive Maintenance Program<o:p></o:p></SPAN></P>
<html>
<head></head>
<body>
<p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Start: 8:30 am
<!--?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /-->
<o:p></o:p></span></p>
<p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">End: 4 pm
<o:p></o:p></span></p>
<p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals.<span style="mso-spacerun: yes"> </span>A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below.<span style="mso-spacerun: yes"> </span>The following items represent the scope and visit focus areas:
<o:p></o:p></span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">1.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">SOP Program
<o:p></o:p></span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">
<o:p></o:p></span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">2.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Training Program
<o:p></o:p></span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">
<o:p></o:p></span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">3.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Calibration/Preventive Maintenance Program
<o:p></o:p></span></p>
</body>
</html>
当我使用Jsoup.parse(AUDIT_SCOPE_LOB.text()时;我得到的数据如下
Start: 8:30 am End: 4 pm The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below. The following items represent the scope and visit focus areas: 1. SOP Program 2. Training Program 3. Calibration/Preventive Maintenance Program
我对java知之甚少。我可以使用jsoup获取java代码来提取数据并重新运行下面的outpu吗
Start: 8:30 am
End: 4 pm
The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below. The following items represent the scope and visit focus areas:
1. SOP Program
2. Training Program
3. Calibration/Preventive Maintenance Program
实际上,这个数据是一个样本数据。我有一些带有html标记的数据,这里没有提到。
org.jsoup.nodes.Element.toString()
返回org.jsoup.nodes.Element.outerHTML()
获取此节点的外部HTML。
org.jsoup.nodes.Element.text()
获取此元素及其所有子元素的组合文本。空白被规范化和修剪。
因此,对整个示例调用toString()
将返回与输出相同的结果。同样,调用text()
将以单个字符串的形式返回所有不带标记的文本。但是,您需要的是每个文本段落的单个字符串。
您的某些段落标记为空。为了获得示例中的输出,您应该首先验证每个段落是否有文本。
Document doc = Jsoup.parse(AUDIT_SCOPE_LOB, "UTF-8");
for (Element p : doc.select("p"))
if (p.hasText())
System.out.println(p.text());
输出
Start: 8:30 am
End: 4 pm
The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined bel ow. The following items represent the scope and visit focus areas:
1. SOP Program
2. Training Program
3. Calibration/Preventive Maintenance Program
查看CSS选择器,了解更多如何解析数据的示例。例如,如果要解析出有序列表,可以在类名上选择并检索列表中的第二个跨度。
for (Element span : doc.select("p.MsoNormal > span:nth-child(2)"))
System.out.println(span.ownText());
输出
SOP Program
Training Program
Calibration/Preventive Maintenance Program
由于信息分为
Document doc = Jsoup.parse(AUDIT_SCOPE_LOB);
Elements el = doc.select("p");
for (Element e : el) {
System.out.println(e.text());
}
我一直在研究用于数据提取的Jsoup示例,并提取了此链接的一个示例 J汤
有多个包含美国专利No.9,000,000的转让数据的div元素出现在行下面 有办法用JSOUP提取这个隐藏的html吗?
问题内容: 我想使用JSoup-framework提取此表,以将内容保存在“表”数组中。第一个tr-tag是表头。所有以下内容(不包括在内)均描述了内容。 我已经测试了这一个和其他一些,但是我没有让它们为我工作: 使用JSoup提取HTML表内容 问题答案: 这是一些示例代码,您如何仅选择标题: 你得到… 解析 文件 :(这里是和字符集,请参阅jsoup对铁道部的相关信息文件) 解析 网站 :(不
当我试图从在线URL=forexalgerie.com中的表中获取数据时,我的目标是这些值: ...似乎我的代码一切正常: 但是结果包含表中的所有内容,除了我想要的值? 怎么了?
我有这个html 并且,我试图得到每个标签的href。 例如,
主要内容:Jsoup 获取HTML 语法,Jsoup 获取HTML 说明,Jsoup 获取HTML 示例以下示例将展示在将 HTML 字符串解析为 Document 对象后获取内部 html 和外部 html 的方法的使用。 Jsoup 获取HTML 语法 document :文档对象代表 HTML DOM。 Jsoup : 解析给定 HTML 字符串的主类。 html : HTML 字符串。 link : 元素对象表示表示锚标记的 html 节点元素。 link.outerHtml() : o