当前位置: 首页 > 知识库问答 >
问题:

如何使用Jsoup从html数据中获取图像源和描述

施英哲
2023-03-14

我正在尝试解析atom提要,以使用ROME API提取提要。atom提要为我提供了content属性,该属性包含文章的图像和描述。以下是atom提要的url:https://news.google.com/news/section?output=atom

 <entry>
<id>tag:news.google.com,2005:cluster=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222</id>
<title type="html">'Not Just GST Stuck In Parliament. Matter of Sorrow': PM Narendra Modi - NDTV</title>
<updated>2015-12-10T06:03:54Z</updated>
<link rel="alternate" type="text/html" href="http://news.google.com/news/url?sa=t&amp;fd=R&amp;ct2=in&amp;usg=AFQjCNE53SQd2skoJLxBTVlYWHdgDBCl7Q&amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;cid=52779006372283&amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;url=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222" hreflang="en"/>
<content type="html">&lt;table border="0" cellpadding="2" cellspacing="7" style="vertical-align:top;">&lt;tr>&lt;td width="80" align="center" valign="top">&lt;font style="font-size:85%;font-family:arial,sans-serif">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNE53SQd2skoJLxBTVlYWHdgDBCl7Q&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222">&lt;img src="//t3.gstatic.com/images?q=tbn:ANd9GcSNi4SJFo9q9PXKPOjJkiUlfk2GFRzRoBlwK6UsiSQ8np66JDvgQiYTdN4Fknntb7bVjdR-NuM" alt="" border="1" width="80" height="80">&lt;br>&lt;font size="-2">NDTV&lt;/font>&lt;/a>&lt;/font>&lt;/td>&lt;td valign="top" class="j">&lt;font style="font-size:85%;font-family:arial,sans-serif">&lt;br>&lt;div style="padding-top:0.8em;">&lt;img alt="" height="1" width="1">&lt;/div>&lt;div class="lh">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNE53SQd2skoJLxBTVlYWHdgDBCl7Q&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.ndtv.com/india-news/not-just-gst-stuck-in-parliament-matter-of-sorrow-pm-narendra-modi-1253222">&lt;b>&amp;#39;Not Just GST Stuck In Parliament. Matter of Sorrow&amp;#39;: PM &lt;b>Narendra Modi&lt;/b>&lt;/b>&lt;/a>&lt;br>&lt;font size="-1">&lt;b>&lt;font color="#6f6f6f">NDTV&lt;/font>&lt;/b>&lt;/font>&lt;br>&lt;font size="-1">With repeated disruptions stalling legislation including the GST or Goods and Services Tax, Prime Minister &lt;b>Narendra Modi&lt;/b> today said it was a &amp;quot;matter of sorrow&amp;quot; that Parliament was not running. &amp;quot;It is not only GST, but many pro-poor steps are stuck in&amp;nbsp;...&lt;/font>&lt;br>&lt;font size="-1">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNEVhO7UtISsITzRIFwxTVFwK8BTDQ&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.india.com/news/india/narendra-modis-stern-message-to-congress-democracy-cannot-run-on-whims-of-some-773082/">&lt;b>Narendra Modi&amp;#39;s&lt;/b> stern message to Congress: Democracy cannot run on whims of some&lt;/a>&lt;font size="-1" color="#6f6f6f">&lt;nobr>India.com&lt;/nobr>&lt;/font>&lt;/font>&lt;br>&lt;font size="-1">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNGkBqqpn2OhEI6w68lLCIXMDppu-Q&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.mid-day.com/articles/jagran-forum-catch-pm-narendra-modi-other-leaders-live/16757192">Jagran Forum: Catch PM &lt;b>Narendra Modi&lt;/b>, other leaders live&lt;/a>&lt;font size="-1" color="#6f6f6f">&lt;nobr>Mid-Day&lt;/nobr>&lt;/font>&lt;/font>&lt;br>&lt;font size="-1">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNHPkB8Wy_-cDqqZrdfcn1cVUKP-Kg&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.oneindia.com/india/democracy-cant-be-restricted-to-elections-only-narendra-modi-1951641.html">Democracy can&amp;#39;t be restricted to elections only, says &lt;b>Narendra Modi&lt;/b>&lt;/a>&lt;font size="-1" color="#6f6f6f">&lt;nobr>Oneindia&lt;/nobr>&lt;/font>&lt;/font>&lt;br>&lt;font size="-1" class="p">&lt;a href="http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=in&amp;amp;usg=AFQjCNFhxDKEsImpQqu0GccMt4MCiPydVw&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779006372283&amp;amp;ei=ACdpVoDJO9Sj4ALYkL94&amp;amp;url=http://www.abplive.in/india-news/everyone-must-feel-he-or-she-is-working-for-indias-progress-says-narendra-modi-258229">&lt;nobr>ABP Live&lt;/nobr>&lt;/a>&lt;/font>&lt;br>&lt;font class="p" size="-1">&lt;a class="p" href="http://news.google.com/news/more?ncl=dac7xEJd70rfdkM8gcjOwSJn8BK9M&amp;amp;authuser=0&amp;amp;ned=in">&lt;nobr>&lt;b>all 29 news articles&amp;nbsp;&amp;raquo;&lt;/b>&lt;/nobr>&lt;/a>&lt;/font>&lt;/div>&lt;/font>&lt;/td>&lt;/tr>&lt;/table></content>
</entry>

对于image,我尝试了jsoup的以下代码:

Elements img = doc.getElementsByTag("img");
         for (Element el : img) {
             System.out.println("Image Found!");
             System.out.println("src attribute is : "+el.attr("src"));
         }

但它什么也不返回。我也不知道如何继续提取描述:

&lt;br>&lt;font size="-1">NEW DELHI: Putting the Ufa process back on track India and Pakistan on Wednesday signaled process of reducing tensions by announcing Comprehensive Bilateral Dialogue to be led by Foreign Secretaries and prepared the ground for a visit by Prime&amp;nbsp;...&lt;/font>

请帮我做这个。

共有1个答案

拓拔嘉运
2023-03-14

请尝试此代码。请注意,RSS提要是通过Jsoup直接获取的。

Document news = Jsoup.connect("http://news.google.com/news/section?output=atom&ned=in&q=narendra%20modi").get();

int i=0;
for (Element entryContent : news.select("entry > content")) {
    System.out.format("\n## ENTRY %d\n", ++i);
    for (Element el : Jsoup.parse(entryContent.text()).select("img[src], tr td.j font[size]:nth-of-type(2)")) {

        String elementTagName = el.tagName();  

        if (elementTagName.equalsIgnoreCase("img")) {
            System.out.println("src attribute is : " + el.attr("src"));
        } else if (elementTagName.equalsIgnoreCase("font")) {
            System.out.println("description is : " + el.text());
        } else {
            System.out.println("Unexpected element >> " + el.html());
        }
    }
}
## ENTRY 1
src attribute is : //t0.gstatic.com/images?q=tbn:ANd9GcSLee4ulBtCEOMSuDuLHCAjDZwmlaVaXJVdC09133QbK3X1OpZH3s1RBplznEadxqV5memM0dh3
description is : With repeated disruptions stalling legislation including the GST or Goods and Services Tax, Prime Minister Narendra Modi today said it was a "matter of sorrow" that Parliament was not running. "It is not only GST, but many pro-poor steps are stuck in ...

## ENTRY 2
src attribute is : //t1.gstatic.com/images?q=tbn:ANd9GcQdJPtLOBi9F2Ktov11_x5kqHC4inID47xKD3we_ZC5rHP1Lps96sYHs_N0pBO9WkDj5KKuEa8
description is : Prime Minister Narendra Modi topped the charts of Facebook under the most-viewed

(...)

在JSoup 1.8.3上测试

 类似资料:
  • 所以我尝试从pretag获取数据,我设置doc连接到url选择pretag,结果出错了,我需要获取的数据按这里

  • 我有一个带有ID、TEXT等列的表,这里的TEXT是超文本标记语言FORMAT中包含数据的Clob列 样本数据: 当我使用Jsoup.parse(AUDIT_SCOPE_LOB.text()时;我得到的数据如下 我对java知之甚少。我可以使用jsoup获取java代码来提取数据并重新运行下面的outpu吗 实际上,这个数据是一个样本数据。我有一些带有html标记的数据,这里没有提到。

  • 问题内容: 我正在尝试从位图获取像素rgb值。我得到了一些价值,但远没有达到我期望的价值。我也得到: 我找不到界外错误… 这是代码: 问题答案: 这个: 与此不符: 你已经计数的行和列,即包含 ÿ 值和包含 X 的值。那是倒退。

  • 尝试使用Jsoup而不是网站API从网站上练习和获取信息。我的代码没有错误,但文本字段没有更改。它只是给我一个空白。我如何从网站上获取信息?我正试图获取主要新闻,以便在我的网站上发布。 我的代码:

  • 如何使用JSOUP从html字符串获取图标路径? 我找到了在网页上添加favicon的不同方法- (在谷歌) 我能得到的第一个方法是使用doc。选择(“html头元”) 但我无法选择链接标签

  • 问题内容: 我需要查看由某个URL给出的页面的HTML。如果我有这个,使用Swift获取该URL的HTML源的最有效和同步的方法是什么?我还没有找到一种简单的在线方式来将其返回到变量中,而不是在completionHandler中将其打印出来。 无论使用URL的任何调用,我都需要操纵源。如何在Swift中完成? 问题答案: 免责声明:由于这已经获得了很多意见,我只想提醒大家,这里的答案是同步的,如