当前位置: 首页 > 知识库问答 >
问题:

如何在Scrapy中使用response.xpath从多个标记中提取文本数据?

丁念
2023-03-14

我只想解决一个问题。我想在Scrapy中使用XPath从以下HTML中提取文本。

<div class="block fix-text job-description">
   <p>We’re looking for an experienced <strong>Events Manager</strong> to develop and deliver our events and exhibitions programme, available to start as soon as possible. You’ll be leading a team of two to create and implement an events strategy that supports our corporate objectives. You’ll be working closely with our campaigns, marketing and projects teams to make sure we connect with our audiences and achieve event objectives.</p>
   <p>In this role, you’ll be working within a dynamic team in a fast-paced environment, with the potential opportunity to be part of the recruitment process to build your own team. Your experience as an events manager will have a strong marketing or digital marketing focus, ideally within a regulatory or third sector context.</p>
   <p>You’ll be managing high profile events across our diverse organisation, from workshops and online webinars to our national flagship conference. It’s an exciting role with the opportunity to help shape our current digital transformation and strengthen our brand, so we’re looking for creativity and innovation. You’ll also be working with senior colleagues and stakeholders, for whom you’ll prepare detailed briefings. In addition, you:</p>
   <ul>
      <li>Can demonstrate your extensive experience of creating and managing high profile events and conferences</li>
      <li>Have experience in delivering complex events programmes integrated into campaigns and marketing communications</li>
      <li>Have experience of audience research and insight</li>
      <li>Have excellent budget management and negotiation skills</li>
      <li>Are an outstanding communicator, both verbal and written</li>
      <li>Have strong people management skills with the ability to motivate and develop a team remotely</li>
   </ul>
   <p>This role is the opportunity to work within one of the largest healthcare regulators within the UK, shaping change within healthcare. As part of your salary and benefits package, you’ll receive:</p>
   <ul>
      <li>A good pension (15% employer contribution)</li>
      <li>25 days’ holiday a year (option to buy &amp; sell)</li>
      <li>Private Medical Insurance (PMI) &amp; Health screens</li>
      <li>Interest free ticket loans</li>
      <li>Exclusive discounts</li>
      <li>Employee assistance programme</li>
      <li>Childcare vouchers</li>
      <li>Cycle to work scheme</li>
      <li>Flexi-working</li>
      <li>The option to work from home up to 2 days a week.</li>
   </ul>
   <p>The General Medical Council (GMC) helps to protect patients and improve medical education and practice in the UK by setting standards for medical students and doctors. We support them in achieving (and exceeding) those standards and take action when they’re not met.</p>
   <p>A registered charity, we value diversity and inclusion because our differences make us stronger. So, our processes are fair, objective, transparent and free from discrimination.</p>
   <p><strong>Employment status: 12-month Fixed Term Contract</strong></p>
   <p><strong>Closing date: Midnight on Sunday 1st July 2018, late applications will not be accepted.</strong></p>
   <p><strong>Assessment date: Interviews &amp; Assessments will take place on Wednesday 11th July 2018</strong></p>
</div>

如何从上面的HTML中提取文本。我尝试使用XPath提取文本

>

  • '//*[@class=“job-description”]'

    //[@id=“main”]/div/div/div[1]/div[1]/div/div[2]/div[2]//text()

    6.'//div[@class=“job-description”]/p/descendant-or-self::text()‘

    但是没有得到输出,请告诉我如何刮取这些信息,因为它有多个{p}标记,(ul}标记在类内部。

    所以现在我不知道如何获取信息。

    提前致谢

  • 共有2个答案

    勾海超
    2023-03-14

    我通过以下回答解决了这个问题:

    我只放了以下xpath://*[contains(@class,“job-description”)]/descendant::text()

    谢谢你的评论@Lars Marius Garshol。

    巢承安
    2023-03-14

    您想要什么还不是很清楚,但听起来像是想要一个XPath查询,提供所有的文本节点。您可以这样做:

    /descendant::text()
    
     类似资料:
    • 我正在尝试使用Python中的BeautifulSoup包提取存在于div标记中的文本。 示例我想提取标记 内部的文本 以及 中的文本 当我运行代码时,系统崩溃并显示以下错误: ----------------------------------------------------------------------------------------------------在60###artic

    • 我正在尝试使用Scrapy在python上用一个简单的蜘蛛代码提取web新闻的每个标题的文本。我将html代码的一部分留在下面 null null 因此,我想摘录H4中的文本。为此,我使用Scrapy在python中编写了以下代码: 在PowerShell中运行代码时没有错误。然而,它并没有废弃任何东西 我在下面留下部分留言 该代码在其他网页中工作。我不知道我是否正确地编写了xpath(我尝试过用

    • 问题内容: 我要提取: 来自标签的src的文本和 类数据内的定位标记的文本 我成功地提取了img src,但是从锚标记中提取文本时遇到了麻烦。 这是整个HTML页面的链接。 这是我的代码: 我想做的是 提取图像src(链接)和中的标题,因此例如: 应该提取: 问题答案: 以上所有答案确实可以帮助我构建答案,因此,我对其他用户提出的所有答案投了赞成票:但是我最终对自己正在处理的确切问题汇总了自己的答

    • 问题内容: 这是我要从中提取数据的网站链接,我试图在锚标记下获取属性的所有文本。这是示例html: 我想提取所有文本值,例如。 我尝试了: 但它给出(空)字符串。 关于如何实现的任何建议? PS-在“ 产品类型”* 下选择单选按钮的第一个值 * 问题答案: 要提取标签内的所有文本值,例如 [‘A / D TC-55 SEALER’,’Carbocrylic 3356-1’] ,您必须为引入 Web

    • 我想摘录: 图像标记和 类数据内的锚标记文本 我成功地提取了img src,但从锚标记中提取文本时遇到了问题。 这是整个HTML页面的链接。 这是我的代码: 我试图做的是提取图像src(link)和div class=data中的标题,例如: 应提取: 尼康COOLPIX L26 16.1 MP数码相机,配备5倍变焦NIKKOR玻璃镜头和3英寸LCD(红色)

    • 我想从Page_inspect得到课文课的价格。 使用driver.find_element_by_xpath和 Web 驱动程序等待。 结果未找到 : 回溯(最后一次调用):文件“D:\project\totempop\webscraping\asrPOP.py”,第22行,rateText=WebDriverWait(driver,10)。直到(EC.presence_of_all_eleme