问题：

使用python LXML从html网页中提取信息

平庆

2023-03-14

我正在尝试制作一个python脚本，用我所拥有的有限知识从一个网页中刮取特定的信息。但我想我有限的知识是不够的。我需要提取7-8条信息。标签如下-

<a class="ui-magnifier-glass" href="here goes the link that i want to extract" data-spm-anchor-id="0.0.0.0" style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"></a>

<a href="link to extract" title="title to extract" rel="category tag" data-spm-anchor-id="0.0.0.0">or maybe this word instead of title</a>

我已使用此代码开始

url = raw_input('url : ')

page = requests.get(url)
tree = html.fromstring(page.text)
productname = tree.xpath('//h1[@class="product-name"]/text()')
price = tree.xpath('//span[@id="sku-discount-price"]/text()')
print '\n' + productname[0]
print '\n' + price[0]

共有1个答案

齐航

2023-03-14

您可以使用lxml和csv模块来执行您想要的操作。lxml支持xpath表达式来选择所需的元素。

from lxml import etree
from StringIO import StringIO
from csv import DictWriter

f= StringIO('''
    <html><body>
    <a class="ui-magnifier-glass" 
       href="here goes the link that i want to extract" 
       data-spm-anchor-id="0.0.0.0" 
       style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"
    ></a>
    <a href="link to extract"
       title="title to extract" 
       rel="category tag" 
       data-spm-anchor-id="0.0.0.0"
    >or maybe this word instead of title</a>
    </body></html>
''')
doc = etree.parse(f)

data=[]
# Get all links with data-spm-anchor-id="0.0.0.0" 
r = doc.xpath('//a[@data-spm-anchor-id="0.0.0.0"]')

# Iterate thru each element containing an <a></a> tag element
for elem in r:
    # You can access the attributes with get
    link=elem.get('href')
    title=elem.get('title')
    # and the text inside the tag is accessable with text
    text=elem.text

    data.append({
        'link': link,
        'title': title,
        'text': text
    })

with open('file.csv', 'w') as csvfile:
    fieldnames=['link', 'title', 'text']
    writer = DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for row in data:
        writer.writerow(row)

类似资料：

从网页中提取链接

问题内容：使用Java，如何从给定的网页中提取所有链接？问题答案：将Java文件下载为纯文本/ html格式，并通过Jsoup或 html clean传递，两者相似，甚至可以用于解析格式错误的html 4.0语法，然后可以使用流行的HTML DOM解析方法，例如getElementsByName（“ a”）或在jsoup中它甚至很酷，您只需使用并找到所有链接，然后使用取自http://j
使用Python从多个网页中提取日期

我想提取新闻文章在网站上发表的日期。对于某些网站，我有确切的html元素，其中日期/时间为（div，p，time），但在某些网站上，我没有：以下是一些网站（德国网站）的链接：（2020年11月3日）http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo （2020年12月1日）http://www.re
Java-使用Socket提取网站HTML

问题内容：下面的代码没有从网址中获取任何HTML 我知道URLConnection方法openConnection（）和URL类方法openStream（），但我不想使用这些方法。我只想知道为什么我不使用Socket类获得任何输入，但是我却使用其他方法获得了输入。问题答案：您没有发送任何请求。HTTP是一个请求-响应协议：您需要发送一个请求，告知Web服务器要检索的URL，然后它将向您发送该
使用JSoup for Java从网页中提取特定行

打印出以下内容我如何只提取第6行，即并将其放入一个数组中，其中每个元素都是逗号前的单词（例如:[0]=Alex,[1]=Cook等）
如何从网页中提取文本？

我有一个Excel工作表，其中一栏填充了专利号。我需要提取每个相应专利的标题，并将其放在专利号旁边的单元格中。因此，代码应执行以下操作：访问espacenet.com并打开需要名称的专利号。获取标题。将其放在所需单元格的Excel工作表中。这是一个完美适用于第一个专利号的代码，但在这之后会立即出现错误。错误显示：“运行时错误'-2147417848（80010108）'：自动化错误调用的
使用Python从HTML提取数据

问题内容：我的Python代码处理了以下文本：您能建议我如何从内部提取数据吗？我的想法是将其放入具有以下格式的CSV文件中：。我希望没有正则表达式会很困难，但实际上我仍然在反对正则表达式。我或多或少地通过以下方式使用了代码：理想情况下是将每个td竞争以某个数组进行竞争。上面的HTML是python的结果。问题答案：获取BeautifulSoup并使用它。这很棒。

使用python LXML从html网页中提取信息

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档