Requests-HTML解析HTML的Python库

吕霍英

2023-12-01

HTML Parsing for Humans, 这句话是库作者(kennethreitz)原话, 提现出了这个库的人性化, 而近来作者出品了一个更加人性化的库，他就是Requests-HTML。

需要提示一下：目前requests-html只支持python3.6及以上版本。

首先需要安装此模块

pip install requests-html

获取首页:

>>> fromrequests_html import HTMLSession

>>> session= HTMLSession()

>>> data = session.get('http://www.baidu.com')

获取所有连接:

print(data.html.links)

获取的结果如下：

{'http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8','http://tieba.baidu.com', 'http://home.baidu.com','http://www.baidu.com/duty/','http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=','http://xueshu.baidu.com', 'http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=','http://e.baidu.com/?refer=888',…}

# 由于这里连接较多在这里仅仅粘贴一部分

获取所有绝对地址

print(data.html.absolute_links)

获取的结果如下：

{'http://ir.baidu.com','http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8','http://news.baidu.com', 'http://xueshu.baidu.com', 'http://tieba.baidu.com','https://www.baidu.com/more/',

{'https://www.csdn.net/nav/iot','http://blog.csdn.net/sfM06sqVW55DFt1', ... ,}

# 由于这里连接较多在这里仅仅粘贴一部分

使用bs4 css选择器

>>>element = data.html.find('#su')

>>>print(element.text)

使用xpath

element = data.html.xpath('//input[@id="su"]')

使用文本

element = data.html.find('a[name="tj_trnews"]')[0]
text = element.text

获取元素属性

>>>attrsr = element.attrs['name']

Print(attrsr)

将HTML转换为markdown:

>>>print(about.markdown)

*[About](/about/)

* [Applications](/about/apps/)

* [Quotes](/about/quotes/)

* [Getting Started](/about/gettingstarted/)

* [Help](/about/help/)

* [PythonBrochure](http://brochure.getpython.info/)

# 这里粘贴了作者的例子, CSDN上获取的没有这个例子清晰

最后附上GitHub原创作者的文章链接：https://github.com/kennethreitz/requests-html

Requests-HTML解析HTML的Python库

相关阅读

相关文章

相关问答

相关文档