Readability is a tool for extracting and curating the primary readable content of a webpage.
Check out The Documentation for full and detailed guides
If available in Hex, the package can be installed as:
mix.exs
:def deps do
[{:readability, "~> 0.9"}]
end
def application do
[applications: [:readability]]
end
Note: Readability requires Elixir 1.3 or higher.
url = "https://medium.com/@kenmazaika/why-im-betting-on-elixir-7c8f847b58"
summary = Readability.summarize(url)
summary.title
#=> "Why I’m betting on Elixir"
summary.authors
#=> ["Ken Mazaika"]
summary.article_html
#=>
# <div><div><p id=\"3476\"><strong><em>Background: </em></strong><em>I’ve spent...
# ...
# ...button!</em></h3></div></div>
summary.article_text
#=>
# Background: I’ve spent the past 6 years building web applications in Ruby and.....
# ...
# ... value in this article, it would mean a lot to me if you hit the recommend button!
### Extract the title.
Readability.title(html)
### Extract authors.
Readability.authors(html)
### Extract the primary content with transformed html.
html
|> Readability.article
|> Readability.readable_html
### Extract only text from the primary content.
html
|> Readability.article
|> Readability.readable_text
### you can extract the primary images with Floki
html
|> Readability.article
|> Floki.find("img")
|> Floki.attribute("src")
If the result is different from your expectations, you can add options to customize it.
url = "https://medium.com/@kenmazaika/why-im-betting-on-elixir-7c8f847b58"
summary = Readability.summarize(url, [clean_conditionally: false])
You can find other algorithm and regex options in readability.ex
To run the test suite:
$ mix test
img#src
and a#href
Check out the main features milestone and features of related projects below
Contributing
NOTE: Be sure to merge the latest from "upstream" before making a pull request!
This code is under the Apache License 2.0. See http://www.apache.org/licenses/LICENSE-2.0.
本文仅供学习交流使用,如侵立删!demo下载见文末 readability 安装 pip install readability-lxml PS:readability有两个版本一个readability,一个readability-lxml注意不要装错。 readability 提取网页标题 from readability import Document import requests u