目录
导读:虽然我大部分使用php生成markdown,但python库确实也比较丰富的不要不要,php composer也是参考学习python包管理,才会让php也有一种搭积木的感觉。
使用python将markdown转换成html的情况比较多,今天我们将另一个库将html转换为markdown。
html2text
安装
1.使用pip
pip install html2text #python3使用pip3
2.源码安装
如果使用的是python3将下面的python后面加一个3
git clone --depth 1 https://github.com/Alir3z4/html2text.git
python setup.py build
python setup.py install
使用
import html2text
html = "
hello https://xxx.com
"md = html2text.html2text(html)
print(md)
运行结果
**hello** https://xxx.com
高级用法
忽略链接即a标签
import html2text
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text_maker.bypass_tables = False
html = html
text = text_maker.handle(html)
print(text)
运行结果
**hello** https://xxx.com
链接
如果将ignore_links = False 运行结果
**hello** https://xxx.com
[链接](https://xxx.com)
我们可以看到开启之后只提取文本,而关闭后变成了markdown的链接语法
其他可选项
UNICODE_SNOB for using unicode
ESCAPE_SNOB for escaping every special character
LINKS_EACH_PARAGRAPH for putting links after every paragraph
BODY_WIDTH for wrapping long lines
SKIP_INTERNAL_LINKS to skip #local-anchor things
INLINE_LINKS for formatting images and links
PROTECT_LINKS protect from line breaks
GOOGLE_LIST_INDENT no of pixels to indent nested lists
IGNORE_ANCHORS
IGNORE_IMAGES
IMAGES_AS_HTML always generate HTML tags for images; preserves height, width, alt if possible.
IMAGES_TO_ALT
IMAGES_WITH_SIZE
IGNORE_EMPHASIS
BYPASS_TABLES format tables in HTML rather than Markdown
IGNORE_TABLES ignore table-related tags (table, th, td, tr) while keeping rows
SINGLE_LINE_BREAK to use a single line break rather than two
UNIFIABLE is a dictionary which maps unicode abbreviations to ASCII values
RE_SPACE for finding space-only lines
RE_ORDERED_LIST_MATCHER for matching ordered lists in MD
RE_UNORDERED_LIST_MATCHER for matching unordered list matcher in MD
RE_MD_CHARS_MATCHER for matching Md \,[,],( and )
RE_MD_CHARS_MATCHERALL for matching `,*, ,{,},[,],(,),#,!
RE_MD_DOT_MATCHER for matching lines starting with 1.
RE_MD_PLUS_MATCHER for matching lines starting with +
RE_MD_DASH_MATCHER for matching lines starting with (-)
RE_SLASH_CHARS a string of slash escapeable characters
RE_MD_BACKSLASH_MATCHER to match \char
USE_AUTOMATIC_LINKS to convert http://xyz to http://xyz
MARK_CODE to wrap ‘pre’ blocks with [code]…[/code] tags
WRAP_LINKS to decide if links have to be wrapped during text wrapping (implies INLINE_LINKS = False)
WRAP_LIST_ITEMS to decide if list items have to be wrapped during text wrapping
DECODE_ERRORS to handle decoding errors. ‘strict’, ‘ignore’, ‘replace’ are the acceptable values.
DEFAULT_IMAGE_ALT takes a string as value and is used whenever an image tag is missing an alt value. The default for this is an empty string '’ to avoid backward breakage
OPEN_QUOTE is the character used to open a quote when replacing the tag. It defaults to ".
CLOSE_QUOTE is the character used to close a quote when replacing the tag. It defaults to ".
本文收藏来自互联网,用于学习研究,著作权归原作者所有,如有侵权请联系删除
markdown @tsingchan
部分引用格式为收藏注解,比如本句就是注解,非作者原文。