When crawling online articles such as news, blogs, etc. I want to save them in markdown files but not databases.Tomd has the ability of converting a HTML that converted from markdown. If a HTML can't be described by markdown, tomd can't convert it right.Tomd is a python tool.
pip install tomd
Input
import tomd
tomd.Tomd('<h1>h1</h1>').markdown
# or
tomd.convert('<h1>h1</h1>')
Output
# h1
from tomd import Tomd
html="""
<h1>h1</h1>
<h2>h2</h2>
<h3>h3</h3>
<h4>h4</h4>
<h5>h5</h5>
<h6>h6</h6>
<p>paragraph
<a href="https://github.com">link</a>
<img src="https://github.com" class="dsad">img</img>
</p>
<ul>
<li>1</li>
<li>2</li>
<li>3</li>
</ul>
<ol>
<li>1</li>
<li>2</li>
<li>3</li>
</ol>
<blockquote>blockquote</blockquote>
<p><code>inline code</code></p>
<pre><code>block code</code></pre>
<p>
<del>del</del>
<b>bold</b>
<i>italic</i>
<b><i>bold italic</i></b>
</p>
<hr/>
<table>
<thead>
<tr>
<th>th1</th>
<th>th2</th>
</tr>
</thead>
<tbody>
<tr>
<td>td</td>
<td>td</td>
</tr>
<tr>
<td>td</td>
<td>td</td>
</tr></tbody></table>
"""
Tomd(html).markdown
# h1
## h2
### h3
#### h4
##### h5
###### h6
paragraph
[link](https://github.com)
![img](https://github.com)
- 1
- 2
- 3
1. 1
1. 2
1. 3
> blockquote
`inline code`
block code
~~del~~
**bold**
*italic*
***bold italic***
---
|th1|th2
|------
|td|td
|td|td
python-爬虫-使用 tomd 库,将 html 转换为 markdown 文档 编码问题搞死人!注意:写python前要先设置两个位置的编码,一个文件顶部设置文件编码,一个是 import 后设置系统默认编码!!! tomd 对与非常复杂的结构,还是不能完美处理,但已经很不错了,用了 不到 200 行的代码写的转换器。 tomd 源码地址:https://github.com/gaojiul
1.安装 nodejs 2.安装 grunt -> npm install -g grunt-cli 3.配置工作环境 npm init ./package.json { "name": "app", "author": "By.Runic - Email: demo2013@vip.qq.com", "version": "1.0.0", "description
安装 pip install tomd 用法 from tomd import Tomd Tomd('<h1>h1</h1>').markdown 返回一个markdown格式的字符串 github地址: https://github.com/gaojiuli/tomd