http://html5lib.readthedocs.org/en/latest/
By default, the document will be an xml.etree element instance.Whenever possible, html5lib chooses the accelerated ElementTreeimplementation (i.e. xml.etree.cElementTree on Python 2.x).
html5lib is a pure-python library for parsing HTML. It is designed toconform to the WHATWG HTML specification, as is implemented by all majorweb browsers.
Simple usage follows this pattern:
import html5lib
with open("mydocument.html", "rb") as f:
document = html5lib.parse(f)
or:
import html5lib
document = html5lib.parse("<p>Hello World!")
By default, the document will be anxml.etree element instance.Whenever possible, html5lib chooses the acceleratedElementTreeimplementation (i.e.xml.etree.cElementTree on Python 2.x).
Two other tree types are supported: xml.dom.minidom andlxml.etree. To use an alternative format, specify the name ofa treebuilder:
import html5lib
with open("mydocument.html", "rb") as f:
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
When using with urllib2 (Python 2), the charset from HTTP should bepass into html5lib as follows:
from contextlib import closing
from urllib2 import urlopen
import html5lib
with closing(urlopen("http://example.com/")) as f:
document = html5lib.parse(f, encoding=f.info().getparam("charset"))
When using with urllib.request (Python 3), the charset from HTTPshould be pass into html5lib as follows:
from urllib.request import urlopen
import html5lib
with urlopen("http://example.com/") as f:
document = html5lib.parse(f, encoding=f.info().get_content_charset())
To have more control over the parser, create a parser object explicitly.For instance, to make the parser raise exceptions on parse errors, use:
import html5lib
with open("mydocument.html", "rb") as f:
parser = html5lib.HTMLParser(strict=True)
document = parser.parse(f)
When you’re instantiating parser objects explicitly, pass a treebuilderclass as thetree keyword argument to use an alternative documentformat:
import html5lib
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("<p>Hello World!")
More documentation is available at http://html5lib.readthedocs.org/.
html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,use:
$ pip install html5lib
The following third-party libraries may be used for additionalfunctionality:
Please report any bugs on the issue tracker.
Unit tests require the nose library and can be run using thenosetests command in the root directory;ordereddict isrequired under Python 2.6. All should pass.
Test data are contained in a separate html5lib-tests repository and includedas a submodule, thus for git checkouts they must be initialized:
$ git submodule init
$ git submodule update
If you have all compatible Python implementations available on yoursystem, you can run tests on all of them using thetox utility,which can be found on PyPI.
There’s a mailing list available for support on Google Groups,html5lib-discuss,though you may get a quicker response asking on IRC in#whatwg onirc.freenode.net.