DiDOM - simple and fast HTML parser.
To install DiDOM run the command:
composer require imangazaliev/didom
use DiDom\Document;
$document = new Document('http://www.news.com/', true);
$posts = $document->find('.post');
foreach($posts as $post) {
echo $post->text(), "\n";
DiDom allows to load HTML in several ways:
// the first parameter is a string with HTML
$document = new Document($html);
// file path
$document = new Document('page.html', true);
// or URL
$document = new Document('http://www.example.com/', true);
The second parameter specifies if you need to load file. Default is false
__construct($string = null, $isFile = false, $encoding = 'UTF-8', $type = Document::TYPE_HTML)
- an HTML or XML string or a file path.
- indicates that the first parameter is a path to a file.
- the document encoding.
- the document type (HTML - Document::TYPE_HTML
, XML - Document::TYPE_XML
$document = new Document();
There are two methods available for loading XML: loadXml
and loadXmlFile
These methods accept additional options:
$document->loadXml($xml, LIBXML_PARSEHUGE);
$document->loadXmlFile($url, LIBXML_PARSEHUGE);
DiDOM accepts CSS selector or XPath as an expression for search. You need to path expression as the first parameter, and specify its type in the second one (default type is Query::TYPE_CSS
:use DiDom\Document;
use DiDom\Query;
// CSS selector
$posts = $document->find('.post');
// XPath
$posts = $document->find("//div[contains(@class, 'post')]", Query::TYPE_XPATH);
If the elements that match a given expression are found, then method returns an array of instances of DiDom\Element
, otherwise - an empty array. You could also get an array of DOMElement
objects. To get this, pass false
as the third parameter.
:$posts = $document('.post');
Warning: using this method is undesirable because it may be removed in the future.
:$posts = $document->xpath("//*[contains(concat(' ', normalize-space(@class), ' '), ' post ')]");
You can do search inside an element:
echo $document->find('nav')[0]->first('ul.menu')->xpath('//li')[0]->text();
To verify if element exist use has()
if ($document->has('.post')) {
// code
If you need to check if element exist and then get it:
if ($document->has('.post')) {
$elements = $document->find('.post');
// code
but it would be faster like this:
if (count($elements = $document->find('.post')) > 0) {
// code
because in the first case it makes two queries.
Methods find()
, first()
, xpath()
, has()
, count()
are available in Element too.
echo $document->find('nav')[0]->first('ul.menu')->xpath('//li')[0]->text();
If you change, replace, or remove an element that was found in another element, the document will not be changed. This happens because method find()
of Element
class (a, respectively, the first ()
and xpath
methods) creates a new document to search.
To search for elements in the source document, you must use the methods findInDocument()
and firstInDocument()
// nothing will happen
// but this will do
Warning: methods findInDocument()
and firstInDocument()
work only for elements, which belong to a document, and for elements created via new Element(...)
. If an element does not belong to a document, LogicException
will be thrown;
DiDom supports search by:
// all links
// any element with id = "foo" and "bar" class
// any element with attribute "name"
// the same as
// input field with the name "foo"
// any element that has an attribute starting with "data-" and the value "foo"
// all links starting with https
// all images with the extension png
// all links containing the string "example.com"
// text of the links with "foo" class
// address and title of all the fields with "bar" class
:$posts = $document->find('.post');
echo $posts[0]->html();
$html = (string) $posts[0];
$html = $document->format()->html();
An element does not have format()
method, so if you need to output formatted HTML of the element, then first you have to convert it to a document:
$html = $element->toDocument()->format()->html();
$innerHtml = $element->innerHtml();
Document does not have the method innerHtml()
, therefore, if you need to get inner HTML of a document, convert it into an element first:
$innerHtml = $document->toElement()->innerHtml();
echo $document->xml();
echo $document->first('book')->xml();
$posts = $document->find('.post');
echo $posts[0]->text();
use DiDom\Element;
$element = new Element('span', 'Hello');
// Outputs "<span>Hello</span>"
echo $element->html();
First parameter is a name of an attribute, the second one is its value (optional), the third one is element attributes (optional).
An example of creating an element with attributes:
$attributes = ['name' => 'description', 'placeholder' => 'Enter description of item'];
$element = new Element('textarea', 'Text', $attributes);
An element can be created from an instance of the class DOMElement
use DiDom\Element;
use DOMElement;
$domElement = new DOMElement('span', 'Hello');
$element = new Element($domElement);
$document = new Document($html);
$element = $document->createElement('span', 'Hello');
$document = new Document($html);
$input = $document->find('input[name=email]')[0];
$document = new Document($html);
$item = $document->find('ul.menu > li')[1];
$html = '<div>Foo<span>Bar</span><!--Baz--></div>';
$document = new Document($html);
$div = $document->first('div');
// element node (DOMElement)
// string(3) "Bar"
// text node (DOMText)
// string(3) "Foo"
// comment node (DOMComment)
// string(3) "Baz"
// array(3) { ... }
$document = new Document($html);
$element = $document->find('input[name=email]')[0];
$document2 = $element->getDocument();
// bool(true)
:$element->setAttribute('name', 'username');
:$element->attr('name', 'username');
:$element->name = 'username';
:$username = $element->getAttribute('value');
:$username = $element->attr('value');
:$username = $element->name;
Returns null
if attribute is not found.
:if ($element->hasAttribute('name')) {
// code
:if (isset($element->name)) {
// code
$element = new Element('span', 'hello');
$element2 = new Element('span', 'hello');
// bool(true)
// bool(false)
$list = new Element('ul');
$item = new Element('li', 'Item 1');
$items = [
new Element('li', 'Item 2'),
new Element('li', 'Item 3'),
$list = new Element('ul');
$item = new Element('li', 'Item 1');
$items = [
new Element('li', 'Item 2'),
new Element('li', 'Item 3'),
$element = new Element('span', 'hello');
Waning: you can replace only those elements that were found directly in the document:
// nothing will happen
// but this will do
$document->first('head title')->replace($title);
More about this in section Search for elements.
Warning: you can remove only those elements that were found directly in the document:
// nothing will happen
// but this will do
$document->first('head title')->remove();
More about this in section Search for elements.
Cache is an array of XPath expressions, that were converted from CSS.
use DiDom\Query;
$xpath = Query::compile('h2');
$compiled = Query::getCompiled();
// array('h2' => '//h2')
Query::setCompiled(['h2' => '//h2']);
By default, whitespace preserving is disabled.
You can enable the preserveWhiteSpace
option before loading the document:
$document = new Document();
The count ()
method counts children that match the selector:
// prints the number of links in the document
echo $document->count('a');
// prints the number of items in the list
echo $document->first('ul')->count('li');
Returns true
if the node matches the selector:
// strict match
// returns true if the element is a div with id equals content and nothing else
// if the element has any other attributes the method returns false
$element->matches('div#content', true);
Checks whether an element is an element (DOMElement):
Checks whether an element is a text node (DOMText):
Checks whether the element is a comment (DOMComment):
开发人员时不时地需要抓取网页以从网站上获取一些信息。 例如,假设您正在一个个人项目中,必须从Wikipedia获取有关不同国家首都的地理信息。 手动输入此内容会花费很多时间。 但是,您可以通过在PHP的帮助下抓取Wikipedia页面来非常快地完成操作。 您还可以自动解析HTML以获取特定信息,而不必手动进行整个标记。 在本教程中,我们将学习一个名为DiDOM的快速,易于使用HTML解析器。 我们