Scrapy框架中selector.css方法和selector.xpath方法,如何获取标签属性(含text文本)的三种方法(scrapy1.6版本)

冯招
2023-12-01

 

   text = '''<ul>
        <li class="toctree-l1"><a class="reference internal" href="intro/overview.html">Scrapy at a glance</a></li>
        <li class="toctree-l1"><a class="reference internal" href="intro/install.html">Installation guide</a></li>
        <li class="toctree-l1"><a class="reference internal" href="intro/tutorial.html">Scrapy Tutorial</a></li>
        <li class="toctree-l1"><a class="reference internal" href="intro/examples.html">Examples</a></li>
        </ul>
        <p class="caption"><span class="caption-text">Basic concepts</span></p>
        <ul>
        <li class="toctree-l1"><a class="reference internal" href="topics/commands.html">Command line tool</a></li>
        <li class="toctree-l1"><a class="reference internal" href="topics/spiders.html">Spiders</a></li>
        <li class="toctree-l1"><a class="reference internal" href="topics/selectors.html">Selectors</a></li>
        <li class="toctree-l1"><a class="reference internal" href="topics/items.html">Items</a></li>
        <li class="toctree-l1"><a class="reference internal" href="topics/loaders.html">Item Loaders</a></li>
        <li class="toctree-l1"><a class="reference internal" href="topics/shell.html">Scrapy shell</a></li>
        <li class="toctree-l1"><a class="reference internal" href="topics/item-pipeline.html">Item Pipeline</a></li>
        <li class="toctree-l1"><a class="reference internal" href="topics/feed-exports.html">Feed exports</a></li>
        <li class="toctree-l1"><a class="reference internal" href="topics/request-response.html">Requests and Responses</a></li>
        <li class="toctree-l1"><a class="reference internal" href="topics/link-extractors.html">Link Extractors</a></li>
        <li class="toctree-l1"><a class="reference internal" href="topics/settings.html">Settings</a></li>
        <li class="toctree-l1"><a class="reference internal" href="topics/exceptions.html">Exceptions</a></li>
        </ul>

        '''

        sel = Selector(text=text)

        ul = sel.css("ul")

        # css方法通过标签::属性名的方式获取属性值
        href = ul.css('li a::text').getall()

        # xpath方法通过标签/@属性名的方式获取属性值
        href_xpath = ul.xpath('./li/a/@href').getall()

        # 支持python方式通过标签.attrib获取标签属性list列表方法
        python_attr = [a.attrib for a in ul.css('li a')]

 

 类似资料: