Scrapy学习笔记-Selectors一

张炳

2023-12-01

抓取网页时，最需要执行的任务是从HTML源中提取数据。有几种库可以实现此目的，例如：BeautifulSoup是在Python程序员中非常流行的Web抓取库，它基于HTML代码的结构构造一个Python对象，并且还可以很好地处理不良标记，但是它有一个缺点：它很慢。lxml是具有基于ElementTree的pythonic API的XML解析库（它也解析HTML）。（lxml不是Python标准库的一部分。）
Scrapy带有自己的数据提取机制。之所以称为选择器，是因为它们“选择”了XPath或CSS表达式指定的HTML文档的某些部分。XPath是一种用于选择XML文档中的节点的语言，该语言也可以与HTML一起使用。 CSS是一种将样式应用于HTML文档的语言。它定义了选择器，以将这些样式与特定的HTML元素相关联。

Scrapy Selectors是parsel库的薄包装。包装器的目的是提供与Scrapy Response对象的更好的集成。parsel是一个独立的Web抓取库，无需Scrapy即可使用。它在后台使用lxml库，并在lxml API之上实现了一个简单的API。这意味着Scrapy选择器的速度和解析精度与lxml非常相似。

"""
XPath selectors based on lxml 基于lxml的XPath selectors
"""
from parsel import Selector as _ParselSelector
from scrapy.utils.trackref import object_ref
from scrapy.utils.python import to_bytes
from scrapy.http import HtmlResponse, XmlResponse

__all__ = ['Selector', 'SelectorList']

def _st(response, st):
    if st is None:
        return 'xml' if isinstance(response, XmlResponse) else 'html'
    return st
def _response_from_text(text, st):
    rt = XmlResponse if st == 'xml' else HtmlResponse
    return rt(url='about:blank', encoding='utf-8',
              body=to_bytes(text, 'utf-8'))


class SelectorList(_ParselSelector.selectorlist_cls, object_ref):
    """
    The :class:`SelectorList` class is a subclass of the builtin ``list``
    class, which provides a few additional methods.
    SelectorList类是内置list的子类，它提供了一些其他方法。
    """
class Selector(_ParselSelector, object_ref):
    """
    An instance of :class:`Selector` is a wrapper over response to select
    certain parts of its content. Selector类是用于选择其内容的某些部分响应的包装
    ``response`` is an :class:`~scrapy.http.HtmlResponse` or an
    :class:`~scrapy.http.XmlResponse` object that will be used for selecting
    and extracting data.
    ``text`` is a unicode string or utf-8 encoded text for cases when a
    ``response`` isn't available. Using ``text`` and ``response`` together is
    undefined behavior.
    ``type`` defines the selector type, it can be ``"html"``, ``"xml"``
    or ``None`` (default). type定义选择器类型，可以是“ html”，“ xml”或“无”（默认）。
    If ``type`` is ``None``, the selector automatically chooses the best type
    based on ``response`` type (see below), or defaults to ``"html"`` in case it
    is used together with ``text``. 如果type为None，选择器会根据响应类型自动选择最佳类型（请参见下文），如果与文本一起使用，则默认“ html”。

    If ``type`` is ``None`` and a ``response`` is passed, the selector type is
    inferred from the response type as follows:

    * ``"html"`` for :class:`~scrapy.http.HtmlResponse` type
    * ``"xml"`` for :class:`~scrapy.http.XmlResponse` type
    * ``"html"`` for anything else

    Otherwise, if ``type`` is set, the selector type will be forced and no
    detection will occur.
    """

    __slots__ = ['response']
    selectorlist_cls = SelectorList

    def __init__(self, response=None, text=None, type=None, root=None, **kwargs):
        if not(response is None or text is None):
           raise ValueError('%s.__init__() received both response and text' % self.__class__.__name__)
        # 判别type类型
        st = _st(response, type or self._default_type)
        # 分别处理
        if text is not None:
            response = _response_from_text(text, st)

        if response is not None:
            text = response.text
            kwargs.setdefault('base_url', response.url)

        self.response = response
        super(Selector, self).__init__(text=text, type=st, root=root, **kwargs)

Selector objects scrapy.selector.Selector

Selector的实例是对选择其内容某些部分的响应的包装。response是一个HtmlResponse或XmlResponse对象，用于选择和提取数据。如果没有响应，则text是unicode字符串或utf-8编码的文本。一起使用文本和响应是未定义的行为。–>也就是说response和text参数只能使用一个。type定义选择器类型，可以是“ html”，“ xml”或“无”（默认）。
如果type为None且传递了响应，则从响应类型推断出选择器类型，如下所示：“html” for HtmlResponse type “xml” for XmlResponse type “html” for anything else 否则，如果设置了类型，则将强制选择器类型，并且不会进行检测。

xpath(query, namespaces=None, **kwargs)
查找与xpath查询匹配的节点，并将结果作为所有元素展平的SelectorList实例返回。列表元素也实现Selector接口。query是一个包含要应用的XPATH查询的字符串。名称空间是一个可选的前缀：名称空间-uri映射（dict），用于向那些在register_namespace（prefix，uri）中注册的前缀附加。与register_namespace（）相反，这些前缀不保存供以后调用。任何其他命名参数都可以用于在XPath表达式中传递XPath变量的值，例如：selector.xpath('//a[href=$url]', url="http://www.example.com")

css(query) 应用给定的CSS选择器并返回SelectorList实例。query是一个包含要应用的CSS选择器的字符串。在后台，使用cssselect库将CSS查询转换为XPath查询，然后运行.xpath（）方法。

get() 序列化并以单个unicode字符串返回匹配的节点。编码内容的百分比未引用。

attrib 返回基础元素的属性字典。

re(regex, replace_entities=True) 应用给定的正则表达式，并返回带有匹配项的unicode字符串列表。regex可以是已编译的正则表达式，也可以是将使用re.compile（regex）编译为正则表达式的字符串。默认情况下，字符实体引用被其对应的字符替换（和除外）。传递replace_entities为False会关闭这些替换。

re_first(regex, default=None, replace_entities=True) 应用给定的正则表达式并返回匹配的第一个unicode字符串。如果不匹配，则返回默认值（如果未提供参数，则为None）。默认情况下，字符实体引用被其对应的字符替换（&amp和&lt除外）。传递replace_entities为False会关闭这些替换。

register_namespace(prefix, uri) 注册要在此选择器中使用的给定名称空间。如果不注册名称空间，则无法从非标准名称空间选择或提取数据。

remove_namespaces() 删除所有名称空间，允许使用无名称空间的xpath遍历文档

__bool__() 如果选择了任何实际内容，则返回True，否则返回False。换句话说，选择器的布尔值由其选择的内容给出。

getall() 序列化并以1-元素的unicode字符串列表返回匹配的节点。为了保持一致性，此方法已添加到Selector中。与SelectorList一起使用时更有用。

SelectorList objects scrapy.selector.SelectorList

SelectorList类是内置列表类的子类，它提供了一些其他方法。
xpath(xpath, namespaces=None, kwargs) 对该列表中的每个元素**调用.xpath方法，并将其结果展平为另一个SelectorList。
query查询与Selector.xpath（）中的参数相同，namespaces名称空间是一个可选的前缀：名称空间-uri映射（dict），用于向那些在register_namespace（prefix，uri）中注册的前缀附加。与register_namespace（）相反，这些前缀不保存供以后调用。

任何其他命名参数都可以用于在XPath表达式中传递XPath变量的值，例如：selector.xpath('//a[href=$url]', url="http://www.example.com")

css(query) 对该列表中的每个元素调用.css方法，并将其结果展平为另一个SelectorList。查询与Selector.css中的参数相同

getall() 调用此列表中每个元素的.get方法，并将其结果展平，作为Unicode字符串列表返回。

get(default=None) 返回此列表中第一个元素的.get结果。如果列表为空，则返回默认值。

re(regex, replace_entities=True) 对该列表中的每个元素调用re方法，并将其结果展平，以unicode字符串列表形式返回。默认情况下，字符实体引用将替换为其对应的字符（&amp和&lt除外。.将replace_entities传递为False会关闭这些替换。

re_first(regex, default=None, replace_entities=True)
对该列表中的第一个元素调用.re方法，并以Unicode字符串返回结果。如果列表为空或正则表达式不匹配，则返回默认值（如果未提供参数，则为None）。默认情况下，字符实体引用由其对应的字符替换（&amp和&lt传递replace_entities除外，因为False会关闭这些替换。

attrib 返回第一个元素的属性字典。如果列表为空，则返回空dict。

案例使用

sel = Selector(html_response)
sel.xpath("//h1")  # Select all <h1> elements from an HTML response body, returning a list of Selector objects
sel.xpath("//h1").getall()         # this includes the h1 tag
sel.xpath("//h1/text()").getall()  # this excludes the h1 tag
for node in sel.xpath("//p"):
    print(node.attrib['class'])

sel = Selector(xml_response)
sel.xpath("//product")
sel.register_namespace("g", "http://base.google.com/ns/1.0")
sel.xpath("//g:price").getall()

Scrapy学习笔记-Selectors一

Selector objects scrapy.selector.Selector

SelectorList objects scrapy.selector.SelectorList

案例使用

相关阅读

相关文章

相关问答

相关文档