从使用urllib2或BeautifulSoup获取的（可能是编码的）字符串中返回小写ASCII字符串

汝吕恭

2023-03-14

问题内容：

我正在使用urllib2从网页中获取数据。所有页面的内容均为英语，因此不存在处理非英语文本的问题。但是页面是经过编码的，有时它们包含HTML实体，例如£或版权符号等。

我想检查页面的某些部分是否包含某些关键字-但是，我想进行不区分大小写的检查（出于明显的原因）。

将返回的页面内容转换为所有小写字母的最佳方法是什么？

def get_page_content_as_lower_case(url):
    request = urllib2.Request(url)
    page = urllib2.urlopen(request)
    temp = page.read()

    return str(temp).lower() # this dosen't work because page contains utf-8 data

[[更新]]

我不必使用urllib2来获取数据，实际上我可以改用BeautifulSoup，因为我需要从页面中的特定元素中检索数据-
BS是更好的选择。我已更改标题以反映这一点。

但是，问题仍然存在，即获取的数据是以utf-8的某种非asci编码（假定为）的。我确实检查了其中一页，编码为iso-8859-1。

由于我只关心英语，因此我想知道如何获取从页面中检索到的数据的小写ASCII字符串版本-以便对是否在其中找到关键字进行区分大小写的测试。文本。

我假设我将自己仅限于英语（来自英语网站）的事实会减少编码的选择？。我对编码了解不多，但是我假设有效的选择是：

ASCII码
iso-8859-1
utf-8

这是一个有效的假设吗？如果是，那么也许有一种方法可以编写一个“健壮”的函数，该函数接受包含英文文本的编码字符串，并返回一个小写的ASCII字符串版本？

问题答案：

BeautifulSoup在内部将数据存储为Unicode，因此您无需手动执行字符编码操作。

要在文本中找到关键字（不区分大小写）（不在属性值或标记名称中）：

#!/usr/bin/env python
import urllib2
from contextlib import closing

import regex # pip install regex
from BeautifulSoup import BeautifulSoup

with closing(urllib2.urlopen(URL)) as page:
     soup = BeautifulSoup(page)
     print soup(text=regex.compile(ur'(?fi)\L<keywords>',
                                   keywords=['your', 'keywords', 'go', 'here']))

示例（@tchrist的Unicode单词）

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex
from BeautifulSoup import BeautifulSoup, Comment

html = u'''<div attr="PoSt in attribute should not be found">
<!-- it must not find post inside a comment either -->
<ol> <li> tag names must not match
<li> Post will be found
<li> the same with post
<li> and poﬆ
<li> and poﬅ
<li> this is ignored
</ol>
</div>'''

soup = BeautifulSoup(html)

# remove comments
comments = soup.findAll(text=lambda t: isinstance(t, Comment))
for comment in comments: comment.extract()

# find text with keywords (case-insensitive)
print ''.join(soup(text=regex.compile(ur'(?fi)\L<opts>', opts=['post', 'li'])))
# compare it with '.lower()'
print '.lower():'
print ''.join(soup(text=lambda t: any(k in t.lower() for k in ['post', 'li'])))
# or exact match
print 'exact match:'
print ''.join(soup(text=' the same with post\n'))

输出量

 Post will be found
 the same with post
 and poﬆ
 and poﬅ

.lower():
 Post will be found
 the same with post

exact match:
 the same with post

从使用urllib2或BeautifulSoup获取的（可能是编码的）字符串中返回小写ASCII字符串

示例（@tchrist的Unicode单词）

输出量

相关阅读

相关文章

相关问答

相关工具

相关文档