Python HTML解析，提供漂亮的汤和过滤停用词

咸臻

2023-03-14

问题内容：

我正在将网站的特定信息解析为文件。现在，我所拥有的程序将查看一个网页，并找到正确的HTML标签并解析出正确的内容。现在，我想进一步过滤这些“结果”。

例如，在网站上：http : //allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-
II/Detail.aspx

我正在解析位于

标记中的成分。这个解析器很好地完成了这项工作，但是我想进一步处理这些结果。

当我运行此解析器时，它将删除数字，符号，逗号和斜杠（\或/），但保留所有文本。当我在网站上运行它时，会得到如下结果：

cup olive oil
cup chicken broth
cloves garlic minced
tablespoon paprika

现在，我想通过删除诸如“ cup”，“丁香”，“剁碎”，“
tablesoon”等停用词来进一步处理此问题。我到底该怎么做？这段代码是用python编写的，我不是很擅长，我只是使用这个解析器来获取可以手动输入的信息，但我宁愿不这样做。

任何有关如何详细执行此操作的帮助将不胜感激！我的代码如下：我该怎么做？

码：

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip('123456789.,/\ ') for s in ingreds.findAll('li')]

    fname = 'PorkRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

问题答案：

import urllib2
import BeautifulSoup
import string

badwords = set([
    'cup','cups',
    'clove','cloves',
    'tsp','teaspoon','teaspoons',
    'tbsp','tablespoon','tablespoons',
    'minced'
])

def cleanIngred(s):
    # remove leading and trailing whitespace
    s = s.strip()
    # remove numbers and punctuation in the string
    s = s.strip(string.digits + string.punctuation)
    # remove unwanted words
    return ' '.join(word for word in s.split() if not word in badwords)

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')]

    fname = 'PorkRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

结果是

olive oil
chicken broth
garlic,
paprika
garlic powder
poultry seasoning
dried oregano
dried basil
thick cut boneless pork chops
salt and pepper to taste

？我不知道为什么它在其中留下了逗号-s.strip（string.punctuation）应该已经解决了。

Python HTML解析，提供漂亮的汤和过滤停用词

相关阅读

相关文章

相关问答

相关工具

相关文档