我试图在Tripadvisor中收集多家酒店的评论,我能够收集150个观察数据,其中包括来自30家酒店的150个评论数据。
但是,当我尝试添加hotel_name的新列并执行爬网时,hotel name不会再次出现,观察次数会减少到hotel的数量,即30。如何将酒店名称复制到每个审阅行?
这是我正在使用的代码:
import scrapy
from..items import ReviewItem
import re
class TripAdvisorReview(scrapy.Spider):
name = "tripadvisor"
start_urls = ["https://www.tripadvisor.co.uk/Hotels-g186217-England-Hotels.html"]
def parse(self, response):
for href in response.css("div.listing_title a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
def parse_hotel(self, response):
for info in response.css('div.page'):
items = ReviewItem()
hotel_names = info.css('._1mTlpMC3::text').extract()
hotel_names = [hotel_name.strip() for hotel_name in hotel_names]
reviewer_names = info.css('._1r_My98y::text').extract()
reviewer_names = [reviewer_name.strip() for reviewer_name in reviewer_names]
reviewer_contributions = info.css('._3fPsSAYi:nth-child(1) ._1fk70GUn , ._1TuWwpYf+ ._3fPsSAYi ._1fk70GUn').css('::text').extract()
reviewer_contributions = [reviewer_contribution.strip() for reviewer_contribution in reviewer_contributions]
review_dates = info.xpath('//div[@class = "_2fxQ4TOx"]/span[contains(text(),"wrote a review")]/text()').extract()
review_dates = [review_date.strip() for review_date in review_dates]
review_stars = info.css('div.nf9vGX55 .ui_bubble_rating').xpath("@class").extract()
review_stars = [review_star.strip() for review_star in review_stars]
review_texts = info.css('#component_15 .cPQsENeY').css('::text').extract()
review_texts = [review_text.strip() for review_text in review_texts]
#helpful_vote = info.css('._3kbymg8R::text').extract()
result = zip(hotel_names, reviewer_names, review_dates, review_texts, review_stars, reviewer_contributions)
for hotel_name, reviewer_name, review_date, review_text, review_star, reviewer_contribution in result:
items['hotel_name'] = hotel_name
items['reviewer_name'] = reviewer_name
items['reviewer_contribution'] = reviewer_contribution
items['review_date'] = review_date
items['review_star'] = review_star
items['review_text'] = review_text
#items['helpful_vote'] = helpful_vote
yield items
您的问题是hotel_names
只有一个值,但其他元素有五个值-检查:
print('hotel_names:', len(hotel_names))
print('reviewer_names:', len(reviewer_names))
print('review_dates:', len(review_dates))
print('review_stars:', len(review_stars))
print('review_texts:', len(review_texts))
print('reviewer_contributions:', len(reviewer_contributions))
但zip()
使用最短列表的长度来创建项目,所以它只创建一个itme。
您应该在不使用酒店名称的情况下使用zip()
,然后在每个项目中添加酒店名称[0]
。
# without `hotel_names`
all_reviews = zip(reviewer_names, review_dates, review_texts, review_stars, reviewer_contributions)
hotel_name = hotel_names[0] # <-- manually get first hotel
# without `hotel_name`
for reviewer_name, review_date, review_text, review_star, reviewer_contribution in all_reviews:
#items = ReviewItem()
items = dict()
items['hotel_name'] = hotel_name # <-- manually add first hotel
items['reviewer_name'] = reviewer_name
items['reviewer_contribution'] = reviewer_contribution
items['review_date'] = review_date
items['review_star'] = review_star
items['review_text'] = review_text
#items['helpful_vote'] = helpful_vote
yield items
顺便说一句:还有另一个问题-review_texts
通常有超过5个项目(即11个项目),这意味着你使用了错误的方法来获取这段文本。
当我检查CSV时,我看到它践踏...
作为单独的审查。你必须改变它。
最小工作代码。
您可以将所有代码放在一个文件中并运行它,而无需创建project-python脚本。py
。这样每个人都可以测试它。
import scrapy
#from ..items import ReviewItem
class TripAdvisorReview(scrapy.Spider):
name = "tripadvisor"
start_urls = ["https://www.tripadvisor.co.uk/Hotels-g186217-England-Hotels.html"]
def parse(self, response):
for href in response.css("div.listing_title a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
def parse_hotel(self, response):
for info in response.css('div.page'):
hotel_names = info.css('._1mTlpMC3::text').extract()
hotel_names = [hotel_name.strip() for hotel_name in hotel_names]
reviewer_names = info.css('._1r_My98y::text').extract()
reviewer_names = [reviewer_name.strip() for reviewer_name in reviewer_names]
reviewer_contributions = info.css('._3fPsSAYi:nth-child(1) ._1fk70GUn , ._1TuWwpYf+ ._3fPsSAYi ._1fk70GUn').css('::text').extract()
reviewer_contributions = [reviewer_contribution.strip() for reviewer_contribution in reviewer_contributions]
review_dates = info.xpath('//div[@class = "_2fxQ4TOx"]/span[contains(text(),"wrote a review")]/text()').extract()
review_dates = [review_date.strip() for review_date in review_dates]
review_stars = info.css('div.nf9vGX55 .ui_bubble_rating').xpath("@class").extract()
review_stars = [review_star.strip() for review_star in review_stars]
review_texts = info.css('#component_15 .cPQsENeY').css('::text').extract()
review_texts = [review_text.strip() for review_text in review_texts]
#helpful_vote = info.css('._3kbymg8R::text').extract()
print('hotel_names:', len(hotel_names))
print('reviewer_names:', len(reviewer_names))
print('review_dates:', len(review_dates))
print('review_stars:', len(review_stars))
print('review_texts:', len(review_texts))
print('reviewer_contributions:', len(reviewer_contributions))
print('----')
# without `hotel_names`
all_reviews = zip(reviewer_names, review_dates, review_texts, review_stars, reviewer_contributions)
hotel_name = hotel_names[0] # <-- manually get first hotel
# without `hotel_name`
for reviewer_name, review_date, review_text, review_star, reviewer_contribution in all_reviews:
#items = ReviewItem()
items = dict()
items['hotel_name'] = hotel_name # <-- manually add first hotel
items['reviewer_name'] = reviewer_name
items['reviewer_contribution'] = reviewer_contribution
items['review_date'] = review_date
items['review_star'] = review_star
items['review_text'] = review_text
#items['helpful_vote'] = helpful_vote
yield items
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(TripAdvisorReview)
c.start()
编辑:
我的版本没有zip()
首先我搜索所有的评论
,然后我使用for
-循环来单独处理每个评论。这样,我可以控制text
并跳过...
-简单地说,我只在审阅
中获得第一个text
。
import scrapy
#from ..items import ReviewItem
class TripAdvisorReview(scrapy.Spider):
name = "tripadvisor"
start_urls = ["https://www.tripadvisor.co.uk/Hotels-g186217-England-Hotels.html"]
def parse(self, response):
for href in response.css("div.listing_title a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
def parse_hotel(self, response):
hotel_name = response.css('#HEADING::text').extract_first().strip()
print('hotel_name:', hotel_name)
for review in response.xpath('.//div[@data-test-target="HR_CC_CARD"]'):
name = review.css('._1r_My98y::text').extract_first().strip()
contribution = review.css('._3fPsSAYi:nth-child(1) ._1fk70GUn , ._1TuWwpYf+ ._3fPsSAYi ._1fk70GUn').css('::text').extract_first().strip()
date = review.xpath('.//div[@class="_2fxQ4TOx"]/span[contains(text(),"wrote a review")]/text()').extract_first().strip().replace('wrote a review ', '')
stars = review.css('div.nf9vGX55 .ui_bubble_rating').xpath("@class").extract_first().strip().replace('ui_bubble_rating bubble_', '')
text = review.xpath('.//div[@class="cPQsENeY"]//span/text()').extract_first().strip()
#items = ReviewItem()
items = dict()
items['hotel_name'] = hotel_name # <-- manually add first hotel
items['reviewer_name'] = name
items['reviewer_contribution'] = contribution
items['review_date'] = date
items['review_star'] = stars
items['review_text'] = text
yield items
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(TripAdvisorReview)
c.start()
我有一些hazelcast http会话复制特性的问题。 我有些问题: 在同一个tomcat集群中有没有可能有hazelcast封装的应用程序和非hazelcast封装的应用程序? 带有hazelcast的应用程序应该是可分发的?(通过像其他方法一样添加它的web.xml) 部署应用程序的Tomcat不应该在集群中?是否可以在同一tomcat中使用标准tomcat会话复制将其他应用程序群集化? 编
假设我有这个外部json: 使用以下代码创建: 我怎样才能得到这个json内容来解析它呢? 这就是我所尝试的: 谢谢
我已命中错误 在尝试代码时: 在我的laravel项目中。 我已尝试按照链接中的说明进行操作,但无法增加内存限制,因为在我的/etc/目录中找不到php.ini文件。我只能找到php.ini.default,我已经将那里的内存限制增加到900MB。我已重新启动apache以反映新配置,但限制仍然显示为128MB,这是我运行命令时始终显示的: 我怎样才能把这个修好?我的php版本是7.3.9 我还必
我使用的是ReactJS,当用户单击按钮时,我想将一些div内容复制到剪贴板。 我不明白为什么这段代码不会导致数据被复制到剪贴板。当redux store的真实内容来自div时,我从中获得了价值
我有下面的代码,我过去经常点击特定的按钮来显示或隐藏内容,它成功地工作了,但这不仅仅是我想要的,而且当一个内容显示时,我想要的也不可能是另一个打开,这意味着如果我显示其中一个内容并试图显示另一个内容,第二个内容将导致第一个内容显示自动崩溃,但我失败了,我怎么能做到呢。 我的代码
问题内容: 我有一个现有的virtualenv,其中包含很多软件包,但是旧版本的Django。 我想要做的就是复制此环境,因此我拥有另一个环境,该环境具有完全相同的软件包,但更新版本的Django。我怎样才能做到这一点? 问题答案: 最简单的方法是使用pip生成需求文件。需求文件基本上是一个文件,其中包含要安装(或在pip生成文件的情况下已经安装)的所有python软件包的列表,以及它们的版本。