scrapy获取网易buff饰品信息继上一次的版本,继续优化。
在后期的测试当中,发现无法实现断点续爬的功能,观察发现和我们自己设置的url有关系(自己挖的坑自己跳),具体部分的代码如下,arms.py文件中。
base_data['biglable_link'] = 'https://buff.163.com/api/market/goods?game=csgo&page_num=1&category_group={}&use_suggestion=0&trigger=undefined_trigger&_={}'.format(base_data['value'], int(time.time() * 1000))
# 和
next_url = 'https://buff.163.com/api/market/goods?game=csgo&page_num={}&category_group={}&use_suggestion=0&trigger=undefined_trigger&_={}'.format(page, base_data['value'], int(time.time() * 1000))
在上一次的文章中有考虑到服务器是否会验证url当中的时间戳,因为每一次带上新的时间戳返回的数据页面都不相同,刚刚自己试了一下发现不带时间戳的影响并不大,这也就倒置每一次请求的url都会带上当前的时间戳,每一次都在变化。而我们的断点续爬的原理就是对url进行去重,这也就倒置断点续爬功能的失效。
进行如下操作,修改我们的arms.py和setting.py文件,删除url当中的时间戳,修改翻页的规则,之前的翻页存在问题,lpush新的request后直接被pop了。
class ArmsSpider(scrapy.Spider):
name = 'arms'
allowed_domains = ['buff.163.com']
start_urls = ['https://buff.163.com/market/csgo']
def parse(self, response):
node_list = response.xpath('//*[@class="h1z1-selType type_csgo"]/div')
for node in node_list:
base_data = {}
base_data['biglabel'] = node.xpath('.//p/text()').get()
base_data['value'] = node.xpath('.//p/@value').get()
base_data['biglable_link'] = 'https://buff.163.com/api/market/goods?game=csgo&page_num=1&category_group={}&use_suggestion=0&trigger=undefined_trigger'.format(base_data['value'])
cookie = '_ntes_nnid=2168b19b62d64bb37f40162a1fd999cf,1656839072318; _ntes_nuid=2168b19b62d64bb37f40162a1fd999cf; Device-Id=zteGfLiffEYmzr7pzqXn; _ga=GA1.2.1822956190.1656920597; vinfo_n_f_l_n3=4f2cffc01c7d98e1.1.0.1657365123345.0.1657365133193; hb_MA-8E16-605C3AFFE11F_source=www.baidu.com; hb_MA-AC55-420C68F83864_source=www.baidu.com; __root_domain_v=.163.com; _qddaz=QD.110858392929324; P_INFO=18958675241|1658722544|1|netease_buff|00&99|null&null&null#gux&450300#10#0|&0||18958675241; remember_me=U1095406721|LG3tz94sUOGVVIXZQjo8lJ1AwzVQbaMk; session=1-AUb3OJyXgFPRIXA0K2S1FiSg4UJqurpwwPEb4RrAolCS2038696921; Locale-Supported=zh-Hans; game=csgo; Hm_lvt_eaa57ca47dacb4ad4f5a257001a3457c=1656920596,1658582225,1658721676,1658818477; Hm_lpvt_eaa57ca47dacb4ad4f5a257001a3457c=1658819822; csrf_token=Ijc0OWRkYzY5YTY1ZGE3MzU3MmIzOWFmNTM3NGJiNGEzMmY3MDlkNjQi.FcEmeA.YCnFQ4ERqsIOrs47b6Jtu6e7IKk'
cookies = {data.split('=')[0]: data.split('=')[1] for data in cookie.split(';')}
yield scrapy.Request(
url=base_data['biglable_link'],
callback=self.parse_img,
meta={'base_data': base_data},
cookies=cookies,
)
def parse_img(self, response):
base_data = response.meta['base_data']
json_data = json.loads(response.text)
id = jsonpath.jsonpath(json_data, '$..items[*].id')
name = jsonpath.jsonpath(json_data, '$..items[*].name')
market_name = jsonpath.jsonpath(json_data, '$..items[*].market_hash_name')
price = jsonpath.jsonpath(json_data, '$..items[*].sell_min_price')
exterior_wear = jsonpath.jsonpath(json_data, '$..info.tags.exterior.localized_name')
quality = jsonpath.jsonpath(json_data, '$..info.tags.quality.localized_name')
rarity = jsonpath.jsonpath(json_data, '$..info.tags.rarity.localized_name')
type = jsonpath.jsonpath(json_data, '$..info.tags.type.localized_name')
weapon_type = jsonpath.jsonpath(json_data, '$..info.tags.weapon.localized_name')
for i in range(len(id)):
item = BuffItem()
item['biglabel'] = base_data['biglabel']
item['biglabel_link'] = base_data['biglable_link']
item['id'] = id[i]
item['name'] = name[i]
item['market_name'] = market_name[i]
item['price'] = price[i]
if not exterior_wear:
item['exterior_wear'] = ''
else:
item['exterior_wear'] = exterior_wear[i]
if not quality:
item['quality'] = ''
else:
item['quality'] = quality[i]
if not rarity:
item['rarity'] = ''
else:
item['rarity'] = rarity[i]
if not type:
item['type'] = ''
else:
item['type'] = type[i]
if not weapon_type:
item['weapon_type'] = ''
else:
item['weapon_type'] = weapon_type[i]
yield item
pages = jsonpath.jsonpath(json_data, '$.data.total_page')[0]
for page in range(1, pages+1):
next_url = 'https://buff.163.com/api/market/goods?game=csgo&page_num={}&category_group={}&use_suggestion=0&trigger=undefined_trigger'.format(page, base_data['value'])
base_data['biglable_link'] = next_url
cookie = '_ntes_nnid=2168b19b62d64bb37f40162a1fd999cf,1656839072318; _ntes_nuid=2168b19b62d64bb37f40162a1fd999cf; Device-Id=zteGfLiffEYmzr7pzqXn; _ga=GA1.2.1822956190.1656920597; vinfo_n_f_l_n3=4f2cffc01c7d98e1.1.0.1657365123345.0.1657365133193; hb_MA-8E16-605C3AFFE11F_source=www.baidu.com; hb_MA-AC55-420C68F83864_source=www.baidu.com; __root_domain_v=.163.com; _qddaz=QD.110858392929324; P_INFO=18958675241|1658722544|1|netease_buff|00&99|null&null&null#gux&450300#10#0|&0||18958675241; remember_me=U1095406721|LG3tz94sUOGVVIXZQjo8lJ1AwzVQbaMk; session=1-AUb3OJyXgFPRIXA0K2S1FiSg4UJqurpwwPEb4RrAolCS2038696921; Locale-Supported=zh-Hans; game=csgo; Hm_lvt_eaa57ca47dacb4ad4f5a257001a3457c=1656920596,1658582225,1658721676,1658818477; Hm_lpvt_eaa57ca47dacb4ad4f5a257001a3457c=1658819822; csrf_token=Ijc0OWRkYzY5YTY1ZGE3MzU3MmIzOWFmNTM3NGJiNGEzMmY3MDlkNjQi.FcEmeA.YCnFQ4ERqsIOrs47b6Jtu6e7IKk'
cookies = {data.split('=')[0]: data.split('=')[1] for data in cookie.split(';')}
yield scrapy.Request(
url=next_url,
callback=self.parse_img,
meta={'base_data': base_data},
cookies=cookies,
)
当我们暂停后,会从request中提取任务继续跑,消耗request队列中的url,而不会从头开始。