当前位置: 首页 > 工具软件 > node-csgo > 使用案例 >

优化某buff csgo饰品

宋宏儒
2023-12-01


scrapy获取网易buff饰品信息继上一次的版本,继续优化。

在后期的测试当中,发现无法实现断点续爬的功能,观察发现和我们自己设置的url有关系(自己挖的坑自己跳),具体部分的代码如下,arms.py文件中。

base_data['biglable_link'] = 'https://buff.163.com/api/market/goods?game=csgo&page_num=1&category_group={}&use_suggestion=0&trigger=undefined_trigger&_={}'.format(base_data['value'], int(time.time() * 1000))

# 和
next_url = 'https://buff.163.com/api/market/goods?game=csgo&page_num={}&category_group={}&use_suggestion=0&trigger=undefined_trigger&_={}'.format(page, base_data['value'], int(time.time() * 1000))

在上一次的文章中有考虑到服务器是否会验证url当中的时间戳,因为每一次带上新的时间戳返回的数据页面都不相同,刚刚自己试了一下发现不带时间戳的影响并不大,这也就倒置每一次请求的url都会带上当前的时间戳,每一次都在变化。而我们的断点续爬的原理就是对url进行去重,这也就倒置断点续爬功能的失效。

进行如下操作,修改我们的arms.py和setting.py文件,删除url当中的时间戳,修改翻页的规则,之前的翻页存在问题,lpush新的request后直接被pop了。

class ArmsSpider(scrapy.Spider):
    name = 'arms'
    allowed_domains = ['buff.163.com']
    start_urls = ['https://buff.163.com/market/csgo']

    def parse(self, response):
        node_list = response.xpath('//*[@class="h1z1-selType type_csgo"]/div')
        for node in node_list:
            base_data = {}
            base_data['biglabel'] = node.xpath('.//p/text()').get()
            base_data['value'] = node.xpath('.//p/@value').get()
            base_data['biglable_link'] = 'https://buff.163.com/api/market/goods?game=csgo&page_num=1&category_group={}&use_suggestion=0&trigger=undefined_trigger'.format(base_data['value'])
            cookie = '_ntes_nnid=2168b19b62d64bb37f40162a1fd999cf,1656839072318; _ntes_nuid=2168b19b62d64bb37f40162a1fd999cf; Device-Id=zteGfLiffEYmzr7pzqXn; _ga=GA1.2.1822956190.1656920597; vinfo_n_f_l_n3=4f2cffc01c7d98e1.1.0.1657365123345.0.1657365133193; hb_MA-8E16-605C3AFFE11F_source=www.baidu.com; hb_MA-AC55-420C68F83864_source=www.baidu.com; __root_domain_v=.163.com; _qddaz=QD.110858392929324; P_INFO=18958675241|1658722544|1|netease_buff|00&99|null&null&null#gux&450300#10#0|&0||18958675241; remember_me=U1095406721|LG3tz94sUOGVVIXZQjo8lJ1AwzVQbaMk; session=1-AUb3OJyXgFPRIXA0K2S1FiSg4UJqurpwwPEb4RrAolCS2038696921; Locale-Supported=zh-Hans; game=csgo; Hm_lvt_eaa57ca47dacb4ad4f5a257001a3457c=1656920596,1658582225,1658721676,1658818477; Hm_lpvt_eaa57ca47dacb4ad4f5a257001a3457c=1658819822; csrf_token=Ijc0OWRkYzY5YTY1ZGE3MzU3MmIzOWFmNTM3NGJiNGEzMmY3MDlkNjQi.FcEmeA.YCnFQ4ERqsIOrs47b6Jtu6e7IKk'
            cookies = {data.split('=')[0]: data.split('=')[1] for data in cookie.split(';')}
            yield scrapy.Request(
                url=base_data['biglable_link'],
                callback=self.parse_img,
                meta={'base_data': base_data},
                cookies=cookies,
            )

    def parse_img(self, response):
        base_data = response.meta['base_data']
        json_data = json.loads(response.text)
        id = jsonpath.jsonpath(json_data, '$..items[*].id')
        name = jsonpath.jsonpath(json_data, '$..items[*].name')
        market_name = jsonpath.jsonpath(json_data, '$..items[*].market_hash_name')
        price = jsonpath.jsonpath(json_data, '$..items[*].sell_min_price')
        exterior_wear = jsonpath.jsonpath(json_data, '$..info.tags.exterior.localized_name')
        quality = jsonpath.jsonpath(json_data, '$..info.tags.quality.localized_name')
        rarity = jsonpath.jsonpath(json_data, '$..info.tags.rarity.localized_name')
        type = jsonpath.jsonpath(json_data, '$..info.tags.type.localized_name')
        weapon_type = jsonpath.jsonpath(json_data, '$..info.tags.weapon.localized_name')
        for i in range(len(id)):
            item = BuffItem()
            item['biglabel'] = base_data['biglabel']
            item['biglabel_link'] = base_data['biglable_link']
            item['id'] = id[i]
            item['name'] = name[i]
            item['market_name'] = market_name[i]
            item['price'] = price[i]
            if not exterior_wear:
                item['exterior_wear'] = ''
            else:
                item['exterior_wear'] = exterior_wear[i]
            if not quality:
                item['quality'] = ''
            else:
                item['quality'] = quality[i]
            if not rarity:
                item['rarity'] = ''
            else:
                item['rarity'] = rarity[i]
            if not type:
                item['type'] = ''
            else:
                item['type'] = type[i]
            if not weapon_type:
                item['weapon_type'] = ''
            else:
                item['weapon_type'] = weapon_type[i]
            yield item

        pages = jsonpath.jsonpath(json_data, '$.data.total_page')[0]
        for page in range(1, pages+1):
            next_url = 'https://buff.163.com/api/market/goods?game=csgo&page_num={}&category_group={}&use_suggestion=0&trigger=undefined_trigger'.format(page, base_data['value'])
            base_data['biglable_link'] = next_url
            cookie = '_ntes_nnid=2168b19b62d64bb37f40162a1fd999cf,1656839072318; _ntes_nuid=2168b19b62d64bb37f40162a1fd999cf; Device-Id=zteGfLiffEYmzr7pzqXn; _ga=GA1.2.1822956190.1656920597; vinfo_n_f_l_n3=4f2cffc01c7d98e1.1.0.1657365123345.0.1657365133193; hb_MA-8E16-605C3AFFE11F_source=www.baidu.com; hb_MA-AC55-420C68F83864_source=www.baidu.com; __root_domain_v=.163.com; _qddaz=QD.110858392929324; P_INFO=18958675241|1658722544|1|netease_buff|00&99|null&null&null#gux&450300#10#0|&0||18958675241; remember_me=U1095406721|LG3tz94sUOGVVIXZQjo8lJ1AwzVQbaMk; session=1-AUb3OJyXgFPRIXA0K2S1FiSg4UJqurpwwPEb4RrAolCS2038696921; Locale-Supported=zh-Hans; game=csgo; Hm_lvt_eaa57ca47dacb4ad4f5a257001a3457c=1656920596,1658582225,1658721676,1658818477; Hm_lpvt_eaa57ca47dacb4ad4f5a257001a3457c=1658819822; csrf_token=Ijc0OWRkYzY5YTY1ZGE3MzU3MmIzOWFmNTM3NGJiNGEzMmY3MDlkNjQi.FcEmeA.YCnFQ4ERqsIOrs47b6Jtu6e7IKk'
            cookies = {data.split('=')[0]: data.split('=')[1] for data in cookie.split(';')}
            yield scrapy.Request(
                url=next_url,
                callback=self.parse_img,
                meta={'base_data': base_data},
                cookies=cookies,
                )

当我们暂停后,会从request中提取任务继续跑,消耗request队列中的url,而不会从头开始。

 

 

 类似资料: