当前位置: 首页 > 工具软件 > book-小说 > 使用案例 >

flask专题-小说网站开发二(抓取数据)

薄欣怿
2023-12-01

上一篇准备好了数据库,现在开始填充数据库,数据来源为小说网站,爬虫库为requests库,多线程爬取,总共爬取数据5万多条。开始干吧!

框架分析
1、网站组成,手机版
3、分页组成,共页,每页50条数据,共计本小说
4、分类暂时未知
5、按照分页爬取每一页的数据,图书网址代码:<a href="/book_34351/">,re提取表达式:<a href="/(.+?)/">
6、对将所有的数据网址保存到记事本
7、读取记事本,拼接url,url = 'http://m.xs52.org/' + 读取结果
8、进入图书网址
9、提取书名,作者,简介,分类,图片,链接,
	1、书名,<p><strong>十年一品温如言(全集)</strong></p>,re提取表达式: <p><strong>(.+?)</strong></p>         
	2、作者,<p>作者:<a href="/modules/article/waps.php?searchtype=author&searchkey=书海沧生">书海沧生</a></p>, re提取表达式:">(.+?)</a></p>[0]
	3、简介,<div class="intro">超级畅销书作家书海沧生口口相传之作,承载千万读	者珍贵回忆的青春经典!第一年,她贪图美色,爱上他,一盆水泼出百年的冤家。第二年,他做了爸爸,她做了妈妈。孩子姓言,母温氏。历数十年之期,他们有了百年的家。</div>, re提取表达式:<div class="intro">(.+?)</div>
	4、分类,<p>类别:<a href="/wapsort/3_1.html">都市小说</a></p>,re提取表达式:">(.+?)</a></p>[1]
	5、图片,<td><img src="http://www.xs52.org/files/article/image/34/34351/34351s.jpg", re提取表达式:<td><img src="http://www.xs52.org/(.+?).jpg"
	6、链接,为本书进入网址
10、爬取文章,由于爬取图书为完本书籍,不考虑更新问题,
11、分析文章网址:从记事本提取文章url,book = book_34351,进行切割操作,book_code = str(book).split('_')[1]
12、拼接章节目录, chapters = 'chapters_'+book_code
13、拼接章节目录url,chapters_url = 'http://m.xs52.org/' + chapters
14、按照倒叙方式进入章节目录,url = 'ttp://m.xs52.org/chapters_rev_' + book_code
15、找到最顶部的一章,url: <li class="even"><a href="/book_34351/20856965.html">,re提取表达式:<li class="even"><a href="/(.+?).html">
16、15提取结果,book_34351/20856965,切割,str().split('/')[1]
17、提取总页数:"page-book-turn">第1/3页[30章/页] <,re提取表达式:page = '"page-book-turn">第1/(.+?)页[30章/页] <'
18、17提取结果,3,计算总章数,page_num = int(page)*30
19、16结果:x = 20856965,遍历执行累减操作,执行次数为18结果,每次减之后
20、拼接url,url = 'http://m.xs52.org/'+book_code+x.html,
21、获取文章内容,
     1、章节名称:<h1 class="nr_title" id="nr_title">第36章 镜头下生日快乐</h1>,re提取式:<h1 class="nr_title" id="nr_title">(.+?)</h1>
     2、章节内容,<div id="nr1"></div>,re:<div id="nr1">(.+?)</div>
22、将标题和内容拼接整合,每次结果存入txt文件

以上为第一次分析,实际写代码与上面不太相符,整体思路一致,实际操作过程中,将爬取进行了分割,两个程序,一个程序专门爬取小说信息,存入数据库,另一个程序爬取小说内容存入服务器磁盘,同时开启了8个线程分别进行进行抓取不同的分类,速度可以说挺快的,具体的爬虫代码如下。

import re
import json
import time
from datetime import datetime
from lxml import etree
import pymysql
import requests
import threading
def request(i):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
    }
    r = requests.get(i)
    r.encoding = r.apparent_encoding
    content = r.text
    return content
def parse(html):
    r = re.compile('<a href="/(.+?)/"><img')
    book = r.findall(html)
    for book in book:
        # print(book)
        url = 'http://m.xs52.org/' + book
        xx = request(url)
        try:
            book_content(con=xx, url=url, book_code=book)
        except:
            url = 'http://m.xs52.org/' + book
            xx = request(url)
            book_content(con=xx, url=url, book_code=book)
def book_content(con, url, book_code):
    r = re.compile('<p><strong>(.+?)</strong></p>')
    try:
        book_name = r.findall(con)[0]
    except:
        book_name = ''
    r = re.compile('">(.+?)</a></p>')
    try:
        book_author = r.findall(con)[0]
    except:
        book_author = ''
    r = re.compile('<div class="intro">(.+?)</div>')
    try:
        book_abstact = r.findall(con)[0]
    except:
        book_abstact = ''
    r = re.compile('">(.+?)</a></p>')
    book_class = r.findall(con)[1]
    r = re.compile('<td><img src="http://www.xs52.org/(.+?)"')
    try:
        book_img = r.findall(con)[0]
    except:
        book_img = ''
    book_link = url
    book_img = 'http://www.xs52.org/' + str(book_img)
    connect = pymysql.connect(host='localhost', port=3306, user='book', password='book', db="book",
                              cursorclass=pymysql.cursors.DictCursor)
    cursor = connect.cursor()
    sql = "insert into book value(null, '%s', '%s', '%s', '%s', '%s', '%s')" % (
    book_name, book_author, book_abstact, book_class, book_img, book_link)
    cursor.execute(sql)
    connect.commit()
    connect.close()
    print('%s爬取完成'%book_name)

def a():
    for i in range(1, 37):
        url = 'http://m.xs52.org/wapsort/2_' + str(i) + '.html'
        # x = request(url)
        print('正在爬取%s页' % i)
        # try:
        parse(request(url))
        # except:
        #     parse(request(url))


def b():
    for i in range(1, 72):
        url = 'http://m.xs52.org/wapsort/4_' + str(i) + '.html'
        # x = request(url)
        print('正在爬取%s页' % i)
        # try:
        parse(request(url))
        # except:
        # parse(request(url))


def c():
    for i in range(1, 81):
        url = 'http://m.xs52.org/wapsort/1_' + str(i) + '.html'
        # x = request(url)
        print('正在爬取%s页' % i)
        # try:
        parse(request(url))
        # except:
        # parse(request(url))


def d():
    for i in range(1, 701):
        url = 'http://m.xs52.org/wapsort/3_' + str(i) + '.html'
        # x = request(url)
        print('正在爬取%s页' % i)
        # try:
        parse(request(url))
        # except:
        # parse(request(url))


def e():
    for i in range(1, 18):
        url = 'http://m.xs52.org/wapsort/5_' + str(i) + '.html'
        # x = request(url)
        print('正在爬取%s页' % i)
        # try:
        parse(request(url))


def f():
    for i in range(1, 43):
        url = 'http://m.xs52.org/wapsort/6_' + str(i) + '.html'
        # x = request(url)
        print('正在爬取%s页' % i)
        # try:
        parse(request(url))

def g():
    for i in range(1, 15):
        url = 'http://m.xs52.org/wapsort/7_' + str(i) + '.html'
        # x = request(url)
        print('正在爬取%s页' % i)
        # try:
        parse(request(url))

def h():
    for i in range(1, 64):
        url = 'http://m.xs52.org/wapsort/8_' + str(i) + '.html'
        # x = request(url)
        print('正在爬取%s页' % i)
        # try:
        parse(request(url))

if __name__ == '__main__':
    x = threading.Thread(target=a)
    y = threading.Thread(target=b)
    z = threading.Thread(target=c)
    d = threading.Thread(target=d)
    e = threading.Thread(target=e)
    f = threading.Thread(target=f)
    g = threading.Thread(target=g)
    h = threading.Thread(target=h)
    x.start()
    y.start()
    z.start()
    d.start()
    e.start()
    f.start()
    g.start()
    h.start()

 类似资料: