TF-IDF
等;在原始arxiv论文中论文都有对应的类别,而论文类别是作者填写的。在本次任务中我们可以借助论文的标题和摘要完成:
思路1:TF-IDF+机器学习分类器
直接使用TF-IDF对文本提取特征,使用分类器进行分类,分类器的选择上可以使用SVM、LR、XGbooset等
思路2:FastText
FastText是入门款的词向量,利用Facebook提供的FastText工具,可以快速构建分类器
思路3:WordVec+深度学习分类器
WordVec是进阶款的词向量,并通过构建深度学习分类完成分类。深度学习分类的网络结构可以选择TextCNN、TextRnn或者BiLSTM
思路4:Bert词向量
Bert是高配款的词向量,具有强大的建模学习能力
import os #操作和处理文件路径
import pandas as pd #处理数据,数据分析
import matplotlib.pyplot as plt
import json
from bs4 import BeautifulSoup
import seaborn as sns
import requests
import re
首先完成字段读取:
os.chdir("D:\数据分析\Datawhale项目")
data = []#初始化
with open("arxiv-metadata-oai-2019.json",'r') as f:
for idx, line in enumerate(f):
d = json.loads(line)
d = {'title':d['title'],'categories':d['categories'],'abstract':d['abstract']}
data.append(d)
data = pd.DataFrame(data)
pd.set_option('display.max_colwidth', -1)#令DataFrame数据显示内容无限大
data.shape
(170618, 3)
为了方便数据的处理,将标题和摘要拼接一起完成分类。
data['text'] = data['title'] + data['abstract']
data['text'] = data['text'].apply(lambda x :x.replace('\n',''))#将text列中的字符'\n'替换为空格符
data['text'] = data['text'].apply(lambda x : x.lower())#将text列中的字符转为为小写
data = data.drop(['abstract','title'],axis=1)
由于原始论文有多个类别,所以也需要处理:
#多个类别,包含子类
data['categories'] = data['categories'].apply(lambda x:x.split(' '))
#多个类别,不包含子类
data['categories_big'] = data['categories'].apply(lambda x : [xx.split('.')[0] for xx in x])
因为这里有多个类别,所以需要多编码:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
data_label = mlb.fit_transform(data['categories_big'].iloc[:])
data_label
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 1, 0, 0],
[0, 1, 0, ..., 1, 0, 0],
[0, 0, 0, ..., 1, 0, 0]])
mlb.classes_
array(['acc-phys', 'adap-org', 'alg-geom', 'astro-ph', 'chao-dyn',
'chem-ph', 'cmp-lg', 'comp-gas', 'cond-mat', 'cs', 'dg-ga', 'econ',
'eess', 'funct-an', 'gr-qc', 'hep-ex', 'hep-lat', 'hep-ph',
'hep-th', 'math', 'math-ph', 'mtrl-th', 'nlin', 'nucl-ex',
'nucl-th', 'patt-sol', 'physics', 'q-alg', 'q-bio', 'q-fin',
'quant-ph', 'solv-int', 'stat', 'supr-con'], dtype=object)
思路1使用TF-IDF提取特征,限制最多4000个:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features = 4000)
data_tfidf = vectorizer.fit_transform(data['text'].iloc[:])
data_tfidf.shape
(170618, 4000)
print(data_tfidf)
(0, 3995) 0.07021652059189036
(0, 4) 0.03998477620695627
(0, 3981) 0.050156978103458774
(0, 1863) 0.046353653530492214
(0, 974) 0.061007097797520574
(0, 3669) 0.028541269073439877
(0, 544) 0.034463432304852167
(0, 3518) 0.06798071079095705
(0, 1760) 0.07215101651579754
(0, 3765) 0.03753666692814393
(0, 1495) 0.01322037355147394
(0, 782) 0.04521877067042028
(0, 1141) 0.05311346808892409
(0, 2815) 0.05752833069747911
(0, 3365) 0.033433042545714706
(0, 2063) 0.04861540933342187
(0, 1796) 0.06540334209658531
(0, 2564) 0.023306744143985307
(0, 263) 0.055637945987728185
(0, 2448) 0.028629546305411327
(0, 2216) 0.07532839597857001
(0, 3469) 0.04264499771346989
(0, 2778) 0.042077240901834755
(0, 142) 0.06254837376110277
(0, 3912) 0.03917142194878494
: :
(170616, 3617) 0.061893376510652305
(170616, 297) 0.03337147925518496
(170616, 3651) 0.05605176602928342
(170616, 2508) 0.027875761199047826
(170616, 222) 0.04062933458664977
(170616, 1808) 0.020844270614692607
(170616, 503) 0.03087776676147839
(170616, 1963) 0.02465323265530774
(170616, 2495) 0.09640842718156238
(170616, 3608) 0.09609529936833933
(170616, 1309) 0.08016642082194166
(170617, 1580) 0.36393899239590916
(170617, 1073) 0.24569139492370443
(170617, 1070) 0.2295246987929196
(170617, 273) 0.23630419647977724
(170617, 711) 0.3969271888729948
(170617, 3356) 0.3915144503119474
(170617, 1265) 0.40269236333876046
(170617, 3371) 0.3230261444843898
(170617, 799) 0.25885026950450557
(170617, 3676) 0.0579294453455891
(170617, 1808) 0.11124168713971093
(170617, 1963) 0.06578467639187029
(170617, 2495) 0.10290248377991636
(170617, 3608) 0.15385239558914604
由于这里是多标签分类,可以使用sklearn的多标签分类进行分装:
#划分训练集和验证集
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(data_tfidf,data_label,test_size = 0.2,random_state =1)
#构建多标签分类模型
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
clf = MultiOutputClassifier(MultinomialNB()).fit(x_train,y_train)
验证模型的精度:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,clf.predict(x_test))
0.5063884655960614
from sklearn.metrics import classification_report
print(classification_report(y_test,clf.predict(x_test)))
precision recall f1-score support
0 0.00 0.00 0.00 0
1 0.00 0.00 0.00 1
2 0.00 0.00 0.00 0
3 0.92 0.84 0.88 3625
4 0.00 0.00 0.00 4
5 0.00 0.00 0.00 0
6 0.00 0.00 0.00 1
7 0.00 0.00 0.00 0
8 0.78 0.74 0.76 3801
9 0.84 0.88 0.86 10715
10 0.00 0.00 0.00 0
11 0.00 0.00 0.00 186
12 0.46 0.37 0.41 1621
13 0.00 0.00 0.00 1
14 0.76 0.54 0.63 1096
15 0.62 0.78 0.69 1078
16 0.89 0.17 0.29 242
17 0.53 0.64 0.58 1451
18 0.73 0.50 0.59 1400
19 0.88 0.83 0.85 10243
20 0.45 0.08 0.13 934
21 0.00 0.00 0.00 1
22 1.00 0.02 0.04 414
23 0.51 0.63 0.56 517
24 0.36 0.29 0.32 539
25 0.00 0.00 0.00 1
26 0.61 0.39 0.47 3891
27 0.00 0.00 0.00 0
28 0.82 0.06 0.11 676
29 0.83 0.10 0.18 297
30 0.82 0.37 0.51 1714
31 0.00 0.00 0.00 4
32 0.57 0.61 0.59 3398
33 0.00 0.00 0.00 0
micro avg 0.77 0.68 0.72 47851
macro avg 0.39 0.26 0.28 47851
weighted avg 0.76 0.68 0.70 47851
samples avg 0.74 0.75 0.71 47851
4.5.2 思路2
思路2使用深度学习模型,单词进行词嵌入然后训练。首先按照文本划分数据集:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data['text'].iloc[:], data_label,
test_size = 0.2,random_state = 1)
将数据集处理进行编码,并进行截断
# parameter
max_features= 500
max_len= 150
embed_size=100
batch_size = 128
epochs = 5
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
tokens = Tokenizer(num_words = max_features)
tokens.fit_on_texts(list(x_train)+list(x_test))
x_sub_train = tokens.texts_to_sequences(x_train)
x_sub_test = tokens.texts_to_sequences(x_test)
x_sub_train=sequence.pad_sequences(x_sub_train, maxlen=max_len)
x_sub_test=sequence.pad_sequences(x_sub_test, maxlen=max_len)
定义模型并完成训练:
# LSTM model
# Keras Layers:
from keras.layers import Dense,Input,LSTM,Bidirectional,Activation,Conv1D,GRU
from keras.layers import Dropout,Embedding,GlobalMaxPooling1D, MaxPooling1D, Add, Flatten
from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D# Keras Callback Functions:
from keras.callbacks import Callback
from keras.callbacks import EarlyStopping,ModelCheckpoint
from keras import initializers, regularizers, constraints, optimizers, layers, callbacks
from keras.models import Model
from keras.optimizers import Adam
sequence_input = Input(shape=(max_len, ))
x = Embedding(max_features, embed_size,trainable = False)(sequence_input)
x = SpatialDropout1D(0.2)(x)
x = Bidirectional(GRU(128, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)
x = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform")(x)
avg_pool = GlobalAveragePooling1D()(x)
max_pool = GlobalMaxPooling1D()(x)
x = concatenate([avg_pool, max_pool])
preds = Dense(34, activation="sigmoid")(x)
model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',optimizer=Adam(lr=1e-3),metrics=['accuracy'])
model.fit(x_sub_train, y_train, batch_size=batch_size, epochs=epochs)
Epoch 1/5
370/1067 [=========>....................] - ETA: 22:29 - loss: 0.1394 - accuracy: 0.3263