机器学习的第二次实战分享:
Task&Data来源:https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
参考链接:https://www.kaggle.com/ivangavrilove88/stroke-fe-smote-technique-17-models#Introduction
主要介绍:这个Task是让你根据已有的数据设计一个良好的模型,用来预测一个人是否会患上stroke,以上就是这个任务的简单介绍,这个结果需要包含两列1)id 2)stroke
环境:我最喜欢的kaggle notebook,安利给你们,当然Jupyter Notebook也可,不过用了就知道啦
这个任务是一个入门的非常好的例子,有兴趣的小伙伴都可以照葫芦画瓢来敲一敲代码,快速理解机器学习的实战层面(当然理论也得去学习)。在这个项目中,会学到EDA(探索数据分析),比如怎样构建新的特征值并选择“最好”特征值,怎么对你的数据进行预处理,以及怎样改良你的模型,是否过拟合啊或者欠拟合之类的。
# 消除警告
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')
# 数据可视化和数据处理
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
import missingno as msno
import pandas_profiling as pdp
# 将matplotlib设置为内联,并在相应单元格下面显示图形
%matplotlib inline
style.use('fivethirtyeight')
sns.set(style='whitegrid',color_codes=True)
# 模型
# 线性回归模型,,这只是第一个模型,以下还有许多的模型
from sklearn.linear_model import LinearRegression,LogisticRegression,Perceptron,RidgeClassifier,SGDClassifier,LassoCV
from sklearn.svm import SVC,LinearSVC,SVR
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import metrics
import xgboost as xgb
from xgboost import XGBClassifier
import lightgbm as lgb
from lightgbm import LGBMClassifier
# 特征选择
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel, SelectKBest, RFE, chi2
# 模型选择
from sklearn.model_selection import train_test_split,cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
# 数据预处理
import sklearn
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, learning_curve, ShuffleSplit
from sklearn.model_selection import cross_val_predict as cvp
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, accuracy_score, confusion_matrix, explained_variance_score
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel, SelectKBest, RFE, chi2
# 评估标准
from sklearn.metrics import mean_squared_log_error,mean_squared_error, r2_score,mean_absolute_error # for regression
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score # for classification
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
import pandas_profiling as pp
数据加载和观察数据特点
# 数据加载
df = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
# 查看数据缺失的的统计
print(df.isnull().sum())
# 查看数据统计
print(df.info())
# 查看每种特征值的不同值的总数,比如gender:male female,那么统计结果就是2,有两种
print(df.nunique())
# 查看每种特征值的一些“特性”,包括平均值等
print(df.describe())
# 查看前五行元素
df.head()
# 显示数据的行数和列数
df.shape
# 去除完全重合的行数据,具体用法可自行百度
df = df.drop_duplicates()
df.shape
缺失值的处理:
# 这里直接进行删除该行数据
df.dropna(inplace = True)
df.info()
for i in ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']:
print(df[i].unique())
# 当指定的是某列特征时,输出的是所有不同取值
countplot_cols = ['heart_disease', 'hypertension', 'gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']
boxplot_cols = ['age','avg_glucose_level', 'bmi']
以下我们画柱状图来显示,对于每一种特征值,每一种特征取值中stroke的分布情况
# enumerate函数,举个例子:
# >>>seasons = ['Spring', 'Summer', 'Fall', 'Winter']
# >>> list(enumerate(seasons))
# [(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]
for i, column in enumerate(countplot_cols):
sns.countplot(x=column, hue = 'stroke', data=df)
plt.show()
通过box图我们可以观察到是否有stroke的分布,大概在什么年龄段,什么摄糖水平,什么BMI指数发生
for u, column in enumerate(boxplot_cols):
sns.boxplot(x='stroke',y=column,data=df)
plt.show()
#df = df.drop(df[df.smoking_status == 'Unknown'].index)
df = df.drop(df[df.gender == 'Other'].index)
df.info()
以下是数据预处理的常见操作:将一些文字类型的信息转化为数字,这是因为机器学习模型更喜欢处理数字
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
df['ever_married'] = le.fit_transform(df['ever_married'])
df['work_type'] = le.fit_transform(df['work_type'])
df['Residence_type'] = le.fit_transform(df['Residence_type'])
df['smoking_status'] = le.fit_transform(df['smoking_status'])
# 这里可以看一看发生了什么变化?
# df.head()
# 删除id列,之后可以看看变化
df = df.drop('id', axis = 1)
print('Encoding was successful')
df.head()
#replace_values = {'Unknown': 'never smoked','formerly smoked': 'smokes'}
#df = df.replace({'smoking_status': replace_values})
#print('Replace was successfully')
创建新属性并进行选择,这些新属性一般是在已有的一个或者多个属性进行数学变换,比如加、减、乘等
def feature_creation(df):
df['age1'] = np.log(df['age'])
df['age2'] = np.sqrt(df['age'])
df['age3'] = df['age']**3
df['bmi1'] = np.log(df['bmi'])
df['bmi2'] = np.sqrt(df['bmi'])
df['bmi3'] = df['bmi']**3
df['avg_glucose_level1'] = np.log(df['avg_glucose_level'])
df['avg_glucose_level2'] = np.sqrt(df['avg_glucose_level'])
df['avg_glucose_level3'] = df['avg_glucose_level']**3
for i in ['gender', 'age1', 'age2', 'age3', 'hypertension', 'heart_disease', 'ever_married', 'work_type']:
for j in ['Residence_type', 'avg_glucose_level1','avg_glucose_level2', 'avg_glucose_level3', 'bmi1', 'bmi2', 'bmi3','smoking_status']:
df[i+'_'+j] = df[i].astype('str')+'_'+df[j].astype('str')
return df
df = feature_creation(df)
features = df.columns.values.tolist()
# features
# 可以查看一下现在的特征
df.head()
df.shape
我们可以看到84个特征值!显然有些事关系不大的,因此我们需要塞选掉一些关系不大的特征值,但这里的处理和一般处理不同(可能),不过后面还是使用了皮尔逊系数筛选
# 以下代码的意思是将非numerics型的数据放入这个空list当中
categorical_columns = []
numerics = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
features = df.columns.values.tolist()
for col in features:
if df[col].dtype in numerics: continue
categorical_columns.append(col)
categorical_columns
类别特征编码
for col in categorical_columns:
if col in df.columns:
#le = LabelEncoder()
# print(list(df[col].astype(str).values))
# astype用来改变数据类型,这里是object改变为str
le.fit(list(df[col].astype(str).values))
df[col] = le.transform(list(df[col].astype(str).values))
print('Encoding was successfull')
# 我们需要选择的特征数量
num_features_opt = 40
# 最多的特征数量限制
num_features_max = 50
features_best = []
X_train = df.drop('stroke',axis = 1).copy()
y_train = df.stroke.copy()
这里就是真正意义上的筛选特征值了,通过皮尔逊系数筛选
# 筛选的阈值,大于这个值选出来
threshold = 0.9
# 用来显示区别
def highlight(value):
if value > threshold:
style = 'background-color: pink'
else:
style = 'background-color: green'
return style
# 绝对值相关系数矩阵
corr_matrix = df.corr().abs().round(2)
# 这两句知道意思就行了,奈何我水平有限写不出来
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper.style.format("{:.2f}").applymap(highlight)
# corr_matrix.shape
# 把大于threshold(门槛或阈值,这不是又学一新单词?)的特征筛选出来
collinear_features = [column for column in upper.columns if any(upper[column] > threshold)]
# 把这些大于的给drop掉
features_filtered = df.drop(columns = collinear_features)
print('The number of features that passed the collinearity threshold: ', features_filtered.shape[1])
# 把这些过滤后的特征加入
features_best.append(features_filtered.columns.tolist())
利用支持向量机进行特征选择
往往从两个方面考虑特征选择:
SelectFromModel是一个元转换器,它可用与来筛选特征值,当低于阈值参数,则认为这些特征不重要并将其删除
lsvc = LinearSVC(C=0.1, penalty="l1", dual=False).fit(X_train, y_train)
model = SelectFromModel(lsvc, prefit = True)
X_new = model.transform(X_train)
# X_new.shape
# 我们通过对比,发现特征值经过筛选后由84->58(这个值每次执行会有些不同,应为并不是每次训练和拟合的结果都会完全一样)
X_selected_df = pd.DataFrame(X_new, columns=[X_train.columns[i] for i in range(len(X_train.columns)) if model.get_support()[i]])
features_best.append(X_selected_df.columns.tolist())
使用Lasso进行特征值选择,和上面几乎一样,就使用的模型不同
lasso = LassoCV(cv=3).fit(X_train, y_train)
model = SelectFromModel(lasso, prefit=True)
X_new = model.transform(X_train)
X_selected_df = pd.DataFrame(X_new, columns=[X_train.columns[i] for i in range(len(X_train.columns)) if model.get_support()[i]])
# add features
features_best.append(X_selected_df.columns.tolist())
Feature selection by the SelectKBest with Chi-2
SelectKBest我目前不是很懂
bestfeatures = SelectKBest(score_func = chi2, k='all')
fit = bestfeatures.fit(abs(X_train),y_train)
dfscores = pd.DataFrame(fit.scores_)
# print(dfscores)
dfcolumns = pd.DataFrame(X_train.columns)
# 连接两个DataFrame得到更好的可视化效果
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
# 然后我们给这个2*n的表格命名column
featureScores.columns = ['Feature','Score']
# featureScores
# 加入特征,pd的nlargest函数有两个参数,一个是你需要截取的行数(比如你想看前10,那第一个参数就可以设置为10)
# 第二个参数就是拟作排序的标准,以什么来进行排序,然后['Feature']就是把相对应的Feature提取出来代替这一行数据
features_best.append(featureScores.nlargest(num_features_max,'Score')['Feature'].tolist())
print(featureScores.nlargest(len(dfcolumns),'Score'))
用RFE(递归特征消除)结合逻辑回归筛选特征
get_support(indices=False):若indices参数为False,返回所有特征的布尔数组,满足条件的特征列为True,不满足的特征列为False;若indices为True,则满足条件的特征列对应的整数组成的数组
rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=num_features_max, step=10, verbose=5)
rfe_selector.fit(X_train, y_train)
rfe_support = rfe_selector.get_support()
# rfe_support
rfe_feature = X_train.loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')
features_best.append(rfe_feature)
用RFE与随机森林来筛选特征,以下的结果每次都会有些不同
embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=200), threshold='1.25*median')
embeded_rf_selector.fit(X_train, y_train)
embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X_train.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')
基于方差阈值的特征选择,检查所有的特征是否都有完全不同的含义
selector = VarianceThreshold(threshold=10)
np.shape(selector.fit_transform(df))
features_best.append(list(np.array(df.columns)[selector.get_support(indices=False)]))
选择最好的特征
# Show best features
features_best
# 元素至少在一个最佳特征列表中
main_cols_max = features_best[0]
for i in range(len(features_best)-1):
main_cols_max = list(set(main_cols_max) | set(features_best[i+1]))
print(main_cols_max)
print('Cols:', len(main_cols_max))
# 所有最佳特性列表中最常见的项
main_cols = []
main_cols_opt = {feature_name : 0 for feature_name in df.columns.tolist()}
for i in range(len(features_best)):
for feature_name in features_best[i]:
main_cols_opt[feature_name] += 1
df_main_cols_opt = pd.DataFrame.from_dict(main_cols_opt, orient='index', columns=['Num'])
df_main_cols_opt.sort_values(by=['Num'], ascending=False).head(num_features_opt)
# 只选择num_features_opt中包含的最佳特性
main_cols = df_main_cols_opt.nlargest(num_features_opt, 'Num').index.tolist()
if not 'stroke' in main_cols:
main_cols.append('stroke')
print(main_cols)
print("Quantity:", len(main_cols))
以上我们已经构建和筛选除了最佳的特征,就是准备合适的模型
# 先看一看我们选择的最好的特征
df[main_cols].head()
准备输入(和上面一样把stroke输出给drop)和输出
X = df[main_cols].drop('stroke', axis = 1)
y = df[main_cols].stroke
标准化处理:
rs = RobustScaler()
X_rs = pd.DataFrame(rs.fit_transform(X), columns = X.columns)
X_rs
下面是非常重要的一步,就是将你的数据分成训练集和测试集,这里用的是sklearn自带的train_test_split函数进行分割
#train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_rs, y, test_size=0.2, random_state=42)
from imblearn.over_sampling import SMOTE
smt = SMOTE()
X_train_sm, y_train_sm = smt.fit_resample(X_train, y_train)
print(y_train_sm.value_counts())
from imblearn.over_sampling import ADASYN
ada = ADASYN()
X_train_ada, y_train_ada = ada.fit_resample(X_train, y_train)
print(y_train_ada.value_counts())
from imblearn.combine import SMOTETomek
smtom = SMOTETomek()
X_train_smtom, y_train_smtom = smtom.fit_resample(X_train, y_train)
print(y_train_smtom.value_counts())
from imblearn.combine import SMOTEENN
smenn = SMOTEENN()
X_train_smenn, y_train_smenn = smenn.fit_resample(X_train, y_train)
print(y_train_smenn.value_counts())
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state = 42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)
print(y_train_rus.value_counts())
from imblearn.under_sampling import NearMiss
NMv1 = NearMiss(version = 1)
X_train_NMv1, y_train_NMv1 = NMv1.fit_resample(X_train, y_train)
print(y_train_NMv1.value_counts())
NMv2 = NearMiss(version = 2)
X_train_NMv2, y_train_NMv2 = NMv2.fit_resample(X_train, y_train)
print(y_train_NMv2.value_counts())
NMv3 = NearMiss(version = 3)
X_train_NMv3, y_train_NMv3 = NMv3.fit_resample(X_train, y_train)
print(y_train_NMv3.value_counts())
from imblearn.under_sampling import CondensedNearestNeighbour
CNN = CondensedNearestNeighbour()
X_train_CNN, y_train_CNN = CNN.fit_resample(X_train, y_train)
print(y_train_CNN.value_counts())
from imblearn.under_sampling import OneSidedSelection
OSS = OneSidedSelection()
X_train_OSS, y_train_OSS = OSS.fit_resample(X_train, y_train)
print(y_train_OSS.value_counts())
from imblearn.under_sampling import NeighbourhoodCleaningRule
NCR = NeighbourhoodCleaningRule()
X_train_NCR, y_train_NCR = NCR.fit_resample(X_train, y_train)
print(y_train_NCR.value_counts())
from imblearn.under_sampling import EditedNearestNeighbours
ENN = EditedNearestNeighbours()
X_train_ENN, y_train_ENN = ENN.fit_resample(X_train, y_train)
print(y_train_ENN.value_counts())
from imblearn.under_sampling import InstanceHardnessThreshold
IHT = InstanceHardnessThreshold()
X_train_IHT, y_train_IHT = IHT.fit_resample(X_train, y_train)
print(y_train_IHT.value_counts())
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
RENN = RepeatedEditedNearestNeighbours()
X_train_RENN, y_train_RENN = RENN.fit_resample(X_train, y_train)
print(y_train_RENN.value_counts())
from imblearn.under_sampling import AllKNN
ALLKNN = AllKNN()
X_train_ALLKNN, y_train_ALLKNN = ALLKNN.fit_resample(X_train, y_train)
print(y_train_ALLKNN.value_counts())
下面就开始建模了!
def evaluate_model(clf, X_train,X_test, y_train, y_test, model_name, sample_type):
print('--------------------------------------------')
print('Model ', model_name)
print('Data Type ', sample_type)
y_pred = clf.predict(X_test)
y_pred_train = clf.predict(X_train)
#accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_test = accuracy_score(y_test, y_pred)
#f1_train = f1_score(y_train, y_pred_train, average = 'binary')
#recall_train = recall_score(y_train, y_pred_train, average = 'binary')
#precision_train = precision_score(y_train, y_pred_train, average = 'binary')
f1_test = f1_score(y_test, y_pred, average='binary')
recall_test = recall_score(y_test, y_pred, average='binary')
precision_test = precision_score(y_test, y_pred, average='binary')
print('TRAIN:', classification_report(y_train, y_pred_train))
#print('TRAIN Accuracy:', accuracy_score(y_train, y_pred_train))
#print("TRAIN: F1 Score ", f1_train)
#print("TRAIN: Recall ", recall_train)
#print("TRAIN: Precision ", precision_train)
print('==================================================================')
print('TEST:', classification_report(y_test, y_pred))
#print('TEST Accuracy:', accuracy_score(y_test, y_pred))
#print("TEST: F1 Score ", f1_test)
#print("TEST: Recall ", recall_test)
#print("TEST: Precision ", precision_test)
return [model_name, sample_type,
f1_test,
precision_test,
recall_test,
accuracy_test]
17 个 模 型 !!!
models = {
'Decision Trees': DecisionTreeClassifier(random_state=42),
'Random Forest':RandomForestClassifier(random_state=42),
'Linear SVC':LinearSVC(random_state=42),
'AdaBoost Classifier':AdaBoostClassifier(random_state=42),
'Stochastic Gradient Descent':SGDClassifier(random_state=42),
'XGBoost': xgb.XGBClassifier(random_state=42),
'LightGBM': lgb.LGBMClassifier(random_state=42),
'KNN': KNeighborsClassifier(),
'Logistic Regression': LogisticRegression(random_state=42),
'Support Vector Machines': SVC(random_state=42),
'MLP Classifier': MLPClassifier(random_state=42),
'Gradient Boosting Classifier': GradientBoostingClassifier(random_state=42),
'Ridge Classifier': RidgeClassifier(),
'Bagging Classifier': BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5),
'Extra Trees Classifier': ExtraTreesClassifier(random_state=42),
'Naive Bayes': GaussianNB(),
'Gaussian Process Classification': GaussianProcessClassifier(random_state=42)
}
#15 techniques
sampled_data = {
'ACTUAL':[X_train, y_train],
'SMOTE':[X_train_sm, y_train_sm],
'ADASYN':[X_train_ada, y_train_ada],
'SMOTE_TOMEK':[X_train_smtom, y_train_smtom],
'SMOTE_ENN':[X_train_smenn, y_train_smenn],
'Random Under Sampling': [X_train_rus, y_train_rus],
'Near Miss1': [X_train_NMv1, y_train_NMv1],
'Near Miss2': [X_train_NMv2, y_train_NMv2],
'Near Miss3': [X_train_NMv3, y_train_NMv3],
'Condensed Nearest Neighbour': [X_train_CNN, y_train_CNN],
'One Sided Selection': [X_train_OSS, y_train_OSS],
'Neighbourhood Cleaning Rule' : [X_train_NCR, y_train_NCR],
'Edited Nearest Neighbours': [X_train_ENN, y_train_ENN],
'Instance Hardness Threshold': [X_train_IHT, y_train_IHT],
'Repeated Edited Nearest Neighbours': [X_train_RENN, y_train_RENN],
'AllKNN': [X_train_ALLKNN, y_train_ALLKNN]
}
以下是模型和data_type的不同所有组合情况
%%time
output = []
for model_k, model_clf in models.items():
for data_type, data in sampled_data.items():
model_clf.fit(data[0], data[1])
output.append(evaluate_model(model_clf, X_train, X_test, y_train, y_test, model_k, data_type))
在得到以上所有,就可以获得结论了!
result = pd.DataFrame(output, columns=['Model', 'DataType',
'F1',
'Precision',
'Recall',
'Accuracy'])
pd.set_option('display.max_rows', None)
result = result[result['F1']!=0]
result.sort_values(by="F1", ascending=False)
result.shape
# 看一看咱们的结果,有248中预测结果
我们选出前五种预测,按照不同标准进行排序的
result.sort_values(by = 'F1', ascending=False).head(5)
result.sort_values(by = 'Precision', ascending=False).head(5)
result.sort_values(by = 'Recall', ascending=False).head(5)
到这我们的项目就已经算是结束了,有一些东西对于初学者来讲有点儿难以理解,但第一步我们是需要直到一个机器学习项目的流程是什么样的,我们只要知道了这个大体的思路,那么我们在进行另外一个项目是就会顺畅多了
我们研究了17个模型的imbalance technique和预测。
我们有一个很好的recall值,但是一个相对较低的精度值。
下一步怎么办?我认为我们需要做网格搜索CV所有的技术和模型的不平衡,并提高精度度量