Machine Learning--Heart Disease Prediction 1

微生城
2023-12-01

Source Information:

( a ) Creators:
– 1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
– 2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
– 3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
– 4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:
Robert Detrano, M.D., Ph.D.
( b ) Donor: David W. Aha (aha@ics.uci.edu) (714) 856-8779
( c ) Date: July, 1988

Data Introduction:

Based on the introduction of the heart disease data base, while the databases have 76 raw attributes, only 14 of them are actually used. So, I would like to use these 14 features to build the model first and later deal with the raw data with 76 features(if I have time)

The followings are the info of the 14 features:
Attribute Information: ( Only 14 used )

  -- 1. #3  (age)       
  -- 2. #4  (sex)       
  -- 3. #9  (cp)        
  -- 4. #10 (trestbps)  
  -- 5. #12 (chol)      
  -- 6. #16 (fbs)       
  -- 7. #19 (restecg)   
  -- 8. #32 (thalach)   
  -- 9. #38 (exang)     
  -- 10. #40 (oldpeak)   
  -- 11. #41 (slope)     
  -- 12. #44 (ca)        
  -- 13. #51 (thal)      
  -- 14. #58 (num)       (the predicted attribute)


--> 3 age:  age in years. 
--> 4 sex: sex (1 = male; 0 = female) 
--> 9 cp: chest pain type
    -- Value 1: typical angina
    -- Value 2: atypical angina
    -- Value 3: non-anginal pain
    -- Value 4: asymptomatic
--> 10 trestbps: resting blood pressure (in mm Hg on admission to the hospital)
--> 12 chol: serum cholestoral in mg/dl
--> 16 fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
--> 19 restecg: resting electrocardiographic results
    -- Value 0: normal
    -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST 
                elevation or depression of > 0.05 mV)
    -- Value 2: showing probable or definite left ventricular hypertrophy
                by Estes' criteria
--> 32 thalach: maximum heart rate achieved
--> 38 exang: exercise induced angina (1 = yes; 0 = no)
--> 40 oldpeak = ST depression induced by exercise relative to rest
--> 41 slope: the slope of the peak exercise ST segment
    -- Value 1: upsloping
    -- Value 2: flat
    -- Value 3: downsloping
--> 44 ca: number of major vessels (0-3) colored by flourosopy
--> 51 thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
--> 58 num: diagnosis of heart disease (angiographic disease status)
    -- Value 0: < 50% diameter narrowing
    -- Value 1: > 50% diameter narrowing
    (in any major vessel: attributes 59 through 68 are vessels)

Data Preprocess

Before, Dealing with the data, we need to load it out.


import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn import svm



## load data
trainSet = pd.read_csv("clevelandtrain.csv")
testSet = pd.read_csv("clevelandtest.csv")

xtrain = (trainSet.drop(["heartdisease::category|0|1"], axis=1)).iloc[:,:].values  # (152, 13)
ytrain = trainSet["heartdisease::category|0|1"].iloc[:].values                     # (152,)

xtest = (testSet.drop(["heartdisease::category|0|1"], axis=1)).iloc[:,:].values    # (145, 13)
ytest = testSet["heartdisease::category|0|1"].iloc[:].values                       # (145,)


From above description, we can find that #9 (cp), #19 (restecg), #41 (slope), #51 (thal) are all categorical integer features. So, we are gonna encode them as a one-hot numeric array.

# one-hot-encoder: #9 (cp), #19 (restecg),  #41 (slope), #51 (thal)

xtrain_pre = trainSet.drop(["cp", "restecg", "slope", "thal", "heartdisease::category|0|1"], axis=1).iloc[:,:].values # (152, 9)
xtrain_cp = trainSet["cp"].iloc[:].values
xtrain_restecg = trainSet["restecg"].iloc[:].values
xtrain_slope = trainSet["slope"].iloc[:].values
xtrain_thal = trainSet["thal"].iloc[:].values

ohe1 = OneHotEncoder(sparse = False,categories='auto',handle_unknown='ignore')
ohe2 = OneHotEncoder(sparse = False,categories='auto',handle_unknown='ignore')
ohe3 = OneHotEncoder(sparse = False,categories='auto',handle_unknown='ignore')
ohe4 = OneHotEncoder(sparse = False,categories='auto',handle_unknown='ignore')

xtrain_cp = ohe1.fit_transform(xtrain_cp.reshape(-1,1))                    # (152, 4)
xtrain_restecg = ohe2.fit_transform(xtrain_restecg.reshape(-1,1))          # (152, 3)
xtrain_slope = ohe3.fit_transform(xtrain_slope.reshape(-1,1))              # (152, 3)
xtrain_thal = ohe4.fit_transform(xtrain_thal.reshape(-1,1))                # (152, 3)


xTrain = np.hstack((xtrain_pre, xtrain_cp, xtrain_restecg, xtrain_slope, xtrain_thal))   # (152, 22)
yTrain = ytrain                                                                          # (152,)



xtest_pre = testSet.drop(["cp", "restecg", "slope", "thal", "heartdisease::category|0|1"], axis=1).iloc[:,:].values   # (145, 9)
xtest_cp = testSet["cp"].iloc[:].values
xtest_restecg = testSet["restecg"].iloc[:].values
xtest_slope = testSet["slope"].iloc[:].values
xtest_thal = testSet["thal"].iloc[:].values

xtest_cp = ohe1.transform(xtest_cp.reshape(-1,1))                 # (145, 4)
xtest_restecg = ohe2.transform(xtest_restecg.reshape(-1,1))       # (145, 3)
xtest_slope = ohe3.transform(xtest_slope.reshape(-1,1))           # (145, 3)
xtest_thal = ohe4.transform(xtest_thal.reshape(-1,1))             # (145, 3)

xTest = np.hstack((xtest_pre, xtest_cp, xtest_restecg, xtest_slope, xtest_thal))   # (145, 22)
yTest = ytest       

Build Model

First of all, we will build the SVM model and use the cross validation method to find the best parameter.(with RBF Kernel)
After testing, I find the rbf has the best performance

svc = svm.SVC()
parameters_kernel = ['rbf']
parameters_C = np.linspace(100,1000, num=10)
parameters_gamma = np.linspace(1e-3,1e-4, num=10)

parameters = {'kernel': parameters_kernel, 'C':parameters_C, 'gamma':parameters_gamma}

# parameters = [{'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
#               {'kernel': ['poly'], 'C': [1, 10, 100, 1000], 'degree': [3]}
#              ]

clf = GridSearchCV(estimator=svc, param_grid=parameters, cv=5)
clf.fit(xTrain,yTrain)

print("Best Parameters:", clf.best_params_)
# print("Best Estimators:\n", clf.best_estimator_)
print("Best Scores:", clf.best_score_)

svcBest = clf.best_estimator_
svcScore =svcBest.score(xTest, yTest)

print("Test Scores:",svcScore)

the followings are the output of SVM (SVC):

Best Parameters: {'C': 300.0, 'gamma': 0.0001, 'kernel': 'rbf'}
Best Scores: 0.7236842105263158
Test Scores: 0.7862068965517242

Form the above analysis, we have got that the best parameters for SVM with RBF kernel.

Now, I build four models, SVM-rbf, SVM-poly, random forest and adaboost. In order compare their result, I print their score together.


svcRBF = SVC(C=300.0,gamma=0.0001,kernel='rbf',probability=True)
svcRBF.fit(xTrain,yTrain)
svcRBFScore = svcRBF.score(xTest, yTest) # test accuracy
print("the test score of svcRBFScore"+str(svcRBFScore))

svcPoly = SVC(C=1.0,degree = 8.666666,coef0=1.0,gamma = 'scale',max_iter=-1,kernel='poly',probability=True)
svcPoly.fit(xTrain,yTrain)
svcPolyScore = svcPoly.score(xTest, yTest) # test accuracy
print("the test score of svcPolyScore"+str(svcPolyScore))


decisonTree = tree.DecisionTreeClassifier()
decisonTreeBagging = BaggingClassifier(decisonTree,max_samples=0.7, max_features=1.0)
decisonTreeAda = AdaBoostClassifier(decisonTree,n_estimators=10,random_state=np.random.RandomState(1))

decisonTreeBagging.fit(xTrain,yTrain)
decisonTreeAda.fit(xTrain,yTrain)
Bagging_score = decisonTreeBagging.score(xTest,yTest)
AdaBoost_score = decisonTreeAda.score(xTest,yTest)

print("the test score of Bagging:", Bagging_score)
print("the test score of Adaboost:", AdaBoost_score)


the test score of svcRBFScore: 0.7862068965517242
the test score of svcPolyScore: 0.696551724137931
the test score of Bagging: 0.8
the test score of Adaboost: 0.7103448275862069

For, now Random Forest (Bagging) seems to have the best performance. Later, I will try some statistic test method to give a more detailed contrast.

 类似资料:

相关阅读

相关文章

相关问答