1. Use a decision tree classifier (default) to train and test the malicious URL dataset. (2pt)
2. Explore how the tree depth number can affect the accuracy score (3 pt)
3. Use random forest to repeat step 2. What’s your observation? (3 pt)
4. Try both the decision tree and random forests with the credit card fraud dataset. What’s your observation on accuracy score change? (2 pt)
creditcard.csv
dataset.csv
hw2_Kyle Wang.ipynb
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_auc_score
urls = pd.read_csv("dataset.csv")
X = urls.iloc[:, 1:30]
y = urls['Result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=1) # 30% training and 70% test
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Accuracy: 0.9484429512856958
DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
print("max_depth:", DEPTH)
clf = DecisionTreeClassifier(max_depth=DEPTH)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
DEPTH += 10
max_depth: 10
Accuracy: 0.9381056984106474
max_depth: 20
Accuracy: 0.9488305982685101
max_depth: 30
Accuracy: 0.9477968729810053
max_depth: 40
Accuracy: 0.9505104018607056
max_depth: 50
Accuracy: 0.9477968729810053
max_depth: 60
Accuracy: 0.9481845199638196
max_depth: 70
Accuracy: 0.9489598139294483
max_depth: 80
Accuracy: 0.9461170693888099
max_depth: 90
Accuracy: 0.9497351078950769
max_depth: 100
Accuracy: 0.9490890295903863
When max_depth = 10, the accuracy always is the lowest. The trend of the accuracy is to rise first and then level off.
In most cases, when max_depth = 60 can get the best accuracy score.
from sklearn.ensemble import RandomForestClassifier
DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
print(DEPTH)
clf = RandomForestClassifier(max_depth=DEPTH)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
DEPTH += 10
10
Accuracy: 0.9483137356247577
20
Accuracy: 0.9594262824654348
30
Accuracy: 0.9591678511435586
40
Accuracy: 0.9598139294482492
50
Accuracy: 0.9589094198216824
60
Accuracy: 0.960847654735754
70
Accuracy: 0.9605892234138778
80
Accuracy: 0.960847654735754
90
Accuracy: 0.9603307920920016
100
Accuracy: 0.9592970668044967
The overall accuracy has been slightly improved. The oeverall trend unchanged.
cards = pd.read_csv("creditcard.csv")
X = cards.iloc[:, 0:30]
y = cards['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=1) # 30% training and 70% test
DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
print("max_depth:", DEPTH)
clf = DecisionTreeClassifier(max_depth=DEPTH)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
DEPTH += 10
max_depth: 10
Accuracy: 0.9993027863466506
max_depth: 20
Accuracy: 0.999117197100795
max_depth: 30
Accuracy: 0.999132244877486
max_depth: 40
Accuracy: 0.9991121811752314
max_depth: 50
Accuracy: 0.9990770696962857
max_depth: 60
Accuracy: 0.999102149324104
max_depth: 70
Accuracy: 0.999102149324104
max_depth: 80
Accuracy: 0.9991372608030497
max_depth: 90
Accuracy: 0.9990770696962857
max_depth: 100
Accuracy: 0.9990319263662127
The overall accuracy has a downward trend ———— DecisionTreeClassifier
DEPTH = 10;
# Make a loop to set max_depth to 10, 20, 30…until 100
for i in range(10):
print("max_depth:", DEPTH)
clf = RandomForestClassifier(max_depth=DEPTH)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
DEPTH += 10
max_depth: 10
Accuracy: 0.9994181526346149
max_depth: 20
Accuracy: 0.9994382163368696
max_depth: 30
Accuracy: 0.9994332004113059
max_depth: 40
Accuracy: 0.9994382163368696
max_depth: 50
Accuracy: 0.9994231685601785
max_depth: 60
Accuracy: 0.9994332004113059
max_depth: 70
Accuracy: 0.9994332004113059
max_depth: 80
Accuracy: 0.9994181526346149
max_depth: 90
Accuracy: 0.9994131367090512
max_depth: 100
Accuracy: 0.9994281844857422
The overall trend of accuracy is relative stable ———— RandomForestClassifier