https://blog.csdn.net/Dinosoft/article/details/50734539
纸上得来终觉浅,还是要多实践呐!
之前看了很多入门的资料,如果现在让我来写写,我觉得我会选择”数字识别(digit recognizer)”作为例子,足够有趣,而且能说明很多问题。kaggle是个实践的好地方,python是门方便的语言,sklearn是个不错的库,文档很适合学习。那就用sklearn来实践一下机器学习,加深理解吧!至于机器学习具体的算法,这里就不赘述了,可以参考博客里其他文章。
kaggle数据读取
import pandas as pd
import numpy as np
import time
from sklearn.cross_validation import cross_val_score
#read data
dataset = pd.read_csv("./data/train.csv")
X_train = dataset.values[0:, 1:]
y_train = dataset.values[0:, 0]
#for fast evaluation
X_train_small = X_train[:10000, :]
y_train_small = y_train[:10000]
X_test = pd.read_csv("./data/test.csv").values123456789101112131415161718
这个代码可以作为模版来用。基本的数据读取,切分X和y,切分小数据用于快速迭代。发现训练有些久,打印个时间看看。离线评估cross validation肯定也是要的。
pandas的dataframe当然是个好东西,完整去学太费时间了,建议先把几个常用的学起来就好了吧。
对于数字识别,从”人脑学习”的角度可能是先识别笔画,然后根据笔画构造出来的关键结构去识别。比如8是上下两个圈圈。如果没学过机器学习,可能就从这个思路开始想了。然后,我们来对比看看机器学习是怎么做的。
KNN
KNN在这里有个很直观intuition,跟哪个数字比较像,那就判断为哪个数字。虽然看上去有点土,但道理上完全讲得通!另辟蹊径,倒是一个挺赞的思路。KNN需要记录原始训练样本,有点死记硬背的味道(属于Non-Parametric Models)。说不定阿猫阿狗之类的动物就是用这种方法来”学习”的。
作为一名机器学习调包客,调参侠,sklearn这个库已经选好了,然后就是调参了。文档在KNeighborsClassifier。用来学习确实不错。
为了加速预测,除了在《统计学习方法》看到过的kd tree结构,这里还提到了ball_tree。加上暴力法,那就有三种方法了,即 {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}。这个只影响预测时间,不影响训练精确度。
看了文档 (现在知道我说sklearn文档好什么意思吧)发现除了指定k,我们还可以指定半径,不过我试了下,因为这里是高维的,我也选不好半径到底选多少,如果太小就会出现一些样本在半径内没近邻,挺麻烦的。还是用K吧。
先试了一下把K调大,以为判断的时候使用多一些样本,准确率会好转,结果发现居然下降了!仔细想想,K调大,那些越不像的样本也混进来了。这样不行,权重要降低点才行。加上weights=’distance’吧,确实有效!
另一个就是metric=’minkowski’,闵可夫斯基,听起来怪怪的,其实就是更一般的距离公式而已,p=2时是欧拉距离,p=1就是曼哈顿距离。默认是2,调成1发现降了,调成3上升了点(发现速度急速下降!?影响kd tree的构建了?)。貌似不好直观解释,而且这个距离计算,显然还会影响weights=’distance’。这里的距离更直观解释其实是”相似度”,怎样衡量图像相似?貌似不太好说。可能可以用上图像处理方面的技术,这里就不深究了。
代码
#knn
from sklearn.neighbors import KNeighborsClassifier
#begin time
start = time.clock()
#progressing
knn_clf=KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree', weights='distance', p=3)
score = cross_val_score(knn_clf, X_train_small, y_train_small, cv=3)
print( score.mean() )
#end time
elapsed = (time.clock() - start)
print("Time used:",int(elapsed), "s")
#k=3
#0.942300738697
#0.946100822903 weights='distance'
#0.950799888775 p=3
#k=5
#0.939899237556
#0.94259888029
#k=7
#0.935395994386
#0.938997377902
#k=9
#0.93389785197812345678910111213141516171819202122232425
最后用全量数据训练,提交kaggle。代码模版
clf=knn_clf
start = time.clock()
clf.fit(X_train,y_train)
elapsed = (time.clock() - start)
print("Training Time used:",int(elapsed/60) , "min")
result=clf.predict(X_test)
result = np.c_[range(1,len(result)+1), result.astype(int)]
df_result = pd.DataFrame(result, columns=['ImageId', 'Label'])
df_result.to_csv('./results.knn.csv', index=False)
#end time
elapsed = (time.clock() - start)
print("Test Time used:",int(elapsed/60) , "min")123456789101112131415
提交成绩 0.96943
(‘Training Time used:’, 26s)
(‘Test Time used:’, 1374s)
LR
轮到万金油一样的LR上场。显然这里用LR并不是很合适,但可以看到结果也没有特别差。这里的LR直观解释就是评估每一个像素点,到底颜色深一点是偏向于目标数字,还是其他数字。但数字主要是靠结构来识别,但笔画深浅,轻微的平移,倾斜,字体变形都会影响LR,所以结果并不会特别好(后面提到的神经网络可以弥补这些不足)。代码里除以256只是为了方便调参C。
#LR also works!
from sklearn.linear_model import LogisticRegression
#begin time
start = time.clock()
#progressing
lr_clf=LogisticRegression(penalty='l2', solver ='lbfgs', multi_class='multinomial', max_iter=800, C=0.2 )
#lr_clf=LogisticRegression(penalty='l1', multi_class='ovr', max_iter=400, C=4 )
parameters = {'penalty':['l2'] , 'C':[2e-2, 4e-2,8e-2, 12e-2, 2e-1]}
#parameters = {'penalty':['l1'] , 'C':[2e0,2e1, 2e2]}
gs_clf = GridSearchCV(lr_clf, parameters, n_jobs=1, verbose=True )
gs_clf.fit( X_train_small.astype('float')/256, y_train_small )
print()
for params, mean_score, scores in gs_clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() * 2, params))
print()
#end time
elapsed = (time.clock() - start)
print("Time used:",elapsed)
#可以打印模型参数出来看看
#clf.coef_ [1,:]12345678910111213141516171819202122232425262728
小数据调参一些结果
0.870 (+/-0.004) for {‘penalty’: ‘l2’, ‘C’: 0.002}
0.900 (+/-0.005) for {‘penalty’: ‘l2’, ‘C’: 0.02}
0.905 (+/-0.001) for {‘penalty’: ‘l2’, ‘C’: 0.2}
0.890 (+/-0.003) for {‘penalty’: ‘l2’, ‘C’: 2.0}
(‘Time used:’, 114.5217506956833)
0.900 (+/-0.005) for {‘penalty’: ‘l2’, ‘C’: 0.02}
0.904 (+/-0.006) for {‘penalty’: ‘l2’, ‘C’: 0.04}
0.908 (+/-0.005) for {‘penalty’: ‘l2’, ‘C’: 0.08}
0.908 (+/-0.005) for {‘penalty’: ‘l2’, ‘C’: 0.12}
0.905 (+/-0.001) for {‘penalty’: ‘l2’, ‘C’: 0.2}
最后选LR,max_iter=800, C=0.2,反正训练挺快,多迭代迭代。
成绩 0.92157
SVM
前面提到LR处理这种非线性的问题效果没有特别好;KNN虽然效果较好,但是训练很费时。SVM可以说很好地解决了前面两个问题:RBF核拟合非线性,support vector有点类似KNN的最近邻,但由于是在分类边界,更具”代表性”,比机械地选出最近邻效果更好。而且只保留support vector,预测速度快多了。
#svc
from sklearn.svm import SVC,NuSVC
from sklearn.grid_search import GridSearchCV
#begin time
start = time.clock()
#progressing
parameters = {'nu':(0.05, 0.02) , 'gamma':[3e-2, 2e-2, 1e-2]}
svc_clf=NuSVC(nu=0.1, kernel='rbf', verbose=True )
gs_clf = GridSearchCV(svc_clf, parameters, n_jobs=1, verbose=True )
gs_clf.fit( X_train_small.astype('float')/256, y_train_small )
print()
for params, mean_score, scores in gs_clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() * 2, params))
print()
#end time
elapsed = (time.clock() - start)
print("Time used:",elapsed)
12345678910111213141516171819202122
调成n_jobs=2貌似有异常,只好乖乖用1了。
调参过程
Fitting 3 folds for each of 6 candidates, totalling 18 fits
[LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM]LibSVM
0.968 (+/-0.001) for {‘nu’: 0.05, ‘gamma’: 0.03}
0.968 (+/-0.001) for {‘nu’: 0.02, ‘gamma’: 0.03}
0.967 (+/-0.003) for {‘nu’: 0.05, ‘gamma’: 0.02}
0.968 (+/-0.002) for {‘nu’: 0.02, ‘gamma’: 0.02}
0.961 (+/-0.002) for {‘nu’: 0.05, ‘gamma’: 0.01}
0.963 (+/-0.002) for {‘nu’: 0.02, ‘gamma’: 0.01}
(‘Time used:’, 819.6633204167592)
选nu:0.02, gamma:0.02吧。
svm就是训练慢,预测快。等着无聊,训练过程打印的参数意思可以自己google
optimization finished, #iter = 1456
C = 2.065921
obj = 160.316989, rho = 0.340949
nSV = 599, nBSV = 15
训练时间还能接受
[LibSVM](‘Training Time used:’, 6, ‘min’)
(‘Test Time used:’, 12, ‘min’)
成绩 0.98286!
Random Forest
接下来试试集成学习的方法。直观理解,跟LR一样,也是根据每个像素来判断,不过由于底层是树形结构,可以学习到非线性的边界。理论上效果应该比LR会好一些。
from sklearn.ensemble import RandomForestClassifier
#begin time
start = time.clock()
#progressing
parameters = {'criterion':['gini','entropy'] , 'max_features':['auto', 12, 100]}
rf_clf=RandomForestClassifier(n_estimators=400, n_jobs=4, verbose=1)
gs_clf = GridSearchCV(rf_clf, parameters, n_jobs=1, verbose=True )
gs_clf.fit( X_train_small.astype('int'), y_train_small )
print()
for params, mean_score, scores in gs_clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() * 2, params))
print()
#end time
elapsed = (time.clock() - start)
print("Time used:",elapsed)12345678910111213141516171819
0.946 (+/-0.002) for {‘max_features’: ‘auto’, ‘criterion’: ‘gini’}
0.945 (+/-0.001) for {‘max_features’: 12, ‘criterion’: ‘gini’}
0.943 (+/-0.005) for {‘max_features’: 100, ‘criterion’: ‘gini’}
0.944 (+/-0.004) for {‘max_features’: ‘auto’, ‘criterion’: ‘entropy’}
0.944 (+/-0.006) for {‘max_features’: 12, ‘criterion’: ‘entropy’}
0.942 (+/-0.007) for {‘max_features’: 100, ‘criterion’: ‘entropy’}
()
(‘Time used:’, 342.1534636337892)
0.946 (+/-0.005) for {‘max_features’: ‘auto’, ‘criterion’: ‘gini’, ‘max_depth’: None}
0.889 (+/-0.004) for {‘max_features’: ‘auto’, ‘criterion’: ‘gini’, ‘max_depth’: 6}
0.945 (+/-0.004) for {‘max_features’: ‘auto’, ‘criterion’: ‘gini’, ‘max_depth’: 18}
0.946 (+/-0.004) for {‘max_features’: ‘auto’, ‘criterion’: ‘gini’, ‘max_depth’: 32}
()
(‘Test Time used:’, 1, ‘min’)
貌似都差不多啊。那用默认参数提交吧。
忘记截图了。补一个,0.96714
Deep Learning
这种图像问题,目前最强还是要靠神经网络。学了UFLDL,然后看了theano,发现就是个符号计算和Automatic Differentiation库,然后代码能编译成cuda。然后keras在theano上抽象一层,实现了神经网络的一些通用结构,大概就是这样吧。怎么构建神经网络和调参,目前没啥经验,直接拷一个demo代码来跑跑吧。
#DL modified from keras's example
'''Train a simple convnet on the MNIST dataset.
Run on GPU: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python mnist_cnn.py
Get to 99.25% test accuracy after 12 epochs (there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.
'''
from __future__ import print_function
import numpy as np
np.random.seed(1337) # for reproducibility
#from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils
batch_size = 128
nb_classes = 10
nb_epoch = 60
# input image dimensions
img_rows, img_cols = 28, 28
# number of convolutional filters to use
nb_filters = 32
# size of pooling area for max pooling
nb_pool = 2
# convolution kernel size
nb_conv = 3
#(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_train = X_train.astype('float32')
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
X_test = X_test.astype('float32')
X_test /=255
X_train /= 255
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)
#Y_test = np_utils.to_categorical(y_test, nb_classes)
model = Sequential()
model.add(Convolution2D(nb_filters, nb_conv, nb_conv,
border_mode='valid',
input_shape=(1, img_rows, img_cols)))
model.add(Activation('relu'))
model.add(Convolution2D(nb_filters, nb_conv, nb_conv))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(nb_pool, nb_pool)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adadelta')
model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch,
show_accuracy=True, verbose=1 )
test_result=model.predict_classes( X_test, batch_size=128, verbose=1)
result = np.c_[range(1,len(test_result)+1), test_result.astype(int)]
df_result = pd.DataFrame(result[:,0:2], columns=['ImageId', 'Label'])
df_result.to_csv('./results.dl.csv', index=False)123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
刚开始没开启GPU,一个epoch要800s,开启GPU之后就只要40s了! 发现用GPU的话,CPU占用率是降低了,但是CPU温度高了?!不知是因为导热铜片把GPU的热量导过来,还是CPU跟GPU通过总线交换数据也会发热??
因为不会调参,只好暂时傻逼地增加epoch次数了,发现是有提高了,最后试了60提交
果然是大杀器 0.99114!
总结
LR线性模型显然最弱。神经网络处理这种图像问题确实目前是最强的。svm的support vector在这里起到作用非常明显,准确地找出了最具区分度的“特征图像”。RF有点像非线性问题的万金油,这里默认参数已经很可以了。只比KNN结果稍微差一点,因为只用了像素的局部信息。当然了,模型的对比这里只针对数字识别的问题,对于其他问题可能有不同的结果,要具体问题具体分析,结合模型特点,选取合适的模型。
发现实际动手一下,分析实验结果,对模型的理解加深了。
---------------------
作者:dinosoft
来源:CSDN
原文:https://blog.csdn.net/Dinosoft/article/details/50734539
版权声明:本文为博主原创文章,转载请附上博文链接!