学习《机器学习实战》有段时间了,回过头来发现并没有留下些笔记之类的,so sad。慢慢积累吧。。。
打算结合Numpy写kNN代码,并和sklearn工具中的neighbors.KNeighborsClassifier()做个比较。
1、手写数字训练样本共1934个,0~9每个数字大约190个样本。
2、测试数据共946个,每个数字约90+个样本。
3、数据文件名:数字_实例名.txt。比如0_3.txt表示数字0的第3个实例数据。
4、每个样本数据由32x32的0、1组成。
利用numpy把二进制图像数据转换成向量。
需要import的包:
import numpy as np
from os import listdir
import operator
import time
# img2vector 方法
# 参数:文件名
# 返回:numpy 数组
# 功能:将 32x32 的二进制图像矩阵转换为 1x1024 的向量
def img2vector(file):
returnVec = np.zeros((1, 1024))
with open(file) as fr:
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVec[0, 32*i+j] = int(lineStr[j])
return returnVec
读取所有数据并转换成向量
# 读取图像数据文件并转换成矩阵
def read_and_convert(filePath):
dataLabel = []
fileList = listdir(filePath)
fileAmount = len(fileList)
dataMat = np.zeros((fileAmount, 1024))
for i in range(fileAmount):
fileNameStr = fileList[i]
classTag = int(fileNameStr.split(".")[0].split("_")[0])
dataLabel.append(classTag)
dataMat[i,:] = img2vector(filePath+"/{}".format(fileNameStr))
return dataMat, dataLabel
核心部分,kNN分类代码
# classify 分类方法
# 功能:对待测数据进行分类
# 参数:
# inX:待分类数据
# dataSet:训练数据集
# labels:类标签向量
# k:近邻数量
# 返回:类标签
def classify(inX, dataSet, labels, k=3):
dataSetSize = dataSet.shape[0]
diffMat = np.tile(inX, (dataSetSize,1)) - dataSet # tile(A,n) -- 将数据组 A 重复n次
sqDiffMat = np.power(diffMat, 2)
sqDistance = sqDiffMat.sum(axis=1)
distance = np.sqrt(sqDistance)
sortedDistIndicies = distance.argsort()
classCount = {}
for i in range(k):
voteLabel = labels[sortedDistIndicies[i]]
classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
# 手写数字识别测试代码
def handwrittingclassify():
trainFilePath = "trainingDigits"
testFilePath = "testDigits"
trainMat, trainLabel = read_and_convert(trainFilePath)
testMat, testLabel = read_and_convert(testFilePath)
m, n = testMat.shape
errorCount = 0.0
st = time.clock()
for i in range(m):
classifyresult = classify(testMat[i], trainMat, trainLabel, 3)
if classifyresult != testLabel[i]:
errorCount += 1
et = time.clock()
print("cost {:.4f} s".format(et-st))
print("total error counts: {} .".format(errorCount))
print("total error rate is: {:.6f} .".format(errorCount/float(m)))
运行代码,得到k=3时的分类结果:
cost 45.5809 s
total error counts: 11.0
total error rate is: 0.011628
accuracy rate is: 0.988372
可以看出,分类准确率还是比较高的。
我们调整一下k的值,看看有什么变化:
k=2时:
cost 46.6430 s
total error counts: 13.0
total error rate is: 0.013742
accuracy rate is: 0.986258
k=4时:
cost 46.7659 s
total error counts: 14.0
total error rate is: 0.014799
accuracy rate is: 0.985201
k=5时:
cost 46.6895 s
total error counts: 17.0
total error rate is: 0.017970
accuracy rate is: 0.982030
k=6时:
cost 45.2715 s
total error counts: 19.0
total error rate is: 0.020085
accuracy rate is: 0.979915
k的不同取值,得到的错误率是不一样的,计算开销也不一样。k过大,容易发生欠拟合;k过小,会导致过拟合现象。对于k的选择,需要通过cross-validation来确定了,或者根据经验来确定。经过测试,发现在这个数据集中,k=3的效果最好。
n_neighbors = 3
trainFilePath = "trainingDigits"
testFilePath = "testDigits"
# 准备训练数据和测试数据
trainData, trainLabel = read_and_convert(trainFilePath)
testData, testLabel = read_and_convert(testFilePath)
lst = time.clock()
for n in [1,2,3,4,5,6]:
st = time.clock()
for weights in ['uniform', 'distance']:
clf = neighbors.KNeighborsClassifier(n_neighbors=n, weights=weights)
clf.fit(trainData, trainLabel)
score = clf.score(testData, testLabel)
print(str(n) + " neighbors " + weights + " score: {:.6f}".format(score))
et = time.clock()
print(str(n) + " neighbors cost: {:.4f} s".format(et-st))
let = time.clock()
print("amount time: {:.4f} s".format(let-lst))
1 neighbors uniform score: 0.986258
1 neighbors distance score: 0.986258
1 neighbors cost: 8.8149 s
2 neighbors uniform score: 0.976744
2 neighbors distance score: 0.986258
2 neighbors cost: 9.5726 s
3 neighbors uniform score: 0.987315
3 neighbors distance score: 0.987315
3 neighbors cost: 9.5219 s
4 neighbors uniform score: 0.983087
4 neighbors distance score: 0.989429
4 neighbors cost: 9.2445 s
5 neighbors uniform score: 0.980973
5 neighbors distance score: 0.982030
5 neighbors cost: 9.0160 s
6 neighbors uniform score: 0.977801
6 neighbors distance score: 0.982030
6 neighbors cost: 9.2116 s
amount time: 55.3818 s
实际上上面的手写数字识别的数据量还是很小的,如果再大些,计算的开销也就很大了。
我们通过对比发现:如果不用sklearn工具包,手写一个kNN,分类的耗时比较长,仅计算一次就得40+s;而用sklearn,每一次计算仅不到10s。当然这还是在k很小的情况下,如果k取值较大,计算开销也会很大。