机器学习资料集 Datasets - Ex 1: The digits 手写数字辨识

优质
小牛编辑
135浏览
2023-12-01

机器学习资料集/ 范例一: The digits dataset

http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html

这个范例目的是介绍机器学习范例资料集的操作,对于初学者以及授课特别适合使用。

(一)引入函式库及内建手写数字资料库

  1. #这行是在ipython notebook的介面裏专用,如果在其他介面则可以拿掉
  2. %matplotlib inline
  3. from sklearn import datasets
  4. import matplotlib.pyplot as plt
  5. #载入数字资料集
  6. digits = datasets.load_digits()
  7. #画出第一个图片
  8. plt.figure(1, figsize=(3, 3))
  9. plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
  10. plt.show()

png

(二)资料集介绍

digits = datasets.load_digits() 将一个dict型别资料存入digits,我们可以用下面程式码来观察裏面资料

  1. for key,value in digits.items() :
  2. try:
  3. print (key,value.shape)
  4. except:
  5. print (key)
  1. ('images', (1797L, 8L, 8L))
  2. ('data', (1797L, 64L))
  3. ('target_names', (10L,))
  4. DESCR
  5. ('target', (1797L,))
显示说明
(‘images’, (1797L, 8L, 8L))共有 1797 张影像,影像大小为 8x8
(‘data’, (1797L, 64L))data 则是将8x8的矩阵摊平成64个元素之一维向量
(‘target_names’, (10L,))说明10种分类之对应 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
DESCR资料之描述
(‘target’, (1797L,))记录1797张影像各自代表那一个数字

接下来我们试着以下面指令来观察资料档,每张影像所对照的实际数字存在digits.target变数中

  1. images_and_labels = list(zip(digits.images, digits.target))
  2. for index, (image, label) in enumerate(images_and_labels[:4]):
  3. plt.subplot(2, 4, index + 1)
  4. plt.axis('off')
  5. plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
  6. plt.title('Training: %i' % label)

png

  1. #接着我们尝试将这个机器学习资料之描述档显示出来
  2. print(digits['DESCR'])
  1. Optical Recognition of Handwritten Digits Data Set
  2. ===================================================
  3. Notes
  4. -----
  5. Data Set Characteristics:
  6. :Number of Instances: 5620
  7. :Number of Attributes: 64
  8. :Attribute Information: 8x8 image of integer pixels in the range 0..16.
  9. :Missing Attribute Values: None
  10. :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
  11. :Date: July; 1998
  12. This is a copy of the test set of the UCI ML hand-written digits datasets
  13. http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
  14. The data set contains images of hand-written digits: 10 classes where
  15. each class refers to a digit.
  16. Preprocessing programs made available by NIST were used to extract
  17. normalized bitmaps of handwritten digits from a preprinted form. From a
  18. total of 43 people, 30 contributed to the training set and different 13
  19. to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
  20. 4x4 and the number of on pixels are counted in each block. This generates
  21. an input matrix of 8x8 where each element is an integer in the range
  22. 0..16. This reduces dimensionality and gives invariance to small
  23. distortions.
  24. For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
  25. T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
  26. L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
  27. 1994.
  28. References
  29. ----------
  30. - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
  31. Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
  32. Graduate Studies in Science and Engineering, Bogazici University.
  33. - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  34. - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
  35. Linear dimensionalityreduction using relevance weighted LDA. School of
  36. Electrical and Electronic Engineering Nanyang Technological University.
  37. 2005.
  38. - Claudio Gentile. A New Approximate Maximal Margin Classification
  39. Algorithm. NIPS. 2000.

这个描述档说明了这个资料集是在 1998年时建立的,由E. Alpaydin, C. Kaynak ,Department of Computer Engineering Bogazici University, Istanbul Turkey 建立的。数字的笔迹总共来自43个人,一开始取像时为32x32的点阵影像,之后经运算处理形成 8x8影像,其中灰阶记录的范围则为 0~16的整数。

(三)应用范例介绍

在整个scikit-learn应用范例中,有以下几个范例是利用了这组手写辨识资料集。这个资料集的使用最适合机器学习初学者来理解分类法的原理以及其进阶应用

  • 分类法 Classification
    • Ex 1: Recognizing hand-written digits
  • 特征选择 Feature Selection
    • Ex 2: Recursive Feature Elimination
    • Ex 3: Recursive Feature Elimination with Cross-Validation