机器学习资料集 Datasets - Ex 3: The iris 鸢尾花资料集

优质
小牛编辑
125浏览
2023-12-01

机器学习资料集/ 范例三: The iris dataset

http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

这个范例目的是介绍机器学习范例资料集中的iris 鸢尾花资料集

(一)引入函式库及内建手写数字资料库

  1. #这行是在ipython notebook的介面裏专用,如果在其他介面则可以拿掉
  2. %matplotlib inline
  3. import matplotlib.pyplot as plt
  4. from mpl_toolkits.mplot3d import Axes3D
  5. from sklearn import datasets
  6. from sklearn.decomposition import PCA
  7. # import some data to play with
  8. iris = datasets.load_iris()
  9. X = iris.data[:, :2] # we only take the first two features.
  10. Y = iris.target
  11. x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
  12. y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
  13. plt.figure(2, figsize=(8, 6))
  14. plt.clf()
  15. # Plot the training points
  16. plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
  17. plt.xlabel('Sepal length')
  18. plt.ylabel('Sepal width')
  19. plt.xlim(x_min, x_max)
  20. plt.ylim(y_min, y_max)
  21. plt.xticks(())
  22. plt.yticks(())

png

(二)资料集介绍

iris = datasets.load_iris() 将一个dict型别资料存入iris,我们可以用下面程式码来观察裏面资料

  1. for key,value in iris.items() :
  2. try:
  3. print (key,value.shape)
  4. except:
  5. print (key)
  6. print(iris['feature_names'])
显示说明
(‘target_names’, (3L,))共有三种鸢尾花 setosa, versicolor, virginica
(‘data’, (150L, 4L))有150笔资料,共四种特征
(‘target’, (150L,))这150笔资料各是那一种鸢尾花
DESCR资料之描述
feature_names四个特征代表的意义,分别为 萼片(sepal)之长与宽以及花瓣(petal)之长与宽

为了用视觉化方式呈现这个资料集,下面程式码首先使用PCA演算法将资料维度降低至3

  1. X_reduced = PCA(n_components=3).fit_transform(iris.data)

接下来将三个维度的资料立用mpl_toolkits.mplot3d.Axes3D 建立三维绘图空间,并利用 scatter以三个特征资料数值当成座标绘入空间,并以三种iris之数值 Y,来指定资料点的颜色。我们可以看出三种iris中,有一种明显的可以与其他两种区别,而另外两种则无法明显区别。

  1. # To getter a better understanding of interaction of the dimensions
  2. # plot the first three PCA dimensions
  3. fig = plt.figure(1, figsize=(8, 6))
  4. ax = Axes3D(fig, elev=-150, azim=110)
  5. ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=Y,
  6. cmap=plt.cm.Paired)
  7. ax.set_title("First three PCA directions")
  8. ax.set_xlabel("1st eigenvector")
  9. ax.w_xaxis.set_ticklabels([])
  10. ax.set_ylabel("2nd eigenvector")
  11. ax.w_yaxis.set_ticklabels([])
  12. ax.set_zlabel("3rd eigenvector")
  13. ax.w_zaxis.set_ticklabels([])
  14. plt.show()

png

  1. #接着我们尝试将这个机器学习资料之描述档显示出来
  2. print(iris['DESCR'])
  1. Iris Plants Database
  2. Notes
  3. -----
  4. Data Set Characteristics:
  5. :Number of Instances: 150 (50 in each of three classes)
  6. :Number of Attributes: 4 numeric, predictive attributes and the class
  7. :Attribute Information:
  8. - sepal length in cm
  9. - sepal width in cm
  10. - petal length in cm
  11. - petal width in cm
  12. - class:
  13. - Iris-Setosa
  14. - Iris-Versicolour
  15. - Iris-Virginica
  16. :Summary Statistics:
  17. ============== ==== ==== ======= ===== ====================
  18. Min Max Mean SD Class Correlation
  19. ============== ==== ==== ======= ===== ====================
  20. sepal length: 4.3 7.9 5.84 0.83 0.7826
  21. sepal width: 2.0 4.4 3.05 0.43 -0.4194
  22. petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
  23. petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
  24. ============== ==== ==== ======= ===== ====================
  25. :Missing Attribute Values: None
  26. :Class Distribution: 33.3% for each of 3 classes.
  27. :Creator: R.A. Fisher
  28. :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
  29. :Date: July, 1988
  30. This is a copy of UCI ML iris datasets.
  31. http://archive.ics.uci.edu/ml/datasets/Iris
  32. The famous Iris database, first used by Sir R.A Fisher
  33. This is perhaps the best known database to be found in the
  34. pattern recognition literature. Fisher's paper is a classic in the field and
  35. is referenced frequently to this day. (See Duda & Hart, for example.) The
  36. data set contains 3 classes of 50 instances each, where each class refers to a
  37. type of iris plant. One class is linearly separable from the other 2; the
  38. latter are NOT linearly separable from each other.
  39. References
  40. ----------
  41. - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
  42. Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
  43. Mathematical Statistics" (John Wiley, NY, 1950).
  44. - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
  45. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
  46. - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
  47. Structure and Classification Rule for Recognition in Partially Exposed
  48. Environments". IEEE Transactions on Pattern Analysis and Machine
  49. Intelligence, Vol. PAMI-2, No. 1, 67-71.
  50. - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
  51. on Information Theory, May 1972, 431-433.
  52. - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
  53. conceptual clustering system finds 3 classes in the data.
  54. - Many, many more ...

这个描述档说明了这个资料集是在 1936年时由Fisher建立,为图形识别领域之重要经典范例。共例用四种特征来分类三种鸢尾花

(三)应用范例介绍

在整个scikit-learn应用范例中,有以下几个范例是利用了这组iris资料集。

  • 分类法 Classification
    • EX 3: Plot classification probability
  • 特征选择 Feature Selection
    • Ex 5: Test with permutations the significance of a classification score
    • Ex 6: Univariate Feature Selection
  • 通用范例 General Examples
    • Ex 2: Concatenating multiple feature extraction methods