The Fundamentals of Machine Learning

丁均

2023-12-01

What is Machine Learning？

Machine Learning is the science（and art） of programming computers so they can learn from data.
ML is the field of study that gives computers the ability to learn without being explicitly programmed.---Arthur Sammuel ,1959
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.---Tom Mitchell, 1997

Type of ML.

从是否监督角度分类

监督学习Supervised Learning

训练集中的数据被认为的设置好标签。例如垃圾邮件管理器中，用户标记的垃圾邮件作为训练集。

k-Nearest Neighbors
Linear Regression
Logistic Regression
Support Vector Machines
Decision Trees and Random Forests
Netural networks

非监督学习UnSupervised Learning

训练集未标记

Clustering(k-Means,Hierarchical Cluster Analysis,Expectation Maximization)
Visualization and dimensionality reduction( Principal Component Analysis, Kernel PCA , Locally-Linear Embedding , t-distributed Stochastic Neighbor Embedding)
Association rule learning(Apriori , Eclat)

半监督学习Semisupervised Learning

部分训练集被标记

增强学习Reinforcement Learning

学习系统称之为Agent，根据Agent的选择，基于rewards和penalties。AlphaGO就是如此。

从训练过程角度

批学习Batch Learning

线下学习（offline learning），顾名思义。对于经常更新的数据不适合，训练需要巨大的资源开支。随着数据更新，训练集会越来越大。

线上学习Online Learning

将数据分割成mini-batches，在计算机资源紧张时很实用，可以删除训练过的min-bathes，并且可以replay到之前的状态。out-of-core learning:训练数据量远大于计算机内存的学习。

从训练逻辑角度

基于样例的学习Instance-based learning

系统随“心”学习，对新数据采用相似度比较的方式度量。例如一个垃圾邮件评判系统，如果训练集中的邮件字数都是单数，那么系统可能会认为字数为单数的邮件都是垃圾邮件。

基于模型的学习Model-based learning

人工选择一个模型，例如人民满意度 = a * 年收入 + b ，即是一个线性模型。要设计“评判模型参数适合度的标准”，来评价当前模型参数的好坏。

Challenges of Machine Learning

训练数据量不足（只要数据量上去了，各种算法的表现都提升）
训练数据没代表性（或者训练数据有偏见）
训练数据质量差（应当清除数据中的errors，outliers，noise）
抓取了无关特征（特征抽取：将现有特征融合成更有用的特征）
过拟合（系统对于训练集训练过度，认为一些无关紧要的内容也是特征，导致在测试集中表现差。通常的操作是，给模型以约束，简化，在训练集中剔除噪声等正常化操作。）
欠拟合（对策是选取更强大的模型，选用更好的特征，减少模型的约束）

Testing and Validating

训练集，测试集八二分成。

仅以测试集的成绩调试，会导致模型和超参（例如人民满意度例子中的a和b）对测试集过拟合。

故，数等分训练集，任选部分为子训练集和验证集，以验证集结果调试模型和超参数。