Most neural networks contain a few principles:
Machine learning focus on how to improve computer performance by using experience.
The training process usually looks like the following few steps:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0vTCe87v-1639923301148)(C:\Users\zhang\AppData\Roaming\Typora\typora-user-images\image-20210928170530090.png)]
The data that we can learn from.
Supervised learning addresses the task of predicting labels given input features. Our goal is to produce a model that maps any input to a label prediction.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DXLPfgPa-1639923301150)(C:\Users\zhang\AppData\Roaming\Typora\typora-user-images\image-20210930185909988.png)]
Typical supervised problems are regression , classification, tagging and recommender systems.
Unsupervised learning is kind of machine learning that using the data without labels. As a form of unsupervised learning, self-supervised learning leverages unlabeled data to provide supervision in training.
To face large data set, generally ,there are two important things we need to do with data sets:
Regression refers to a set of methods for modeling the relationship between one or more independent variables and a dependent variable. Regression is an efficient way for solving the problem which we want to predict a numerical value.
Suppose that we wish to estimate the prices of houses based on their area and age. The thing we are trying to predict (price) is called a label (or target). The independent variables (age and area) upon which the predictions are based are called features (or covariates).
Typically, we will use n n n to denote the number of examples in our dataset. We index the data examples by i i i, denoting each input as x ( i ) = [ x ( i ) 1 ( i ) , x ( i ) 2 ( i ) ] T x(i)=[x(i)_1^{(i)},x(i)_2^{(i)}]^T x(i)=[x(i)1(i),x(i)2(i)]T and the corresponding label as y ( i ) y^{(i)} y(i).
The linearity assumption just says that the target (price) can be expressed as a weighted sum of the features (area and age):
p
r
i
c
e
=
ω
a
r
e
a
∗
a
r
e
a
+
ω
a
g
e
∗
a
g
e
+
b
,
price=\omega_{area}*area+\omega_{age}*age+b,
price=ωarea∗area+ωage∗age+b,
which
ω
a
r
e
a
\omega_{area}
ωarea and
ω
a
g
e
\omega_{age}
ωage are called weights, and
b
b
b is called a bias. The weights determine the influence of each feature on our prediction and the bias just says what value the predicted price should take when all of the features take value 0. Even if we will never see any homes with zero area, or that are precisely zero years old, we still need the bias or else we will limit the expressivity of our model.
In machine learning, we usually work with high-dimensional datasets, so it is more convenient to employ linear algebra notation. When our inputs consist of d features, we express our prediction
y
^
\hat{y}
y^ as:
y
^
=
ω
1
x
1
+
.
.
.
+
ω
d
x
d
+
b
.
\hat{y}=\omega_1x_1+...+\omega_dx_d+b.
y^=ω1x1+...+ωdxd+b.
Collecting all features into a vector
x
2
∈
R
d
x^2\in\R^d
x2∈Rd and all weights into a vector
w
∈
R
d
w\in\R^d
w∈Rdd, we can express our model compactly using a dot product:
y
^
=
W
T
X
+
b
.
\hat{y}=W^TX+b.
y^=WTX+b.
The vector
X
X
X corresponds to features of a single data example. We will often find it convenient to refer to features of our entire dataset of
n
n
n examples via the design matrix
X
2
∈
R
n
∗
d
X^2\in\R^{n*d}
X2∈Rn∗d. Here,
X
X
X contains one row for every example and one column for every feature.
For a collection of features
X
X
X, the predictions
y
^
∈
R
n
\hat{y}\in\R^n
y^∈Rn can be expressed via the matrix-vector product:
y
^
=
X
W
+
b
.
\hat{y}=XW+b.
y^=XW+b.
Given features of a training dataset X and corresponding (known) labels y, the goal of linear regression is to find the weight vector w and the bias term b that given features of a new data example sampled from the same distribution as X, the new exampleʼs label will (in expectation) be predicted with the lowest error.
We need to determine a measure of fitness. The loss function quantifies the distance between the real and predicted value of the target. The loss will usually be a non-negative number where smaller values are better and perfect predictions incur a loss of 0.
The most popular loss function in regression problems is the squared error. When our prediction for an example
i
i
i is
y
(
i
)
y^(i)
y(i) and the corresponding true label is
y
(
i
)
y(i)
y(i), the
squared error is given by:
l
(
i
)
(
w
,
b
)
=
1
2
(
y
^
(
i
)
−
y
(
i
)
)
2
l^{(i)}(\textbf{w},b)=\frac{1}{2}(\hat{y}^{(i)}-y^{(i)})^2
l(i)(w,b)=21(y^(i)−y(i))2
The constant
1
2
\frac{1}{2}
21 makes no real difference but will prove notationally convenient, canceling out when we take the derivative of the loss.
ifference but will prove notationally convenient, canceling out when we take the derivative of the loss.