Dive into Deep Learning

1 Introduce

Most neural networks contain a few principles:

  • using linear or nonlinear units alternately,which are called layer.
  • using gradient descent to update network parameters,which is base on chain rule(back propagation BP) in math.

Machine learning focus on how to improve computer performance by using experience.

2 Preparatory knowledge

2.1 How to train a model?

The training process usually looks like the following few steps:

  1. Start off with a randomly initialized model that cannot do anything useful.
  3. Tweak the knobs so the model sucks less with respect to those examples.
  4. Repeat Step 2 and 3 until the model is awesome.


2.2 Key components:

The data that we can learn from.

  1. A model of how to transform the data.
  2. An objective function that quantifies how well (or badly) the model is doing.
  3. An algorithm to adjust the modelʼs parameters to optimize the objective function

2.3 Kinds of machine learning problems

2.3.1 Supervised Learning

​ Supervised learning addresses the task of predicting labels given input features. Our goal is to produce a model that maps any input to a label prediction.


​ Typical supervised problems are regression , classification, tagging and recommender systems.

2.3.2 Unsupervised and Self-Supervised Learning

​ Unsupervised learning is kind of machine learning that using the data without labels. As a form of unsupervised learning, self-supervised learning leverages unlabeled data to provide supervision in training.

3 Preliminaries

3.1 Data manipulate

​ To face large data set, generally ,there are two important things we need to do with data sets:

  1. acquire them;
  2. process them once they are inside the computer.

3.2 Data preprocessing

4 Linear Neural Networks

4.1 Linear Regression

​ Regression refers to a set of methods for modeling the relationship between one or more independent variables and a dependent variable. Regression is an efficient way for solving the problem which we want to predict a numerical value.

4.1.1 Basic elements of linear regression

​ Suppose that we wish to estimate the prices of houses based on their area and age. The thing we are trying to predict (price) is called a label (or target). The independent variables (age and area) upon which the predictions are based are called features (or covariates).

​ Typically, we will use n n n to denote the number of examples in our dataset. We index the data examples by i i i, denoting each input as x ( i ) = [ x ( i ) 1 ( i ) , x ( i ) 2 ( i ) ] T x(i)=[x(i)_1^{(i)},x(i)_2^{(i)}]^T x(i)=[x(i)1(i),x(i)2(i)]T and the corresponding label as y ( i ) y^{(i)} y(i).

Linear Model

​ The linearity assumption just says that the target (price) can be expressed as a weighted sum of the features (area and age):
p r i c e = ω a r e a ∗ a r e a + ω a g e ∗ a g e + b , price=\omega_{area}*area+\omega_{age}*age+b, price=ωareaarea+ωageage+b,
which ω a r e a \omega_{area} ωarea and ω a g e \omega_{age} ωage are called weights, and b b b is called a bias. The weights determine the influence of each feature on our prediction and the bias just says what value the predicted price should take when all of the features take value 0. Even if we will never see any homes with zero area, or that are precisely zero years old, we still need the bias or else we will limit the expressivity of our model.

​ In machine learning, we usually work with high-dimensional datasets, so it is more convenient to employ linear algebra notation. When our inputs consist of d features, we express our prediction y ^ \hat{y} y^​ as:
y ^ = ω 1 x 1 + . . . + ω d x d + b . \hat{y}=\omega_1x_1+...+\omega_dx_d+b. y^=ω1x1+...+ωdxd+b.
​ Collecting all features into a vector x 2 ∈ R d x^2\in\R^d x2Rd and all weights into a vector w ∈ R d w\in\R^d wRd​d, we can express our model compactly using a dot product:
y ^ = W T X + b . \hat{y}=W^TX+b. y^=WTX+b.
​ The vector X X X corresponds to features of a single data example. We will often find it convenient to refer to features of our entire dataset of n n n examples via the design matrix X 2 ∈ R n ∗ d X^2\in\R^{n*d} X2Rnd. Here, X X X contains one row for every example and one column for every feature.

​ For a collection of features X X X, the predictions y ^ ∈ R n \hat{y}\in\R^n y^Rn​ can be expressed via the matrix-vector product:
y ^ = X W + b . \hat{y}=XW+b. y^=XW+b.
​ Given features of a training dataset X and corresponding (known) labels y, the goal of linear regression is to find the weight vector w and the bias term b that given features of a new data example sampled from the same distribution as X, the new exampleʼs label will (in expectation) be predicted with the lowest error.

Loss Function

​ We need to determine a measure of fitness. The loss function quantifies the distance between the real and predicted value of the target. The loss will usually be a non-negative number where smaller values are better and perfect predictions incur a loss of 0.

​ The most popular loss function in regression problems is the squared error. When our prediction for an example i i i is y ( i ) y^(i) y(i) and the corresponding true label is y ( i ) y(i) y(i)​, the
squared error is given by:
l ( i ) ( w , b ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 l^{(i)}(\textbf{w},b)=\frac{1}{2}(\hat{y}^{(i)}-y^{(i)})^2 l(i)(w,b)=21(y^(i)y(i))2
​ The constant 1 2 \frac{1}{2} 21 makes no real difference but will prove notationally convenient, canceling out when we take the derivative of the loss.
