我正在学习如何编写神经网络,目前我正在研究一种具有一个输入层、一个隐藏层和一个输出层的反向传播算法。算法正在运行,当我抛出一些测试数据时
x_train = np.array([[1., 2., -3., 10.], [0.3, -7.8, 1., 2.]])
y_train = np.array([[10, -3, 6, 1], [1, 1, 6, 1]])
在我的算法中,使用3个隐藏单元的默认值和10e-4的默认学习率,
Backprop.train(x_train, y_train, tol = 10e-1)
x_pred = Backprop.predict(x_train),
我得到了很好的结果:
Tolerances: [10e-1, 10e-2, 10e-3, 10e-4, 10e-5]
Iterations: [2678, 5255, 7106, 14270, 38895]
Mean absolute error: [0.42540, 0.14577, 0.04264, 0.01735, 0.00773]
Sum of squared errors: [1.85383, 0.21345, 0.01882, 0.00311, 0.00071].
每次平方误差之和都会像我预期的那样下降一个十进制。然而,当我使用这样的测试数据时
X_train = np.random.rand(20, 7)
Y_train = np.random.rand(20, 2)
Tolerances: [10e+1, 10e-0, 10e-1, 10e-2, 10e-3]
Iterations: [11, 19, 63, 80, 7931],
Mean absolute error: [0.30322, 0.25076, 0.25292, 0.24327, 0.24255],
Sum of squared errors: [4.69919, 3.43997, 3.50411, 3.38170, 3.16057],
没有什么真正的改变。我检查了我的隐藏单位、梯度和权重矩阵,它们都不同,梯度确实在缩小,就像我设置的backprop算法一样
if ( np.sum(E_hidden**2) + np.sum(E_output**2) ) < tol:
learning = False,
其中E\u hidden和E\u output是我的梯度矩阵。我的问题是:尽管梯度正在缩小,但对于某些数据,指标实际上保持不变,这怎么可能?对此我能做些什么?
我的背部支柱如下所示:
class Backprop:
def sigmoid(r):
return (1 + np.exp(-r)) ** (-1)
def train(x_train, y_train, hidden_units = 3, learning_rate = 10e-4, tol = 10e-3):
# We need y_train to be 2D. There should be as many rows as there are x_train vectors
N = x_train.shape[0]
I = x_train.shape[1]
J = hidden_units
K = y_train.shape[1] # Number of output units
# Add the bias units to x_train
bias = -np.ones(N).reshape(-1,1) # Make it 2D so we can stack it
# Make the row vector a column vector for easier use when applying matrices. Afterwards, x_train.shape = (N, I+1)
x_train = np.hstack((x_train, bias)).T # x_train.shape = (I+1, N) -> N column vectors of respective length I+1
# Create our weight matrices
W_input = np.random.rand(J, I+1) # W_input.shape = (J, I+1)
W_hidden = np.random.rand(K, J+1) # W_hidden.shape = (K, J+1)
m = 0
learning = True
while learning:
##### ----- Phase 1: Forward Propagation ----- #####
# Create the total input to the hidden units
u_hidden = W_input @ x_train # u_hidden.shape = (J, N) -> N column vectors of respective length J. For every training vector we # get J hidden states
# Create the hidden units
h = Backprop.sigmoid(u_hidden) # h.shape = (J, N)
# Create the total input to the output units
bias = -np.ones(N)
h = np.vstack((h, bias)) # h.shape = (J+1, N)
u_output = W_hidden @ h # u_output.shape = (K, N). For every training vector we get K output states.
# In the code itself the following is not necessary, because, as we remember from the above, the output activation function
# is the identity function, but let's do it anyway for the sake of clarity
y_pred = u_output.copy() # Now, y_pred has the same shape as y_train
##### ----- Phase 2: Backward Propagation ----- #####
# We will calculate the delta terms now and begin with the delta term of the output unit
# We will transpose several times now. Before, having column vectors was convenient, because matrix multiplication is
# more intuitive then. But now, we need to work with indices and need the right dimensions. Yes, loops are inefficient,
# they provide much more clarity so that we can easily connect the theory above with our code.
# We don't need the delta_output right now, because we will update W_hidden with a loop. But we need it for the delta term
# of the hidden unit.
delta_output = y_pred.T - y_train
# Calculate our error gradient for the output units
E_output = np.zeros((K, J+1))
for k in range(K):
for j in range(J+1):
for n in range(N):
E_output[k, j] += (y_pred.T[n, k] - y_train[n, k]) * h.T[n, j]
# Calculate our change in W_hidden
W_delta_output = -learning_rate * E_output
# Update the old weights
W_hidden = W_hidden + W_delta_output
# Let's calculate the delta term of the hidden unit
delta_hidden = np.zeros((N, J+1))
for n in range(N):
for j in range(J+1):
for k in range(K):
delta_hidden[n, j] += h.T[n, j]*(1 - h.T[n, j]) * delta_output[n, k] * W_delta_output[k, j]
# Calculate our error gradient for the hidden units, but exclude the hidden bias unit, because W_input and the hidden bias
# unit don't share any relation at all
E_hidden = np.zeros((J, I+1))
for j in range(J):
for i in range(I+1):
for n in range(N):
E_hidden[j, i] += delta_hidden[n, j]*x_train.T[n, i]
# Calculate our change in W_input
W_delta_hidden = -learning_rate * E_hidden
W_input = W_input + W_delta_hidden
if ( np.sum(E_hidden**2) + np.sum(E_output**2) ) < tol:
learning = False
m += 1 # Iteration count
Backprop.weights = [W_input, W_hidden]
Backprop.iterations = m
Backprop.errors = [E_hidden, E_output]
##### ----- #####
def predict(x):
N = x.shape[0]
# x1 = Backprop.weights[1][:,:-1] @ Backprop.sigmoid(Backprop.weights[0][:,:-1] @ x.T) # Trying this we see we really need to add
# a bias here the bias if we also train using bias
# Add the bias units to x
bias = -np.ones(N).reshape(-1,1) # Make it 2D so we can stack it
# Make the row vector a column vector for easier use when applying matrices.
x = np.hstack((x, bias)).T
h = Backprop.weights[0] @ x
u = Backprop.sigmoid(h) # We need to transform the data using the sigmoidal function
h = np.vstack((u, bias.reshape(1, -1)))
return (Backprop.weights[1] @ h).T
我找到了答案。如果在后支柱中。预测,我写
output = (Backprop.weights[1] @ h).T
return output
相反,一切都很好。
我试图实现一个无隐层神经网络来破解MNIST数据集。 我使用sigmoid作为激活函数,交叉熵作为损失函数。 为了简单起见,我的网络没有隐藏层,只有输入和输出。 这是我实现反向传播算法的一部分,但它没有按预期工作。损失函数的下降速度非常慢(我尝试了学习率从0.001到1的变化),准确度永远不会超过0.1。 输出如下:
前面几节里我们使用了小批量随机梯度下降的优化算法来训练模型。在实现中,我们只提供了模型的正向传播(forward propagation)的计算,即对输入计算模型输出,然后通过autograd模块来调用系统自动生成的backward函数计算梯度。基于反向传播(back-propagation)算法的自动求梯度极大简化了深度学习模型训练算法的实现。本节我们将使用数学和计算图(computationa
随时间反向传播(BPTT)算法 $$s_t = \tanh (Ux_t+Ws_{t-1})$$ $$\hat y_t=softmax(Vs_t)$$ RNN的损失函数定义为交叉熵损失: $$E_t(y_t,\hat y_t)=-y_t\log\hat y_t $$ $$E(y,\hat y)=\sum_{t}E_t(y_t, \hat y_t)=-\sum_{t}y_t\log\hat y_t$$
在RNN模型里,我们讲到了RNN具有如下的结构,每个序列索引位置t都有一个隐藏状态h^{(t)}。 如果我们略去每层都有的$$o{(t)}, L{(t)}, y^{(t)}$$,则RNN的模型可以简化成如下图的形式: 图中可以很清晰看出在隐藏状态$$h{(t)}$$由$$x{(t)}$$和$$h{(t-1)}$$得到。得到$$h{(t)}$$后一方面用于当前层的模型损失计算,另一方面用于计算下一层
卷积神经网络其实是神经网络特征学习的一个典型例子。传统的机器学习算法其实需要人工的提取特征,比如很厉害的SVM。而卷积神经网络利用模板算子的参数也用以学习这个特点,把特征也学习出来了。其实不同的模板算子本质上就是抽象了图像的不同方面的特征。比如提取边缘,提取梯度的算子。用很多卷积核去提取,那就是 提取了很多的特征。一旦把参数w,b训练出来,意味着特征和目标之间的函数就被确定。今天分享下CNN的关键
训练发散 理想的分类器应当是除了真实标签的概率为1,其余标签概率均为 0,这样计算得到其损失函数为 -ln(1) = 0 损失函数越大,说明该分类器在真实标签上分类概率越小,性能也就越差。一个非常差的分类器,可能在真实标签上的匪类概率接近于0,那么损失函数就接近于正无穷,我们成为训练发散,需要调小学习速率。 6.9 高原反应 在 ImageNet-1000 分类问题中,初始状态为均匀分布,每个类别