From and thanks to: github jcjohnson/pytorch-examples
本文通过自包含的示例介绍了PyTorch的基本概念,jcjohson的这些实例可以很好地帮助理解PyTorch与numpy、TensorFlow等之间的关系,以及其自己的概念和设计。
PyTorch的核心是两个主要特征:
1)一个n维Tensor,类似于numpy但可以在GPU上运行
2)自动差分建立和训练神经网络
这里将使用完全连接的ReLU网络作为运行示例。 网络将具有单个隐藏层,并且将通过最小化网络输出和真实输出之间的欧几里德距离来训练梯度下降以适合随机数据。
注意:这些示例已经更新为PyTorch 0.4,它对核心PyTorch API进行了几处重大更改。 最值得注意的是,在0.4之前,Tensors必须用Variable对象包装才能使用autograd; 此功能现已直接添加到Tensors,现在不推荐使用变量。
目录
Numpy是支持n维矩阵运算和操作的科学计算框架。Numpy本身与计算图、深度学习或梯度无关,但可以通过numpy的操作轻易地构建简单的神经网络。以numpy实现一个2层的网络(1个隐藏层):
# Code in file tensor/two_layer_net_numpy.py
import numpy as np
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
Numpy很强大,但是不支持利用GPU加速数值计算。而现代的深度神经网络通过GPU能提速50倍甚至更高,所以numpy不能满足现代的深度学习。
PyTorch类似numpy,但支持GPU。PyTorch的Tensor在概念上等同于numpy的array,Tensor是一个n维array,类似numpy对于array的支持,PyTorch提供了很多操作Tensor的函数。因此,任何numpy可以完成的PyTorch都可以实现,它们都是科学计算的基本工具。在GPU上运行Tensor需要在创建Tensor时使用设备参数(例如:device = torch.device('cuda'); x = torch.randn(10, 20, device=device))。
以PyTorch Tensor实现一个2层的网络:
# Code in file tensor/two_layer_net_tensor.py
import torch
device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device)
w2 = torch.randn(H, D_out, device=device)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss; loss is a scalar, and is stored in a PyTorch Tensor
# of shape (); we can get its value as a Python number with loss.item().
loss = (y_pred - y).pow(2).sum()
print(t, loss.item())
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
上面的例子中我们不得不手动实现网络的前向和后向传播。对于一个二层小网络的后向传播进行实现问题不大,但是对于大的复杂网络这种实现迅速变得很难。
幸运的是我们可以利用automatic differentiation来自动计算神经网络的后向传播。PyTorch中的autograd包就是实现这个功能的。当使用autograd时,网络的前向传播会定义一个计算图(computational graph);图中的节点是Tensor,图中的连线是获取输出Tensor的函数。通过该图的后向传播可以轻松计算梯度。
如果我们想要计算某个Tensor的梯度,只需要在创建Tensor时设置requires_grad=True。然后该Tensor的所有PyTorch操作都会创建一个可以在后续计算后向传播的计算图。例如,x是一个设置requires_grad=True的Tensor,那么后向传播之后得到的x.grad是另一个Tensor,存储针对某个放缩值的x的梯度。
有时,你可能不希望PyTorch在Tensor执行某些操作时建立计算图,例如我们在训练神经网络时通常不想通过权重更新步进行后向传播。在这种情况下我们可以使用torch.no_grad()来阻止计算图的创建。
注意,在神经网络中,我们需求更新的梯度是权重梯度,而输入输出和隐藏层的权重并不需要。这也就对应了我们上述的两种情况。
这里我们使用PyTorch Tensor和autograd来实现2层网络,现在我们不再需要手动实现网络的后向传播。这个例子中调用了autograd中的backward()方法完成后向传播。
# Code in file autograd/two_layer_net_autograd.py
import torch
device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)
# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y using operations on Tensors. Since w1 and
# w2 have requires_grad=True, operations involving these Tensors will cause
# PyTorch to build a computational graph, allowing automatic computation of
# gradients. Since we are no longer implementing the backward pass by hand we
# don't need to keep references to intermediate values.
y_pred = x.mm(w1).clamp(min=0).mm(w2)
# Compute and print loss. Loss is a Tensor of shape (), and loss.item()
# is a Python number giving its value.
loss = (y_pred - y).pow(2).sum()
print(t, loss.item())
# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Tensors with requires_grad=True.
# After this call w1.grad and w2.grad will be Tensors holding the gradient
# of the loss with respect to w1 and w2 respectively.
loss.backward()
# Update weights using gradient descent. For this step we just want to mutate
# the values of w1 and w2 in-place; we don't want to build up a computational
# graph for the update steps, so we use the torch.no_grad() context manager
# to prevent PyTorch from building a computational graph for the updates
with torch.no_grad():
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after running the backward pass
w1.grad.zero_()
w2.grad.zero_()
PyTorch: Defining new autograd functions
原生的autograd操作包括两种针对Tensor的操作:forward函数计算输入Tensor的输出Tensor;backward函数接收输出Tensor的梯度并计算输入Tensor的梯度。
在PyTorch中我们可以通过定义torch.autograd.Function子类和实现forward和backward函数来定义我们自己的autograd操作。然后我们可以通过创建实例来使用新的autograd操作,并像函数一样调用,传播Tensor。
这个例子我们自定义autograd函数并执行ReLU非线性,并使用它实现2层网络。
# Code in file autograd/two_layer_net_custom_function.py
import torch
class MyReLU(torch.autograd.Function):
"""
We can implement our own custom autograd Functions by subclassing
torch.autograd.Function and implementing the forward and backward passes
which operate on Tensors.
"""
@staticmethod
def forward(ctx, x):
"""
In the forward pass we receive a context object and a Tensor containing the
input; we must return a Tensor containing the output, and we can use the
context object to cache objects for use in the backward pass.
"""
ctx.save_for_backward(x)
return x.clamp(min=0)
def backward(ctx, grad_output):
"""
In the backward pass we receive the context object and a Tensor containing
the gradient of the loss with respect to the output produced during the
forward pass. We can retrieve cached data from the context object, and must
compute and return the gradient of the loss with respect to the input to the
forward function.
"""
x, = ctx.saved_tensors
grad_x = grad_output.clone()
grad_x[x < 0] = 0
return grad_x
device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and output
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)
# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y using operations on Tensors; we call our
# custom ReLU implementation using the MyReLU.apply function
y_pred = MyReLU.apply(x.mm(w1)).mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum()
print(t, loss.item())
# Use autograd to compute the backward pass.
loss.backward()
with torch.no_grad():
# Update weights using gradient descent
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after running the backward pass
w1.grad.zero_()
w2.grad.zero_()
PyTorch autograd 同TensorFlow非常像:两个框架都定义了计算图,并使用automatic differentiation来计算梯度。两者最大的差异就是,TensorFlow的计算图是静态的,而PyTorch的计算图是动态的。
在TensorFlow中我们仅定义一次计算图,然后一遍一遍的执行相同的计算图,每次给不同的输入数据。在PyTorch中,每次前向传播定义一个新的计算图。
静态图很好,因为你可以预先优化图; 例如,框架可能决定融合某些图的操作以提高效率,或者提出一种用于多GPU或多机器上的分布图策略。 如果反复使用相同的图表,那么对于这种代价高昂的前期优化便可以分摊,因为相同的图表会反复重新运行。
静态和动态图表不同的一个方面是控制流程。 对于某些模型,我们可能希望对每个数据点执行不同的计算; 例如,可以针对每个数据点针对不同数量的时间步长展开循环网络; 这种展开可以作为循环实现。 使用静态图形,循环结构需要是图形的一部分; 因此,TensorFlow提供了诸如tf.scan之类的运算符,用于将循环嵌入到图中。 使用动态图形情况更简单:因为我们为每个示例动态构建图形,我们可以使用常规命令流程控制来执行每个输入不同的计算。
为了与上面的PyTorch autograd示例相比,这里我们使用TensorFlow来实现一个简单的2层网络:
补充TensorFlow的解释:TF里的图有的是变量和有的是变量计算,这与PyTorch不太一样;另外TF先定义所有的图(也即静态图的概念),但是在执行时只执行需要执行的图(例如下例中的损失和权重,sess.run([loss, new_w1, new_w2]…)),并返回结果(该图所涉及的其它图都会计算但不返回结果),这也是它的一个灵活之处,但还是不免让刚入门的人不太习惯,因为不符合往常的编程习惯。PyTorch与TensorFlow不同的是不预先定义图,用到的时候定义,例如图可以定义在循环里,这就和我们平常写程序的习惯是一样的,也就是PyTorch的动态图特性,使得我们觉得其简单易用。
# Code in file autograd/tf_two_layer_net.py
import tensorflow as tf
import numpy as np
# First we set up the computational graph:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))
# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))
# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)
# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)
# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])
# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)
# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
# Run the graph once to initialize the Variables w1 and w2.
sess.run(tf.global_variables_initializer())
# Create numpy arrays holding the actual data for the inputs x and targets y
x_value = np.random.randn(N, D_in)
y_value = np.random.randn(N, D_out)
for _ in range(500):
# Execute the graph many times. Each time it executes we want to bind
# x_value to x and y_value to y, specified with the feed_dict argument.
# Each time we execute the graph we want to compute the values for loss,
# new_w1, and new_w2; the values of these Tensors are returned as numpy
# arrays.
loss_value, _, _ = sess.run([loss, new_w1, new_w2],
feed_dict={x: x_value, y: y_value})
print(loss_value)
计算图和autograd是非常强大的范式,用于定义复杂的运算符并自动获取导数;然而对于大型神经网络,原始的autograd可能有点太低级别。
在构建神经网络时,我们经常考虑将计算安排到层中,其中一些层具有可学习的参数(learnable parameters),这些参数将在学习期间进行优化。
在TensorFlow中,像Keras,TensorFlow-Slim和TFLearn这样的软件包提供了基于原始计算图更高级别的抽象,这对构建神经网络十分有用。
在PyTorch中,nn包具有同样的作用。 nn包定义了一组模块,它们大致相当于神经网络层。模块接收输入Tensor并计算输出Tensor,但也可以保持内部状态,例如包含可学习参数的Tensor。nn包还定义了一组在训练神经网络时常用的有用的损失函数。
在这个例子中,我们使用nn包来实现我们的2层网络:
# Code in file nn/two_layer_net_nn.py
import torch
device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)
# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# After constructing the model we use the .to() method to move it to the
# desired device.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
).to(device)
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(size_average=False)
learning_rate = 1e-4
for t in range(500):
# Forward pass: compute predicted y by passing x to the model. Module objects
# override the __call__ operator so you can call them like functions. When
# doing so you pass a Tensor of input data to the Module and it produces
# a Tensor of output data.
y_pred = model(x)
# Compute and print loss. We pass Tensors containing the predicted and true
# values of y, and the loss function returns a Tensor containing the loss.
loss = loss_fn(y_pred, y)
print(t, loss.item())
# Zero the gradients before running the backward pass.
model.zero_grad()
# Backward pass: compute gradient of the loss with respect to all the learnable
# parameters of the model. Internally, the parameters of each Module are stored
# in Tensors with requires_grad=True, so this call will compute gradients for
# all learnable parameters in the model.
loss.backward()
# Update the weights using gradient descent. Each parameter is a Tensor, so
# we can access its data and gradients like we did before.
with torch.no_grad():
for param in model.parameters():
param.data -= learning_rate * param.grad
到目前为止,我们通过手动改变持有可学习参数的张量来更新模型的权重。对于像随机梯度下降这样的简单优化算法来说,这不是一个巨大的负担,但在实践中,我们经常使用AdaGrad,RMSProp,Adam等更复杂的优化算法训练神经网络。
PyTorch中的optim包抽象出优化算法的思想,并提供常用优化算法的实现。其采用很多高级方法来获取更优的梯度方向并更新梯度。
在这个例子中,我们将像以前一样使用nn包定义我们的模型,但我们将使用optim包提供的Adam算法优化模型:
# Code in file nn/two_layer_net_optim.py
import torch
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(size_average=False)
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(x)
# Compute and print loss.
loss = loss_fn(y_pred, y)
print(t, loss.item())
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the Tensors it will update (which are the learnable weights
# of the model)
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to model parameters
loss.backward()
# Calling the step function on an Optimizer makes an update to its parameters
optimizer.step()
有时需要指定比现有模块序列更复杂的模型;对于这些情况,可以通过子类化nn.Module定义自己的模块,并定义一个接收输入Tensor的forward,并使用其他模块或Tensor上的其他autograd操作生成输出Tensor。
在这个例子中,我们将以自定义Module子类的形式实现2层网络:
# Code in file nn/two_layer_net_module.py
import torch
class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we instantiate two nn.Linear modules and assign them as
member variables.
"""
super(TwoLayerNet, self).__init__()
self.linear1 = torch.nn.Linear(D_in, H)
self.linear2 = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
In the forward function we accept a Tensor of input data and we must return
a Tensor of output data. We can use Modules defined in the constructor as
well as arbitrary (differentiable) operations on Tensors.
"""
h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Construct our model by instantiating the class defined above.
model = TwoLayerNet(D_in, H, D_out)
# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x)
# Compute and print loss
loss = loss_fn(y_pred, y)
print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()
PyTorch: Control Flow + Weight Sharing
作为动态图和权重共享的一个例子,我们实现了一个非常奇怪的模型:一个完全连接的ReLU网络,在每个前向传递中选择1到4之间的随机数并使用那么多隐藏层,重复使用相同的权重多次 计算最里面的隐藏层。
对于这个模型,可以使用普通的Python流控制来实现循环,并且我们可以通过在定义正向传递时多次重复使用相同的模块来实现最内层之间的权重共享。
我们可以轻松的将这个模型实现成一个模块子类:
# Code in file nn/dynamic_net.py
import random
import torch
class DynamicNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we construct three nn.Linear instances that we will use
in the forward pass.
"""
super(DynamicNet, self).__init__()
self.input_linear = torch.nn.Linear(D_in, H)
self.middle_linear = torch.nn.Linear(H, H)
self.output_linear = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
and reuse the middle_linear Module that many times to compute hidden layer
representations.
Since each forward pass builds a dynamic computation graph, we can use normal
Python control-flow operators like loops or conditional statements when
defining the forward pass of the model.
Here we also see that it is perfectly safe to reuse the same Module many
times when defining a computational graph. This is a big improvement from Lua
Torch, where each Module could be used only once.
"""
h_relu = self.input_linear(x).clamp(min=0)
for _ in range(random.randint(0, 3)):
h_relu = self.middle_linear(h_relu).clamp(min=0)
y_pred = self.output_linear(h_relu)
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)
# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x)
# Compute and print loss
loss = criterion(y_pred, y)
print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()