在我们之前的神经网络中,如第一个图,我们输入的样本是有标签的,即(input, target),这样我们根据当前输出和target(label)之间的差去改变前面各层的参数,直到收敛。但现在我们只有无标签数据,也就是右边的图。那么这个误差怎么得到呢?
另一种:通过有标签样本,微调整个系统:(如果有足够多的数据,这个是最好的。end-to-end learning端对端学习)
An autoencoder takes an input$$x in [0, 1]d$$and first maps it (with an_encoder)_to a hidden representation$$x in [0, 1]{d^{'}}$$through a deterministic mapping, e.g.:
Where s is a non-linearity such as the sigmoid. The latent representation y , or **code **is then mapped back (with a_decoder)_into a **reconstruction z **of the same shape as x. The mapping happens through a similar transformation, e.g.:
(Here, the prime symbol does not indicate matrix transposition.) z should be seen as a prediction of x , given the code y . Optionally, the weight matrix $$W{'}$$ of the reverse mapping may be constrained to be the transpose of the forward mapping:$$W{'}=W{T}$$ This is referred to as_tied weights_. The parameters of this model (namely W, b , $$b{'}$$and, if one doesn’t use tied weights, also $$W^{'}$$) are optimized such that the average reconstruction error is minimized.
The reconstruction error can be measured in many ways, depending on the appropriate distributional assumptions on the input given the code. The traditional_squared error_$$L(xz)=||x-z||^2$$, can be used. If the input is interpreted as either bit vectors or vectors of bit probabilities,_cross-entropy_of the reconstruction can be used:
$$L_{H}(x, z)=-sum _{k=1}^{d} [x_klogz_k+(1-x_k)log(1-z_k)]$$
The hope is that the code y is a_distributed_representation that captures the coordinates along the main factors of variation in the data. This is similar to the way the projection on principal components would capture the main factors of variation in the data. Indeed, if there is one linear hidden layer (the_code)_and the mean squared error criterion is used to train the network, then the k hidden units learn to project the input in the span of the first k principal components of the data. If the hidden layer is non-linear, the auto-encoder behaves differently from PCA, with the ability to capture multi-modal aspects of the input distribution. The departure from PCA becomes even more important when we consider_stacking multiple encoders_(and their corresponding decoders) when building a deep auto-encoder[Hinton06].
Because y is viewed as a lossy compression of x, it cannot be a good (small-loss) compression for all x. Optimization makes it a good compression for training examples, and hopefully for other inputs as well, but not for arbitrary inputs. That is the sense in which an auto-encoder generalizes: it gives low reconstruction error on test examples from the same distribution as the training examples, but generally high reconstruction error on samples randomly chosen from the input space.
We want to implement an auto-encoder using Theano, in the form of a class, that could be afterwards used in constructing a stacked autoencoder. The first step is to create shared variables for the parameters of the autoencoder W, b and $$b{'}$$. (Since we are using tied weights in this tutorial,$$W{T}$$will be used for $$W^{'}$$):
def __init__(
Initialize the dA class by specifying the number of visible units (the
dimension d of the input ), the number of hidden units ( the dimension
d' of the latent or hidden space ) and the corruption level. The
constructor also receives symbolic variables for the input, weights and
bias. Such a symbolic variables are useful when, for example the input
is the result of some computations, or when weights are shared between
the dA and an MLP layer. When dealing with SdAs this always happens,
the dA on layer 2 gets as input the output of the dA on layer 1,
and the weights of the dA are used in the second stage of training
to construct an MLP.
:type numpy_rng: numpy.random.RandomState
:param numpy_rng: number random generator used to generate weights
:type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
:param theano_rng: Theano random generator; if None is given one is
generated based on a seed drawn from `rng`
:type input: theano.tensor.TensorType
:param input: a symbolic description of the input or None for
standalone dA
:type n_visible: int
:param n_visible: number of visible units
:type n_hidden: int
:param n_hidden: number of hidden units
:type W: theano.tensor.TensorType
:param W: Theano variable pointing to a set of weights that should be
shared belong the dA and another architecture; if dA should
be standalone set this to None
:type bhid: theano.tensor.TensorType
:param bhid: Theano variable pointing to a set of biases VALUES (NULL,
high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),
size=(n_visible, n_hidden)
W = theano.shared(value=initial_W, name='W', borrow=True)
if not bvis:
bvis = theano.shared(
if not bhid:
bhid = theano.shared(
self.W = W
# b corresponds to the bias of the hidden
self.b = bhid
# b_prime corresponds to the bias of the visible
self.b_prime = bvis
# tied weights, therefore W_prime is W transpose
self.W_prime = self.W.T
self.theano_rng = theano_rng
# if no input is given, generate a variable representing the input
if input is None:
# we use a matrix because we expect a minibatch of several
# examples, each example being a row
self.x = T.dmatrix(name='input')
self.x = input
self.params = [self.W, self.b, self.b_prime]
Note that we pass the symbolicinput
to the autoencoder as a parameter. This is so that we can concatenate layers of autoencoders to form a deep network: the symbolic output (the y above) of layer k will be the symbolic input of layer k+1
Now we can express the computation of the latent representation and of the reconstructed signal:
def get_hidden_values(self, input):
""" Computes the values of the hidden layer """
return T.nnet.sigmoid(T.dot(input, self.W) + self.b)
def get_reconstructed_input(self, hidden):
"""Computes the reconstructed input given the values of the
hidden layer
return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)
And using these functions we can compute the cost and the updates of one stochastic gradient descent step:
def get_cost_updates(self, corruption_level, learning_rate):
""" This function computes the cost and the updates for one trainng
step of the dA """
tilde_x = self.get_corrupted_input(self.x, corruption_level)
y = self.get_hidden_values(tilde_x)
z = self.get_reconstructed_input(y)
# note : we sum over the size of a datapoint; if we are using
# minibatches, L will be a vector, with one entry per
# example in minibatch
L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1)
# note : L is now a vector, where each element is the
# cross-entropy cost of the reconstruction of the
# corresponding example of the minibatch. We need to
# compute the average of all these to get the cost of
# the minibatch
cost = T.mean(L)
# compute the gradients of the cost of the `dA` with respect
# to its parameters
gparams = T.grad(cost, self.params)
# generate the list of updates
updates = [
(param, param - learning_rate * gparam)
for param, gparam in zip(self.params, gparams)
return (cost, updates)
We can now define a function that applied iteratively will update the parametersW
such that the reconstruction cost is approximately minimized.
da = dA(
n_visible=28 * 28,
cost, updates = da.get_cost_updates(
train_da = theano.function(
x: train_set_x[index * batch_size: (index + 1) * batch_size]
start_time = timeit.default_timer()
# go through training epochs
for epoch in range(training_epochs):
# go through trainng set
c = []
for batch_index in range(n_train_batches):
print('Training epoch %d, cost ' % epoch, numpy.mean(c, dtype='float64'))
end_time = timeit.default_timer()
training_time = (end_time - start_time)
print(('The no corruption code for file ' +
os.path.split(__file__)[1] +
' ran for %.2fm' % ((training_time) / 60.)), file=sys.stderr)
image = Image.fromarray(
img_shape=(28, 28), tile_shape=(10, 10),
tile_spacing=(1, 1)))
# start-snippet-3
rng = numpy.random.RandomState(123)
theano_rng = RandomStreams(rng.randint(2 ** 30))
da = dA(
n_visible=28 * 28,
cost, updates = da.get_cost_updates(
train_da = theano.function(
x: train_set_x[index * batch_size: (index + 1) * batch_size]
start_time = timeit.default_timer()
# go through training epochs
for epoch in range(training_epochs):
# go through trainng set
c = []
for batch_index in range(n_train_batches):
print('Training epoch %d, cost ' % epoch, numpy.mean(c, dtype='float64'))
end_time = timeit.default_timer()
training_time = (end_time - start_time)
print(('The 30% corruption code for file ' +
os.path.split(__file__)[1] +
' ran for %.2fm' % (training_time / 60.)), file=sys.stderr)
# end-snippet-3
# start-snippet-4
image = Image.fromarray(tile_raster_images(
img_shape=(28, 28), tile_shape=(10, 10),
tile_spacing=(1, 1)))
# end-snippet-4
if __name__ == '__main__':
If there is no constraint besides minimizing the reconstruction error, one might expect an auto-encoder withinputs and an encoding of dimension
(or greater) to learn the identity function, merely mapping an input to its copy. Such an autoencoder would not differentiate test examples (from the training distribution) from other input configurations.
Surprisingly, experiments reported in[Bengio07]suggest that, in practice, when trained with stochastic gradient descent, non-linear auto-encoders with more hidden units than inputs (called overcomplete) yield useful representations. (Here, “useful” means that a network taking the encoding as input has low classification error.)
A simple explanation is that stochastic gradient descent with early stopping is similar to an L2 regularization of the parameters. To achieve perfect reconstruction of continuous inputs, a one-hidden layer auto-encoder with non-linear hidden units (exactly like in the above code) needs very small weights in the first (encoding) layer, to bring the non-linearity of the hidden units into their linear regime, and very large weights in the second (decoding) layer. With binary inputs, very large weights are also needed to completely minimize the reconstruction error. Since the implicit or explicit regularization makes it difficult to reach large-weight solutions, the optimization algorithm finds encodings which only work well for examples similar to those in the training set, which is what we want. It means that the_representation is exploiting statistical regularities present in the training set,_rather than merely learning to replicate the input.
There are other ways by which an auto-encoder with more hidden units than inputs could be prevented from learning the identity function, capturing something useful about the input in its hidden representation. One is the addition of_sparsity_(forcing many of the hidden units to be zero or near-zero). Sparsity has been exploited very successfully by many[Ranzato07][Lee08]. Another is to add randomness in the transformation from input to reconstruction. This technique is used in Restricted Boltzmann Machines (discussed later inRestricted Boltzmann Machines (RBM)), as well as in Denoising Auto-Encoders, discussed below.