Application of Information Theory to Blind Source Separation
Introduction :: BSS :: Mackay's Approach :: Natural Gradient :: p(u)
Natural Gradient
The training law we have developed up to this point requires computation of
W
^{

T
}
. We can modify this by
This becomes (since )
This modification to the gradient, multiplying by W ^{ T } W is called the natural gradient (Amari, 1998). In this section, we examine this, with any eye to the question: what is natural about it? Comment on scaling of update formula. We follow Amari 1998 in the following discussion. Suppose is some parameter space (e.g., the space of parameters in the weighting matrix. Suppose there is some function defined. Consider a parameter value , and some incremental change to . If the parameter space is Euclidean , then the length of the increment is
However, not all parameter spaces are Euclidean. Consider, for example, a case where the parameters all lie on a sphere. Then the appropriate distance measure is not simply the sum of the squares of the coordinates, especially if is measured in spherical coordinates! So we measure the change differently:
Here, g is called the Riemannian metric tensor ; it describes the local curvature of the parameter space at the point . In terms of vectors, we can write
where (a function of ). G is symmetric. We see that we are simply dealing with a weighted distance, induced from a weighted inner product, defined by
When , we simply get the Euclidean distance. Now consider the problem of learning by "steepest descent." The question is, do we really go in the right direction, if we take into account the curvature of the parameter space. We want to decrease by moving in a direction to obtain , and do the best possible job with the motion. Let us assume that we have a fixed step length,
for some small positive .
Observe that the usual ``steepest descent'' that we deal with always assumes that G = I .
We call
the natural gradient of L in the Riemannian space. In Euclidean space, it is the same as the usual gradient. Now consider the BSS problem in the context of natural gradient. We first formulate the problem. We have, as before, signal vectors with independent components, so that
and . The output is
and we update the matrix by some learning rule
Previously, we took the learning update to be , but this will now change. We observe that in order to obtain equilibrium, the function F must satisfy
when W = A ^{ 1 } (we stop changing at the correct answer). Now let be an operator that maps a matrix to a matrix, and let
Then satisfies ( 2 ) when F does (same equilibrium). We want to determine what form the transformation should take. Let dW be a small deviation from a matrix W to W + dW . dW constitutes a "vector'' starting from the point W . Let us define an inner product at W as
(Draw a picture of a curved W surface, and the vector on it.) We can pull back the point, mapping to another surface, by rightmultiplying by W ^{ 1 } . Then W maps to I , and W + dW maps to
where
A deviation dW at W is equivalent to the deviation dX at I by this mapping. The key idea is that we want the metric to be invariant under this mapping: the inner product of dW at W is to be the same as the inner product of dWY at WY for any Y . Thus we impose the invariant
In particular, when Y = W ^{ 1 } , we have WY = I . We define the inner product at I by
the (unweighted, Euclidean) Frobenius norm. Under our principle of equivalence (using dX = dWW ^{ 1 } ), we should therefore have
It follows that the Riemannian tensor has the form
We can determine an explicit form for the natural gradient using the principle of invariance. We interpret as a vector applied at W , and as a vector applied at I . Then we must have
We thus have (using the definition of the inner product)
Using the commuting properties of trace we find
Since this must be true for arbitrary dW , we must have
or