Application of Information Theory to Blind Source Separation
Introduction :: BSS :: Mackay's Approach :: Natural Gradient :: p(u)
Natural Gradient
The training law we have developed up to this point requires computation of W-T. We can modify this by
This becomes (since
This modification to the gradient, multiplying by WTW is called the natural gradient (Amari, 1998). In this section, we examine this, with any eye to the question: what is natural about it? Comment on scaling of update formula. We follow Amari 1998 in the following discussion. Suppose
is some parameter space (e.g., the space of parameters in the weighting matrix. Suppose there is some function
defined. Consider a parameter value
, and some incremental change to
. If the parameter space is Euclidean, then the length of the increment is
However, not all parameter spaces are Euclidean. Consider, for example, a case where the parameters all lie on a sphere. Then the appropriate distance measure is not simply the sum of the squares of the coordinates, especially if
is measured in spherical coordinates! So we measure the change differently:
Here, g is called the Riemannian metric tensor; it describes the local curvature of the parameter space at the point
where
When
for some small positive
Observe that the usual ``steepest descent'' that we deal with always assumes that G=I.
We call
the natural gradient of L in the Riemannian space. In Euclidean space, it is the same as the usual gradient. Now consider the BSS problem in the context of natural gradient. We first formulate the problem. We have, as before, signal vectors
and
and we update the matrix by some learning rule
Previously, we took the learning update to be
, but this will now change. We observe that in order to obtain equilibrium, the function F must satisfy when W = A-1 (we stop changing at the correct answer). Now let
Then
(Draw a picture of a curved W surface, and the vector on it.) We can pull back the point, mapping to another surface, by right-multiplying by W-1. Then W maps to I, and W+dW maps to
where
A deviation dW at W is equivalent to the deviation dX at I by this mapping. The key idea is that we want the metric to be invariant under this mapping: the inner product of dW at W is to be the same as the inner product of dWY at WY for any Y. Thus we impose the invariant
In particular, when Y=W-1, we have WY=I. We define the inner product at I by
the (unweighted, Euclidean) Frobenius norm. Under our principle of equivalence (using dX = dWW-1), we should therefore have
It follows that the Riemannian tensor has the form
We can determine an explicit form for the natural gradient using the principle of invariance. We interpret
We thus have (using the definition of the inner product)
Using the commuting properties of trace we find
Since this must be true for arbitrary dW, we must have
or







