Application of Information Theory to Blind Source Separation
Introduction :: BSS :: Mackay's Approach :: Natural Gradient :: p(u)
BSS
Let
Now, not knowing either
recovers
If we maximize
, we should (1) maximize each H(yi) and (2) minimize
. As mentioned before, the H(yi) are maximized when (and if) the outputs are uniformly distributed. The mutual information is minimized when they are all independent! Achieving both of these exactly requires that g have the form of the CDF of si. So we might contemplate modifying W, and also modifying g. Or we might (as Bell and Sejnowski do) fix g, and don't worry about this. This corresponds to the assumption that p(si) is super-Gaussian (heavier tails than a Gaussian has). We can write
where we have
so that
Thus
Then
In the case that
, then the last stuff goes away. In other words, we ideally want yi = gi(ui) to be the CDF of the ui. When this is not exactly the case (there is a mismatch), then the last term exists and may interfere with the minimization of
and "error term". Now we note that
The term
Now we come to an important concept: We would like to compute the derivative, but can't compute the expectation. We make the stochastic gradient approximation:
we will consider the elements:
since
Thus
(See appdx E of Moon and Stirling.) Looking at the second term,
since
. Let us write
This looks like a density, and ideally would be so, as discussed above. But we can think of this as simply a function. We thus find, stacking all the results,
This gives us the learning rule:
We will let
be the learning nonlinearity, also called in the literature the score function. Then
This approach can only separate super-Gaussian distributions (heavy tails).







