Personal tools
You are here: Home Electrical and Computer Engineering Information Theory Application of Information Theory to Blind Source Separation

Application of Information Theory to Blind Source Separation

Document Actions
  • Content View
  • Bookmarks
  • CourseFeed

Introduction   ::   BSS   ::   Mackay's Approach   ::   Natural Gradient   ::   p(u)


Let $s_i(t), i=1,2,\ldots, n$ be a set of statistically independent signals. We will later examine some other assumptions, but for now assume simply that they are independent. The signals are processed according to


\begin{displaymath}\xbf(t) = A \sbf(t).


Now, not knowing either $\sbf(t)$ or A , we desire to determine a matrix W so that


\begin{displaymath}\ybf(t) = W\xbf(t) = WA \sbf(t)


recovers $\sbf(t)$ as fully as possible. Let us take as a criterion the mutual information at the output: $H(\ybf)$ . (Q: how did they know to try this? A: It seemed plausible, they tried it, and it worked! Moral: think about the implications of ideas, then see if it works.) Then, as shown in the exercises,


\begin{displaymath}H(\ybf) = \sum_{i=1}^N H(y_i) - I(y_1, \ldots, y_N).


If we maximize $H(\ybf)$ , we should (1) maximize each H ( y i ) and (2) minimize $I(y_1, \ldots, y_N)$ . As mentioned before, the H ( y i ) are maximized when (and if) the outputs are uniformly distributed. The mutual information is minimized when they are all independent! Achieving both of these exactly requires that g have the form of the CDF of s i . So we might contemplate modifying W , and also modifying g . Or we might (as Bell and Sejnowski do) fix g , and don't worry about this. This corresponds to the assumption that p ( s i ) is super-Gaussian (heavier tails than a Gaussian has). We can write


\begin{displaymath}H(y_i) = -E[\log p(y_i)],


where we have


\begin{displaymath}p(y_i) = p(u_i)/\vert\partiald{y_i}{u_i}\vert


so that


\begin{displaymath}H(y_i) = -E[\log p(u_i)/\vert\partiald{y_i}{u_i}\vert]




\begin{displaymath}H(\ybf) = -\sum_{i=1}^N E[\log p(u_i)/\vert\partiald{y_i}{u_i}\vert] -




\begin{displaymath}\partiald{H(\ybf)}{W} = \partiald{-I(\ybf)}{W} - \partiald{}{W}
\sum_{i=1}^N E[\log p(u_i)/\vert\partiald{y_i}{u_i}\vert]


In the case that $p(u_i) = \vert\partiald{y_i}{u_i}\vert$ , then the last stuff goes away. In other words, we ideally want y i = g i ( u i ) to be the CDF of the u i . When this is not exactly the case (there is a mismatch), then the last term exists and may interfere with the minimization of $I(\ybf)$ . We call the term $\partiald{}{W}
\sum_{i=1}^N E[\log p(u_i)/\vert\partiald{y_i}{u_i}\vert]$ and "error term". Now we note that


\begin{displaymath}H(\ybf) = -E[\log p(\ybf)] = -E[\log p(\xbf)/\vert J(\xbf)\vert] = -E[\log
p(\xbf)] + E[\log \vert J(\xbf)\vert].


The term $ -E[\log p(\xbf)]$ does not depend upon W , so we obtain


\begin{displaymath}\partiald{H(\ybf)}{W} = \partiald{}{W} E[\log \vert J(\xbf)\vert].


Now we come to an important concept: We would like to compute the derivative, but can't compute the expectation. We make the stochastic gradient approximation : $E[\log \vert J(\xbf)\vert] \approx \log
\vert J(\bf )\vert$ . We just throw the expectation away! Does it work? On average! Now it becomes a matter of grinding through the calculus to take the appropriate partial derivative. Since


\begin{displaymath}J(\xbf) = \det \begin{bmatrix}\partiald{y_1}{x_1} & \cdots &
...iald{y_n}{x_1} & \cdots & \partiald{y_n}{x_n}


we will consider the elements:


\begin{displaymath}\partiald{y_i}{x_j} = \partiald{y_i}{u_i}\partiald{u_i}{x_j} =
w_{ij} \partiald{y_i}{u_j}


since $\ubf = W \xbf$ , and y i = g ( u i ). Because this connection, the partial $\partiald{y_i}{u_j}$ is nonzero only when i = j . Combining these facts, we find


\begin{displaymath}J(\xbf) = \det(W) \prod_{i=1}^N \vert\partiald{y_i}{u_i}\vert




\begin{displaymath}\begin{aligned}\partiald{H(\ybf)}{W} &= \partiald{}{W} \log\l...
...tiald{}{W} \log \vert\partiald{y_i}{u_i}\vert.


(See appdx E of Moon and Stirling.) Looking at the second term,


\begin{displaymath}\partiald{}{w_{ij}} \sum_{k=1}^N \log \vert\partial{y_k}{u_k}...
...ld{y_k}{u_k} = 1/(\partiald{y_i}{u_i})\partiale{y_i}{u_i} x_j


since $\partiald{u_i}{w_{ij}} = x_j$ . Let us write


\begin{displaymath}p(u_i) = \partiald{y_i}{u_i}.


This looks like a density, and ideally would be so, as discussed above. But we can think of this as simply a function. We thus find, stacking all the results,


\begin{displaymath}\partiald{}{W} \sum_{i=1}^N \log \vert\partiald{y_i}{u_i}\vert =
\frac{\partiald{p(\ubf)}{\ubf}}{p(\ubf)} \xbf^T.


This gives us the learning rule:


\begin{displaymath}\partiald{H(\ybf)}{W} = W^{-T} +
\left(\frac{\partiald{p(\ubf)}{\ubf}}{p(\ubf)}\right) \xbf^T.


We will let


\begin{displaymath}\psi(\ubf) = - \frac{\partiald{p(\ubf)}{\ubf}}{p(\ubf)}


be the learning nonlinearity, also called in the literature the score function. Then


\begin{displaymath}\partiald{H(\ybf)}{W} = W^{-T} - \psi(\ubf) \xbf^T.


\begin{displaymath}y = g(u) = \frac{1}{1+e^{-u}}
...(\ybf)}{W} = W^{-T} + (\onebf - 2 \ybf)\xbf^T.

If $g(u) = \tanh(u)$, then $\phi(u) = 2 \tanh(u).$

This approach can only separate super-Gaussian distributions (heavy tails).

Copyright 2008, by the Contributing Authors. Cite/attribute Resource . admin. (2006, May 17). Application of Information Theory to Blind Source Separation. Retrieved January 07, 2011, from Free Online Course Materials — USU OpenCourseWare Web site: This work is licensed under a Creative Commons License Creative Commons License