Personal tools
  •  
You are here: Home Electrical and Computer Engineering Information Theory Application of Information Theory to Blind Source Separation

Application of Information Theory to Blind Source Separation

Document Actions
  • Content View
  • Bookmarks
  • CourseFeed

Introduction   ::   BSS   ::   Mackay's Approach   ::   Natural Gradient   ::   p(u)

BSS

Let $s_i(t), i=1,2,\ldots, n$ be a set of statistically independent signals. We will later examine some other assumptions, but for now assume simply that they are independent. The signals are processed according to

 

\begin{displaymath}\xbf(t) = A \sbf(t).
\end{displaymath}

 

Now, not knowing either $\sbf(t)$ or A , we desire to determine a matrix W so that

 

\begin{displaymath}\ybf(t) = W\xbf(t) = WA \sbf(t)
\end{displaymath}

 

recovers $\sbf(t)$ as fully as possible. Let us take as a criterion the mutual information at the output: $H(\ybf)$ . (Q: how did they know to try this? A: It seemed plausible, they tried it, and it worked! Moral: think about the implications of ideas, then see if it works.) Then, as shown in the exercises,

 

\begin{displaymath}H(\ybf) = \sum_{i=1}^N H(y_i) - I(y_1, \ldots, y_N).
\end{displaymath}

 

If we maximize $H(\ybf)$ , we should (1) maximize each H ( y i ) and (2) minimize $I(y_1, \ldots, y_N)$ . As mentioned before, the H ( y i ) are maximized when (and if) the outputs are uniformly distributed. The mutual information is minimized when they are all independent! Achieving both of these exactly requires that g have the form of the CDF of s i . So we might contemplate modifying W , and also modifying g . Or we might (as Bell and Sejnowski do) fix g , and don't worry about this. This corresponds to the assumption that p ( s i ) is super-Gaussian (heavier tails than a Gaussian has). We can write

 

\begin{displaymath}H(y_i) = -E[\log p(y_i)],
\end{displaymath}

 

where we have

 

\begin{displaymath}p(y_i) = p(u_i)/\vert\partiald{y_i}{u_i}\vert
\end{displaymath}

 

so that

 

\begin{displaymath}H(y_i) = -E[\log p(u_i)/\vert\partiald{y_i}{u_i}\vert]
\end{displaymath}

 

Thus

 

\begin{displaymath}H(\ybf) = -\sum_{i=1}^N E[\log p(u_i)/\vert\partiald{y_i}{u_i}\vert] -
I(\ybf)
\end{displaymath}

 

Then

 

\begin{displaymath}\partiald{H(\ybf)}{W} = \partiald{-I(\ybf)}{W} - \partiald{}{W}
\sum_{i=1}^N E[\log p(u_i)/\vert\partiald{y_i}{u_i}\vert]
\end{displaymath}

 

In the case that $p(u_i) = \vert\partiald{y_i}{u_i}\vert$ , then the last stuff goes away. In other words, we ideally want y i = g i ( u i ) to be the CDF of the u i . When this is not exactly the case (there is a mismatch), then the last term exists and may interfere with the minimization of $I(\ybf)$ . We call the term $\partiald{}{W}
\sum_{i=1}^N E[\log p(u_i)/\vert\partiald{y_i}{u_i}\vert]$ and "error term". Now we note that

 

\begin{displaymath}H(\ybf) = -E[\log p(\ybf)] = -E[\log p(\xbf)/\vert J(\xbf)\vert] = -E[\log
p(\xbf)] + E[\log \vert J(\xbf)\vert].
\end{displaymath}

 

The term $ -E[\log p(\xbf)]$ does not depend upon W , so we obtain

 

\begin{displaymath}\partiald{H(\ybf)}{W} = \partiald{}{W} E[\log \vert J(\xbf)\vert].
\end{displaymath}

 

Now we come to an important concept: We would like to compute the derivative, but can't compute the expectation. We make the stochastic gradient approximation : $E[\log \vert J(\xbf)\vert] \approx \log
\vert J(\bf )\vert$ . We just throw the expectation away! Does it work? On average! Now it becomes a matter of grinding through the calculus to take the appropriate partial derivative. Since

 

\begin{displaymath}J(\xbf) = \det \begin{bmatrix}\partiald{y_1}{x_1} & \cdots &
...
...iald{y_n}{x_1} & \cdots & \partiald{y_n}{x_n}
\end{bmatrix}
\end{displaymath}

 

we will consider the elements:

 

\begin{displaymath}\partiald{y_i}{x_j} = \partiald{y_i}{u_i}\partiald{u_i}{x_j} =
w_{ij} \partiald{y_i}{u_j}
\end{displaymath}

 

since $\ubf = W \xbf$ , and y i = g ( u i ). Because this connection, the partial $\partiald{y_i}{u_j}$ is nonzero only when i = j . Combining these facts, we find

 

\begin{displaymath}J(\xbf) = \det(W) \prod_{i=1}^N \vert\partiald{y_i}{u_i}\vert
\end{displaymath}

 

Thus

 

\begin{displaymath}\begin{aligned}\partiald{H(\ybf)}{W} &= \partiald{}{W} \log\l...
...tiald{}{W} \log \vert\partiald{y_i}{u_i}\vert.
\end{aligned}
\end{displaymath}

 

(See appdx E of Moon and Stirling.) Looking at the second term,

 

\begin{displaymath}\partiald{}{w_{ij}} \sum_{k=1}^N \log \vert\partial{y_k}{u_k}...
...ld{y_k}{u_k} = 1/(\partiald{y_i}{u_i})\partiale{y_i}{u_i} x_j
\end{displaymath}

 

since $\partiald{u_i}{w_{ij}} = x_j$ . Let us write

 

\begin{displaymath}p(u_i) = \partiald{y_i}{u_i}.
\end{displaymath}

 

This looks like a density, and ideally would be so, as discussed above. But we can think of this as simply a function. We thus find, stacking all the results,

 

\begin{displaymath}\partiald{}{W} \sum_{i=1}^N \log \vert\partiald{y_i}{u_i}\vert =
\frac{\partiald{p(\ubf)}{\ubf}}{p(\ubf)} \xbf^T.
\end{displaymath}

 

This gives us the learning rule:

 

\begin{displaymath}\partiald{H(\ybf)}{W} = W^{-T} +
\left(\frac{\partiald{p(\ubf)}{\ubf}}{p(\ubf)}\right) \xbf^T.
\end{displaymath}

 

We will let

 

\begin{displaymath}\psi(\ubf) = - \frac{\partiald{p(\ubf)}{\ubf}}{p(\ubf)}
\end{displaymath}

 

be the learning nonlinearity, also called in the literature the score function. Then

 

\begin{displaymath}\partiald{H(\ybf)}{W} = W^{-T} - \psi(\ubf) \xbf^T.
\end{displaymath}

 


\begin{example}
Let
\begin{displaymath}y = g(u) = \frac{1}{1+e^{-u}}
\end{di...
...(\ybf)}{W} = W^{-T} + (\onebf - 2 \ybf)\xbf^T.
\end{displaymath}
\end{example}

\begin{example}
If $g(u) = \tanh(u)$, then $\phi(u) = 2 \tanh(u).$
\end{example}

This approach can only separate super-Gaussian distributions (heavy tails).

Copyright 2008, by the Contributing Authors. Cite/attribute Resource . admin. (2006, May 17). Application of Information Theory to Blind Source Separation. Retrieved January 07, 2011, from Free Online Course Materials — USU OpenCourseWare Web site: http://ocw.usu.edu/Electrical_and_Computer_Engineering/Information_Theory/lecture4_2.htm. This work is licensed under a Creative Commons License Creative Commons License