Personal tools
You are here: Home Electrical and Computer Engineering Information Theory Application of Information Theory to Blind Source Separation

Application of Information Theory to Blind Source Separation

Document Actions
  • Send this
  • Print this
  • Content View
  • Bookmarks
  • CourseFeed

Introduction   ::   BSS   ::   Mackay's Approach   ::   Natural Gradient   ::   p(u)

BSS

Let $s_i(t), i=1,2,\ldots, n$ be a set of statistically independent signals. We will later examine some other assumptions, but for now assume simply that they are independent. The signals are processed according to

 

\begin{displaymath}\xbf(t) = A \sbf(t).
\end{displaymath}

 

Now, not knowing either $\sbf(t)$ or A, we desire to determine a matrix W so that

 

\begin{displaymath}\ybf(t) = W\xbf(t) = WA \sbf(t)
\end{displaymath}

 

recovers $\sbf(t)$ as fully as possible. Let us take as a criterion the mutual information at the output: $H(\ybf)$. (Q: how did they know to try this? A: It seemed plausible, they tried it, and it worked! Moral: think about the implications of ideas, then see if it works.) Then, as shown in the exercises,

 

\begin{displaymath}H(\ybf) = \sum_{i=1}^N H(y_i) - I(y_1, \ldots, y_N).
\end{displaymath}

 

If we maximize $H(\ybf)$, we should (1) maximize each H(yi) and (2) minimize $I(y_1, \ldots, y_N)$. As mentioned before, the H(yi) are maximized when (and if) the outputs are uniformly distributed. The mutual information is minimized when they are all independent! Achieving both of these exactly requires that g have the form of the CDF of si. So we might contemplate modifying W, and also modifying g. Or we might (as Bell and Sejnowski do) fix g, and don't worry about this. This corresponds to the assumption that p(si) is super-Gaussian (heavier tails than a Gaussian has). We can write

 

\begin{displaymath}H(y_i) = -E[\log p(y_i)],
\end{displaymath}

 

where we have

 

\begin{displaymath}p(y_i) = p(u_i)/\vert\partiald{y_i}{u_i}\vert
\end{displaymath}

 

so that

 

\begin{displaymath}H(y_i) = -E[\log p(u_i)/\vert\partiald{y_i}{u_i}\vert]
\end{displaymath}

 

Thus

 

\begin{displaymath}H(\ybf) = -\sum_{i=1}^N E[\log p(u_i)/\vert\partiald{y_i}{u_i}\vert] -
I(\ybf)
\end{displaymath}

 

Then

 

\begin{displaymath}\partiald{H(\ybf)}{W} = \partiald{-I(\ybf)}{W} - \partiald{}{W}
\sum_{i=1}^N E[\log p(u_i)/\vert\partiald{y_i}{u_i}\vert]
\end{displaymath}

 

In the case that $p(u_i) = \vert\partiald{y_i}{u_i}\vert$, then the last stuff goes away. In other words, we ideally want yi = gi(ui) to be the CDF of the ui. When this is not exactly the case (there is a mismatch), then the last term exists and may interfere with the minimization of $I(\ybf)$. We call the term $\partiald{}{W}
\sum_{i=1}^N E[\log p(u_i)/\vert\partiald{y_i}{u_i}\vert]$ and "error term". Now we note that

 

\begin{displaymath}H(\ybf) = -E[\log p(\ybf)] = -E[\log p(\xbf)/\vert J(\xbf)\vert] = -E[\log
p(\xbf)] + E[\log \vert J(\xbf)\vert].
\end{displaymath}

 

The term $ -E[\log p(\xbf)]$ does not depend upon W, so we obtain

 

\begin{displaymath}\partiald{H(\ybf)}{W} = \partiald{}{W} E[\log \vert J(\xbf)\vert].
\end{displaymath}

 

Now we come to an important concept: We would like to compute the derivative, but can't compute the expectation. We make the stochastic gradient approximation: $E[\log \vert J(\xbf)\vert] \approx \log
\vert J(\bf )\vert$. We just throw the expectation away! Does it work? On average! Now it becomes a matter of grinding through the calculus to take the appropriate partial derivative. Since

 

\begin{displaymath}J(\xbf) = \det \begin{bmatrix}\partiald{y_1}{x_1} & \cdots &
...
...iald{y_n}{x_1} & \cdots & \partiald{y_n}{x_n}
\end{bmatrix}
\end{displaymath}

 

we will consider the elements:

 

\begin{displaymath}\partiald{y_i}{x_j} = \partiald{y_i}{u_i}\partiald{u_i}{x_j} =
w_{ij} \partiald{y_i}{u_j}
\end{displaymath}

 

since $\ubf = W \xbf$, and yi = g(ui). Because this connection, the partial $\partiald{y_i}{u_j}$ is nonzero only when i=j. Combining these facts, we find

 

\begin{displaymath}J(\xbf) = \det(W) \prod_{i=1}^N \vert\partiald{y_i}{u_i}\vert
\end{displaymath}

 

Thus

 

\begin{displaymath}\begin{aligned}\partiald{H(\ybf)}{W} &= \partiald{}{W} \log\l...
...tiald{}{W} \log \vert\partiald{y_i}{u_i}\vert.
\end{aligned}
\end{displaymath}

 

(See appdx E of Moon and Stirling.) Looking at the second term,

 

\begin{displaymath}\partiald{}{w_{ij}} \sum_{k=1}^N \log \vert\partial{y_k}{u_k}...
...ld{y_k}{u_k} = 1/(\partiald{y_i}{u_i})\partiale{y_i}{u_i} x_j
\end{displaymath}

 

since $\partiald{u_i}{w_{ij}} = x_j$. Let us write

 

\begin{displaymath}p(u_i) = \partiald{y_i}{u_i}.
\end{displaymath}

 

This looks like a density, and ideally would be so, as discussed above. But we can think of this as simply a function. We thus find, stacking all the results,

 

\begin{displaymath}\partiald{}{W} \sum_{i=1}^N \log \vert\partiald{y_i}{u_i}\vert =
\frac{\partiald{p(\ubf)}{\ubf}}{p(\ubf)} \xbf^T.
\end{displaymath}

 

This gives us the learning rule:

 

\begin{displaymath}\partiald{H(\ybf)}{W} = W^{-T} +
\left(\frac{\partiald{p(\ubf)}{\ubf}}{p(\ubf)}\right) \xbf^T.
\end{displaymath}

 

We will let

 

\begin{displaymath}\psi(\ubf) = - \frac{\partiald{p(\ubf)}{\ubf}}{p(\ubf)}
\end{displaymath}

 

be the learning nonlinearity, also called in the literature the score function. Then

 

\begin{displaymath}\partiald{H(\ybf)}{W} = W^{-T} - \psi(\ubf) \xbf^T.
\end{displaymath}

 


\begin{example}
Let
\begin{displaymath}y = g(u) = \frac{1}{1+e^{-u}}
\end{di...
...(\ybf)}{W} = W^{-T} + (\onebf - 2 \ybf)\xbf^T.
\end{displaymath}
\end{example}

\begin{example}
If $g(u) = \tanh(u)$, then $\phi(u) = 2 \tanh(u).$
\end{example}

This approach can only separate super-Gaussian distributions (heavy tails).

Copyright 2008, by the Contributing Authors. Cite/attribute Resource. admin. (2006, May 17). Application of Information Theory to Blind Source Separation. Retrieved November 23, 2009, from Free Online Course Materials — USU OpenCourseWare Web site: http://ocw.usu.edu/Electrical_and_Computer_Engineering/Information_Theory/lecture4_2.htm. This work is licensed under a Creative Commons License. Creative Commons License
Reuse Course
Download this course