Personal tools
You are here: Home Electrical and Computer Engineering Information Theory Application of Information Theory to Blind Source Separation

Application of Information Theory to Blind Source Separation

Document Actions
  • Content View
  • Bookmarks
  • CourseFeed

Introduction   ::   BSS   ::   Mackay's Approach   ::   Natural Gradient   ::   p(u)


The principles of information theory can be applied to the blind source separation problem. We will briefly state the problem, then develop steps toward its solution.

Background and some preliminary results

We consider first the case of adapting a processing function g which operates on a a scalar X using a function Y = g ( X ) in order to maximize the mutual information between X and Y . That is, we assume that g ( X ) = g ( X ; w , w 0 ) for some parameters w and w 0 , which are to be chosen to maximize I ( X ; Y ). We assume that g is a deterministic function. We have

I ( X ; Y ) = H ( Y ) - H ( Y |X).

But since g is deterministic, H ( Y |X) = H ( g ( X )|X) = 0, so the mutual information is maximized when H ( Y ) is maximized. (Actually, if we are dealing with differential entropy, this may not be the case. But we will take derivatives, and in any event H ( Y |X) is constant.) Now, assuming the range of g is restricted (a reasonable assumption), what form should g be ideally? (the CDF of X ). Draw a picture. Recall that

\begin{displaymath}f_Y(y) = f_x(x)\vert\frac{dx}{dy}\vert _{x=g^{-1}(y)} =
f_x(x)/\vert dy/dx\vert _{x=g^{-1}(y)}.

If g ( x ) = F X ( x ), then dy /dx = f x ( x ), and we get f Y ( y ) = 1 (fill in some details). Under the rule for transformations,

\begin{displaymath}H(y) = -E[\ln f_y(y)] = E[\ln \vert\partiald{y}{x}\vert] - E[\ln f_x(x)].

But f x ( x ) does not depend on our parameters, so we can ignore it. Of course, we may not know the pdf of X , and may not have the flexibility to choose. However, what is frequently done is to assume a particular functional form, and just fill in the parameters. Take

\begin{displaymath}y = g(x) = \frac{1}{1+e^{-u}}, \qquad u = w x + w_0.
Then an adaptive scheme is to take

\begin{displaymath}\Delta w \propto \partial{H}{w} = \partiald{}{w}(\ln
...rt) = (\partiald{y}{x})^{-1}
As examined in the HW, we find

\begin{displaymath}\Delta w \propto \frac{1}{w} + x(1-2y)
Similarly, we find

\begin{displaymath}\Delta w_0 \propto 1-2y.
We define by this means a weight update rule:

\begin{displaymath}w^{[k+1]} = w^{[k]} + \mu_w \Delta w
\begin{displaymath}w_0^{[k+1]} = w_0^{[k]} + \mu_0 \Delta w_0.


The effect of this learning rule is to drive Y to be as uniform as possible, then the form of g . We can generalize this to N inputs and N outputs. Suppose we take

\begin{displaymath}\ybf = g(W \xbf + \wbf_0),
where the function is applied element-by-element (expand out). Then
I ( X ; Y ) = H ( Y ) - H ( Y |X) = H ( Y ).

We want to determine W and $\wbf_0$ to maximize the joint entropy of the output, $H(\ybf)$ . W is a matrix, $\wbf_0$ is a vector. We have the pdf transformation equation  
\begin{displaymath}f_y(\ybf) = f_x(\xbf) \vert J\vert^{-1},
where J is the Jacobian of the transformation,  
\begin{displaymath}J = \det \begin{bmatrix}\partiald{y_1}{x_1} & \cdots &
...ald{y_n}{x_1} & \cdots & \partiald{y_n}{x_n}


Then, we before, we find
\begin{displaymath}H(\ybf) = E \ln \vert J\vert -E \ln f_x(\xbf),


where the second term does not depend upon the parameters. Then


\begin{displaymath}\Delta W = \partiald{H(\ybf)}{W} = \partiald{}{W}\ln\vert J\vert.


As explored in the homework,
\begin{displaymath}\Delta W \propto [W^{-T}] + (\onebf - 2 \ybf)xbf^T
            \end{displaymath} (1)


and similarly,


\begin{displaymath}\Delta \wbf_0 \propto \onebf - 2 \ybf.


Copyright 2008, Todd Moon. Cite/attribute Resource . admin. (2006, May 15). Application of Information Theory to Blind Source Separation. Retrieved January 07, 2011, from Free Online Course Materials — USU OpenCourseWare Web site: This work is licensed under a Creative Commons License Creative Commons License