Personal tools
You are here: Home Electrical and Computer Engineering Information Theory Application of Information Theory to Blind Source Separation

Application of Information Theory to Blind Source Separation

Document Actions
  • Content View
  • Bookmarks
  • CourseFeed

Introduction   ::   BSS   ::   Mackay's Approach   ::   Natural Gradient   ::   p(u)

Natural Gradient

The training law we have developed up to this point requires computation of W - T . We can modify this by


\begin{displaymath}\Delta W \propto \partiald{H(\ybf)}{W} W^T W.


This becomes (since $\ubf = W \xbf$ )


\begin{displaymath}\Delta W \propto (I - \phi(\ubf)\ubf^T) W.


With the natural gradient, the weight update for the logistic
\begin{displaymath}(I - 2 \ybf \ubf^T)W

This modification to the gradient, multiplying by W T W is called the natural gradient (Amari, 1998). In this section, we examine this, with any eye to the question: what is natural about it? Comment on scaling of update formula. We follow Amari 1998 in the following discussion. Suppose $S = \{\wbf
\in\Rbb^n\}$ is some parameter space (e.g., the space of parameters in the weighting matrix. Suppose there is some function $L(\wbf)$ defined. Consider a parameter value $\wbf$ , and some incremental change to $\wbf + d\wbf$ . If the parameter space is Euclidean , then the length of the increment is


\begin{displaymath}\Vert d\wbf\Vert^2 = \sum_{i=1}^n (dw_i)^2.


However, not all parameter spaces are Euclidean. Consider, for example, a case where the parameters all lie on a sphere. Then the appropriate distance measure is not simply the sum of the squares of the coordinates, especially if $\wbf$ is measured in spherical coordinates! So we measure the change differently:


\begin{displaymath}\Vert d\wbf\Vert^2 = \sum_{i,j} g_{ij}(\wbf) dw_i dw_j.


Here, g is called the Riemannian metric tensor ; it describes the local curvature of the parameter space at the point $\wbf$ . In terms of vectors, we can write


\begin{displaymath}\Vert d\wbf\Vert^2 = \wbf^T G \wbf,


where $G = G(\wbf)$ (a function of $\wbf$ ). G is symmetric. We see that we are simply dealing with a weighted distance, induced from a weighted inner product, defined by


\begin{displaymath}\la \xbf,\ybf\ra_G = \ybf^T G \xbf.


When $G(\wbf) = I$ , we simply get the Euclidean distance. Now consider the problem of learning by "steepest descent." The question is, do we really go in the right direction, if we take into account the curvature of the parameter space. We want to decrease $L(\wbf)$ by moving in a direction $d\wbf$ to obtain $L(\wbf +
d\wbf)$ , and do the best possible job with the motion. Let us assume that we have a fixed step length,


\begin{displaymath}\Vert d\wbf\Vert^2 = \epsilon^2


for some small positive $\epsilon$ .
The steepest descent {\em direction} of $L(\wbf)$\ in a Rieman...
...s \\

Observe that the usual ``steepest descent'' that we deal with always assumes that G = I .

Let $\abf$\ be a unit vector (under the Riemannian metric), so
... is chosen to normalize $\abf$\ (without changing its
We call


\begin{displaymath}\nablatilde L(\wbf) = G^{-1} \Delta L(\wbf)


the natural gradient of L in the Riemannian space. In Euclidean space, it is the same as the usual gradient. Now consider the BSS problem in the context of natural gradient. We first formulate the problem. We have, as before, signal vectors $\sbf(t)$ with independent components, so that


\begin{displaymath}p(\sbf) = \prod_{i=1}^n p_i(s_i)


and $\xbf(t) = A\sbf(t)$ . The output is


\begin{displaymath}\ybf(t) = W_t \xbf(t),


and we update the matrix by some learning rule


\begin{displaymath}W_{t+1} = W_t - \eta_t F(\xbf,W_t).


Previously, we took the learning update to be $F(\xbf,W_t) =
\partiald{}{W} H(\ybf)$ , but this will now change. We observe that in order to obtain equilibrium, the function F must satisfy
            E[F(\xbf,W)] = 0
            \end{displaymath} (2)


when W = A -1 (we stop changing at the correct answer). Now let $K(W)\mc \Rbb^{\matsize{n}{n}} \rightarrow \Rbb^{\matsize{n}{n}}$ be an operator that maps a matrix to a matrix, and let


\begin{displaymath}\Ftilde(\xbf,W) = K(W)F(\xbf,W).


Then $\Ftilde$ satisfies ( 2 ) when F does (same equilibrium). We want to determine what form the transformation should take. Let dW be a small deviation from a matrix W to W + dW . dW constitutes a "vector'' starting from the point W . Let us define an inner product at W as


\begin{displaymath}ds^2 = \text{squared length of the vector at $W$} = \la dW,dW\ra_W
= \Vert dW\Vert^2.


(Draw a picture of a curved W surface, and the vector on it.) We can pull back the point, mapping to another surface, by right-multiplying by W -1 . Then W maps to I , and W + dW maps to


I + dX




dX = dW W -1 .


A deviation dW at W is equivalent to the deviation dX at I by this mapping. The key idea is that we want the metric to be invariant under this mapping: the inner product of dW at W is to be the same as the inner product of dWY at WY for any Y . Thus we impose the invariant


\begin{displaymath}\la dW,dW\ra_{W} = \la dWY,dWY\ra_{WY}


In particular, when Y = W -1 , we have WY = I . We define the inner product at I by


\begin{displaymath}\la dX,dX\ra_I = \sum_{i,j} (dX_{ij})^2 = \trace(dX^T dX),


the (unweighted, Euclidean) Frobenius norm. Under our principle of equivalence (using dX = dWW -1 ), we should therefore have


\begin{displaymath}\la dW,dW\ra_W = \la dX,dX\ra_I = \trace(dX^T dX) = \trace( W...
...dW^T dW W^{-1} = \sum_{i,j,k,l} G_{ij,kl}(W) dW_{ij} dW_{kl}.


It follows that the Riemannian tensor has the form


\begin{displaymath}G_{ij,kl} = \sum_m \delta_{ik} (W^{-1})_{jm} (W^{-1})_{lm}.


We can determine an explicit form for the natural gradient using the principle of invariance. We interpret $\nablatilde f(W)$ as a vector applied at W , and $\nabla f(W)$ as a vector applied at I . Then we must have


\begin{displaymath}\la \nablatilde f(W),dW\ra_W = \la \nablatilde f(W)W^{-1},dW
W^{-1}\ra_{W W^{-1}} \defeq \la \nabla f(W),\Delta W\ra_I


We thus have (using the definition of the inner product)


\begin{displaymath}\trace(W^{-T} \nablatilde f(W)^T dW W^{-1}) \defeq \trace( \nabla
f(W)^T dW).


Using the commuting properties of trace we find


\begin{displaymath}\trace(W^{-1}W^{-T} \nablatilde f(W)^T dW) = \trace( \nabla f(W)^T



\begin{displaymath}\trace[(W^{-1}W^{-T} \nablatilde f(W)^T - \nabla f(W)^T)dW] = 0.


Since this must be true for arbitrary dW , we must have


\begin{displaymath}(W^{-1}W^{-T} \nablatilde f(W)^T = \nabla f(W)^T




\begin{displaymath}\nablatilde f(W) = \nabla f(W) W^T W^{-1}


Copyright 2008, by the Contributing Authors. Cite/attribute Resource . admin. (2006, May 17). Application of Information Theory to Blind Source Separation. Retrieved January 07, 2011, from Free Online Course Materials — USU OpenCourseWare Web site: This work is licensed under a Creative Commons License Creative Commons License