Data Processing Inequality

A very intuitive yet powerful inequality in information theory is the data processing inequality.

Lemma: If random variable X, Y and Z form a Markov chain 
X \rightarrow Y \rightarrow Z, then I(X;Y) \ge I(X;Z).

The great thing about the inequality is that unlike some results in information theory, it works for both discrete and continuous random variables. (Actually it works even for mixed variables with some continuous and some discrete.) Let’s show it assuming variables are continuous.

Proof: I(X;Z)=h(X)-h(X|Z)\overset{(a)}{\le} h(X)-h(X|Y,Z)
\overset{(b)}{=}h(X)-h(X|Y)=I(X;Y), where (a) is because h(X|Y)-h(X|Y,Z)=I(X;Z|Y)\ge 0 and (b) is from X\rightarrow Y\rightarrow Z. \Box

Note that we can use this simple inequality to show some rather nontrivial result. Consider continuous random variable X and a continuous reversible function f(\cdot). Let Y=f(X). Note that in general we have h(Y)\neq h(X) even though we H(Y)=H(X) if X were discrete. However, for any other continuous random variable Z, h(Z|X)=h(Z|Y) always holds. This can be easily seem noting that both Y\rightarrow X\rightarrow Z and X\rightarrow Y\rightarrow Z hold as f(\cdot) is reversible. Thus we have both I(X;Z) \ge I(Y;Z) and I(Y;Z) \ge I(X;Z), and therefore I(Y;Z)=I(X;Z)\Rightarrow h(Z)-h(Z|Y)=h(Z)-h(Z|X)
\Rightarrow h(Z|Y)=h(Z|X). One may also prove it with first principle but it is going to be quite a bit harder.

Leave a comment