## Walking Downhill

The problem asks you to show that $f(\vec{x}) \geq 0$. If $f(x_1,x_2) = x_1^2 + x_2^2$, then the solution is quite easy. Set the gradient equal to zero and solve the system of equations for $\vec{x}$. Since the function is convex, this is the minimum point, and the answer is simply

$f(\vec{x}) \geq f(0,0) = 0$.

However, one can add a wrinkle to this problem to make it more of a challenge. If the $f(x_1, x_2) = \log (1 + x_1^2 + x_2^2)$, then we no longer have a convex function. However, it can be shown that this function is minimized at $\vec{x} = \vec{0}$, and there is a surprisingly general argument that works and has been used to considerable effect in the literature.

The argument can be summarized intuitively as follows: show that from any point, there is a downward path to the minimum. For instance, in the above problem, one can define $g(t)$ as follows:

$g(t) = f(x_1 \cdot (1 - t), x_2 \cdot (1 - t))$.

Then, it is straightforward to show that over the interval $[0,1]$, $g'(t) \leq 0$ for either choice of $f(x_1, x_2)$ above. Furthermore, $g(0) = f(x_1, x_2)$ and $g(1) = 0$, so we can conclude that $f(\vec{x}) \geq 0$.

Entropy Power Inequality

As mentioned before, this type of argument has proved to be potent in papers. One of the first places I encountered the argument was in the proof of the entropy power inequality. There are several equivalent statements of the result (see e.g. Dembo et al. 1991), one of which is the following. Given two independent random variables $X$ and $Y$ with differential entropies h(X) and h(Y), respectively, then

$h(X + Y) \geq h(\tilde{X} + \tilde{Y})$,

where $\tilde{X}$ and $\tilde{Y}$ are independent Gaussian random variables with the same differential entropies as $X$ and $Y$, respectively. The entropy power inequality has been used to prove converses for Gaussian broadcast channels and the quadratic Gaussian CEO problem.

The proof essentially involves transforming the distributions of $X$ and $Y$ to the distributions of $\tilde{X}$ and $\tilde{Y}$ along a path that does not increase the differential entropy of the sum.

Parameter Redundancy of the KT-Estimator

I’ll end with another use of the argument found in Willems, Shtarkov, and Tjalken’s “The Context-Tree Weighting Method: Basic Properties” from the May 1995 issue of the IEEE Transactions on Information Theory. Many of the results that follow from the paper make use of the Krichevski-Trofimov (KT) estimator, which approximates the probability distribution of a binary sequence based on the number of ones and zeroes. With the above argument, one can show that the parameter redundancy of a KT-estimator can be uniformly bounded. To state this mathematically, first let $P_e (a, b)$ represent the KT-estimator for a sequence with $a$ ones and $b$ zeroes. Also note that for a Bernoulli sequence with parameter $\theta$, the probability the sequence has $a$ ones and $b$ zeroes is $(1 - \theta)^a \cdot \theta^b$. The result states that for all values of $\theta$,

$\log \frac{(1 - \theta)^a \cdot \theta^b}{P_e (a, b)} \leq \frac{1}{2} \log (a + b) + 1$.

Note that the upper bound does not depend on $\theta$. This result is a key component for the authors to show that the redundancy of the context-tree weighting method is small and thereby demonstrate they have a compelling strategy for universal source coding.

The uniform bound above follows from a lower bound on the KT-estimator:

$P_e (a, b) \geq \frac{1}{2} \cdot \frac{1}{\sqrt{a + b}} \cdot (\frac{a}{a+b})^a \cdot (\frac{b}{a+b})^b$.

To prove the result, the authors define

$\Delta(a, b) = \frac{P_e (a, b)}{\frac{1}{\sqrt{a + b}} \cdot (\frac{a}{a+b})^a \cdot (\frac{b}{a+b})^b}$

and find a downward path to show that $\Delta(a, b) \geq \Delta(1,0) = \Delta(0,1) = \frac{1}{2}$.