As I read through “Information theoretic inequalities”, it felt like the icing on the cake: a simpler proof of a result from my Master’s thesis. The result itself wasn’t the focus of my Master’s but instead facilitated the proof of a number of other results, so following a writeup of the initial proof and a sanity check by my advisor, I had submitted the result and its many corollaries as a series of papers to a number of conferences. Between those conference submissions and finishing my Master’s thesis, I hadn’t worried about looking for an elegant proof.

By February of 2006, the thesis had been filed, and paper acceptances had started coming in, so my focus had shifted to trying to find alternate ways to show the same result, and with that I had pulled up Dembo et al.’s “Information theoretic inequalities” among other papers of interest like Amir Dembo’s “Information inequalities and uncertainty principles”; the latter came back from Stanford with notes scrawled in blue ink on its margins, likely from Dembo himself, that read, “Ignore this part.”

My eyes had fixed upon a result in the paper showing that for real numbers where , the following two statements are equivalent:

- For all ,

The first statement falls out of the second by setting , and the second falls from the first by showing that this choice of maximizes the right side of the inequality in the second statement. The equivalence above let me rewrite my claim in a way that reduced the act of computing derivatives and showing inequalities from a multipage undertaking to a few lines, and I would be able to simplify the “walking downhill” technique I was using to show the result.

I started typing up the new proof, and when I went to apply one of the lemmas I had proven to show the result, I caught myself. Because of an operation that I was performing, one of the conditions needed to apply the lemma wouldn’t hold. Then I went back to the original proof and noticed that it suffered from the same problem. I had found a bug in the proof, and if I couldn’t resolve it quickly, I would have to retract papers and withdraw from conferences. If I couldn’t resolve it at all, a lot of the results that I had proven over the past several months would no longer hold, including one of the key ones from my Master’s thesis.

My Master’s thesis had focused on a class of problems known as CEO problems, which try to characterize fundamental limits on compressing noisy data from multiple sensors when those sensors communicate to a central estimation unit via rate-constrained links. The abstract of the paper introducing the CEO problem contains a more business-focused exposition:

A firm’s Chief Executive Officer (CEO) is interested in the data sequence which cannot be observed directly, perhaps because it represents tactical decisions by a competing firm. The CEO deploys a team of agents who observe independently corrupted versions of . Because is only one among many pressing matters to which the CEO must attend, the combined data rate at which the agents may communicate information about their observations to the CEO is limited to, say, bits per second. If the agents were permitted to confer and pool their data, then in the limit as they usually would be able to smooth out their independent observation noises entirely. … In particular, with such data pooling can be made arbitrarily small if exceeds the entropy rate of . Suppose, however, that the agents are not permitted to convene, Agent having to send data based solely on his own noisy observations . We show that then there does not exist a finite value of for which even infinitely many agents can make arbitrarily small.

The inspiration for my work had started from a paper by Yasutada Oohama, who had found the sum-rate-distortion function of the quadratic Gaussian CEO Problem.

My goal had been to extend his results to non-Gaussian sources, and while I could show that his sum-rate-distortion function was an upper bound in those cases, I had struggled to find a corresponding lower bound. My hope had been to apply his techniques to derive such a lower bound, but Oohama’s work relied on a specific property regarding the geometry of Gaussian random variables to derive the bound. Specifically, his lower bound followed by tying the orthogonality principle of conditional expectation to statistical independence, which is true for Gaussian random variables, which allowed him to use the entropy power inequality to derive a lower bound.

The proof in which I had just found a bug was an attempt to generalize the entropy power inequality to get around the fact that for non-Gaussian sources, this statistical independence condition would not hold. My attempts to fix the proof or find a counterexample would take me down the rabbit hole of related results and techniques, from Young’s inequality to the Brascamp-Lieb inequality, all because I had just realized the claim I had based part of my identity on was merely a conjecture.