Invoking Authority

Over dinner recently, a friend mentioned that when he is unfamiliar with a topic, he tends to accept what he reads on that topic at face value. The remark sparked some thoughts I’ve been having about how the way in which information is presented can affect how people interpret it, so I shared one example that has been puzzling me.

Suppose I were to say, “Red is better than blue.” Alternatively, suppose I were to say, “I prefer red to blue.” In what sense are these two sentences different from one another?

Both of the statements are making a statement from my perspective, so one could argue that if I say, “Red is better than blue,” then I must prefer red to blue. On the other hand, based on the handful of people to whom I’ve posed these sentences, the first one is more likely to invite an argument; to argue the first, one might be able to appeal to some objective knowledge about the world that contradicts the statement, but to argue against the second, one has to know something about my preferences, and one could argue there is no greater authority on my preferences than I.

My friend’s intuitions appeared to match up with what others had said about that example, so I tried to construct an analogue that was related to a statement that isn’t concerned with the opinions of the speaker to see if a larger pattern might emerge.

Suppose I were to say, “The sky is blue.” Alternatively, suppose I were to say, “I read in an encyclopedia that the sky is blue.”

Again, the first sentence is more likely to invite an argument than the second. Taking this along with with the previous example, I wonder if this all comes down to authority. What makes me an authority on the color of the sky? On the other hand, it could be argued that I am an authority on my memories, and therefore to say, “I read in an encyclopedia that…” suggests that I did read the information. Furthermore, it could also be argued that the encyclopedia is a more reliable authority on the color of the sky than I am, and therefore the second sentence is more authoritative than the first.

I’d be curious to find out to what extent these types of sentences have been studied and whether the thoughts others have in any way line up with the intuitions I’ve detailed.

Posted in Uncategorized | 1 Comment

Propagating Errors

Jim mentioned Puzzled Pint a few months ago, but it conflicted with an improv rehearsal. However, this month, we decided to make Puzzled Pint our troupe’s social outing.

The first step in Puzzled Pint is to figure out where the event will be held. To do this, one solves a location puzzle. In our case, we needed to traverse a maze, and the puzzle was sensitive in the sense that early mistakes propagate to steps later on. It took a while to realize that we had missed a clue, but once it was resolved, we found ourselves on our way.

What’s interesting about a puzzle with this type of propagating error is that it becomes fairly simple to detect that an error was made since the solution is typically an English word or phrase, but not necessarily obvious what the error was, akin to losing an inter frame in video and seeing an error propagate across subsequent frames until the start of the next GOP. The difference between the puzzle and video coding is that given a codec, there is a well understood way to decode the video frames, whereas in the puzzle, the decoding mechanism is intentionally obfuscated and to be deduced along with the solution.

This makes a mistake in a puzzle all the more challenging to debug. For instance, in an early attempt at the February location puzzle, I decoded ROTATERETAMETOYONE, so I was confident that my method was on the right track because the prefix ROTATE is a word, but it wasn’t necessarily clear whether I was applying a correct method and had simply overlooked a piece in the process, or whether the method I was applying was too simplistic to solve the puzzle, and there was something more complex occurring that needed to be applied before moving to the next word.

Ultimately, someone figured out what was missing, and we had a mechanism to validate that the solution we had arrived at was correct. That leads to a second difference between video coding and puzzle solving; namely, it’s sometimes possible to arrive at and validate a solution by sidestepping the expected method the puzzle wants one to apply. For instance, if one gets enough clues that the entropy of the solution is sufficiently low, taking advantage of one’s knowledge of English can provide an alternate mechanism to arrive at the solution instead of figuring out the intended method. In fact, in some cases, one might reverse-engineer the remaining pieces of the method after discovering the solution.

Posted in Puzzle | Leave a comment

Apples and Oranges

While waiting to do an improv set, Chrysteena told me that she and Jim had three cartons: one with apples, one with oranges, and a third with apples and oranges. Unfortunately, each carton was mislabeled, and they wanted to move the labels to the correct cartons. What is the minimum number of pieces of fruit one would one need to take out of the cartons to correct the labels, and what would be the strategy?

The thing I like about this puzzle is that there’s information encoded in the wording of the problem, and changing one sentence changes the puzzle significantly. “Each carton was mislabeled” indicates that no box contains the correct label, which eliminates 4 out of 6 permutations of labels. If we replace the phrase “each carton was” with “some cartons were” in that sentence, the sentence only eliminates 1 out of 6, and it uncovers a potential ambiguity in how the final sentence of the puzzle is worded, but regardless of how one interprets that sentence, its solution is different from the “each carton was” case. If “each carton was” changes to “the cartons may have been” then the sentence doesn’t eliminate any of the permutations, and depending on how one interprets the wording of the final sentence of the puzzle, it may or may not have the same solution as the “some cartons were” case.

Posted in Uncategorized | Leave a comment

Jigsaw Puzzle Procrastination

“What made you decide to go to graduate school?”

During a workshop about academic careers at Allerton in 2008, Roy Yates responded to the above question in one of my favorite responses to a question of this type: “It was the most socially acceptable way to procrastinate what I wanted to do with the rest of my life.”

Recently, a friend posted on Facebook that she decided to spend time working on a jigsaw puzzle with her daughter when the ISIT 2015 submission deadline was extended. Roy responded to the thread:

I’ve been trying to understand why a puzzle with several hundred pieces is easier to put together than an ISIT paper with a few simple ideas. Each puzzle piece has several possible rotations. The ordering of the pieces would seem to explode combinatorially. You can argue that the picture provides a lot of clues, but I suspect you could do that same 200 piece puzzle upside down, without looking at the picture in a maybe one extra hour. I suppose google could tell me why, but I’d rather tag Lalitha Sankar, Chris Rose, Anand Sarwate, Krish Eswaran and Bobak Nazer.

The question immediately reminded me of a paper I’d encountered during my summer at Xerox PARC back in 2002: David Goldberg, Christopher Malon, and Marshall Bern’s “A Global Approach to Automatic Solution of Jigsaw Puzzles”. The algorithm is for the apictorial case, which corresponds to flipping over all the jigsaw pieces face-down so that the picture doesn’t help.

If one tailors an algorithm based on the picture, the discrepancy between the face-up and face-down cases could be quite severe when compared with the abstraction of Goldberg, Malon, and Bern. For instance, imagine a picture whose pixels follow an even gradient that goes from light to dark vertically and from green to red horizontally. Then by splitting the puzzle pieces by greenness and luminance, one could subdivide the initial set of puzzle pieces into four groups of one fourth the size. In fact, one could continue this subdivision until the number of pieces are small, solve them brute force, creating larger puzzle pieces that can then be solved one level above, similar to a divide-and-conquer approach like the merge sort algorithm, taking roughly O(N \log N) steps for N-piece puzzles.

The problem with the tailor-made algorithm above is that it doesn’t necessarily generalize to all pictures. One could have a picture that is symmetric, and then any sort of division would result in smaller groups that don’t necessarily merge into a single piece. Similar problems may occur if the puzzle is of a painting by Ad Reinhardt, which effectively reduces the problem to the apictorial case.

While allowances could be made, it’s worth noting that any divide-and-conquer approach cannot automatically take advantage of an algorithm like the one in Goldberg, Malon, and Bern, which attempts to solve the corners first. On the other hand and as noted in the paper, solving the corners, if abstracted in the way that the authors suggest, turns out to be equivalent to a traveling salesman problem and is NP-hard.

It’s unclear to me, however, that the traveling salesman problem is the most useful abstraction. For instance, if one assumes that puzzle pieces only fit together if they fit together in the solution, and there is a unique way in which they do, in the apictorial case, for N-piece puzzles, one could test all pieces against each other pairwise in O(N^2) steps to create a graph based on connecting pieces (vertices) that fit together. After that, one could prune the resulting graph one vertex at a time by combining vertices that share an edge at each step, terminating in N steps, thereby solving the puzzle. I suspect there is an even more efficient algorithm one could employ in this case, but this is more to illustrate how an alternate abstraction could make the problem simpler.

Ultimately, it appears there is both a sensitivity to how the jigsaw puzzle problem is abstracted as well as how useful the image is. As to whether a jigsaw puzzle is easier to solve than an ISIT paper is to write, it might be worth revisiting that in a week.

Posted in Puzzle | Leave a comment

The Doors

It was the first evening of the 2008 School of Information Theory, and Paul Cuff, then a grad student at Stanford, had just shared a probability puzzle about doors (not the Monty Hall problem). I don’t remember exactly how we started on puzzles, but I am pretty sure people had grown tired of my Ali G impersonation. At some point, I had shared a puzzle about dice that has a solution in the spirit of Paul’s puzzle, which may have been the inspiration.

We found a strategy that would work, but calculating the probability seemed a little tricky. When I asked Paul, he mentioned that the probability was about 1/3 but that he hadn’t gleaned any special insights from it. Thoughts about the puzzle were quickly replaced with the more common preoccupations of a final-year graduate student: finishing a dissertation and finding a job. The “about 1/3″ result entered my head as a folk theorem.

Recently, I posed the puzzle to Jim and Chrysteena, two friends in the San Francisco improv community. Jim designs puzzles and seemed dissatisfied when he evaluated the probability for the candidate solution purported to be the best. I quoted the folk theorem, but I was dissatisfied, as well, and I thought it was time to revisit the puzzle.

100 Doors Puzzle

There are 100 people and a room with 100 doors. Behind each door is the name of exactly one of the 100 people such that every name is behind exactly one door. The location of the names are unknown to the 100 people. The 100 people are then asked to play the following game:

  1. Each person is given a chance to go into the room and open 50 doors.
  2. After a person opens 50 doors, the room is reset to the state it was in before entry, and the person who entered may not communicate with any of the other people for the remainder of the game.
  3. The game ends in victory if and only if each of the 100 people opens the door with his or her name. In other words, if any one person fails, everyone loses.

If the location of the names behind the doors is chosen at random, find a strategy that maximizes the probability of victory.

Candidate Solution

It helps to look at cases with a smaller number of doors to build intuition. Let’s first consider two doors and two people, in which each person is allowed to open one door. Note that if both people open the same door, then they are guaranteed to lose since we require that they both find their names. Thus, the best strategy is for them to choose separate doors, at which point the probability of them winning is \frac{1}{2}. Note that this matches the probability that one of them finds his or her name, so it is optimal.

Now let’s move to four doors, four people, in which each person is allowed to open two doors. After some thought, one strategy that would generalize to 100 doors is to arrange the names in alphabetical order and assign a number to each name based on where it is in the ordering, e.g. the first name in alphabetical order is assigned the number 1, and the last is assigned 100. Then, each person starts by opening the door with the number corresponding to his or her name. Upon opening the door, that person will see a new name, which has its own number associated with it in the alphabetical ordering, and will then proceed to open the door associated with this new number. The process repeats until all tries are exhausted.

If one follows this approach, the cases leading to victory are those in which the process of going through the subsequent doors will cycle through to each person’s name within 50 tries in the case of 100 doors, or within 2 tries in the case of 4 doors. For instance, with 4 doors, victory occurs in a case like the following:

Door 1 (Bob), Door 2 (Alice), Door 3 (Dennis), Door 4 (Cindy)

Bob would start at door 2, see Alice’s name and then go to door 1, where he would find his name. Following the approach for the others results in something similar. On the other hand, we fail if we have a case like this:

Door 1 (Bob), Door 2 (Cindy), Door 3 (Alice), Door 4 (Dennis)

Again, Bob would start at door 2, which would give him Cindy’s name and lead him to door 3, which would give him Alice’s name, by which point he has exhausted his two tries without finding his name. Since one person failed, everyone loses.

It turns out that for four doors, there are 10 cases among the 24 permutations that have these smaller cyles, resulting in a victory with probability \frac{5}{12}, which is close to \frac{1}{2}. For 100 doors, there’s the folk theorem.

Probability of Victory

One thing that has bothered me in the formulation of the puzzle is that the focus is to maximize the probability, but it’s unclear to me whether the candidate strategy above does this. If “about 1/3″ continues to hold as the probability of victory as the number of doors increases say even beyond 100, then at least there is a sense that this is a scaleable strategy if not the best since no strategy can exceed 1/2. On the other hand, if the probability decays to 0, it raises a new question to understand why that’s the case, if all strategies decay to a victory probability of 0 as the number of doors increase, or if there exists a strategy that is within some bound \Delta < \frac{1}{2} of \frac{1}{2}.

We can start asking these questions if we have an expression for the probability of victory. Let’s consider n = 2k doors, where k represents the number of tries, and n = 100 in the original problem. We have characterized the victory cases as those in which the permutation of doors can be represented as cycles of at most k. To calculate the probability, we can use the cycle index of the symmetric group S_n as

Z(S_n) = \sum_{j_1+2 j_2 + 3 j_3 + \cdots + n j_n = n} \frac{1}{\prod_{i=1}^n i^{j_i} j_i!} \prod_{i=1}^n a_i^{j_i},

where a_i denotes cycles of length i.

It turns out the cycle index of the symmetric group can be expressed as the recurrence

Z(S_n) = \frac{1}{n} \sum_{l=1}^n a_l Z(S_{n-l})

Let’s define F_n(a_1, \ldots, a_n) = Z(S_n) to highlight the dependence on the cycles. Thus, we can rewrite the above equation as

F_n(a_1, \ldots, a_n) = \frac{1}{n} \sum_{l=1}^n a_l F_{n-l}(a_1, \ldots, a_{n-l}).

Let’s define the cumulative distribution function (CDF) F_{n}^{(i)} as F_n(a_1, \ldots, a_n) evaluated at a_l = 1, l \leq i and a_l = 0, l > i. In words, F_{n}^{(i)} is the probability that all cycles are of length i or less.

Thus, if we can find an expression for F_{n}^{(k)}, we can evaluate it at F_{100}^{(50)} to get our result. The results are contained in Propositions 1 and 2 below, where Proposition 2 is reminiscent of our folk theorem, which yields a probability of 0.306, which is slightly less than 1/3. Furthermore, Corollary 1.1 indicates that we decrease monotonically to this probability, so regardless of the number of doors, this strategy is within 0.2 of \frac{1}{2}, the upper bound on how well one could possibly do.

Proposition 1. F_{2k}^{(k)} = F_{n}^{(k)} = 1 - \sum_{m=k+1}^n \frac{1}{m}.
Proof: The first equality is by definition, so we simply need to show the second equality. For i \leq j, we can evaluate the probabilities as

F_{2j}^{(i)} = \frac{1}{2j} \sum_{l=1}^i F_{2j-l}^{(i)}

F_{2j}^{(2j-i)} = \frac{1}{2j} \sum_{l=1}^i F_{2j-l}^{(2j-i)} +\frac{1}{2j}\sum_{m=i+1}^{2j-i} F_{2j-m}^{(2j-m)} = \frac{2j-2i}{2j}+ \frac{1}{2j} \sum_{l=1}^i F_{2j-l}^{(2j-i)}

F_{2j+1}^{(i)} = \frac{1}{2j+1} \sum_{l=1}^i F_{2j+1-l}^{(i)}

F_{2j+1}^{(2j+1-i)} = \frac{1}{2j+1} \sum_{l=1}^i F_{2j+1-l}^{(2j+1-i)} + \frac{1}{2j+1} \sum_{l=i+1}^{2j+1-i} F_{2j+1-l}^{(2j+1-l)} = \frac{2j+1-2i}{2j+1} + \frac{1}{2j+1} \sum_{l=1}^i F_{2j+1-l}^{(2j+1-i)}.

Note that if i \leq j, then i - 1 \leq j, so we can apply the same result to get that

F_{2j}^{(2j+1-i)} = \frac{2j-2(i-1)}{2j}+ \frac{1}{2j} \sum_{l=1}^{i-1} F_{2j-l}^{(2j+1-i)}

and substituting this into our expression for F_{2j+1}^{(2j+1-i)} yields

F_{2j+1}^{(2j+1-i)} = \frac{2j}{2j+1} (F_{2j}^{(2j+1-i)} - \frac{2j-2(i-1)}{2j}) = F_{2j}^{(2j+1-i)} - \frac{1}{2j+1}.

Likewise, one can similarly derive

F_{2j}^{(2j+1-i)} = F_{2j-1}^{(2j+1-i)} - \frac{1}{2j}.

Starting with the base case that F_{2k}^{(k)} = F_{n}^{(k)} = F_{n-1}^{(k)} - \frac{1}{n} and the above inductive steps, one can conclude that

F_{n}^{(k)} = F_{k}^{(k)} - \sum_{m=k+1}^{2k} \frac{1}{m}.

However, F_{k}^{(k)} = 1, so we have our result.

Corollary 1.1. F_{n+2}^{(k+1)} = F_{n}^{(k)} - \frac{1}{(n+2)(n+1)}.

Proposition 2. \lim_{k \rightarrow \infty} F_{2k}^{(k)} = 1 - \ln 2 \approx 0.306
Proof: From Proposition 1, we can write

F_{2k}^{(k)} = 1 - \sum_{m=k+1}^{2k} \frac{1}{m}. Note that we can rewrite the summation

\sum_{m=k+1}^{2k} \frac{1}{m} = \sum_{i=1}^{k} \frac{1}{k+i} = \sum_{i=1}^{k} \frac{1}{1+i/k} \frac{1}{k}.

The key is to observe that this is simply a Riemann sum approximating the integral

\int_{0}^1 \frac{1}{1+x} dx = \ln 2,

which converges as we let k \rightarrow \infty.

Posted in Probability, Puzzle | 8 Comments

Ages of Three Daughters

My uncle’s in town and just reminded me of a puzzle that he’d asked me over ten years ago. A census worker goes to a house and the person who answers the door tells the census worker that she has three daughters.

“I need to know their ages,” replies the census worker.

“Well, the product of their ages is my age: 36.”

“That’s not enough information.”

“Well, the sum of their ages is the same as the number as the house on the right.”

The census worker checks the house next door and responds, “That’s still not enough information.”

“My eldest daughter is asleep upstairs.”

The census worker thanks the homeowner and leaves. How old are the three daughters?

I love this puzzle because on the face of it, it would appear that there isn’t enough information to figure out the puzzle. If you want to figure out the puzzle yourself, do not read beyond this point.

The first thing we learn about the ages of the daughters is that the product of the ages is 36.

Let’s consider all such possibilities (let’s assume the puzzle maker only allows for integer ages):


As the census worker notes, there are still too many to disambiguate the ages.

The next clue is that the sum of the ages is the same as the number of the house next door. We don’t know the sum, but the census worker does, and we also know that the house number is insufficient to disambiguate the ages. Let’s look at the sums of the factorizations above:

1+1+36 = 38
1+2+18 = 21
1+3+12 = 16
1+6+6 = 13
1+4+9 = 14
2+2+9 =13
2+3+6 = 11
3+3+4 = 10

Note that the only sum that is not unique above is 13, so that is the only possibility in which the second clue would not have been enough for the census worker to disambiguate the ages. This leaves two possibilities:


The final clue helps with that one. If there’s an eldest daughter (let’s assume the puzzle maker created a world in which twins are considered to be the same age and that no two daughters who aren’t twins were born within twelve months of each other), then the only viable solution is 2,2,9.

Posted in Information Theory, Puzzle | 1 Comment

Le Poisson

It was the spring of 2003; Toby Berger and I were both in the common room of Phillips Hall, and there was half an hour before his information theory class was set to start. Toby was advising my honors thesis, a project that was exploring mathematical models of how neurons transmit information. I approached Toby to ask about one of the assumptions he had made in his model, and a few minutes later, he took out a sheet of paper and started deriving an analogue of the central limit theorem for Poisson processes. I tried keeping up as he wrote down the steps, justified assumptions where he needed to, and eventually yielded a beautiful result. I left that discussion wanting the power to do what he had just done.

What Toby had done was derive a model for the probability distribution of the inter-arrival times between the spikes of a neuron, and he was explaining the rationale for one part of this model. Specifically, he was trying to justify to me how even though the inter-arrival times of individual neural spikes do not behave like a Poisson process (due to a refractory period), we could look at the afferent spike arrivals into a neuron as Poisson.

I referenced Toby’s derivation in some unpublished work during grad school, but my thoughts never returned to it until a holiday party this past week. During the party, someone posed a question that made me think about the frequency with which a person does an activity, and my mind settled upon Toby’s Poisson limit derivation. Like the spikes of a neuron, an activity, e.g. brushing one’s teeth, can be modeled as a renewal process. We were specifically talking about the distribution of inter-arrival times and how the average per person behavior over a population might compare against an individual’s activity.

I was positing cases in which the distribution of an individual’s activity might be bimodal. For instance, some people might brush their teeth twice a day while others might only brush once a day. The limiting argument then immediately started nagging at me because the inter-arrival distribution of a Poisson process isn’t bimodal, and I wanted to understand how the limiting argument would collapse the distribution into a single mode.

Toby’s Derivation (Asymptotic)

Toby’s argument is a good starting point to think about this problem. Let’s consider m renewal processes. Each process could represent the spikes of a single neuron or the times at which a specific person brushes his or her teeth. Given some fixed time t, we can model the time to the next arrival as non-negative random variables

X_1, \ldots, X_m \sim F_X,

where F_X is a cumulative distribution function that we will assume is continuously differentiable over [0, \infty) and where the probability density function f_X(x) has the property that f_X(0) > 0. I am going to wave my hands that this last assumption isn’t so severe if we assume that the processes are out of phase with one another with respect to independent random phases and that is an assumption one would want to make, anyway.

Let’s superimpose all of the events of these m renewal processes on a single timeline. I will now derive the result as m \to \infty, the superimposed process will look Poisson. By definition, this means that the inter-arrival times will look exponentially distributed. To see this, note that we can write the distribution of the time until the next arrival from our starting point t as

Z_m = m \cdot \min\{X_1, \ldots, X_m \}~.

The multiplicative scaling factor m in Z_m above can be thought of as a per process averaging term since as we superimpose more processes, the points become closer together.

The goal is to argue that Z_m is exponentially distributed. To do this, we can decompose the probability as follows:

  1. By definition, \mathbb{P}(Z_m \geq z) =\mathbb{P}(X_1 \geq z/m, \ldots, X_m \geq z/m )
  2. By independence, \mathbb{P}(Z_m \geq z) = \prod_{i=1}^m \mathbb{P}(X_i \geq z/m)
  3. By identical distributedness, \mathbb{P}(Z_m \geq z) = (1 - F_X(z/m))^m
  4. Taking the limit as k \to \infty via L’Hospital’s rule, we have that \lim_{k \to \infty} \mathbb{P}(Z_m \geq z) = e^{-z f_X(0)}
  5. This limit is exponentially distributed, indicating that our inter-arrival times are exponentially distributed, so by definition the limiting arrival process is Poisson.

Finite Derivation

Let’s return to step 3 in Toby’s derivation, which is just before one takes the limit. We can rewrite this as

\mathbb{P}(Z_m \geq z) = e^{m \ln (1 - F_X(z/m))}~.

We can now apply one of the fundamental inequalities of information theory to the equation above: \ln x \leq x - 1, which can also be written as \ln x \geq 1 - 1/x. By monotonicity of the exponent, our inequality gives the upper and lower bounds

e^{-\frac{m F_X(z/m)}{1 - F_X(z/m)}} \leq \mathbb{P}(Z_m \geq z) \leq e^{-m F_X(z/m)}~.

Note that the denominator in the exponent of the lower bound 1 - F_X(z/m) converges to 1 as m \to \infty. Furthermore, we see that as m \to \infty, the quantity m F_X(z/m) approaches z f_X(0), which results in the exponential limit of the inter-arrival distribution of both bounds. If we examine this quantity more closely, m dilates m F_X(z/m) so that any modes in the distribution are effectively pushed closer together around 0, and it’s directly connected to the fact that we scale the \min operation by the multiplicative factor m.

Posted in Biology, Information Theory, Probability | Leave a comment