the multiplicity of maxes

Posted by piantado on June 20, 2009

Here’s a thought which recently came up: optimization problems with many possible solutions are more likely to have a single best solution than optimization problems with few solutions. A problem with 10 possible solutions is more likely to have multiple “best” solutions than one with 1000 possible solutions. Why might you think this?

Well, you might think about it in the following way: suppose that value or “goodness” or utility of each possible solution to an optimization problem is sampled from some underlying (discrete) distribution. So the optimal solution is the max of this distribution of samples. This assumption probably isn’t true of many real kinds of optimization problems, but let’s go with it for a minute.

If this assumption were true, then as your sample size increases, the best solution takes on a more and more extreme value in the underlying distribution. Analogously, the tallest person in a room with 5 people will be shorter than the tallest person in a room with 25 people. But if the max gets more and more extreme, its value is lower and lower probability (for any distribution, in the limit), and so you are less and less likely to have sampled multiple solutions with that maximum value. So the max should tend to be unique as the problem gets more and more complex. Analogously, if you grab 10 people and measure their height, you are more likely to find two people tied for tallest (rounded to, say, the nearest inch) than if you grabbed 1000 people.

Wait, that’s not obviously true. Actually, it’s not even right.

Here’s a simulation where you draw N samples from a poison distribution (lambda=0.1). The X axis is log N. The red line shows the probability of finding a tie for the maximum value. (Code here.)

What’s surprising, maybe, is that the probability of a tie for first place oscillates (it’s so surprising I thought that R had a bug in it’s random number generation until I wrote it in another language). In fact, for many distributions the limiting probability probability of a tie for the maximum value doesn’t exist. This was originally investigated for geometric distributions; for these distributions, you can picture the game is that N people flip a coin. If a person gets a tail, they are eliminated. The winner is the last person to be eliminated, but there is some probability that on the next flip, everyone left will be eliminated and thus tie as winners. What does the probability that there is a tie for winner, as a function of N? Well, it oscillates.

In 1995 Baryshnikov, Eisenberg, and Stengle proved the general theorem you really want: the limiting probability of a tie for the max exists if and only if

\lim_{j} \frac{P(X = j)}{P(X > j)}=0

where P is the underlying discrete distribution on j=1,2,3,\ldots . If the limit doesn’t exist, you get something that looks like the plot above. So, for many distributions, the probability of a tie for the maximum oscillates as the sample sizes increases. Bizarre, I think.

But here is one way to understand it. Imagine having dawn a sample, finding the max, and slowing adding more samples. As you add more and more samples, you will get the max again before sampling something larger, assuming that the max is sufficiently higher probability than the numbers that follow it (which is one way to read what happens when the above limit is not 0). Then with the new max, you start all over, getting the new max again before finding a new new max. So your probability of getting the max more than once oscillates with the sample size.

This might all be irrelevant to actual optimization problems since I’m not sure which ones have spaces of possible solutions which can be thought of as draws from some underlying distribution. But I thought it was odd anyway that the probability of a tie for the max of N samples oscillates as N increases. So yeah.

Wow! 2

Posted by piantado on February 15, 2009

Here’s the most amazing thing I read recently (in Mar-Apr edition of American Scientist–probably the best science writing around–which attributed this puzzle to David Blackwell).

Suppose that you write down two numbers, each on a slip of paper. You shuffle them up and put them, face-down, on a table. I am free to flip over the first one. Now I say whether or not the first number is larger than the second number. If I am right, you give me $100. If I am wrong, I give you $100.

First question: is this a fair bet for you? You should try for a minute to prove that this is a fair bet (and may “succeed”). Informally, you might think the first number doesn’t give me any information about the second since you could write down anything, so it seems like there is no strategy I can use to, on average, do better than breaking even.

Ahh our dumb intuition. Here’s how I can make money on this bet. This is the remarkable part. I flip over the first slip of paper. I then choose a number at random (say, from a normal distribution). If the random number is larger than the one I flipped over, I tell you that the hidden number is also larger than the one I flipped over. If the random number is smaller than the one I flipped over, I tell you that the hidden number is also smaller than the one I flipped over.

How could this possibly make me money on an even bet? Well suppose that you wrote down numbers X and Y. When my random number is less than both X and Y, then I will win 50% of the time since I will turn over the smaller number first half the time, and the larger number first half the time. In both of these cases, I will tell you the hidden number is the smaller of the two numbers, and so will be right half the time. Similarly, when the random number is larger than both X and Y, I will be right half the time.

But, when the random number is between X and Y, I will will all the time. To see this, suppose that X < Y. When I turn over X, my random number is larger than it, so I say that the unseen number, Y, is larger than the seen number X, and I am right. When I turn over Y first, my random number is smaller than it, so I say that the unseen number, X, is smaller than the seen number Y, and I am right.

So when I randomly choose a number between X and Y, I win all the time. Thus, if I sample from a distribution which assigns nonzero probability to every range, I will on average win more than half the time. Crazy, huh?

visualizing nonindependence

Posted by piantado on October 28, 2008

Here’s a problem I recently had: suppose you are given a sample of some joint distribution on X and Y. That is you are given a bunch of (X,Y) pairs. On thing you may want to know is whether the distribution P( Y | X ) depends on X. You can’t always test this by doing a regression–sometimes the expected of Y may not depend on X, but the distribution of Ys might. Sometimes P(Y | X) may be entirely different types of distributions for various X.

But here’s one fun technique: it is easy to compute the marginal densities P(Y) and P(X) using, say, density estimation. Then we can do a 2D density estimation where we weight each data point (x,y) by 1 / ( P(x) P(y) ).

Weighting the density estimation like this doesn’t recover a function for the actual density. Instead, it gives you a value P(Y | X) / P(Y). To see this, note that the density we get out should approximate

P(X,Y) / ( P(x) P(y) ) = P(X) P(Y | X) / ( P(x) P(y) ) = P(Y | X) / P(Y)

Thus, if P(Y | X) is independent of X–and therefore equal to P(Y)–you will recover a constant function. So if the function looks non-constant, you know that there is some kind of interaction. This thing P(Y | X) / P(Y) can be interpreted as how much the probability changes by conditioning on X. That is, how much the probability of that specific Y value depends on whether or not you know X. If you take logs, you get how much the suprisal of Y changes by conditioning on X.

An example is shown below:

This was generated by taking X to be distributed uniformly in [0,1]. Y is then chosen to be a normal distribution with SD 1, centered at 10*X. I ran this density estimation technique, and plotted the log densities. As things get more red, they get more negative. A script to generate it (and can be modified to play with other data) can be found here. If you squint a bit, you can see how this represents that distribution’s nonindependence: for example, if you don’t know X, a high Y is very possible and maybe as likely as a small Y. But if you know X is small, a high Y is very unlikely. So, when X is small and Y is large, you get red, meaning that by conditioning on X, the probability in that region goes down a lot. Hopefully this can help visualize how they are nonindependent.