A nice algorithm: reservoir sampling

Posted by piantado on July 07, 2009

You are an octopus who loves sushi. Every friday night, you visit the local sushi restaurant and chow down. Your favorite sushi is the Spider Roll and least favorite is Tako. But, you are the kind of octopus who really likes variety—a totally random mix of sushi.

Unfortunately, you go to a sushi restaurant with a conveyor belt of sushi, where each roll moves past you in succession. As each roll passes, you can decide whether to pick it up (and put down one if your arm is currently holding one) or ignore it and wait for the next. Your problem is to find some way of choosing a sample of sushi rolls uniformly, such that each sushi roll that passes you has an equal probability of ending up being in one of your arms at the end. Sadly, you don’t know how long the sushi belt will continue to run. What to do?

One thing you might think to do is as each sushi roll passes you, you flip a coin. If the coin is heads, you pick up the current roll with an arm, and throw out that arm’s sushi. If the coin is tails, you ignore the current roll. Done dumbly, this, of course, biases you against keeping the earlier sushi, since they have a greater chance of being replaced if they’re even picked up. So you figure you can change the coin weight as the sushi roll past, and maybe, with some luck, you can end up with a uniform sample. But, if you don’t know how many sushi rolls will go by, you might have trouble deciding how quickly the coin weight should change…

But there’s a better way. It’s called reservoir sampling. Here’s one way to think about it, before you think about it: suppose that we assigned a random number chosen iid, uniformly from [0,1], to each sushi roll on the conveyor belt—all the ones you have seen and will see.

Now suppose that to take a uniform random sample of 8 of them (one for each arm), you take the rolls with the 8 lowest random numbers. Each roll has an equal probability of having its number in the lowest 8, so your sample of the lowest 8 is uniform.

What have you gotten? Well, something very nice. By thinking of it in those terms, rather than coin-flipping-accept-reject terms, you can make a pretty online algorithm. Each time you see a new sushi roll, you draw a random number. You hold on to the 8 sushi rolls with the lowest numbers, and if the new roll is lower than the largest of your 8, you add it to your 8, and drop the largest. At the end of this process, you are holding in your 8 arms the 8 lowest samples from all of them. And you didn’t need to ever hold on to more than 8 rolls. Cool.

(If you aren’t really an octopus, maybe you’re doing NLP or psycholinguistics and trying to draw N random sentences containing the word “octopus” from a corpus, without holding all occurrences of “octopus” in memory.)

Of course, that algorithm I gave isn’t great. In particular, you have to look at each sushi roll. But of course you could draw random numbers for the next several rolls and not even look at them if they are too large. Amazingly, you can sample 8 rolls uniformly, from N total in less than O(N) time. Thank this guy. In fact, drawing n samples uniformly from N is doable in O(n(1+\log(N/n))) time.

Pretty, clever algorithm.

the multiplicity of maxes

Posted by piantado on June 20, 2009

Here’s a thought which recently came up: optimization problems with many possible solutions are more likely to have a single best solution than optimization problems with few solutions. A problem with 10 possible solutions is more likely to have multiple “best” solutions than one with 1000 possible solutions. Why might you think this?

Well, you might think about it in the following way: suppose that value or “goodness” or utility of each possible solution to an optimization problem is sampled from some underlying (discrete) distribution. So the optimal solution is the max of this distribution of samples. This assumption probably isn’t true of many real kinds of optimization problems, but let’s go with it for a minute.

If this assumption were true, then as your sample size increases, the best solution takes on a more and more extreme value in the underlying distribution. Analogously, the tallest person in a room with 5 people will be shorter than the tallest person in a room with 25 people. But if the max gets more and more extreme, its value is lower and lower probability (for any distribution, in the limit), and so you are less and less likely to have sampled multiple solutions with that maximum value. So the max should tend to be unique as the problem gets more and more complex. Analogously, if you grab 10 people and measure their height, you are more likely to find two people tied for tallest (rounded to, say, the nearest inch) than if you grabbed 1000 people.

Wait, that’s not obviously true. Actually, it’s not even right.

Here’s a simulation where you draw N samples from a poison distribution (lambda=0.1). The X axis is log N. The red line shows the probability of finding a tie for the maximum value. (Code here.)

What’s surprising, maybe, is that the probability of a tie for first place oscillates (it’s so surprising I thought that R had a bug in it’s random number generation until I wrote it in another language). In fact, for many distributions the limiting probability probability of a tie for the maximum value doesn’t exist. This was originally investigated for geometric distributions; for these distributions, you can picture the game is that N people flip a coin. If a person gets a tail, they are eliminated. The winner is the last person to be eliminated, but there is some probability that on the next flip, everyone left will be eliminated and thus tie as winners. What does the probability that there is a tie for winner, as a function of N? Well, it oscillates.

In 1995 Baryshnikov, Eisenberg, and Stengle proved the general theorem you really want: the limiting probability of a tie for the max exists if and only if

\lim_{j} \frac{P(X = j)}{P(X > j)}=0

where P is the underlying discrete distribution on j=1,2,3,\ldots . If the limit doesn’t exist, you get something that looks like the plot above. So, for many distributions, the probability of a tie for the maximum oscillates as the sample sizes increases. Bizarre, I think.

But here is one way to understand it. Imagine having dawn a sample, finding the max, and slowing adding more samples. As you add more and more samples, you will get the max again before sampling something larger, assuming that the max is sufficiently higher probability than the numbers that follow it (which is one way to read what happens when the above limit is not 0). Then with the new max, you start all over, getting the new max again before finding a new new max. So your probability of getting the max more than once oscillates with the sample size.

This might all be irrelevant to actual optimization problems since I’m not sure which ones have spaces of possible solutions which can be thought of as draws from some underlying distribution. But I thought it was odd anyway that the probability of a tie for the max of N samples oscillates as N increases. So yeah.

How long till I take over? 1

Posted by piantado on January 20, 2009

I played around with a fun problem this weekend with Celeste. We started to talk about descendants and ancestors. What are the odds that someone descended from me will be alive T years in the future? What are the odds I will have no descendants in 10000 years? What is the relationship between the number of kids I have and the odds my descendants will all die out?

There are some funny phenomena relevant to this. Mitochondrial Eve is one. The number of people descended from Genghis Kahn is another. In Knuth’s Volume 1, he talks about a “proof” by H. W. Watson that under certain assumptions, infinitely many people will be born in the future, but each family line will die with probability 1 (pg 383). This is of course logically inconsistent, and Knuth proves so: that in an infinitely tall tree, you can find an infinite path of descent. If people live forever, someone’s family will live forever. Which should be obvious.

Here’s a simple model for thinking about infinite family trees. Suppose that there is some bounded population P. To create people for the next generation, the following is repeated: two people are chosen at random, bred, and their (one) child inhabits the next generation. That child is a descendant of their two parents. This is repeated until the next generation is filled up with people. And so on. This neglects a lot of important things–like gender, natural selection, sexual selection, and the grossness of incest. But maybe its not such a bad start…

First, two observations. If the population size is bounded, then it is eventually going to be the case that everyone is a descendant of me, or nobody is. This is because everyone being my descendant, and nobody being my descendant are fixed points—once either is true, it is true for all of time. And each generation there is some probability of either of these happening, so given enough time, one must come true. The second thing to notice is that the expected number of descendants of me at the next time step is 1-(1-p)^2 = p(2-p), where p is the proportion of people who are my descendants at the current time step. This is because the probability of getting someone in the next generation who is not a descendant of me is the probability that neither parent was a descendant of me, or (1-p)(1-p). (This also shows that p=0 and p=1 are the only fixed points). Thus, at each generation the number of my descendants should grow by a factor of (2-p). Things are looking good.

But of course there is a lot of variability, especially when you only have a few kids initially. So how many kids should someone have in order to be pretty sure your genes will take over? Since I know more about perl than stochastic processes, I wrote a little script to figure this out. If anyone wants to solve this analytically, I’d like to know, because I can’t run the perl script very quickly on a population of size 6 billion.

Here are some results, showing the probability that everyone is eventually a descendant of me for various population sizes and number of kids I have, averaged over 100 runs:

size=1000 size=10000 size=100000
1 Kids 0.8 0.78 0.79
2 Kids 0.95 0.96 0.94
3 Kids 1.0 0.99 0.99
4 Kids 1.0 1.0 1.0

So, you don’t need many kids to eventually take have a good odds of taking over. At least when the population is relatively small. But in the larger population, probably most people’s kids reproduce themselves more readily than in this model, so you don’t need many kids to eventually take over. I wonder if anyone has attempted to model the distribution of family names along these lines…

I love dynamic programming

Posted by piantado on September 14, 2008

Here’s a fun problem that came up recently: how many possible English words are there of length L? A word is a “possible” English word if it obeys the constraints of English phonotactics. For example, in English there are no words that begin with the sounds “bn”, meaning that “bnark” is not a “possible” English word.

For simplicity, we’ll assume that English phonotactics is completely captured by the set of triphones, or combinations of three phonemes). For example, “#bn” is never seen in English words (where # denotes the beginning or end of a word), so we’ll assume “#bn” is forbidden in English. But “#st” is allowed because you see it in words like “start” and “star.” So, its relatively easy to take a big list of English word pronunciations and come up with a set of all three phoneme combinations that English allows.

But then the fun part. Given these restrictions about what triples of phonemes are allowed, how could we count this number of possible words that are allowed of length L? The naive algorithm is to generate phoneme strings of length L, and for each string see if it contains only allowed triples of phonemes. This is bad because there are P^L such strings to check (where P is the number of phonemes), and this grows too fast to be practical for anything other than small L.

But there’s a better option: dynamic programming. Because of the local nature of constraints we don’t actually need to search through all possible phoneme combinations of length L. Instead, we just store the number of all words ending in XY for each pair of phonemes XY. We’ll use the notation C(XY,L) to be the number of strings of length L that end in XY. We start with strings of length two and set C(XY,2)=1 if #XY is an allowed string. In other words, for each allowed #XY, there is only one allowed string ending in XY.

Then we repeat the following: set C(YZ,L) to be the sum over all X of C(XY,L-1) such that XYZ is allowed. This counts the number of strings you get by extending each string that ends in XY by one phoneme.

To count the actual number of allowed words, we have to restrict endings to the word boundary symbol #. So set N(L) to the sum over all XY of C(XY,L) such that XY# is allowed. Then N(L) gives you the number of possible English words, according to the triphone constraints. The neat part is that instead of requiring P^L time, this is upper bounded by L*P^3 time, meaning that we have taken a problem which was exponential in L and made it linear in L and polynomial in P. I love dynamic programming.

If you’re curious, here are some (exact) counts of the number of possible words:

length    count    logcount
2    444    6.09582456243222
3    6400    8.76405326934776
4    84247    11.341508239272
5    1278377    14.0611018625933
6    18478098    16.7320966968031
7    268961125    19.4100774103939
8    3915334048    22.0881664879088
9    56997708070    24.7662768946743
10    829696173526    27.4443254147294
11    12077076407040    30.1223302598591
12    175801721337926    32.8003778925875

And here is a perl script to compute it (note, requires CELEX).