Something like this is bound to happen

Posted by piantado on October 28, 2009

Hilarious news from California: the Governator apparently sent a veto message to the legislature in which the first word of each column spells “fuck you.” You can read about it here and here. His spokesman Aaron McLear is quoted as saying “My goodness. What a coincidence. I suppose when you do so many vetoes, something like this is bound to happen.”

I couldn’t resist actually computing the probability. You can pretty fairly assume that each word starting a line is chosen independently. If you take the (token) frequency with which each letter appears at the start of a word, you can find that the probability of actually spelling “fuckyou”: it’s about 10^{-12}. So yes, it is bound to happen, but only for one in 1 trillion (seven-line) vetos. As for the alternative hypothesis, odds are probably better than 1 in 1 trillion that either this is a hoax, someone is trying to get fired, or Schwarzenegger has a sense of humor.

It’s also easy to find the most likely words to spell in the left hand column. Here are the top four letter words with their corresponding log probabilities (of course, its not very likely you’ll get a word at all). There are so many “t”s because it is the most likely letter to start a word (p=0.15). It seems they could have picked some other insults if they really wanted the “coincidence” story to be a little more plausible:

twat 0.000235
watt 0.000235
tats 0.000227
that 0.000182
tact 0.000145
wait 0.000125
twit 0.000124
tits 0.00012
iota 0.000119
swat 9.90e-05
wast 9.90e-05
data 9.71e-05
asst 9.59e-05

Update:

Apparently there’s been some interest in this so I’ll give a little more detail of how to compute this. If you have a list of word frequencies, you can add up the total frequency of all words beginning with “a”, beginning with “b”, beginning with “c”, etc. When you divide each of these numbers by the sum of all your frequencies, you get the probability that a randomly chosen word in a text will begin with each letter. These probabilities are listed below:

a 0.153
b 0.0445
c 0.0411
d 0.0273
e 0.0207
f 0.0431
g 0.0167
h 0.0513
i 0.0809
j 0.0041
k 0.005
l 0.023
m 0.036
n 0.0225
o 0.063
p 0.032
q 0.0021
r 0.0209
s 0.0642
t 0.1521
u 0.0109
v 0.0056
w 0.0663
x 0.0001
y 0.0133
z 0.0001

Note that this is not simply how many words start with each letter—instead, it’s the total probability of all words starting with each letter.

Since to spell “fuckyou” you must first choose “f”, and then “u”, and then “c”, etc, you can compute the total probability of “fuckyou” by simply multiplying the probabilities for each letter (This assumes independence between the words which start each line, which is pretty reasonable). When you do this, you get a probability of 8.82 \cdot 10^{-13}.  Above I just reported the rougher, slightly conservative, order of magnitude estimate of magnitude estimate of 1 in 1 trillion.

As Ted pointed out, what you probably really want to compute is the probability of anything shocking–if Schwarzenegger had said “biteme” instead, it might have also made news headlines. So how surprised should we be to see anything insulting? Well to do this we could sum up the total probability of a big list of insults, but I don’t have one handy.

A nice algorithm: reservoir sampling

Posted by piantado on July 07, 2009

You are an octopus who loves sushi. Every friday night, you visit the local sushi restaurant and chow down. Your favorite sushi is the Spider Roll and least favorite is Tako. But, you are the kind of octopus who really likes variety—a totally random mix of sushi.

Unfortunately, you go to a sushi restaurant with a conveyor belt of sushi, where each roll moves past you in succession. As each roll passes, you can decide whether to pick it up (and put down one if your arm is currently holding one) or ignore it and wait for the next. Your problem is to find some way of choosing a sample of sushi rolls uniformly, such that each sushi roll that passes you has an equal probability of ending up being in one of your arms at the end. Sadly, you don’t know how long the sushi belt will continue to run. What to do?

One thing you might think to do is as each sushi roll passes you, you flip a coin. If the coin is heads, you pick up the current roll with an arm, and throw out that arm’s sushi. If the coin is tails, you ignore the current roll. Done dumbly, this, of course, biases you against keeping the earlier sushi, since they have a greater chance of being replaced if they’re even picked up. So you figure you can change the coin weight as the sushi roll past, and maybe, with some luck, you can end up with a uniform sample. But, if you don’t know how many sushi rolls will go by, you might have trouble deciding how quickly the coin weight should change…

But there’s a better way. It’s called reservoir sampling. Here’s one way to think about it, before you think about it: suppose that we assigned a random number chosen iid, uniformly from [0,1], to each sushi roll on the conveyor belt—all the ones you have seen and will see.

Now suppose that to take a uniform random sample of 8 of them (one for each arm), you take the rolls with the 8 lowest random numbers. Each roll has an equal probability of having its number in the lowest 8, so your sample of the lowest 8 is uniform.

What have you gotten? Well, something very nice. By thinking of it in those terms, rather than coin-flipping-accept-reject terms, you can make a pretty online algorithm. Each time you see a new sushi roll, you draw a random number. You hold on to the 8 sushi rolls with the lowest numbers, and if the new roll is lower than the largest of your 8, you add it to your 8, and drop the largest. At the end of this process, you are holding in your 8 arms the 8 lowest samples from all of them. And you didn’t need to ever hold on to more than 8 rolls. Cool.

(If you aren’t really an octopus, maybe you’re doing NLP or psycholinguistics and trying to draw N random sentences containing the word “octopus” from a corpus, without holding all occurrences of “octopus” in memory.)

Of course, that algorithm I gave isn’t great. In particular, you have to look at each sushi roll. But of course you could draw random numbers for the next several rolls and not even look at them if they are too large. Amazingly, you can sample 8 rolls uniformly, from N total in less than O(N) time. Thank this guy. In fact, drawing n samples uniformly from N is doable in O(n(1+\log(N/n))) time.

Pretty, clever algorithm.

the multiplicity of maxes

Posted by piantado on June 20, 2009

Here’s a thought which recently came up: optimization problems with many possible solutions are more likely to have a single best solution than optimization problems with few solutions. A problem with 10 possible solutions is more likely to have multiple “best” solutions than one with 1000 possible solutions. Why might you think this?

Well, you might think about it in the following way: suppose that value or “goodness” or utility of each possible solution to an optimization problem is sampled from some underlying (discrete) distribution. So the optimal solution is the max of this distribution of samples. This assumption probably isn’t true of many real kinds of optimization problems, but let’s go with it for a minute.

If this assumption were true, then as your sample size increases, the best solution takes on a more and more extreme value in the underlying distribution. Analogously, the tallest person in a room with 5 people will be shorter than the tallest person in a room with 25 people. But if the max gets more and more extreme, its value is lower and lower probability (for any distribution, in the limit), and so you are less and less likely to have sampled multiple solutions with that maximum value. So the max should tend to be unique as the problem gets more and more complex. Analogously, if you grab 10 people and measure their height, you are more likely to find two people tied for tallest (rounded to, say, the nearest inch) than if you grabbed 1000 people.

Wait, that’s not obviously true. Actually, it’s not even right.

Here’s a simulation where you draw N samples from a poison distribution (lambda=0.1). The X axis is log N. The red line shows the probability of finding a tie for the maximum value. (Code here.)

What’s surprising, maybe, is that the probability of a tie for first place oscillates (it’s so surprising I thought that R had a bug in it’s random number generation until I wrote it in another language). In fact, for many distributions the limiting probability probability of a tie for the maximum value doesn’t exist. This was originally investigated for geometric distributions; for these distributions, you can picture the game is that N people flip a coin. If a person gets a tail, they are eliminated. The winner is the last person to be eliminated, but there is some probability that on the next flip, everyone left will be eliminated and thus tie as winners. What does the probability that there is a tie for winner, as a function of N? Well, it oscillates.

In 1995 Baryshnikov, Eisenberg, and Stengle proved the general theorem you really want: the limiting probability of a tie for the max exists if and only if

\lim_{j} \frac{P(X = j)}{P(X > j)}=0

where P is the underlying discrete distribution on j=1,2,3,\ldots . If the limit doesn’t exist, you get something that looks like the plot above. So, for many distributions, the probability of a tie for the maximum oscillates as the sample sizes increases. Bizarre, I think.

But here is one way to understand it. Imagine having dawn a sample, finding the max, and slowing adding more samples. As you add more and more samples, you will get the max again before sampling something larger, assuming that the max is sufficiently higher probability than the numbers that follow it (which is one way to read what happens when the above limit is not 0). Then with the new max, you start all over, getting the new max again before finding a new new max. So your probability of getting the max more than once oscillates with the sample size.

This might all be irrelevant to actual optimization problems since I’m not sure which ones have spaces of possible solutions which can be thought of as draws from some underlying distribution. But I thought it was odd anyway that the probability of a tie for the max of N samples oscillates as N increases. So yeah.

Fair redistricting

Posted by piantado on April 08, 2009

A few days ago when I was riding home on the T with John Kraemer, we ran into his friend, Brian Olson, a guy from google who has some neat political-computational ideas. Pretty quickly on the ride home, he sold me on the idea of fair redistricting, which is meant to provide an impartial way to solve problems of gerrymandering (Hilariously, wikipedia tells me that this is a portmanteau of “Gerry” and “Salamander,” the former being the governor of MA and the latter being the shape of the district chosen to help him).

The idea of fair redistricting is really clever: define an objective measure for the best districting scheme, and use that to define the districts, rather than allowing politicians to haggle over them. You can picture having some metric for how “good” a districting scheme is, and then setting them by law to be the best according to goodness. This is clever because it leaves no room to wiggle. I find it hard to argue with.

Brian Olson’s scheme is to pick a districting such that the average distance of people in the district to the center of the district is minimized. Of course, there are others, but that happens to be relatively straightforward to compute (though finding the best is hard). On his webpage, he provides a program (note: requires protobuf) for computing the best districting scheme for each state according to this metric. This scheme produces regions that look quite reasonable, for instance converting the gerrymandered FL districting on the left to the objectively “better” one on the right.

And not only is it objectively better, but it bears no hallmarks–weird shaped regions–of a gross political system.

There are of course a few additional things you want. You want there to be a unique solution (for some reason, I find this plausible when there are so many constraints.. more on that later maybe). You also want the solution to be findable–or at least able to be approximated. Here that doesn’t mean finding a district whose score is close to the best score. It means finding a map whose districting is close to the best one, which is a slightly different problem. A practical way around this search problem might be to have each party submit their preferred districting, and then choose the one with the highest score according to the objective function.

Anyway this looks like the right direction to move in: formalize our political intuitions of goodness, and go by them without letting the grimy politicians get in the way! Thanks Brian!

Perl 6

Posted by piantado on April 02, 2009

Just saw a talk by Larry Wall, the inventor of Perl. I expected him to resemble a stereotype of computer network admins, but he showed up in a surprising form: wearing all brown, an apparently Australian cowboy hat, and a horseshoe mustache (thanks wikipedia!). He gave a bizarre talk consisting mainly of several hundred slides with single words on them, that somehow wove into his attempts to describe programming, science, and whatever.

Larry is an interesting fellow. Wikipedia says he basically started out in linguistics at Berkeley, wanting to translate the Bible into novel written languages. Instead he joined JPL. He is most famous not for winning the obfuscated C context twice, but for inventing the programming language perl. By the way, in case you are interested in doing any kind of corpus work or anything where you need to shuffle around data, I would definitely recommend perl. It is amazing.

Larry talked about the upcoming Perl 6 release. The driving goal behind this, he explained, was to make Perl a better version of X than X is (where X is any other language). This is achieved by making the Perl syntax/parser itself changable within programs. That is, you can change how the syntax of the perl interpreter itself works. For instance, you can use the perl command

use Lisp;

and then write code in Lisp syntax. This is a super cool idea that means perl literally contains every other programming language as a proper subset.

Larry’s linguistics training seems to bleed into his language design, and he described several design choices that were motivated by noticing what types of things people can handle in natural language. For instance, some pieces of perl code look locally ambiguous, which he shook off by saying that people could deal with ambiguity well (which they can). He also described how someone suggested being able to write conditionals of the form X == 1|2|3 (which were called “quantum superpositions”), which would be true if X==1 or X==2 or X==3. At first he didn’t like this idea, until he noted that you can say in english “If X is 1 or 2 or 3.”

Even better, the new Perl 6 looks like it will have great support for functional programming (allowing things like closures, currying, etc.), objects, and multithreading.

I am strangely excited.

A quirk of the mighty t-test

Posted by piantado on March 12, 2009

This situation came up in Rebecca’s class last Wednesday. Some students had gathered data that, looked something like this made up version:

This shows for each subject, their RTs on two conditions of the experiment (Green/Blue). If you throw this sucker into the paired t-test, you find that the difference between conditions is, in fact, not significant (t = -1.8994, df = 10, p > 0.08), even though the difference in means is large (-1.28), and every subject shows the correct trend. But the signed-rank test comes out (p < 0.003).

As Talia pointed out, there’s one subject, the first one, which has a much bigger effect size than the others, and the t-test takes this into account. And, if you remove this first subject, the T-test comes out significant (t = -5.8054, df = 9, p < 0.0005). This is because the t statistic takes cares about the variance in the differences (for each subject) between conditions. If you leave in the first subject, this variance is large and the t statistic decreases in magnitude.

So adding a data point with the correct trend can actually decrease your t statistic, and make this difference nonsignificant. Why is this counterintuitive? Because the first point is a much bigger effect than the others in the right direction! And removing it actually decreases the mean difference between conditions (to -0.62).

Beware!

A good puzzle 4

Posted by piantado on March 04, 2009

David told me a good one last night. Here it is:

Suppose that there is a machine which can perfectly predict the future. It is never wrong. You get to play the following game with it. Two boxes are in a room and you can come into the room and open either one or both boxes. Yesterday, the machine predicted whether you would open one or both boxes, but you don’t know what it predicted you would do.

However this machine is a jerk. Last night it predicted what you would do, and snuck into the room with a bundle of cash. If the machine predicted you will open both boxes, it put no money in either box. If it predicted you would open only one box, however, it put $5000 in one box and $5000 in the other. You get to the room and must decide to open one box or two before seeing in either.

Well if you open one box and the machine predicted it, you will win some money. But, then again, if you are going to just open one box, then the perfect predictor already predicted it so there is going to be more money in the other box too. So you should just open both boxes because the money is already there. So if you are going to open one, it is rational to really open two.

Then again, the machine is a perfect predictor, and if it knew you would do this, you should only open one box. But then …

Wow! 2

Posted by piantado on February 15, 2009

Here’s the most amazing thing I read recently (in Mar-Apr edition of American Scientist–probably the best science writing around–which attributed this puzzle to David Blackwell).

Suppose that you write down two numbers, each on a slip of paper. You shuffle them up and put them, face-down, on a table. I am free to flip over the first one. Now I say whether or not the first number is larger than the second number. If I am right, you give me $100. If I am wrong, I give you $100.

First question: is this a fair bet for you? You should try for a minute to prove that this is a fair bet (and may “succeed”). Informally, you might think the first number doesn’t give me any information about the second since you could write down anything, so it seems like there is no strategy I can use to, on average, do better than breaking even.

Ahh our dumb intuition. Here’s how I can make money on this bet. This is the remarkable part. I flip over the first slip of paper. I then choose a number at random (say, from a normal distribution). If the random number is larger than the one I flipped over, I tell you that the hidden number is also larger than the one I flipped over. If the random number is smaller than the one I flipped over, I tell you that the hidden number is also smaller than the one I flipped over.

How could this possibly make me money on an even bet? Well suppose that you wrote down numbers X and Y. When my random number is less than both X and Y, then I will win 50% of the time since I will turn over the smaller number first half the time, and the larger number first half the time. In both of these cases, I will tell you the hidden number is the smaller of the two numbers, and so will be right half the time. Similarly, when the random number is larger than both X and Y, I will be right half the time.

But, when the random number is between X and Y, I will will all the time. To see this, suppose that X < Y. When I turn over X, my random number is larger than it, so I say that the unseen number, Y, is larger than the seen number X, and I am right. When I turn over Y first, my random number is smaller than it, so I say that the unseen number, X, is smaller than the seen number Y, and I am right.

So when I randomly choose a number between X and Y, I win all the time. Thus, if I sample from a distribution which assigns nonzero probability to every range, I will on average win more than half the time. Crazy, huh?

How long till I take over? 1

Posted by piantado on January 20, 2009

I played around with a fun problem this weekend with Celeste. We started to talk about descendants and ancestors. What are the odds that someone descended from me will be alive T years in the future? What are the odds I will have no descendants in 10000 years? What is the relationship between the number of kids I have and the odds my descendants will all die out?

There are some funny phenomena relevant to this. Mitochondrial Eve is one. The number of people descended from Genghis Kahn is another. In Knuth’s Volume 1, he talks about a “proof” by H. W. Watson that under certain assumptions, infinitely many people will be born in the future, but each family line will die with probability 1 (pg 383). This is of course logically inconsistent, and Knuth proves so: that in an infinitely tall tree, you can find an infinite path of descent. If people live forever, someone’s family will live forever. Which should be obvious.

Here’s a simple model for thinking about infinite family trees. Suppose that there is some bounded population P. To create people for the next generation, the following is repeated: two people are chosen at random, bred, and their (one) child inhabits the next generation. That child is a descendant of their two parents. This is repeated until the next generation is filled up with people. And so on. This neglects a lot of important things–like gender, natural selection, sexual selection, and the grossness of incest. But maybe its not such a bad start…

First, two observations. If the population size is bounded, then it is eventually going to be the case that everyone is a descendant of me, or nobody is. This is because everyone being my descendant, and nobody being my descendant are fixed points—once either is true, it is true for all of time. And each generation there is some probability of either of these happening, so given enough time, one must come true. The second thing to notice is that the expected number of descendants of me at the next time step is 1-(1-p)^2 = p(2-p), where p is the proportion of people who are my descendants at the current time step. This is because the probability of getting someone in the next generation who is not a descendant of me is the probability that neither parent was a descendant of me, or (1-p)(1-p). (This also shows that p=0 and p=1 are the only fixed points). Thus, at each generation the number of my descendants should grow by a factor of (2-p). Things are looking good.

But of course there is a lot of variability, especially when you only have a few kids initially. So how many kids should someone have in order to be pretty sure your genes will take over? Since I know more about perl than stochastic processes, I wrote a little script to figure this out. If anyone wants to solve this analytically, I’d like to know, because I can’t run the perl script very quickly on a population of size 6 billion.

Here are some results, showing the probability that everyone is eventually a descendant of me for various population sizes and number of kids I have, averaged over 100 runs:

size=1000 size=10000 size=100000
1 Kids 0.8 0.78 0.79
2 Kids 0.95 0.96 0.94
3 Kids 1.0 0.99 0.99
4 Kids 1.0 1.0 1.0

So, you don’t need many kids to eventually take have a good odds of taking over. At least when the population is relatively small. But in the larger population, probably most people’s kids reproduce themselves more readily than in this model, so you don’t need many kids to eventually take over. I wonder if anyone has attempted to model the distribution of family names along these lines…

shiny new concepts

Posted by piantado on December 21, 2008

I was recently watching some kids play in the airport. They had a toy airplane that they would wave around through the air and pretend it would fly. I was thinking about what makes play so much fun, and then realized that whenever I learn something new, I also enjoy flinging it around and seeing what its really made of. For adults, it may be things like calculus, bayes theorem. I remember when I first learned about limits in calculus, it really changed how I thought about things like, say, physical movement and speed and acceleration. I thought about it all the time, past the time I felt like I really got it simply because it was fun to do so. Graduate school has involved a lot of playing with Bayes theorem and information theory and related ideas. Every time I learn a new one (say, explaining away, graphical models, rational analysis, etc) I find myself playing with it in a way that I think is a lot like children’s play. Think of some examples, some neat cases, look for it in new places. See what happens when you throw it at the wall. Of course, for kids the amazing new concepts are simpler things–like elephants, flight, gender, chivalry, and revenge. I wonder if I was just learning these things, if I’d be enthusiastic about thinking about them as a kid is. Maybe we just all like to take our shiny new concepts out for a drive.

But there’s something beside the shoreline, moving across the beachhead

Posted by piantado on December 08, 2008

A funny thing came up in conversation a few days ago. Do you think the following is true:

Men on average have more relationships than women.

Sounds plausible enough, right? The funny thing is that (assuming for simplicity that every man is in a relationship with a woman and vice versa and there are equal numbers of men and women) this can’t possibly be true! Because if exactly one man and one woman is in each relationship, then the sum of all men in relationships and the sum of all women in relationships must be equal. But if there are equal numbers of men and women, the averages must be the same.

Is the first sentence as tempting as,

On average, men are in longer relationships than women.

Or

There are more men in relationships than women.

It just seems very compelling to me that yes, it’s true men have more relationships than women. And it takes a little reasoning to see why this can’t be true. But these two other sentences seem progressively less compelling. I wonder why. Celeste suggested it might be that men talk (or lie!) about more relationships than women do.

I think the mistake is maybe on par with things like the conjunction fallacy. So does “average” really mean arithmetic mean, as opposed to, say, mode, or median? Or when we answer the first question, are we answering a question about typicality or representativeness (as with Linda)? Is it more typical–whatever that means–for a man to have more relationships than a woman?

JFK’s clubs 1

Posted by piantado on November 23, 2008

Just read an interesting letter from TICS by Paul Bloom and Susan Gelman: Psychological essentialism in selecting the 14th Dalai Lama. In it, they discuss the fact that humans tend to act as though physical objects can acquire some kind of nonphysical essence. For example, John F Kennedy’s golf clubs sold at auction for $772,500. That’s pretty silly.

It’s especially silly when you consider that there is no physical characteristic of the golf clubs that link them to JFK. They are just golf clubs! They are worth that much not because they are made of gold, or because the objects are specially different from any others. They are worth that much because of their history. Yet, nothing about their current state contains any real information about that history (or maybe they have JFK’s DNA on them?).You could swap in some other golf clubs from the same manufacturer and nobody would know! So what’s so special about them?

On the one hand, I find this intuition very compelling–that there is something essentially different about those golf clubs as compared to any others. Their history is different. They’ve been through different things. But on the other hand, I don’t believe that they are interestingly different as physical objects from other golf clubs. You could never tell by looking at them who’s owned them! What’s a hardcore materialist to do?

I wonder about psychological essentialism poking its head into other social and political domains. Americans seem to care less about the number of Iraqi deaths than American deaths. We also care less about the rights of immigrants, and noncitizens. How much of our national and social in-group mentality is driven by this idea that somehow this group of people is different from this other group of people? That somehow differences of origin or culture are somehow relevant to what rights a group should have?

Said another way, how could a country seriously believe that its citizens were so essentially different from any other group, that rights should only be guaranteed to citizens? How can anyone seriously believe that those golf clubs are so essentially different as to be worth three quarters of a million dollars? Is thinking that American citizens deserve things other groups don’t any different from thinking JFK’s clubs have intrinsic value?

visualizing nonindependence

Posted by piantado on October 28, 2008

Here’s a problem I recently had: suppose you are given a sample of some joint distribution on X and Y. That is you are given a bunch of (X,Y) pairs. On thing you may want to know is whether the distribution P( Y | X ) depends on X. You can’t always test this by doing a regression–sometimes the expected of Y may not depend on X, but the distribution of Ys might. Sometimes P(Y | X) may be entirely different types of distributions for various X.

But here’s one fun technique: it is easy to compute the marginal densities P(Y) and P(X) using, say, density estimation. Then we can do a 2D density estimation where we weight each data point (x,y) by 1 / ( P(x) P(y) ).

Weighting the density estimation like this doesn’t recover a function for the actual density. Instead, it gives you a value P(Y | X) / P(Y). To see this, note that the density we get out should approximate

P(X,Y) / ( P(x) P(y) ) = P(X) P(Y | X) / ( P(x) P(y) ) = P(Y | X) / P(Y)

Thus, if P(Y | X) is independent of X–and therefore equal to P(Y)–you will recover a constant function. So if the function looks non-constant, you know that there is some kind of interaction. This thing P(Y | X) / P(Y) can be interpreted as how much the probability changes by conditioning on X. That is, how much the probability of that specific Y value depends on whether or not you know X. If you take logs, you get how much the suprisal of Y changes by conditioning on X.

An example is shown below:

This was generated by taking X to be distributed uniformly in [0,1]. Y is then chosen to be a normal distribution with SD 1, centered at 10*X. I ran this density estimation technique, and plotted the log densities. As things get more red, they get more negative. A script to generate it (and can be modified to play with other data) can be found here. If you squint a bit, you can see how this represents that distribution’s nonindependence: for example, if you don’t know X, a high Y is very possible and maybe as likely as a small Y. But if you know X is small, a high Y is very unlikely. So, when X is small and Y is large, you get red, meaning that by conditioning on X, the probability in that region goes down a lot. Hopefully this can help visualize how they are nonindependent.

I love dynamic programming

Posted by piantado on September 14, 2008

Here’s a fun problem that came up recently: how many possible English words are there of length L? A word is a “possible” English word if it obeys the constraints of English phonotactics. For example, in English there are no words that begin with the sounds “bn”, meaning that “bnark” is not a “possible” English word.

For simplicity, we’ll assume that English phonotactics is completely captured by the set of triphones, or combinations of three phonemes). For example, “#bn” is never seen in English words (where # denotes the beginning or end of a word), so we’ll assume “#bn” is forbidden in English. But “#st” is allowed because you see it in words like “start” and “star.” So, its relatively easy to take a big list of English word pronunciations and come up with a set of all three phoneme combinations that English allows.

But then the fun part. Given these restrictions about what triples of phonemes are allowed, how could we count this number of possible words that are allowed of length L? The naive algorithm is to generate phoneme strings of length L, and for each string see if it contains only allowed triples of phonemes. This is bad because there are P^L such strings to check (where P is the number of phonemes), and this grows too fast to be practical for anything other than small L.

But there’s a better option: dynamic programming. Because of the local nature of constraints we don’t actually need to search through all possible phoneme combinations of length L. Instead, we just store the number of all words ending in XY for each pair of phonemes XY. We’ll use the notation C(XY,L) to be the number of strings of length L that end in XY. We start with strings of length two and set C(XY,2)=1 if #XY is an allowed string. In other words, for each allowed #XY, there is only one allowed string ending in XY.

Then we repeat the following: set C(YZ,L) to be the sum over all X of C(XY,L-1) such that XYZ is allowed. This counts the number of strings you get by extending each string that ends in XY by one phoneme.

To count the actual number of allowed words, we have to restrict endings to the word boundary symbol #. So set N(L) to the sum over all XY of C(XY,L) such that XY# is allowed. Then N(L) gives you the number of possible English words, according to the triphone constraints. The neat part is that instead of requiring P^L time, this is upper bounded by L*P^3 time, meaning that we have taken a problem which was exponential in L and made it linear in L and polynomial in P. I love dynamic programming.

If you’re curious, here are some (exact) counts of the number of possible words:

length    count    logcount
2    444    6.09582456243222
3    6400    8.76405326934776
4    84247    11.341508239272
5    1278377    14.0611018625933
6    18478098    16.7320966968031
7    268961125    19.4100774103939
8    3915334048    22.0881664879088
9    56997708070    24.7662768946743
10    829696173526    27.4443254147294
11    12077076407040    30.1223302598591
12    175801721337926    32.8003778925875

And here is a perl script to compute it (note, requires CELEX).

“just” theories and the science of the ordinary

Posted by piantado on August 31, 2008

It’s often hard to explain to people what I do. If I tell them I study how kids learn language, the almost universal response I get is “don’t parents just teach their kids?” Or if I say I study language processing, trying to understand how people convert a sound signal into a meaning and how they deal with things like ambiguity, the typical response is “don’t people just figure it out?” Yes and yes. Parents just teach kids language and people just figure out what sounds mean what. Can I have my PhD now?

I often wonder if physicists and biologists and chemists get these kinds of responses when they try to describe their work. Has there been some physicist somewhere who tried to explain his latest theory of quantum gravity, but was cut off by someone saying “well don’t things just fall?” Or a biologist who was explaining the developmental dynamics of the patterns on butterflys’ wings, but was interrupted by “well don’t they just grow like that?”

Probably there has. But I think it’s worse for us cognitive scientists because we study things which people find easy and effortless. People feel like they understand them. Almost everything we study is done without serious conscious effort, and so it feels to us like we are “just” solving the problem. I remember reading the PDP book and seeing an example in the first few pages about reaching for a cup of coffee. How do people reach for their cup of coffee on their desk? Well you just reach for it.

But when you do, you face all kinds of computational complexities that we are only beginning to be able to solve. You must recognize your coffee cup, perhaps from a view you’ve never seen before. You must recognize other things on the desk, and move your muscles in such a way as to reach for the cup and not knock anything else over. You have to shape your hand to the handle, figure out how much force to exert to pick it up without dropping it, but not crush it.  How to support its weight, how to move it without spilling it, how far to move it, how to translate your three-dimension image of the world into motor actions, etc. etc. You do all this effortlessly–you just pick it up.

The fact that you “just” do it is the remarkable part. It’s not the deeper explanatory theory we want to find; it is a statement of the problem we are trying to solve. How do you reach for the cup? How do kids “just” learn language? How do adults “just” understand it?

The hard part is understanding the computational processes that explain the how and the why. If you’ve programmed a lot, you know the painful process of making everything correct and explicit in a program. You know it’s very hard to take high-level things which are intuitive and make them explicit. “Lift the cup so that you don’t spill it and avoid putting it on your keyboard” are fine, intuitive instructions for another human. But it’s remarkably hard to implement them in a program which require explicitness about everything–how much force to send to which motors, etc etc. If I had to describe what cognitive science is, I would say that’s it’s pretty much that: making explicit the computational processes that underlie what we do. The fact that we, as humans, find it easy to execute those computational processes is mostly irrelevant, except that it prevents people from seeing how remarkable we are as biological-computational machines.

But we aren’t just remarkable; we are also very weird. Take language: you sit on one side of the room and I on the other, and I get some idea in my head. We don’t know what it really means for me to have the idea, but I get one, and decide to convey it to you. To do this, I shake the air a little bit, bumping the molecules back and forth, and they ricochet and bounce between us, eventually bumping your eardrums. You interpret these molecular bumpings in some way, using a code that you inferred as a child from listening to the bumpings around you. And the code is not easy to learn–this is why you can’t understand a language you never heard. For one, it is richly structured in such a way to let us communicate arbitrarily complex ideas–I can talk about John, or John’s grocer, or John’s grocer’s brother’s uncle, or the uncle of John’s grocer’s brother who once flipped off the pope–and (intrusively) construct some new mental object in your brain. Also, the mapping between the bumpings and what the bumpings mean is somewhat arbitrary: not much would change if “dog” meant cat and “cat” meant dog. The code is hard enough to learn that a very smart monkey can’t do it, but easy enough that every human can–without explicit instruction. After feeling some of these bumpings, you get a new idea in your head. The idea you construct has the power to influence your actions and beliefs impressively: it might lead you to cry, laugh, reconsider your politics, or blow up a federal building. Isn’t that strange? Remarkable? What would a martian think of our quaint communication system?

I think seeing and appreciating the weirdness and complexity of the world is why some people–myself included–do basic science. Often, the most fundamental progress in science comes from recognizing that something we thought was intuitive really isn’t. Did you know that hot water can freeze faster than cold water, that you can only insert “fuckin” in certain places in English words (”stu-fuckin-pendous” sounds okay but “stupen-fuckin-dous” doesn’t), or the world is crazily different for tiny organisms (or other low-Reynolds number life)?

What the cognitive scientist has is awe for the ordinary; an appreciation of how hard things really are to do, even though we find them easy. Fortunately, once you see the complexity in one cognitive act, it is easy to find it in everything you do. Give me an algorithm for how you move your fingers to pull your housekey out on your key chain. Or how you catch yourself when tripping. Or how you can decide if an animal off in the distance is a donkey. How can you recognize someone by how they walk? How do you reason about nonexistent or abstract things (can a unicorn trip? sneeze? laugh?)? How you decide if it’s wrong to sleep with your stepsister? Why can’t John McCain smile? And after the mundanely easy, things get even harder–how do the Car Talk guys diagnose a Ford? How does a musician improvise? How does an expert play chess, or writer choose sentences?

I think this is why I do basic science–there is so much to be in awe of when you really look. Anyone can appreciate the concert, but only a lucky few get to appreciate the act of reaching for the instrument.

security is not free 1

Posted by piantado on August 24, 2008

I was standing in airport security a while ago, wondering about how much all of this security cost society. Unfortunately, this is a hard thing to quantify: how much is our time worth? How much does it cost us to stand a few minutes in an airport security line while you fly? I could imagine quantifying it in terms of lost wages or the cost to pay TSA workers, but what is best?

One way that’s good is to figure out how much time is lost per year–how many hours does it take away from American’s lives every year? The TSA reports that there were 708,400,522 people screened in 2006, with an average wait time of 3.79 minutes. This comes out to 5108 people-years of lost time due to airport security screenings. Or, if you assume that people live to be 80 years old, airport security costs an equivalent of 63 person-lifetimes every year. 63 lifetimes. American lost 63 lives worth of time due to airport security. How many lives worth of time lost due to hijackings? None.

It’s worth thinking about what an optimal airport security policy would be. It should be clear that if security were too lax, we would be losing more people to terrorism than airport security wait lines. Conversely, if security were too tight, we would be losing more people due to security wait times. Picture a system in which you were screened for a few weeks before flying. This might totally eliminate terrorism on aircraft (but probably not), yet it would cost us a substantial amount of time. The way to determine if it is too much time is to see if the medicine costs more American lifetimes than the disease. Which, at 63-0, it seems to.

Said another way, what matters is the total number of lives lost–lives lost to security policies plus lives lost to terrorism. If no lives are lost to terrorism, then we could reduce the total number of lives lost by decreasing the security until the number of lives worth of lost time due to security is equal to the number lost due to terrorism. This increases the lives lost to terrorism, but decreases the total lives lost overall.

This means it is not rational to have a “zero tolerance” policy for airline terrorism. The best policy allows as many deaths, on average, from airline terrorism as it does from waiting in security lines. Unfortunately, it’s not clear how much to adjust it–maybe a small change would lead to a huge number of terrorist deaths.

Of course, people may have a different definition of “best” than the one here. After all, waiting in a security line is a guaranteed loss of 3.7 minutes, but potentially dying from terrorism is a small probability of a much bigger loss of minutes. Prospect theory tells us that people are risk averse for small probabilities of losses. So, people might rather pay the 3.7 minutes than have a very small chance of being killed by a terrorist. Even if that’s what people want, though, anything other than an equal number of deaths due to terrorism and security wait times costs society more human lifetimes.

What’s the point? The point is that security is not free. It is not free in terms of our rights and civil liberties. And it is demonstratably nonoptimal in terms of the number of American lifetimes worth of time it costs. The only thing worse than terrorism costing Americans part of their lives is the government doing it more so.

reflections on September 11′th and memory

Posted by piantado on August 22, 2008

In 1991, Anderson and Schooler published a paper, “Reflections of the environment in memory” in which they attempted to provide a rational explanation for the laws of human memory. Human memory shows characteristic power law decays in a number of ways; for example, Ebbinghaus’ data shows a power law relationship between performance (as measured in percent savings when relearning a list) and time elapsed since learning.

Anderson and Schooler attempted to provide a rational explanation for why memory decays like a power law: many things in the environment also decay like power laws, and so human memory may have evolved its “forgetting” function in order to match the probaiblities with which things occur in the environment. Memory may representing what information is needed when. They showed that many naturally occuring phenomena scale like power laws, including the probability of a word appearing in a headline, the probability of words appearing in parental speech, and the probability of receiving an email message from someone.

I became curious about power laws and media reporting. For example, what is the decay law for media news stories: is there a power law relationship between the amount of time that’s passed since an event and the number of news stories that mention it? Does human memory match this pattern, and, maybe more interestingly, does the pattern exist because of how human memory works, or does is it the other way around?

I whipped up a quick perl script to count the number of stories google news finds for “September 11th” restricted to each of the months following September 11th 2001. These show a very clear power law decay.

Here, the red dots are Septembers and the black dots are any other month. The x-axis is number of days since September 11′th 2001, and the Y axis is the number of newspaper stories mentioning September 11′th. The power law decay is clear for Septembers and other months, and the slopes of each line are remarkably similar: -0.5628 for Septembers and -0.5773 for other months. This means that newpaper stories mentioning September 11′th are dying out at about the same rate for September, and other months. The intercept for Septembers is 13.28 and 12.12 for other months.

So media reportage of September 11th shows a power law decay. It would be fun to test other big events too (and small ones) but google banned my script since it requested webpages in quick succession, so I may leave it to someone else to write a program that can scan google news and look like a human.

By the way, using this law we can compute when a month should pass with only one newpaper story referencing September 11′th: 48,467,636 years for a September to pass only mentioning it once, and 3,530,904 years for a non-September month.

p-value sniffing

Posted by piantado on August 22, 2008

Something I see sometimes is people running experiments and repeatedly checking p-values to see when they are significant. I don’t know what this is called, so I’ll call it p-value sniffing–you sniff the p-values every so often, and stop testing more subjects when things smell good for your hypothesis. Everyone knows that p-value sniffing is bad, but I wondered how much it really hurt, so I made a quick R script (here) to play around with it.

In the script, you can set N to be the number of subjects you run. There is a vector called sniff_at which sees which values you sniff p-values at. The script works by pretending to run a t-test experiment in which the null hypothesis is true, and sniffing p-values at each of the sniff_at values. If it finds anything significant at any of these sniffed values, it stops the experiment and rejects the nully hypothesis. The script simulates a bunch of runs, and prints out the average number of times the null hypothesis generates data sets which are significantly different.

Therefore, you can interpret the output of this program as something like adjusted p-values: how often do you get a large enough difference of means under the null hypothesis and sniffing experimental procedure. If you didn’t sniff at all (until your one sniff at the end of the experiment) you should get the correct significance value.

Here is some output for N=40 and N=20, and two different significance levels:

Sniff at p<0.05 p<0.01
40 0.0518 0.0098
20,40 0.0796 0.0161
20,30,40 0.0996 0.02
20,25,30,35,40 0.1156 0.0266
5,10,15,20,25,30,35,40 0.1725 0.0421
30,31,32,33,34,35,36,37,38,39,40 0.1021 0.0217



You can see that when we just test at 40, we get about the right significance levels of p=0.0518 and p=0.0098. But with more sniffing, your stats go to crap, but not as fast as I would have thought. It’s certainly not as bad as doing multiple comparisons, but still not good!