I love dynamic programming

Posted by piantado on September 14, 2008

Here’s a fun problem that came up recently: how many possible English words are there of length L? A word is a “possible” English word if it obeys the constraints of English phonotactics. For example, in English there are no words that begin with the sounds “bn”, meaning that “bnark” is not a “possible” English word.

For simplicity, we’ll assume that English phonotactics is completely captured by the set of triphones, or combinations of three phonemes). For example, “#bn” is never seen in English words (where # denotes the beginning or end of a word), so we’ll assume “#bn” is forbidden in English. But “#st” is allowed because you see it in words like “start” and “star.” So, its relatively easy to take a big list of English word pronunciations and come up with a set of all three phoneme combinations that English allows.

But then the fun part. Given these restrictions about what triples of phonemes are allowed, how could we count this number of possible words that are allowed of length L? The naive algorithm is to generate phoneme strings of length L, and for each string see if it contains only allowed triples of phonemes. This is bad because there are P^L such strings to check (where P is the number of phonemes), and this grows too fast to be practical for anything other than small L.

But there’s a better option: dynamic programming. Because of the local nature of constraints we don’t actually need to search through all possible phoneme combinations of length L. Instead, we just store the number of all words ending in XY for each pair of phonemes XY. We’ll use the notation C(XY,L) to be the number of strings of length L that end in XY. We start with strings of length two and set C(XY,2)=1 if #XY is an allowed string. In other words, for each allowed #XY, there is only one allowed string ending in XY.

Then we repeat the following: set C(YZ,L) to be the sum over all X of C(XY,L-1) such that XYZ is allowed. This counts the number of strings you get by extending each string that ends in XY by one phoneme.

To count the actual number of allowed words, we have to restrict endings to the word boundary symbol #. So set N(L) to the sum over all XY of C(XY,L) such that XY# is allowed. Then N(L) gives you the number of possible English words, according to the triphone constraints. The neat part is that instead of requiring P^L time, this is upper bounded by L*P^3 time, meaning that we have taken a problem which was exponential in L and made it linear in L and polynomial in P. I love dynamic programming.

If you’re curious, here are some (exact) counts of the number of possible words:

length    count    logcount
2    444    6.09582456243222
3    6400    8.76405326934776
4    84247    11.341508239272
5    1278377    14.0611018625933
6    18478098    16.7320966968031
7    268961125    19.4100774103939
8    3915334048    22.0881664879088
9    56997708070    24.7662768946743
10    829696173526    27.4443254147294
11    12077076407040    30.1223302598591
12    175801721337926    32.8003778925875

And here is a perl script to compute it (note, requires CELEX).