2007-02-16

Number of unique words in the english language...

I am trying to use statistics, mostly n-gram statistics to peel some useful data off of Wikipedia. This poses a bit a challenge for someone like me who do not own a cluster of machines to do these things on. (hint: Google) The best-speced machine I own is actually a laptop: 1GiB RAM, Core Duo 1.66Ghz. This is fine for almost anything I can think of, but when it comes to Artificial Intelligence its just not anywhere near enough.

Take this scenario: You want to have a frequency of all words occurring in Wikipedia, so that you can use it later to exclude not-so-important data. Now, how many unique words could there really be in Wikipedia? Maybe one million? That sounds like a lot, and should be more than the true number, right? Wrong. Well, this depends on how you count I guess. I do not yet have a useful stemmer to remove different forms of the same word, and I will always include misspellings. Still, I only include words with a-z, A-Z in them, remember this will exclude common things like "it's", "don't" and so on.

I am still finding more than 2 million words, in parsing less than half the articles in Wikipedia. I am also ignoring all other namespaces other than the main wikipedia article namespace so talk and userpages are not included in this.



So why is this 2M words a problem? Well, firstly, Java uses 16 bits per character. Lets ponder that each word is in average 8 characters in length, and means that each word now consumes at least 24 bytes, and probably even more. Add in the storage requirements for the HashMap and Integer that is required to keep track of frequencies and you have a lot of memory usage. This means, that even if I "optimize" my strings to only store a byte[] of ASCII, I can still only store 2M words in 400+M of RAM!

And there is a lot more words than 2 million, unfortunately. If we look at the Google n-gram data we see that they found 13 million words!

That would require me to have something like 2.6 GB of RAM available to Java, and I just don't.

The obvious solution here is of course to use a proper, disk-based database but that is painfully slow! Compare: 1000 articles taking ~2 seconds, or it taking about 25 minutes! This would mean that the 6M+ articles would take approximately 14 weeks to analyze. This was with Apache Derby (durability=test, autocommit=false). HSQLDB is faster, but is unusable. Maybe I will have to use MySQL after all. Also remember that this step was planned to be a simple pre-optimizer stage for my n-gram goodies....

No comments: