2009-03-18

More Wikipedia statistics

Doing some statistics on the WEX datadump, and I am once again hitting memory limits. At least unless I filter uncommon stuff out ,which is entirely almost possible, given that you know every common word should appear a number of times if you take a chunk of wikipedia articles. Current stats are:

words 29657715 sentences: 1795296 articles: 13801 words unique: 1182046

Words are the number of whitespace-separated strings I've seen, sentences is sentences according to WEX, which by the way is way off for some articles such as Andre Agassi where it believes sentences ending with "No." means a new sentence when its Agassi's world ranking that comes next. Articles is self-explanatory, and unique words are unique words, this time with all character allowed but all lower-cased. So, from 14K articles the number of words are already more than 1M. This is very different from my older run which ignored anything that was not a-zA-Z, where it took a couple of hundred K articles to reach that number.

No comments: