2007-02-20
More Wikipedia distance!
I recently mentioned a Wikipedia distance service in this post, and I've now found a much more current one: Six Degrees of Wikipedia.
2007-02-16
Number of unique words in the english language...
I am trying to use statistics, mostly n-gram statistics to peel some useful data off of Wikipedia. This poses a bit a challenge for someone like me who do not own a cluster of machines to do these things on. (hint: Google) The best-speced machine I own is actually a laptop: 1GiB RAM, Core Duo 1.66Ghz. This is fine for almost anything I can think of, but when it comes to Artificial Intelligence its just not anywhere near enough.
Take this scenario: You want to have a frequency of all words occurring in Wikipedia, so that you can use it later to exclude not-so-important data. Now, how many unique words could there really be in Wikipedia? Maybe one million? That sounds like a lot, and should be more than the true number, right? Wrong. Well, this depends on how you count I guess. I do not yet have a useful stemmer to remove different forms of the same word, and I will always include misspellings. Still, I only include words with a-z, A-Z in them, remember this will exclude common things like "it's", "don't" and so on.
I am still finding more than 2 million words, in parsing less than half the articles in Wikipedia. I am also ignoring all other namespaces other than the main wikipedia article namespace so talk and userpages are not included in this.
So why is this 2M words a problem? Well, firstly, Java uses 16 bits per character. Lets ponder that each word is in average 8 characters in length, and means that each word now consumes at least 24 bytes, and probably even more. Add in the storage requirements for the HashMap and Integer that is required to keep track of frequencies and you have a lot of memory usage. This means, that even if I "optimize" my strings to only store a byte[] of ASCII, I can still only store 2M words in 400+M of RAM!
And there is a lot more words than 2 million, unfortunately. If we look at the Google n-gram data we see that they found 13 million words!
That would require me to have something like 2.6 GB of RAM available to Java, and I just don't.
The obvious solution here is of course to use a proper, disk-based database but that is painfully slow! Compare: 1000 articles taking ~2 seconds, or it taking about 25 minutes! This would mean that the 6M+ articles would take approximately 14 weeks to analyze. This was with Apache Derby (durability=test, autocommit=false). HSQLDB is faster, but is unusable. Maybe I will have to use MySQL after all. Also remember that this step was planned to be a simple pre-optimizer stage for my n-gram goodies....
Take this scenario: You want to have a frequency of all words occurring in Wikipedia, so that you can use it later to exclude not-so-important data. Now, how many unique words could there really be in Wikipedia? Maybe one million? That sounds like a lot, and should be more than the true number, right? Wrong. Well, this depends on how you count I guess. I do not yet have a useful stemmer to remove different forms of the same word, and I will always include misspellings. Still, I only include words with a-z, A-Z in them, remember this will exclude common things like "it's", "don't" and so on.
I am still finding more than 2 million words, in parsing less than half the articles in Wikipedia. I am also ignoring all other namespaces other than the main wikipedia article namespace so talk and userpages are not included in this.
So why is this 2M words a problem? Well, firstly, Java uses 16 bits per character. Lets ponder that each word is in average 8 characters in length, and means that each word now consumes at least 24 bytes, and probably even more. Add in the storage requirements for the HashMap and Integer that is required to keep track of frequencies and you have a lot of memory usage. This means, that even if I "optimize" my strings to only store a byte[] of ASCII, I can still only store 2M words in 400+M of RAM!
And there is a lot more words than 2 million, unfortunately. If we look at the Google n-gram data we see that they found 13 million words!
That would require me to have something like 2.6 GB of RAM available to Java, and I just don't.
The obvious solution here is of course to use a proper, disk-based database but that is painfully slow! Compare: 1000 articles taking ~2 seconds, or it taking about 25 minutes! This would mean that the 6M+ articles would take approximately 14 weeks to analyze. This was with Apache Derby (durability=test, autocommit=false). HSQLDB is faster, but is unusable. Maybe I will have to use MySQL after all. Also remember that this step was planned to be a simple pre-optimizer stage for my n-gram goodies....
Labels:
ai,
artificial intelligence,
java,
programming,
wikipedia
2007-02-14
Apache Derby versus HSQLDB
So I tried to use HSQLDB for a little hobby AI project of mine, that is, mining information from Wikipedia. Now HSQLDB is so fast it is just silly. Thats all good. Now, I wanted to store a histogram in a table because the number of items was too great to store in RAM/HashMap. I convert the code to use the DB instead of the HashMap, and all is fine. Now I want to see the resulting topmost entries in this histogram thing. So I do something along these lines:
SELECT TOP 10 * FROM histo ORDER BY cnt;
Does this work? No. OutOfMemoryError. Why?? It turns out that HSQLDB does not use indexes for ORDER BY, and hence it tries to build a temporary result that consists of the entire database. I had a look at the source, and I was determined to fix this, however ugly the solution would be.
But then I found Apache Derby, and it looks to be all that I want. It does not seem to be as fast as HSQLDB, but on the other hand I should be mostly I/O bound anyway since my databases will be many times my RAM in size. Also, Derby seems to excel at the embedded and PreparedStatement corner, and that is exactly what I'm doing. 100% of my recurring SQL statements are already Prepared.
HSQLDB is not a very young project, but I must say Derby seems to be about 10 times more mature to me.
HSQLDB 0, Apache Derby 1.
SELECT TOP 10 * FROM histo ORDER BY cnt;
Does this work? No. OutOfMemoryError. Why?? It turns out that HSQLDB does not use indexes for ORDER BY, and hence it tries to build a temporary result that consists of the entire database. I had a look at the source, and I was determined to fix this, however ugly the solution would be.
But then I found Apache Derby, and it looks to be all that I want. It does not seem to be as fast as HSQLDB, but on the other hand I should be mostly I/O bound anyway since my databases will be many times my RAM in size. Also, Derby seems to excel at the embedded and PreparedStatement corner, and that is exactly what I'm doing. 100% of my recurring SQL statements are already Prepared.
HSQLDB is not a very young project, but I must say Derby seems to be about 10 times more mature to me.
HSQLDB 0, Apache Derby 1.
2007-02-13
How to compile ffmpeg statically
It took me some time to figure this one out. I've never understood all the details of shared vs. static linking. Just thought I would share my experience. This was to enable AAC encoding in linux, and not having to depend on the shared libraries. You can of course add any necessary libraries that you need.
- Build libfaac normally (.tar.gz files are somewhat problematic, (linebreaks?) use CVS instead to get properly formatted files. Centos 4.4 was unable to compile this due to outdated autotools, used Debian testing.)
Build ffmpeg with this:
./configure --prefix=/home/x/ffmpeginstall/ --enable-faac --extra-libs=/home/x/ffmpeginstall/lib/libfaac.a --enable-gpl --extra-cflags=-I/home/x/ffmpeginstall/include --disable-ffplay --disable-ffserver --disable-shared --disable-debug --extra-ldflags=-L/home/x/ffmpeginstall/lib
Subscribe to:
Posts (Atom)