sandos

2007-03-18

Broken NAD 317 amp

My NAD 317 stereo amp is broken: the right channel is exhibiting some very bad noise regardless of volume and input source. I opened it up but I could see no signs of obvious damage such as open caps or burned areas.

These seem to have a problem with the quality of their capacitors in the power supply, but since my problem is virtually non-existent on the left channel I assume the power supply is fine. I guess caps could still be to blame on the right channel but I have no way of measuring these things unfortunately, and I am not about to jeopardize my speakers just to try and save some money on the repairs for the amp, or buying a new amp. If I could just find a specific reference where somewhere fixed this specific problem, I might be able to fix it myself. As I (almost) did for our Canon G1. Yes, I found the fix, bought the caps but I actually let a professional guy solder it. Nicely enough, he did it for free since I did all the hard work with actually dismantling the camera, which was a bith. :)

2007-02-20

More Wikipedia distance!

I recently mentioned a Wikipedia distance service in this post, and I've now found a much more current one: Six Degrees of Wikipedia.

2007-02-16

Number of unique words in the english language...

I am trying to use statistics, mostly n-gram statistics to peel some useful data off of Wikipedia. This poses a bit a challenge for someone like me who do not own a cluster of machines to do these things on. (hint: Google) The best-speced machine I own is actually a laptop: 1GiB RAM, Core Duo 1.66Ghz. This is fine for almost anything I can think of, but when it comes to Artificial Intelligence its just not anywhere near enough.

Take this scenario: You want to have a frequency of all words occurring in Wikipedia, so that you can use it later to exclude not-so-important data. Now, how many unique words could there really be in Wikipedia? Maybe one million? That sounds like a lot, and should be more than the true number, right? Wrong. Well, this depends on how you count I guess. I do not yet have a useful stemmer to remove different forms of the same word, and I will always include misspellings. Still, I only include words with a-z, A-Z in them, remember this will exclude common things like "it's", "don't" and so on.

I am still finding more than 2 million words, in parsing less than half the articles in Wikipedia. I am also ignoring all other namespaces other than the main wikipedia article namespace so talk and userpages are not included in this.

So why is this 2M words a problem? Well, firstly, Java uses 16 bits per character. Lets ponder that each word is in average 8 characters in length, and means that each word now consumes at least 24 bytes, and probably even more. Add in the storage requirements for the HashMap and Integer that is required to keep track of frequencies and you have a lot of memory usage. This means, that even if I "optimize" my strings to only store a byte[] of ASCII, I can still only store 2M words in 400+M of RAM!

And there is a lot more words than 2 million, unfortunately. If we look at the Google n-gram data we see that they found 13 million words!

That would require me to have something like 2.6 GB of RAM available to Java, and I just don't.

The obvious solution here is of course to use a proper, disk-based database but that is painfully slow! Compare: 1000 articles taking ~2 seconds, or it taking about 25 minutes! This would mean that the 6M+ articles would take approximately 14 weeks to analyze. This was with Apache Derby (durability=test, autocommit=false). HSQLDB is faster, but is unusable. Maybe I will have to use MySQL after all. Also remember that this step was planned to be a simple pre-optimizer stage for my n-gram goodies....

2007-02-14

Apache Derby versus HSQLDB

So I tried to use HSQLDB for a little hobby AI project of mine, that is, mining information from Wikipedia. Now HSQLDB is so fast it is just silly. Thats all good. Now, I wanted to store a histogram in a table because the number of items was too great to store in RAM/HashMap. I convert the code to use the DB instead of the HashMap, and all is fine. Now I want to see the resulting topmost entries in this histogram thing. So I do something along these lines:

SELECT TOP 10 * FROM histo ORDER BY cnt;

Does this work? No. OutOfMemoryError. Why?? It turns out that HSQLDB does not use indexes for ORDER BY, and hence it tries to build a temporary result that consists of the entire database. I had a look at the source, and I was determined to fix this, however ugly the solution would be.

But then I found Apache Derby, and it looks to be all that I want. It does not seem to be as fast as HSQLDB, but on the other hand I should be mostly I/O bound anyway since my databases will be many times my RAM in size. Also, Derby seems to excel at the embedded and PreparedStatement corner, and that is exactly what I'm doing. 100% of my recurring SQL statements are already Prepared.

HSQLDB is not a very young project, but I must say Derby seems to be about 10 times more mature to me.

HSQLDB 0, Apache Derby 1.

2007-02-13

How to compile ffmpeg statically

It took me some time to figure this one out. I've never understood all the details of shared vs. static linking. Just thought I would share my experience. This was to enable AAC encoding in linux, and not having to depend on the shared libraries. You can of course add any necessary libraries that you need.

Build libfaac normally (.tar.gz files are somewhat problematic, (linebreaks?) use CVS instead to get properly formatted files. Centos 4.4 was unable to compile this due to outdated autotools, used Debian testing.)

Build ffmpeg with this:
./configure --prefix=/home/x/ffmpeginstall/ --enable-faac --extra-libs=/home/x/ffmpeginstall/lib/libfaac.a --enable-gpl --extra-cflags=-I/home/x/ffmpeginstall/include --disable-ffplay --disable-ffserver --disable-shared --disable-debug --extra-ldflags=-L/home/x/ffmpeginstall/lib

2006-12-28

"Wikipedia Distance"

I found the website I was talking about earlier, its available here: http://www.omnipelagos.com. It seems very rough around the edges, in addition to having very outdated wikipedia content. I might just make my own version of that site soon, if I find the time for it.

Wikipedia seems incredibly well-linked, since everything seems to have a distance of 4 or 5. This means that measuring links qualitatively becomes important: a very simple first step to doing this is to start considering the language-links as having higher "connectedness" than random linked words in sentences. Next step could be simple template-based recognition of common high-valued links such as "X is a Y", "X is a sort of Y" and so on. Doing more than this though very quickly becomes an academic exercise in implementing general AI.

Oh, oh, this thing would be so much easier if wikipedia started implementing semantic tags a la semantic mediawiki and ontoworld

2006-05-16

Exploiting Wikipedia for AI purposes

Firstly, I'm pretty sure I've seen a reference to a service which could tell you the wikipedia distance between two articles, but I can't for the life of me find it again. If anyone knows what I'm talking about, please leave me a comment telling me what you know about this!

Secondly, I'm perplexed that nobody has yet parsed wikipedia and used it for common-sense or other AI-related tasks such as spreading activation. Think Conceptnet, Wordnet and OpenMind. They all try to build some form of graph between concepts, you could say, and that is exactly what Wikipedia is. Wikipedia currently has 1M+ articles, easily besting both Wordnet and Conceptnet in number of nodes. I'm convinced that the number of internal links also outnumber the others mentioned here, so I think Wikipedia could be a really useful AI resource. You could even start to follow external links to get an even finer-grained link between Wordnet-nodes, although I suspect that would help very little since I suspect very few Wordnet articles references the same external URLs.

If I had the opportunity, I would try to experiment around with these ideas, but unfortunately it seems I won't have the time for that, finishing up my thesis and then going straight to getting a job. Actually, I am already trying to get a job.

2006-04-01

Groovy is slow!

I have been trying to write a prototype for my thesis work in groovy, but I was thinking it seemed a bit slow. Sure, this is to be expected for such a dynamic language, and interpreted at that. But when I wrote a few microbenchmarks I noticed that interpretation didnt seem to be the culprit here.

First, I tried adding things to a hashmap. This did not reveal any big difference between java and groovy though, the difference were about 100% which I find acceptable, so I continued my search on towards creating objects, and here groovy seems to be adding some substantial overhead:


        val = new Integer("0") 
        for(i in 0..<1000000){
            Integer tmp2 = new Integer(43)
            val += tmp2
        }

versus:


        val = new Integer("0") 
        Integer tmp2 = new Integer(43)
        for(i in 0..<1000000){
            val += tmp2
        }

Java: 387ms, 206ms, Groovy: 29873ms, 4527ms respectively. Firstly, even when not creating Integers in the loop, groovy is still much much slower, but adding the object creation just makes it intolerably slow! This was precompiled with groovyc then run with the -server parameter to gain some speed.

I am reading in and parsing the database of Wordnet, and that alone takes some 400 seconds in the current version of my code, which I find a bit too much to sit around and wait for each time I restart my app! Yes, there are ways around this, and I will use them eventually (such as only reloading changed code and not restart the entire JVM, keeping the intialized structures in RAM and ready).

2006-02-22

Rm my Mac

I saw this page http://rm-my-mac.wideopenbsd.org/ which hands out free ssh accounts on a mac mini, for the purpose of hacking it. Not my favorite pastime, for sure, but I'm sure there are others interested out there.

2006-02-08

Thesis: Opennlp and groovy..

Just wanted to write about what my thesis ended up being about: Question classification. No chatbot, no Loebner contest I'm afraid. Seems I will use Opennlp and java, but the problem is I hate Java, I find it way too slow to prototype in. So I decided to use groovy. This did not turn out well, it took me about a day to figure out that groovys error-messages are far from perfect, leaving me to believe that class loading was all broken in groovy.

Anyway, both groovy and opennlp seems like great projects. Ive also downloaded the Stanford parser, might be useful.

2006-01-02

Piratpartiet

Piratpartiet:

Piratpartiet siktar på att ta en vågmästarroll efter valet 2006. Det finns mellan 800 000 och 1 100 000 aktiva fildelare i Sverige, och de är alla trötta på att kallas kriminella. Vi behöver ha 225 000 av dessa med oss för att komma förbi fyraprocentspärren och hamna i vågmästarroll. Den rollen tänker vi sedan använda till att avskaffa upphovsrätten.

Mycket intressant, och det kan bli så att jag röstar på detta parti men är inte helt övertygad ännu.

Filed in: politik, p2p, filesharing

2005-12-15

Thesis subject: winning the Loebner contest

The Open Mind project, together with ThoughtTreasure and Learner are all very interesting projects in and of their own, where Learner and Open Mind try to build up a common sense ontology from voluntary user input on webpages, and TT trying to understand texts. The one thing in common to all these is that some version of their knowledgebase is available for download. There is also data from SUMO and OpenCYC available.

Why all this is interesting to me right now is that I am trying to choose my thesis topic, and I am very much in favor of my idea about entering a chatbot into the Loebner contest right now. The problem is I understand that this is a huge, enourmous, even mind-bogglingly hard task. Firstly there is the three main parts of a chatbot: Input recognition, inference engine/reasoning/knowledge store, and generating output. Now each of these are hard parts, and I will try to make them all in less than 20 weeks, with no deep knowledge about inferencing, knowledge representation or linguistics. Yeah, I would say it is looking dark. But it is so damn fun to atleast try. I might have to restrict myself though.

Filed in: ai,ontology,science,research,cyc

2005-11-28

Det kontantlösa samhället

Det kontantlösa samhället:

En ekonomi utan gråa pengar vore lika svår att skapa som ett internet utan darknets

Varför inte ett tor för elektroniska transaktioner? Där användare från flera olika banker skickar vidare elektroniska överföringar för att dölja mottagare och avsändare. Vet att åtminstone en bank har möjlighet till inloggning med personlig kod, vilket väl är ett krav då varken dosa eller engångs-skrapkoder blir särskilt enkelt för ett script att använda sig av. Jag tror inte på att systemet skulle fungera om det var manuellt. Ett problem är att bankerna säkerligen skulle blockera dessa överföringar ganska snabbt genom att upptäcka typiska mönster.

Dock har väl polisen fortfarande möjlighet att plocka fram uppgifter från vilken bank som helst, så det handlar egentligen mest om att en bank inte ska sitta på för mycket information, och inte om att undkomma lagen.

Filed in: p2p, anonp2p, economy, cash

2005-11-27

Freenet Alphatest

Soo, freenet is alphatesting their next version 0.7. If you want in, go to #freenet-alphatest on the freenode irc network. This version is still far from being usable or anything, but I think they need more testers still. Basically, you can insert files and download them from a single-threaded CLI interface.

Instructions for installation of the node is here.

Filed in: freenet, p2p, filesharing