2008-06-10

HTMLUnit, MultiThreadedHttpConnectionManager and memory leaks

I have been having this wonderful time at my work. A small coding project involving use of mostly HtmlUnit was almost done, and working properly. But what happens? By chance I notice that it is leaking memory: Perm Gen space, even.

This was coded as a plugin for a larger product, and was dynamically reloaded at every invocation. I had to first remake the plugin CassLoader in this case to become a post-delegating classloader so that I could override the version of HttpClient already in the product. I couldn't really be sure it was not my changes leading to the classloader that gave rise to the leak, but eventually I got to that conclusion. The next step was to narrow down where this happened. Long story short, I found that if I changed HtmlUnit to not use the MultiThreadedHttpConnectionManager from HttpClient, it did not leak. I did not want to really do this though, being unsure of how HtmlUnit actually used this, and also because of the fact that we have multiple threads using HtmlUnit.

The thing that solved the issue was to call shutdownAll in the connection manager. I am not allowed to access that from my code as a user of HtmlUnit though, and I did want to avoid having to recompile anything, so I used reflection to subvert the access checks. Calling shutdown on the one manager did not work, however, nor did closing the connection, which HtmlUnit already did by the way.

I can only assume this is some obscure bug that nobody else ever trips, but now at least if somebody does, they might find this as a reference.

I could not use the latest HtmlUnit because of needing JDK1.4 compatibility, so this was done in HtmlUnit 1.13. Oh, and 1.14 needed CSS stuff that clashed with regular DOM libraries, making classloading not work. Not sure why this does not work when I can safely override HttpClient with a newer version.

2008-06-06

Semantic Web is freaking cool! (and on a roll it seems)

Been surfing around for semantic web websites to find ontologies or datasources to (ab)use for Yet Another AI-Project From Me. This is what I found:


  • True Knowledge. Incredibly cool question-answering frontend to an incredibly complex datamodel, with a moderately complex and severely boring input process. Not free data, I can not download a dump of their database.

  • Freebase. Took me some time to dig into this, actually, but I like what I am seeing. Data model seems a lot simpler than True Knowledge, or at least that is what i think (subclassing, transitivity missing?). Inputting stuff is from 2-10 times quicker/easier. For bulk stuff it is infinitely easier, since TK does not support that at all. Free data, but not RDF!

  • Faviki. Very nice and easy to use semantic social tagging/bookmarking service.

  • RDFScape. Visualizer for cytoscape. Very nice, have not had time to play with this yet.

  • Attempto Controlled English. Maybe the least exciting of the bunch, but is useful for my NLP-related project.




I also got access to twine. Oh my god what a bore. I just did not see the idea behind it, and the interface turned me off so much that after my third visit I never came back.

True Knowledge has some awesome NLP parsing going on, but it also fails miserably often. I have a simpe idea to get me atleast started, it pretty much builds upon AIML/patterns to extract meaning from stuff, specifically Wikipedia.

Freebase has a weak model in my mind, there does not seem to be a real inheritance hierarchy and the "upper ontology" is basically missing. The upper ontology not being there is not such a big deal though, I think. There should be a set of "uppermost" classes in Freebase that can be mapped to SUMO/YAGO/DBPedia/Wordnet or whatever to help with any inferencing/analogous thinking.

I can not help but think that in five years from now, "semantic" does not really exist. Everything is then semantic, or gone since long. AGI is not far behind, either. I predict a surge of NLP success in the coming few years, mainly with knowledge-intensive approaches. Common-sense is still the missing piece of the puzzle, the above efforts do not concentrate on this at all, but rather on knowledge that is useful to humans. Remember, common-sense is boring for humans to input and administer, since it is all so basic.