sandos: 2007

2007-12-05

Domain Model - A Tale of Bad (J2EE) Design
(And a newbie developer trying to fix it)

Where I work we have a product that we can call ABC. Now ABC is what I would call "legacy": more than 10k loc (I think it is somewhere around 50-100k), and the project was started more than 5 years ago, in 2001 I think. I have been working on this for about 16 months now, and I am getting more and more comfortable with it every day due to actually getting assigned to develop it at the moment. I am the only developer since a couple of months back.

ABC is in need of a lot of things, this I saw the first week on the job. So, many obvious things are wrong. Where to start? Well, our "service layer" (Note: we do not have clearly defined layers, and nobody seems to have known what to call bigger parts of the system, so if I could say "hey, we have a problem with our service layer" nobody would understand what I was talking about) has code in it that is very, very long-winded, some made up code:


dto.Part p = DAOFactory.getPartsDAO().getPartById(new Integer(1));
validatePart(p);
if(p.getInventoryID() != null) {
 dto.Inventory inv = DAOFactory.getInventoryDAO().getInvetoryById(p.getInventoryID());

 //Do more...
}

Hey, we use a factory! Great. But, this code does very little for being so verbose, and note that this sort of code could go on for hundreds of lines. Can we see spot any problems? I can, at least now, previously I really could not. I am not a very experienced J2EE developer, this is my first job with it fresh from university:

I hate the fact that we have to explicitly do "DAO things". Why is this necessary? It is not, if we for example use hibernate a bit more correctly: ABC is currently not using any mappings between entities for example. Adding this, we could do away with fetching one object in a transaction and then accessing the collection mappings instead of all the "DAOing" as I call it.
^{(This DAO-ing also has severe performance penalties when done this way: it has a select n+1 problem. It can be solved by being "smart", but takes a lot of manual coding, for something that hibernate solves perfectly for us, practically for free. Logic is honestly more compact in a HQL query than in this sort of convoluted code. 20-lines of HQL did in some instances replace 100+ lines of Java for us, especially manually fetching objects, checking them instead of simply using a WHERE clause.)}

We are fetching dtos from a database, by the looks of it. Is this really the intended use for DTO pattern? No (as far as I know, anyway), and it turns out these are not DTOs, firstly, and secondly, they should not be. Our "DTO layer" here is misnamed, and misguided. It is actually our domain model, although an almost completely anemic such^[1]. We do in fact have another DTO layer, that is really a DTO layer, and which is needed and should be a DTO layer. This is used for sending data from webservices, and decoupling this from our domain model is on the contrary not a bad idea.

So we have a, more or less, procedural service layer with 95% of our business logic in it, and a domain model layer with data and only trivial behavior in it. It seems straightforward enough to just move behavior into our domain model, and be done with it. One slight problem here is that I have introduced alternative "DTOs" (entity beans, domain objects) that are properly mapped using hibernate, and adding _Mapped to their classname, inheriting from its non-collection-mapped hibernate class. For example we might have:

Part has a subclass PartMapped. PartMapped has collection mappings, Part does not.

It can also look like so: We have PartBase, PartMapped and Part. Part is the "original" DTO, with a foreign key in the form of a Integer. We don't really want this integer in our hibernate-collection-mapped variant, so we create a base class which does not have it, and extend that while adding the collections for hibernate to populate.

Modifying the domain model to fit hibernate is a bad idea, I realize that. We need to stop depending on hibernate. I have been looking at spring lately (which ABC of course does not use) and I think it could help with a lot of this. One reason for having alternate classes for collection-mapped entities was that I do not yet trust hibernate to actually save those back: we still always use the simple save() call for that on a entity with no collection mappings. Makes it very explicit what is happening. This is simply a matter of education though, I need to learn how hibernate works in this regard and, say, filtered collection.

This inheritance structure makes it not hard, far from, but a bit smelly to add behaviour to these objects. Mostly due to the structure being hibernate induced, but also because of the duplication arising from that: We would need two different implementations of a "public Inventory getInventory()" for a Part and a PartMapped, one using straight accessors (hibernate needs objects to use the accessors for its collection proxies to work) and the other one using old-style DAO-ing to fetch the object. We have a lot of these methods to implement if we were to move them into the domain model proper.

I assume the right course of action is to abstract away the database stuff such as Integer ID; in all domain objects, and let "something" (not specifically hibernate) manage persistence for us, and then start modeling the domain properly. And then document it all and educate everyone that, now, ABC is much more agile and can withstand changes easily.

It cannot today, that is for sure. I draw the conclusion that earlier developers have been "fooled" by EJB patterns or something similar, and was unable to see anything wrong with our anemic domain model. Design patterns are used, but those are localized, and in general there "is no design". No explicit design anyway, there is one that can be sort-of inferred from looking at the code, but it is very vague and lots of boundaries in that are often broken by code. It can still server as a "new vision" for the overall design though.

2007-07-02

Amazon EC2 and S3 performance...

I was playing around with a EC2 instance, and I got this great idea that I should benchmark the disk subsystem. I had expected pretty standard performance for the disk at /mnt, but I was wrong:

Benchmarking /dev/sda1 [1537MB], wait 30 seconds
Results: 1014 seeks/second, 0.99 ms random access time

Benchmarking /dev/sda2 [152704MB], wait 30 seconds
Results: 4494 seeks/second, 0.22 ms random access time

Thats pretty damn fast. To get this sort of performance out of a RAID-5 set I think you need a hole lot of disks, maybe 20 or so? It sure as hell beats any disk I have at home with a factor of about between 15 and 60.

I don't know if this could be a Xen problem with this particular benchmark, but if it is true I actually just found a use for EC2: to run my DB-intensive information extraction jobs on. Though they are still most likely better solved by just buying a new machine with 4 or 8GB of RAM. It is pretty cheap now.

Edit: Oh, yeah, I understand that this disk array is shared by a number of instances. A good question then is by how many? And how likely are people to actually use the disk for any stressful activity, with it being non-persistent and all?

Edit2: Oh yes, I am stupid. This is most likely due to Xen using a sparse Copy-on-Write format disk image? It would explain it all, hitting the same physical sector over and over....

2007-07-01

Flash random access write performance

I just bought a new 2GB USB stick, Sandisk Cruzer Micro. I have a Pretec Tiny 256MB since before. I benchmarked these using h2benchw in windows (and seeker in linux, but only for reads).

Now, these two are actually pretty close together in performance, h2benchw reports 0.77 ms access time for the 2GB stick, and 1.77 ms for the 256MB one. Raw read speeds are, according to linux hdparm -tT about 25MB/s for the smaller one, and slightly lower for the Sandisk.

But the real odd difference comes when I set h2benchw up to test random access _write_ performance. Pretec manages ~10ms access time here, whereas the Sandisk comes in at a whopping 132 ms!! Wooha. There faded my hope somewhat of using flash as a faster substitue for disks. X-bit labs says there are faster write access times though, specifically for Apacer HT202 sticks with a 28ms access time for writes. Thats unfortunately still in line with harddisks, so the only real benefit is the read access times, in other words flash will only be great when the ratio of read/writes is heavily biased towards reads. A typical database-load might not perform so well.

2007-05-22

MIDP 2.0 and killing homebrew..

Why is it that MIDP 2.0 decided to completely lock homebrew developers out of J2ME development? If you read this first you'll notice some people are annoyed at the way root certificates are handled, mainly that there is no certificate that is guaranteed to be available on every device, and the way Java Verified is per-device. This is a big problem.

Another problem that I have been having lately is that of free software and J2ME. Now you may ask why can't you just put up with a few security prompts for your hobby projects? Sure, I am fine with that. As long as they are few. The problem here is of course that they are not few! If you want to, for example, read maps from a memory card you could be reading from hundreds of tiles.

I think the problem here is that the APIs are actually becoming useful, and the security prompts for these sort of applications are becoming a big bottleneck, necessitating signing. But there is no way to sign anything for free! Not as far i know, anyway. You are also not allowed to import your own root-certificate.

I think security is a good thing, but locking out power-users is also bad. This while things is putting me off my long-needed phone-upgrade.

2007-04-10

KR tutorial...

I found this interesting and simple intro to KR and I must say it is good. I don't quite agree that description logics is the way of the future though.

I also fully agree that Google most likely is doing A.I. research. If they are not, I think they are totally misjudging how useful it is and how close we are to useful A.I. applications.

This book that was linked also looks very promising, I have to add it to my Amazon wish list for sure.

2007-04-08

Nokia E70 video playback resolution

I have been trying to find information on what Nokia 5300 and Nokia E70 is capable of playing back in terms of resolution. I would love a E70 if it can playback full-res video, and my girlfriend wants a 5300. This thread seems to imply the 5300 is unable to play back full-res video, and I suspect the E70 has the same problem, given its huge resolution.

2007-04-05

SleekXMPP

Another nice XMPP library that aims to make implementing or testing XEPs easy, something I really long to do.

Some day I am going to start using XMPP "for real": I have been on jabber for a couple of months but I have also noticed that I never really IM anyone anymore now that I work all day long. I am an antisocial creature I guess. Also I only have bots and automated services on jabber, while every human in my roster is on MSN or ICQ.

Canon TX-1

I would much like a Canon TX1 when I go to Venezuela in December. Or why not for the summer vacation in Paris/Öland as well.

I see many people complaining about the MJPEG of Canon. I like it though: its perfect for editing, no temporal component that makes re-encoding necessary when you cut the movie. Also the quality is very high. 8GB SDHC cards are not expensive enough to make them unachievable in any way. A problem is where to store all the raw footage when you empty the cards though... this will be a serious filler of HDDs.

2007-04-04

Linkbacks and blogger/blogspot

Why, oh why, doesn't Google support pingback? As far as I have understood, blogger does not support any of the linkback protocols at all.

Thankfully there is, as is common, a hack around this. The problem is that the hack this time around is very ugly: Trackback with some help from Greasemonkey:

Manually add links

Trackback as opposed to pingback

I need to do... more m$ stuff.

Wow, I found this blog really interesting: Vista Smalltalk. It is, you might have guessed it, Smalltalk for Vista, sort of. It's using .NET and Microsoft's "AJAX" WPF/E to do some very cool stuff in the browser. I happen not to like Smalltalk much though, i did a presentation on it when I was at university and I was not impressed. It lacked anything special enough for it to considered useful by me anyway.

The future will be very interesting when it comes to portable applications. Seems Microsoft really has got something going here, and sadly the XUL approach of Mozilla will not be able to compete here, even though it is not really aimed at the same thing. I really need to learn some .NET stuff for real soon, or I fear I will be obsolete in a short time.

2007-03-18

Broken NAD 317 amp

My NAD 317 stereo amp is broken: the right channel is exhibiting some very bad noise regardless of volume and input source. I opened it up but I could see no signs of obvious damage such as open caps or burned areas.

These seem to have a problem with the quality of their capacitors in the power supply, but since my problem is virtually non-existent on the left channel I assume the power supply is fine. I guess caps could still be to blame on the right channel but I have no way of measuring these things unfortunately, and I am not about to jeopardize my speakers just to try and save some money on the repairs for the amp, or buying a new amp. If I could just find a specific reference where somewhere fixed this specific problem, I might be able to fix it myself. As I (almost) did for our Canon G1. Yes, I found the fix, bought the caps but I actually let a professional guy solder it. Nicely enough, he did it for free since I did all the hard work with actually dismantling the camera, which was a bith. :)

2007-02-20

More Wikipedia distance!

I recently mentioned a Wikipedia distance service in this post, and I've now found a much more current one: Six Degrees of Wikipedia.

2007-02-16

Number of unique words in the english language...

I am trying to use statistics, mostly n-gram statistics to peel some useful data off of Wikipedia. This poses a bit a challenge for someone like me who do not own a cluster of machines to do these things on. (hint: Google) The best-speced machine I own is actually a laptop: 1GiB RAM, Core Duo 1.66Ghz. This is fine for almost anything I can think of, but when it comes to Artificial Intelligence its just not anywhere near enough.

Take this scenario: You want to have a frequency of all words occurring in Wikipedia, so that you can use it later to exclude not-so-important data. Now, how many unique words could there really be in Wikipedia? Maybe one million? That sounds like a lot, and should be more than the true number, right? Wrong. Well, this depends on how you count I guess. I do not yet have a useful stemmer to remove different forms of the same word, and I will always include misspellings. Still, I only include words with a-z, A-Z in them, remember this will exclude common things like "it's", "don't" and so on.

I am still finding more than 2 million words, in parsing less than half the articles in Wikipedia. I am also ignoring all other namespaces other than the main wikipedia article namespace so talk and userpages are not included in this.

So why is this 2M words a problem? Well, firstly, Java uses 16 bits per character. Lets ponder that each word is in average 8 characters in length, and means that each word now consumes at least 24 bytes, and probably even more. Add in the storage requirements for the HashMap and Integer that is required to keep track of frequencies and you have a lot of memory usage. This means, that even if I "optimize" my strings to only store a byte[] of ASCII, I can still only store 2M words in 400+M of RAM!

And there is a lot more words than 2 million, unfortunately. If we look at the Google n-gram data we see that they found 13 million words!

That would require me to have something like 2.6 GB of RAM available to Java, and I just don't.

The obvious solution here is of course to use a proper, disk-based database but that is painfully slow! Compare: 1000 articles taking ~2 seconds, or it taking about 25 minutes! This would mean that the 6M+ articles would take approximately 14 weeks to analyze. This was with Apache Derby (durability=test, autocommit=false). HSQLDB is faster, but is unusable. Maybe I will have to use MySQL after all. Also remember that this step was planned to be a simple pre-optimizer stage for my n-gram goodies....

2007-02-14

Apache Derby versus HSQLDB

So I tried to use HSQLDB for a little hobby AI project of mine, that is, mining information from Wikipedia. Now HSQLDB is so fast it is just silly. Thats all good. Now, I wanted to store a histogram in a table because the number of items was too great to store in RAM/HashMap. I convert the code to use the DB instead of the HashMap, and all is fine. Now I want to see the resulting topmost entries in this histogram thing. So I do something along these lines:

SELECT TOP 10 * FROM histo ORDER BY cnt;

Does this work? No. OutOfMemoryError. Why?? It turns out that HSQLDB does not use indexes for ORDER BY, and hence it tries to build a temporary result that consists of the entire database. I had a look at the source, and I was determined to fix this, however ugly the solution would be.

But then I found Apache Derby, and it looks to be all that I want. It does not seem to be as fast as HSQLDB, but on the other hand I should be mostly I/O bound anyway since my databases will be many times my RAM in size. Also, Derby seems to excel at the embedded and PreparedStatement corner, and that is exactly what I'm doing. 100% of my recurring SQL statements are already Prepared.

HSQLDB is not a very young project, but I must say Derby seems to be about 10 times more mature to me.

HSQLDB 0, Apache Derby 1.

2007-02-13

How to compile ffmpeg statically

It took me some time to figure this one out. I've never understood all the details of shared vs. static linking. Just thought I would share my experience. This was to enable AAC encoding in linux, and not having to depend on the shared libraries. You can of course add any necessary libraries that you need.

Build libfaac normally (.tar.gz files are somewhat problematic, (linebreaks?) use CVS instead to get properly formatted files. Centos 4.4 was unable to compile this due to outdated autotools, used Debian testing.)

Build ffmpeg with this:
./configure --prefix=/home/x/ffmpeginstall/ --enable-faac --extra-libs=/home/x/ffmpeginstall/lib/libfaac.a --enable-gpl --extra-cflags=-I/home/x/ffmpeginstall/include --disable-ffplay --disable-ffserver --disable-shared --disable-debug --extra-ldflags=-L/home/x/ffmpeginstall/lib

sandos