2008-09-25

OutOfMemoryErrors and ObjectOutputStream

Ran into an interesting problem today - turns out if you use an ObjectOutputStream to write out lots of data the data doesn't go away after OOS.write() returns - instead it gets cached in a handles map within the OOS itself so that should it be told to write that object instance again it can instead write a handle reference (thus saving on space in the stream and avoiding circular references). Kewl, huh? Well, not entirely...

The downside to this approach is that if you're writing LOTS of objects (say 250,000) then even tho you've designed your code to avoid holding all those objects in memory every one of them will be held in memory until you finish with the OOS entirely because of that handle cache! (This explains why my simple little app eats up over 2GB of RAM even tho it's supposed to be processing data objects one object at a time). The fix?

Two came up and I'm not 100% certain which is better all the time so I'll discuss them both (in the order in which I thought of them). :)

Solution #1 - Unshared Externalizable/Serializable


There's a new interface in town, java.io.Externalizable, which is the cousin of java.io.Serializable, but with a few new quirks:

  1. First, unlike Serializable, Externalizable has methods (imagine that! methods in an interface!) - readExternal(ObjectInput) and writeExternal(ObjectOutput), typically the args are Object(Input|Output)Streams, which implement Object(Input|Output).

  2. The second (and rather weird) quirk is that your Externalizable object must have a PUBLIC no-arg constructor. Now, considering that the ctor is gotten via reflection I'm a little puzzled why it has to be public but... whatever.


So first you make your DAO implement Externalizable rather than Serializable (which is a parent of Externalizable) and then you replace all your ObjectOutputStream.writeObject() calls with writeUnshared() (and on the read side OIS.readUnshared()) which tells the streams to just ignore that whole handle map thing.

Now you might think just making your DAO unshared is enough but NO! (unless your DAO is the data itself, which is nearly impossible since all DAOs are, at some level, composed of primitives, primitive-wrappers, arrays or collections of primitive/primitive-wrappers). So while your DAO might be unshared (not cached) its parts are probably shared (cached). That's why you have to implement Externalizable or Serializable - after a bit of testing i'd just go with Serializable - no funky public ctor and no casting of ObjectInput to ObjectInputStream and so forth. Six of one, half a baker's dozen of the other... :)

Solution #2 - ObjectOutputStream with amnesia


There is a method on OOS called reset() which clears the handle cache and reduces memory use to nearly nothing (other than what you were using without the OOS). It also injects a TC_RESET marker into the output stream which, in my DAO's case, increased the output file size by 50% (e.g., what was a 4.9M file became a 7.2M file). basically if you look in the serialized file you will see your DAO's and component fields' class names between every single instance (normally they're declared once at the top). So if size of your serialized data is an issue this might not be a great fix. But if RAM usage is a concern this approach is huge - my app which used to max out at over 400M (in testing) with regular serialization dropped to 100M with unshared but then dropped all the way to less than 1M java heap with reset() between each writeUnshared() (hard to say exactly 'cause there were so few Full-GCs but i'd guess it was about 400-500K). THAT's amazing (to me).


Performance Impact


Didn't see any performance impact in calling reset() between each writeUnshard(); in fact i'd say the reset() version works faster because it avoids all those Full-GCs.

Hope this helps folks as much as learning it helped me. :)