Discussion: container-managed persistence

If you followed the tutorial on container-managed persistence with JBoss, you will have seen that creating persistent, distributed objects is not really any more difficult than creating transient ones. The EJB container does most of the hard work; all the programmer needs to do is to tell it which fields are persistent. However, it isn't quite as simple as that, and naive use of CMP can lead to very inefficient programs. To see why, it's necessary to understand at least in outline how the EJB server deals with container-managed persistence.

Technical overview

In the EJB field there is a very strong correspondence between "rows of a database table", and "instances of an object". It is clear that the EJB developers had this notion in mind from the very beginning. While the specification doesn't stipulate that persistence is provided by database tables, in practice it always is. Moreover, it is tacitly assumed that the communication between the Beans and the database will be by means of SQL statements. What does this imply for container-managed persistence?

When an persistent object is instantiated, the EJB container must generate SQL code that will write a row in the table. When the object is deleted, it must generate SQL to remove it. This isn't much of a problem. When one object asks for a reference to another, the container must find (or create) that object's row in the table, read the columns, instantiate the object in the JVM, and write the data from the table into its instance variables. Because this process can be quite slow, the EJB server may choose to defer it. That is, when one object gets a reference to an object that is container-managed, the latter object may be uninitialized. Initialization from the database table takes place later, perhaps when one of the methods is called. This late initialization reduces inefficiencies arising from initializing objects that are never read, but has its own problems, as we shall see.

Limitations of CMP

Efficiency limitations

The main limitation is that the EJB container will probably not be able to generate database access statements with the efficiency of a human programmer. Consider this example: suppose I have an database table containing details of my music CD collection. I want to search ithe collection for any one which has the text "Chopin" (case insensitive) in either the "title" or "notes" column. In SQL I could write a statement like this:

 SELECT FROM CD WHERE title LIKE "%chopin%" OR notes LIKE "%chopin%";

The % character is an SQL wild-card and takes care of finding the required string somewhere inside the field; the "LIKE" operator is case-insensitive by default. How could we achieve this with a container-managed EJB? If "CD" is an EJB, the container-supplied method "findAll()" in its home interface will get all the current instances of "CD". In practice it will do this by executing a statement like

 
  SELECT FROM CD;

and then instantiating CD for each row found. At some point it will probably store the primary key from each row of the database into the appropriate attribute of each CD instance. Then the program must examine the objects one at a time, checking whether they meet the required criteria (i.e., the word "Chopin" in the appropriate attributes). As the program iterates throuugh the objects, the server must cause their attributes to be read from the table; it won't have done this until now because it would try to conserve memory. So for each object examined the server will generate SQL code like this:

 SELECT FROM CD WHERE ID=xxxx;

Suppose there are 200 CDs known to the system. Rather than executing one SQL statement to get a list of all matching CDs, the CMP scheme has executed over 200 SQL statements to achieve the same effect. We can't improve the situation by using a call to findByTitle then findByNotes() because these methods only provide exact string matches.

Another efficiency limitation comes from the way the database table is updated when attributes change. There are two main ways to achieve this. The server could execute an instruction like this:

 UPDATE CD SET artist="Bloggs" WHERE ID="200";

for example. This is efficient, but requires the that "Artist" field really be called "artist". This makes it difficult to change the names of columns in the table. Alternatively the server could do a SELECT to get the current column values, delete the whole row, then insert a row with modified values. This allows a number of values to change at once and, because all values are written, it doesn't matter what the columns are called. This is the approach that JBoss uses. The problem is that if a class has ten persistent attributes, and they are altered one after the other, in the worst case this results in ten row deletions and ten row insertions.

Limitations of late initialization

Suppose we want to find whether a CD with a specific ID exists on the system. With CMP this corresponds to finding whether there is a row in the database table with the corresponding value of the "id" column. The code in Java might look like this:

// Get a reference to a CD Bean
Object ref  = jndiContext.lookup("cd/CD");

// Get a reference from this to the Bean's Home interface
CDHome home = (CDHome)
        PortableRemoteObject.narrow (ref, CDHome.class);

// Find the matching CD
CD cd = home.findByPrimaryKey("xxx");

What will happen if "XXX" is not the ID of a CD that exists? There would seem to be two sensible approaches. Either "findByPrimaryKey" could throw an exception, or perhaps it could return a null reference. In either case the client could easily tell whether the object exists. In practice, the EJB server may do neither of these things. It may well return a reference to a CD bean instance, which appears to be a perfectly valid object. However, none of the object's attributes will be initialized; initialization won't happen until the object is really required. This is done to improve efficiency; there is, after all, no need to initialize the object unless it will be needed. However, if the program continues to execute on the basis that "cd" refers to a valid object, an exception will be thrown later when the program tries to interact with it. This may not be a problem; if the ID had been generated from some earlier database access then we may be sure it really exists, and any failure to find it in the database represents a serious failure. However, if the data has come from the user, it is reasonable to expect some errors of typing or memory. Things can be made more predictable by always reading one of the attributes of an object after getting a reference to it, like this:

CD cd = home.findByPrimaryKey("xxx");
String dummy = cd.getId();

If there is no CD whose ID field is "XXX" then this will throw a java.rmi.NoSuchObjectException. This gets around the problem of late initialization, but at the cost of an additional SQL access.

Suitability of container-managed persistence

In many applications of object-oriented programming we have had to accept that some things that are philosophically objects are in reality implemented as something else. The "something else" may be a row of a database table, or a line of a text file, or whatever; at some point we had to code the interface between the object-oriented system and the "something elses". Entity JavaBeans goes some way towards eliminating this problem; things that are philosophically object can be modelled as objects, with methods and persistence. But this comes at a cost. It's worth asking whether the "CD" EJB in the tutorial example really is an object in a meaningful sense. It has attributes, but it doesn't do very much. We don't really gain all that much by making it an object; it could have remained a database row, and been manipulated through the "CDCollection" class. Of course this isn't as elegant, but elegance can come at a high price.

In summary then, container-managed persistence is straightforward to implement using JBoss (or any other EJB server, for that matter) but needs to be used quite carefully if serious inefficiencies are to be avoided.