Welcome!

Java Authors: Lavenya Dilip, Russell Levine, Bob Gourley, Yakov Fain, Scott Quint

Related Topics: Java

Java: Article

High-Performance Batch Processing with Java Enterprise Edition

The benefits

Enterprise software developers and corporate IT architects have established the Java Enterprise Edition (JEE) platform as a leading choice for building enterprise software applications. The platform is widely used for everything from eCommerce Websites to back office data aggregation systems. Its versatility and reliability as an enterprise computing platform is well established.

But this wasn't always so. Sun initially trumpeted Java as a desktop platform that would bring rich content to Web applications in the form of Java applets that run locally in a user's Web browser. It was also touted as a thick-client desktop application development tool that would be widely used to build applications that could run on any computer (remember write once, run anywhere?).

Sometime in the late nineties, Java application development took a 90 degree turn and ended up resulting in software that mostly runs on corporate servers instead of corporate workstations. Today, a substantial portion of Web applications are delivered on the JEE platform.

Despite the "Enterprise" in its name, the JEE platform was principally designed for handling HTTP requests from Web browsers and performing some business logic in response to each request. It now includes many other technologies, but most of them are related to this mission.

However, as the complexity and disparate uses of Web applications has grown, users and designers of these systems have found many users for JEE beyond just responding to requests from a browser. Many of these uses include common enterprise back office tasks such as batch processing of large volumes of data, and while the JEE platform was not originally designed for such purposes, it is versatile enough to provide viable solutions to these problems.

What Is a Batch?
Batch scenarios arise often in business software applications because of a conflict between the enterprise's desire to respond immediately to customer requests and also analyze the resulting transactions. This requires the speedy capture of the initial transaction with no analysis and then a later batch process to aggregate or optimize the data for reporting, analysis, archive or some other large volume process. It is a safe assumption that every business in the world does some kind of batch processing on their data.

The characteristics of the typical batch process include:
  • A long-running process that must occur on a regularly scheduled basis.
  • The volume of data to be processed is high, usually on the order of thousands to millions of database rows.
  • There may be complex logic or calculations to perform on the data.
  • The process may require a large set of data from some other system that is delivered at a specific time in a large set.
  • The process is run asynchronously from user interactions. It's not part of a user session in an online system. A user does not start it and is not waiting on it to complete.

Why Do Batch Processing in JEE?
The JEE specification was designed for online Web applications and has several limitations with respect to batch processing. For instance, JEE containers are required to manage the life cycle of Enterprise JavaBeans (EJB) and as such might limit the ability to create threads from within these classes.

However, this limitation can be overcome in a couple ways. First, while most JEE containers discourage developers from creating and managing their own threads, they do not prohibit the practice, especially outside the bounds of EJB classes. Therefore, the batch process can do its own threading using the java.util.Concurrent package (available as of Java 5) and on most JEE platforms this causes no trouble. This package provides user-friendly thread pool classes and thread management facilities that make it easier than ever to create multi-threaded applications in Java.

Second, a more spec-compliant approach to multithreading is to use Java Message Service (JMS) messages to create worker threads within the JEE context. This approach is a little more complex to implement but provides the benefits of complying with the JEE specification while also allowing the batch process to span multiple Java Virtual Machine (JVM) instances in a clustering situation. This will be discussed in more detail below.

Another issue with batch processing on the JEE platform is that by default the container manages transactions and session timeouts. The JEE container is inclined to limit how long resources such as database connections, transactions and beans can be monopolized. This is meant to guarantee a high level of service to all users within an online application, but can be problematic for a long-running batch process.

This issue can be addressed by correctly configuring a batch process not to require JEE transactions and to avoid the use of entity beans and stateful session beans that might have timeout or locking problems. Also, be sure to use the pooled resources such as database connections judiciously, releasing them back to the pool when not in use.

In addition to these limitations there is a performance question. Other methods can achieve higher performance than the JEE platform. Batch processing typically involves operations on large volumes of rows stored in a relational database, and a stored procedure implemented directly in the database might offer the fastest performance for most applications. However, there are legitimate reasons to implement the logic in JEE instead.

  • Stored procedures are typically implemented in the version of SQL specific to the database platform and are not portable to other databases. This may not matter for a departmental application but is usually not acceptable for an enterprise software product that must be supported on many different databases.
  • The JEE platform provides complimentary technology such as JCA connections to other systems, Web service calls to other services and other features that might be useful.
  • Logic implemented in Java can reuse other application logic that is also present in the business layer tier of the application.
  • Well-written Java code is usually easier to understand, maintain, and enhance than a collection of stored procedures.
  • JEE servers usually include clustering capabilities that provide the ability to federate multiple, cheap, commodity servers to improve batch processing performance.

These benefits will often outweigh any performance gain that might be achieved using stored procedures. Furthermore, the difference in performance between a Java solution and a database stored procedure solution can be minimized using the techniques described below.

Techniques for High-Performance Batch Processing on JEE
Now that we've covered the limitations and the alternatives, let's discuss how to architect a batch process on the JEE platform for maximum performance. Batch problems are clearly candidates for multi-threaded solutions because the objective is to complete as much work as possible in the shortest time possible and no human user interaction is necessary. Parallel processing using multiple threads is necessary to bring all available computing resources to bear on the problem. Today's multiple core, multiple CPU servers are especially well suited for multi-threaded processing.


More Stories By Colin Hendricks

Colin Hendricks is CTO of Rome Corp. He has worked as a software developer and consultant on high-performance, server-side Java systems for the past 10 years.

Comments (3) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Snehal Antani 07/27/08 08:06:36 PM EDT

Kalyan, to answer your questions:

"what are the hiccups?": a key issue with batch processing using java and application servers relates to JDBC cursors, transactions, and holding cursors across transactions. Checkpointing - committing work periodically so you can restart the job if needed - is important in batch. Checkpointing is achieved by using transactions, JTA transactions specifically. Unfortunately if you use a Type-4 JDBC driver with XA, you're not able to keep cursors open across transactions, therefore you are not easily able to do a "select account from table1" type of query that retrieves all of the accounts to process and leverage some checkpoint strategy as you process those records. There are a few approaches to getting around this: first, we've built a stateful session bean pattern (SFSB) where reads to the DB are done in a local transaction and the writes to the database are done in the global transaction; second, executing smaller queries that are bounded by the checkpoint intervals versus one very large query; third, if you are on z/OS and your data is in DB2 z/OS, to use the Type-2 JDBC driver that allows you to hold cursors across transactions; fourth, to use Last Participant Support, which is the ability to use a single 1-PC resource in a 2-PC (XA) transaction. This problem will plague *every* java-batch solution and a pain due to limitations in XA. The WebSphere XD Compute Grid (aka WebSphere Batch) forum has some posts on this topic, please feel free to ask more questions there: http://www-128.ibm.com/developerworks/forums/forum.jspa?forumID=1240&sta.... Within Compute Grid, we've built the SFSB pattern as part of our Batch Datastream Framework (BDS Framework) to make it simpler to leverage. Using LPS or type-2 drivers is pretty straightforward in WebSphere.

Another important gotcha is workload management and ensuring your batch processing doesn't negatively impact your online transaction (OLTP) workloads (and vice versa). The only way to have a good solution in this area is to use a software stack that integrates with the database and the workload manager. Basically, you need an integrated batch and OLTP platform, not just a batch container.

"app's performance would depend on database specifics": yes, of course, but this is business-as-usual. DB vendors have their own knobs and runtime behaviors that will differ, therefore each has to be optimized in its own way.

"what sort of frameworks have you worked with": I've found Hibernate to not be very good for batch processing. You can read more about why here: http://forum.hibernate.org/viewtopic.php?t=988575&view=next&sid=0aada757.... I've seen customers use IBatis, OpenJPA, raw JDBC, Pure Query, and SQLJ/Static SQL. As the article mentions, getting down to the raw SQL query for Batch can be crucial for performance. I tend to stick to raw JDBC and I use the Batch Data Stream Framework (BDS Framework) to manage the connections, prepared statements, restarting, etc. You can read more about this at: http://www-128.ibm.com/developerworks/forums/thread.jspa?threadID=190623...

Kalyan 11/13/07 04:06:33 PM EST

This article looks pretty good in its content. Couple of questions though:

# Have you used this architecture on any of the systems that you have implemented? If so, what are the hiccups that you have come across?

# Though you discourage using storedpocs for performance reasons, you say that tweak some database configuration to see if one can get better performance. Wouldn't this make the app's performance (thought not logic) dependent on database specifics?

Interacting with databases is the most important part of any batch processing application that has to save data to the persistent store. It'd be interesting to see what sort of framework (hibernate, ibatis, etc.) have you worked with in this kind of architecture.

Snehal Antani 08/13/07 04:06:11 PM EDT

Interesting article. I recently published an article describing your Dispatcher-Worker pattern for highly parallel batch jobs in the context of WebSphere XD Compute Grid.

http://www.ibm.com/developerworks/websphere/techjournal/0707_antani/0707...

An interesting extension to the your description is depicted in figure 6 of my article- establishing endpoint affinity which enables new caching opportunities.

The minus with using straight JEE5 multi-threading packages versus building on an existing enterprise java batch framework like Compute Grid- the developer would have to manage threading which, for enterprise adopters composed of large development teams, could be more trouble than its worth.