Welcome!

Java Authors: Maureen O'Gara, Liz McMillan, Walter H. Pinson, III, Yakov Werde, Tony Bishop

Related Topics: Java

Java: Article

Turbo-Charging Applications with Mid-Tier Distributed Caching

Fast and predictable data access

Caching Topologies
Depending on data usage patterns such as data volatility, frequency of update, and expiration requirements, many different topologies or configurations must be available for use. For example, for relatively small volumes of read-only or rarely updated data, a brute force "replicate everywhere" topology may work. In contrast, large amounts of volatile data (which may grow) may require a topology that will dynamically spread the load over the members in the cluster and repartition when new members are added. A combination of these topologies could also be used, which would provide the benefits of both in-memory access and the ability to grow and load balance the data across the cluster.

The key here is that the developer shouldn't have to code the clustering, replication, data backup, or parallel processing logic required to support the different topology types. The developer should code to a standard API and concentrate on writing business logic. The configuration underneath should be able to be changed declaratively via configuration, without any changes to the APIs that have been written.

Data Source Integration
When using a mid-tier data grid there are a number of usage patterns for data. Some data will be populated directly from the applications themselves. However, for applications that require data to be cached, there should be a consistent way of loading data from back-end data sources in the case of a cache-miss - that is, when the data being queried isn't available in the grid but does exist in a back-end data store. The developer shouldn't have to write code to deal with it.

Vendors with robust solutions in this space frequently implement them using approaches that let the data source plug transparently into the grid. For example, in the case of Oracle Coherence, loading directly from the database is done declaratively by attaching a CacheStore interface to the deployment configuration. Developers can either implement to a standard interface that calls to the back-end data store for query and update or use out-of-the-box integration with persistence solutions such as JBoss Hibernate or Oracle TopLink.

When either of these methods is used, if the data doesn't exist in the data grid, the solution will automa- tically delegate the data request to the CacheStore implementation, which then retrieves it from the back-end stores.

The capability of refreshing the data objects in the data grid based on time-triggered or other data expiry mechanisms is especially useful for those who use the data grid as a system of record and the official place for accessing data. Having a formal mechanism such as this built into the solution enables expiry policies and other data eviction policies to be matched by the infrastructure, which refreshes the data grid based on policies defined by an administrator. Ideally the solution shouldn't require customers to poll their back-end system for changes in data or scheduling jobs to refresh the data grid — these solutions are simply not scalable or manageable.

Sending the Processing to the Data
The advantage of using a distributed cache or data grid topology is that processing as well as data can be scaled when adding more resources to the grid. In a traditional use case in which we need to read data and do processing on it within a Map (for example, giving a raise to employees), we may have used something similar to the following (ignoring error handling, etc.):

Iterator<Employee> iter = map.values().iterator();
for (Employee emp : iter) {
    emp.setSalary(e.getSalary() * 1.1);
}

This (which could be written dozens of ways, of course) would achieve the desired result, but in this example, if the Map wasn't local to the Java process or distributed on another server, there would be a lot of network traffic to and from the client. The process would be serialized (that is, one entry processed at a time) and to rewrite this to run in parallel over multiple JVMs, taking into consideration the co-ordination of the concurrent processing, would require a considerable amount of work.

Taking advantage of grid processing and the ability of the data caching topology to load balance and partition data across multiple servers, it makes sense to send the processing to where the data is, rather than bringing the data to the client for processing. A common approach (this example is specific to Oracle Coherence) is to deploy code in the grid that performs the logic local to the nodes in the grid, rather than requiring the programmer to bring the all the data to the client.

The example shows how this approach could be used to raise the salary of all employees. First create a class to process the data:

    public class RaiseSalary extends AbstractProcessor {
       public RaiseSalary() {
       }

    public Object process(Entry entry ) {
    Employee emp = (Employee)entry.getValue();
    emp.setSalary(emp.getSalary() * 1.10);
    entry.setValue(emp);
    return null;
   }
}

Now invoke this across the Map (data grid):

    empCache.invokeAll(AlwaysFilter.INSTANCE, new RaiseSalary());

Sending the processing to the data dramatically improves the performance of tasks such as this because now the compute activity is parallelized across the entire grid.

Figure 1 illustrates the benefits of sending the processing to the data.

With multiple nodes in the grid and data distributed in parallel across the nodes, the processing model would scale well and take advantage of the processing capabilities of each node. Also, the fact that data doesn't need to be shipped back and forth between the client and server significantly increases the scalability and performance of such a system. As outlined in the example, using traditional non-grid methods would result in extremely poor performance and limited scalability.


More Stories By Tim Middleton

Tim Middleton is a solution architect with Oracle in Perth, Western Australia. He has over 17 years of experience in the IT industry. During this time he has been involved in the design and implementation of many large and leading-edge technology projects within the government and private sectors. His focus is on providing middleware solutions around SOA, with an emphasis on architectures that are highly available, scalable and reliable. Tim also has extensive development experience with J2EE and application server-based solutions, as well as many years experience as a DBA.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.