Welcome!

Java IoT Authors: Pat Romanski, Liz McMillan, Elizabeth White, Jamie Maidson, Yakov Fain

Related Topics: Java IoT, IoT User Interface, Recurring Revenue

Java IoT: Article

Java Persistence on the Grid: Approaches to Integration

JPA - the enterprise standard for accessing relational data in Java

Oracle on Ulitzer

The Java Persistence API (JPA) is the enterprise standard for accessing relational data in Java. JPA provides support for mapping Java objects to a database schema and includes a simple programming API and expressive query language for retrieving mapped entities from a database and writing back changes made to these entities. JPA offers developers productivity gains over writing and maintaining their own mapping code allowing a single API regardless of the platform, application server, or persistence provider implementation. Besides the productivity gains the leading implementations offer developers valuable performance and scalability benefits through the inclusion of caching solutions. These caching solutions allow frequently accessed entities to be cached which reduces the number of queries going to the database and the amount of processing time spent converting database query results into objects. Caching can have a significant positive effect on application performance.

JPA and Data Grids
A data grid is software that runs on a cluster of typically low-cost hardware to provide data storage and processing services. Data grid products aggregate the processing power and storage capacity of cluster servers and make it available to clients through APIs designed to shield them from the complexity of distributed computing. Data grids are commonly used as scalable distributed caches; however, distributed data processing is also a common feature. As a cache, a data grid provides a way to exceed the heap size of a single server by distributing data across all cluster servers.

The relevance of data grids to today's enterprise applications is huge, yet their usage is still limited to technology specialists. Data grids are becoming mainstream and developers should consider grid architectures when developing applications and be aware that an application might be expected to scale up to a grid in the future.

Consider a banking system that processes incoming deposit and withdrawal requests by validating all fields before writing them to the database. Validation might include whether the account is valid, whether the requester is the account owner, whether the account contains sufficient funds for the request, etc. You can imagine there are many other validations that could be performed in such a system. The amount of data you have to read from the database to perform the validation of a single request can be significant and result in a large number of queries. Fortunately building such a database-centric application in JPA is straightforward. You map each of the classes in your domain to the database and write the necessary JP QL queries to retrieve the objects required for validation. The system may have to read large amounts of data from the database to process each request, but it works.

Now if we want to dramatically increase the throughput of this system we'd have to address its single greatest bottleneck: querying the database for validation data. Most JPA implementations either provide an L2 cache or support the integration of third-party L2 caches. But if we have to handle very large numbers of requests that arrive in a random order it's unlikely we'll have the required reference data in cache. Caches are useful when you're repeatedly accessing the same data. If your access pattern is random then it's unlikely your cache will contain what you need when you need it. Of course you can always increase your cache size to better your odds of a hit, but each server only has so much heap.

Data grids provide a way to exceed the heap size of a single server and distribute your cached objects over a cluster of servers. The challenge is to integrate data grid technology with JPA to increase throughput without requiring complete application rewrites. Of course as is typically the case with software systems, there's more then one approach to integration, each with its advantages and disadvantages. Let's look at different integration architectures and how we could use them.

Data Grid as Middle Tier Object Cache
As we mentioned, data grid products let you spread your cache across a cluster and can be used as a shared middle tier cache (see Figure 1). They provide a single logical heap that's physically spread over multiple servers with a total storage capacity that's the sum of the heaps of all the cluster servers. In the example, this would mean that by adding more servers to the grid its storage capacity could be increased to the point where all data required for validation could be pre-loaded (commonly referred to as "warming" the cache). Since validation data access is our bottleneck, caching all the required data effectively eliminates it.

For example, consider a simple validation method in our banking system:

public boolean isValidAccount(Request request) {
Account account = entityManager.find(
Account.class, request.getAccountId());
if (account == null) {
return false;
} else {
return account.isValid();
}
}

With the data grid integrated as the L2 cache, the find() will check the grid for the desired Account. If not found, it can then proceed to query the underlying database. However, if the grid is warmed with all the Accounts then there will be no need to query the database. Warming the appropriate caches can eliminate database access from the validation process entirely.

Primary key finds are easily directed to the data grid but what about JP QL queries? Consider this method, which finds the Customer associated with a request using a non-primary key query:

public Customer getTxCustomer(Request request)
throws NoResultException {
Customer customer = entityManager
.createQuery("select c from Customer c
where c.masterAccountId = :id")
.setParameter("id", request.getMasterAccountId())
.getSingleResult();
return customer;
}

Querying the data grid for an object that matches an arbitrary criterion is problematic. First it requires that the data grid provides some sort of query framework and second that the JPA/data grid integration can translate from JP QL into this framework. If both requirements are satisfied then it's possible that the query in our example could be directed to the grid and not the database.

One of the most valuable features of this approach is the possibility of parallel query execution. It stands to reason that the query in our example could be executed in parallel on all the servers in the grid to find the desired object. However, a query that returns many objects is much more interesting. Each grid server could execute a query in parallel to identify those objects it holds that match a given criteria. Performing such a query 10 times in parallel on 10,000 objects is going to be much faster than one time on 100,000 objects. The more servers the smaller the number of objects on each server and the faster the query executes!

Unfortunately there's one complication with queries that return multiple results. Unlike a primary key find() in which a cache miss could automatically result in a database query, it's not clear whether the results obtained from the grid are sufficient. Perhaps only half of the objects you're looking for are in the grid so a grid query wouldn't return the other half from the database. Warming the cache solves this problem by ensuring all objects are in the grid but that isn't always possible. However, for a given use case, you may know whether a particular query should be directed to the grid or to the database. The way you effect query execution in JPA is through query hints. Perhaps something like:

Customer customer = entityManager
.createQuery("select c from Customer c
where c.masterAccountId = :id")
.setParameter("id", request.getMasterAccountId())
.setHint("my-jpa-implementation.dont-query-grid", true)
.getSingleResult();

Of course there's no standard JPA hint for whether to direct a query to a data grid or not. This means that you'd have to introduce implementation-specific hints into your code. Fortunately the JPA specification requires that implementations ignore hints they don't understand so your code isn't tightly coupled to any particular one through hints.

Updating Objects
Naturally querying is the first thing you think of when looking at JPA on the grid but we also have to consider updates: persisting new objects, modifying existing objects, and deleting objects. When the grid is the L2 cache, it's important to ensure that the grid is only updated after a database transaction has successfully committed. Persisting a new object will result in a database INSERT and the new object will be placed into the grid. Modifying an object will result in a database UPDATE and the updated object being placed into the grid. And finally deleting an object will result in a database DELETE and the object being removed from the grid. The key thing is to update the data grid once the database transaction successfully commits.

Data Grid as System of Record
When JPA uses the data grid as a distributed cache, the database is the "system of record." It's the ultimate source of truth and is kept up-to-date at all times. But what if the data grid were the system of record? This is often the case in many financial applications dealing with rapidly changing and transient data. What would JPA on the grid look like if there were no database or if the database were used more as a data archive or warehouse than as an online system? (see Figure 2)

In this architecture, all JPA operations that would normally have resulted in SQL directed to the database are directed instead to the data grid. This includes all queries and all updates. Essentially we replace the database entirely with the data grid. With JP QL translation support we can continue to use JPA as our programming API while working with data stored exclusively in the middle tier. For systems that don't need long-term persistent storage this is ideal. And if more storage or query performance is required you simply add servers to the grid.

Database-backed Data Grid
Even with all queries and updates being performed against the data grid, it's still possible to integrate a database for persistent storage. In this architecture, the grid is responsible for propagating the operations performed on the grid to the database. For example, putting an object into the grid would result in a database INSERT. The advantage of this configuration is that data continues to be highly available but updates are communicated back to the database for permanent storage, reporting purposes, etc. Ideally a grid operation wouldn't be propagated to the database synchronously since that would dramatically reduce throughput. Asynchronous writes of updates to a backing database keeps the grid responsive and yet still supports persistence storage requirements (see Figure 3).

Mix and Match - Heterogeneous Configuration
So far we've looked at a data grid as a cache for JPA and using JPA as a standard API on top of a data grid. The difference in the two architectures is actually fairly minor. For new objects, it boils down to configuring whether or not JPA writes first to the database and then to the data grid or whether it just writes to the grid. The same logic applies to update and delete operations. Querying, as we've seen, is also similar.

If we can configure how to read/write/query on an Entity-by-Entity basis we can mix architectures. Consider a stock trading application. In such an application you've got "enduring" Entities like Companies, Stocks, and Bonds. But you've also got transient Entities like Bids and Asks. An Entity-level configuration would enable JPA to use the data grid as a cache for persistent Entities, like Company, and the data grid as the system of record for transient Entities like Bids.

Scaling JPA with a Data Grid
Hopefully it's clear by now that integrating JPA with data grids is possible and that they can increase system throughput by providing fast access to data managed in the middle tier. But they also offer significantly better scalability for JPA applications as compared with commonly used approaches.

Traditionally, scaling up a JPA application is done by increasing the number of servers in the application cluster and using a load balancer to distribute the work evenly. But as you increase the cluster size you are limited in what you can cache without introducing inter-process messaging and locking. Updates to shared data must be communicated to all cluster servers to ensure no JPA caches contain stale data. For a cluster with N servers this means each update will require N-1 messages. As you increase the number of servers in the cluster the cost of processing a single concurrent update per server increases quadratically according to (N-1)² because each server must message every other server for every update. Worse still, as the cluster grows each server will have to spend a significant amount of its available processing time dealing with incoming update messages. These non-linear communication and update processing costs means that while traditional approaches to clustering JPA applications that employ caching do work well, they are limited to small-to-medium-size clusters.

A data grid solves this communication problem by having only one shared copy of an object accessible from all servers. An update doesn't require messaging to all servers because they'll each pick up the change next time (if ever) they need the updated object. In a data grid with a scalable peer-to-peer communication architecture (i.e., one without a central message routing bottleneck) an update requires communicating to the server that stores the object and to the server(s) that stores a backup copy in case of failover. In this case, the communication cost for processing a single concurrent update per server is described by the linear function C(N) where C is a constant reflecting the number of copies (primary and backups). This linear update cost means that it's possible to scale JPA application using a data grid to large clusters and achieve much higher throughput than would typically be possible.

Challenges
Of course JPA on the grid isn't without its challenges. The first thing developers familiar with object relational mapping and JPA will undoubtedly be thinking about is cache staleness. This is the most common problem caches introduce. Staleness has two sources: third-party updates to the database, and updates performed by JPA applications running on other servers in the cluster. Dealing with third-party updates is no different with a data grid than it is with any other cache. Most JPA implementations offer a range of techniques to deal with this, including eviction policies, query refresh options, and for extremely volatile data, the ability to disable caching. This is well-worn territory that data grids don't particularly complicate.

As discussed earlier, staleness due to updates made in other cluster servers is traditionally solved by messaging although it has its limitations. In high-transaction-rate systems where the messaging overhead is significant, JPA applications tend to minimize their cache usage and rely on the database to ensure they have the most recent version of the data. Ironically, as transaction rates increase and the value of caching increases it's often disabled because the cost of maintaining cache coherence is too expensive. The use of a data grid to virtually eliminate the messaging and update processing overhead means that high-transaction-rate systems can take advantage of caching to achieve even higher throughput without having to manage staleness.

Querying is another challenge. JPA defines the general-purpose JP QL that is in many ways similar to SQL and includes many of the same notions. The goal of JP QL is to provide an object-based query language that's easy to translate into SQL for execution on a relational database. Of course data grids aren't relational databases and each has its own query framework. The extent that JP QL can be translated and executed on a particular grid depends on the expressiveness of the grid's query framework.

Another challenging area is object relationships. JPA supports a number of relationship types along with the notion of embedded objects. Relationship support varies by data grid product and each has its subtleties. Issues include: what kind of relationships are supported; whether objects can have relationships across the grid or must be co-located; and what query operators are supported on relationships. The answer to this last question obviously has a big impact on what kind of JP QL queries can be executed.

This list is definitely not exhaustive but it highlights the kinds of issues that have an impact on JPA/data grid integration.

Conclusion
Data grids are not relational databases and so we can't expect a perfect match between JPA and data grids. But even with some limitations, JPA on the grid is an exciting technology that provides a way to evolve JPA applications to leverage the power of data grids to build scalable high-performance systems.

References

More Stories By Shaun Smith

Shaun Smith is a Principal Product Manager for Oracle TopLink and an active member of the Eclipse community. He's Ecosystem Development Lead for the Eclipse Persistence Services Project (EclipseLink) and a committer on the Eclipse EMF Teneo and Dali Java Persistence Tools projects. He’s currently involved with the development of JPA persistence for OSGi and Oracle TopLink Grid, which integrates Oracle Coherence with Oracle TopLink to provide JPA on the grid.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
The essence of data analysis involves setting up data pipelines that consist of several operations that are chained together – starting from data collection, data quality checks, data integration, data analysis and data visualization (including the setting up of interaction paths in that visualization). In our opinion, the challenges stem from the technology diversity at each stage of the data pipeline as well as the lack of process around the analysis.
Designing IoT applications is complex, but deploying them in a scalable fashion is even more complex. A scalable, API first IaaS cloud is a good start, but in order to understand the various components specific to deploying IoT applications, one needs to understand the architecture of these applications and figure out how to scale these components independently. In his session at @ThingsExpo, Nara Rajagopalan is CEO of Accelerite, will discuss the fundamental architecture of IoT applications, ...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, wh...
What a difference a year makes. Organizations aren’t just talking about IoT possibilities, it is now baked into their core business strategy. With IoT, billions of devices generating data from different companies on different networks around the globe need to interact. From efficiency to better customer insights to completely new business models, IoT will turn traditional business models upside down. In the new customer-centric age, the key to success is delivering critical services and apps wit...
As cloud and storage projections continue to rise, the number of organizations moving to the cloud is escalating and it is clear cloud storage is here to stay. However, is it secure? Data is the lifeblood for government entities, countries, cloud service providers and enterprises alike and losing or exposing that data can have disastrous results. There are new concepts for data storage on the horizon that will deliver secure solutions for storing and moving sensitive data around the world. ...
SYS-CON Events announced today that ContentMX, the marketing technology and services company with a singular mission to increase engagement and drive more conversations for enterprise, channel and SMB technology marketers, has been named “Sponsor & Exhibitor Lounge Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York City, New York. “CloudExpo is a great opportunity to start a conversation with new prospects, but what happens after the...
SYS-CON Events announced today that MangoApps will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. MangoApps provides modern company intranets and team collaboration software, allowing workers to stay connected and productive from anywhere in the world and from any device. For more information, please visit https://www.mangoapps.com/.
SYS-CON Events announced today that 24Notion has been named “Bronze Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. 24Notion is full-service global creative digital marketing, technology and lifestyle agency that combines strategic ideas with customized tactical execution. With a broad understand of the art of traditional marketing, new media, communications and social influence, 24Notion uniquely understands how to con...
WebRTC is bringing significant change to the communications landscape that will bridge the worlds of web and telephony, making the Internet the new standard for communications. Cloud9 took the road less traveled and used WebRTC to create a downloadable enterprise-grade communications platform that is changing the communication dynamic in the financial sector. In his session at @ThingsExpo, Leo Papadopoulos, CTO of Cloud9, will discuss the importance of WebRTC and how it enables companies to fo...
SYS-CON Events announced today TechTarget has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. TechTarget is the Web’s leading destination for serious technology buyers researching and making enterprise technology decisions. Its extensive global networ...
Korean Broadcasting System (KBS) will feature the upcoming 18th Cloud Expo | @ThingsExpo in a New York news documentary about the "New IT for the Future." The documentary will cover how big companies are transmitting or adopting the new IT for the future and will be filmed on the expo floor between June 7-June 9, 2016, at the Javits Center in New York City, New York. KBS has long been a leader in the development of the broadcasting culture of Korea. As the key public service broadcaster of Korea...
In his session at 18th Cloud Expo, Bruce Swann, Senior Product Marketing Manager at Adobe, will discuss how the Adobe Marketing Cloud can help marketers embrace opportunities for personalized, relevant and real-time customer engagement across offline (direct mail, point of sale, call center) and digital (email, website, SMS, mobile apps, social networks, connected objects). Bruce Swann has more than 15 years of experience working with digital marketing disciplines like web analytics, social med...
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York and Silicon Valley. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty ...
The 19th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit y...
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo New York Call for Papers is now open.
There are several IoTs: the Industrial Internet, Consumer Wearables, Wearables and Healthcare, Supply Chains, and the movement toward Smart Grids, Cities, Regions, and Nations. There are competing communications standards every step of the way, a bewildering array of sensors and devices, and an entire world of competing data analytics platforms. To some this appears to be chaos. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists will discuss the vast to...
SYS-CON Events announced today that Enzu, a leading provider of cloud hosting solutions, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive advantage. By offering a suite of proven hosting and management services, Enzu wants companies to foc...
SYS-CON Events announced today the How to Create Angular 2 Clients for the Cloud Workshop, being held June 7, 2016, in conjunction with 18th Cloud Expo | @ThingsExpo, at the Javits Center in New York, NY. Angular 2 is a complete re-write of the popular framework AngularJS. Programming in Angular 2 is greatly simplified. Now it’s a component-based well-performing framework. The immersive one-day workshop led by Yakov Fain, a Java Champion and a co-founder of the IT consultancy Farata Systems and...
Customer experience has become a competitive differentiator for companies, and it’s imperative that brands seamlessly connect the customer journey across all platforms. With the continued explosion of IoT, join us for a look at how to build a winning digital foundation in the connected era – today and in the future. In his session at @ThingsExpo, Chris Nguyen, Group Product Marketing Manager at Adobe, will discuss how to successfully leverage mobile, rapidly deploy content, capture real-time d...
IoT generates lots of temporal data. But how do you unlock its value? How do you coordinate the diverse moving parts that must come together when developing your IoT product? What are the key challenges addressed by Data as a Service? How does cloud computing underlie and connect the notions of Digital and DevOps What is the impact of the API economy? What is the business imperative for Cognitive Computing? Get all these questions and hundreds more like them answered at the 18th Cloud Expo...