Welcome!

Java IoT Authors: Carmen Gonzalez, Elizabeth White, Liz McMillan, Pat Romanski, Sematext Blog

Related Topics: Java IoT, IoT User Interface, Recurring Revenue

Java IoT: Article

Java Persistence on the Grid: Approaches to Integration

JPA - the enterprise standard for accessing relational data in Java

Oracle on Ulitzer

The Java Persistence API (JPA) is the enterprise standard for accessing relational data in Java. JPA provides support for mapping Java objects to a database schema and includes a simple programming API and expressive query language for retrieving mapped entities from a database and writing back changes made to these entities. JPA offers developers productivity gains over writing and maintaining their own mapping code allowing a single API regardless of the platform, application server, or persistence provider implementation. Besides the productivity gains the leading implementations offer developers valuable performance and scalability benefits through the inclusion of caching solutions. These caching solutions allow frequently accessed entities to be cached which reduces the number of queries going to the database and the amount of processing time spent converting database query results into objects. Caching can have a significant positive effect on application performance.

JPA and Data Grids
A data grid is software that runs on a cluster of typically low-cost hardware to provide data storage and processing services. Data grid products aggregate the processing power and storage capacity of cluster servers and make it available to clients through APIs designed to shield them from the complexity of distributed computing. Data grids are commonly used as scalable distributed caches; however, distributed data processing is also a common feature. As a cache, a data grid provides a way to exceed the heap size of a single server by distributing data across all cluster servers.

The relevance of data grids to today's enterprise applications is huge, yet their usage is still limited to technology specialists. Data grids are becoming mainstream and developers should consider grid architectures when developing applications and be aware that an application might be expected to scale up to a grid in the future.

Consider a banking system that processes incoming deposit and withdrawal requests by validating all fields before writing them to the database. Validation might include whether the account is valid, whether the requester is the account owner, whether the account contains sufficient funds for the request, etc. You can imagine there are many other validations that could be performed in such a system. The amount of data you have to read from the database to perform the validation of a single request can be significant and result in a large number of queries. Fortunately building such a database-centric application in JPA is straightforward. You map each of the classes in your domain to the database and write the necessary JP QL queries to retrieve the objects required for validation. The system may have to read large amounts of data from the database to process each request, but it works.

Now if we want to dramatically increase the throughput of this system we'd have to address its single greatest bottleneck: querying the database for validation data. Most JPA implementations either provide an L2 cache or support the integration of third-party L2 caches. But if we have to handle very large numbers of requests that arrive in a random order it's unlikely we'll have the required reference data in cache. Caches are useful when you're repeatedly accessing the same data. If your access pattern is random then it's unlikely your cache will contain what you need when you need it. Of course you can always increase your cache size to better your odds of a hit, but each server only has so much heap.

Data grids provide a way to exceed the heap size of a single server and distribute your cached objects over a cluster of servers. The challenge is to integrate data grid technology with JPA to increase throughput without requiring complete application rewrites. Of course as is typically the case with software systems, there's more then one approach to integration, each with its advantages and disadvantages. Let's look at different integration architectures and how we could use them.

Data Grid as Middle Tier Object Cache
As we mentioned, data grid products let you spread your cache across a cluster and can be used as a shared middle tier cache (see Figure 1). They provide a single logical heap that's physically spread over multiple servers with a total storage capacity that's the sum of the heaps of all the cluster servers. In the example, this would mean that by adding more servers to the grid its storage capacity could be increased to the point where all data required for validation could be pre-loaded (commonly referred to as "warming" the cache). Since validation data access is our bottleneck, caching all the required data effectively eliminates it.

For example, consider a simple validation method in our banking system:

public boolean isValidAccount(Request request) {
Account account = entityManager.find(
Account.class, request.getAccountId());
if (account == null) {
return false;
} else {
return account.isValid();
}
}

With the data grid integrated as the L2 cache, the find() will check the grid for the desired Account. If not found, it can then proceed to query the underlying database. However, if the grid is warmed with all the Accounts then there will be no need to query the database. Warming the appropriate caches can eliminate database access from the validation process entirely.

Primary key finds are easily directed to the data grid but what about JP QL queries? Consider this method, which finds the Customer associated with a request using a non-primary key query:

public Customer getTxCustomer(Request request)
throws NoResultException {
Customer customer = entityManager
.createQuery("select c from Customer c
where c.masterAccountId = :id")
.setParameter("id", request.getMasterAccountId())
.getSingleResult();
return customer;
}

Querying the data grid for an object that matches an arbitrary criterion is problematic. First it requires that the data grid provides some sort of query framework and second that the JPA/data grid integration can translate from JP QL into this framework. If both requirements are satisfied then it's possible that the query in our example could be directed to the grid and not the database.

One of the most valuable features of this approach is the possibility of parallel query execution. It stands to reason that the query in our example could be executed in parallel on all the servers in the grid to find the desired object. However, a query that returns many objects is much more interesting. Each grid server could execute a query in parallel to identify those objects it holds that match a given criteria. Performing such a query 10 times in parallel on 10,000 objects is going to be much faster than one time on 100,000 objects. The more servers the smaller the number of objects on each server and the faster the query executes!

Unfortunately there's one complication with queries that return multiple results. Unlike a primary key find() in which a cache miss could automatically result in a database query, it's not clear whether the results obtained from the grid are sufficient. Perhaps only half of the objects you're looking for are in the grid so a grid query wouldn't return the other half from the database. Warming the cache solves this problem by ensuring all objects are in the grid but that isn't always possible. However, for a given use case, you may know whether a particular query should be directed to the grid or to the database. The way you effect query execution in JPA is through query hints. Perhaps something like:

Customer customer = entityManager
.createQuery("select c from Customer c
where c.masterAccountId = :id")
.setParameter("id", request.getMasterAccountId())
.setHint("my-jpa-implementation.dont-query-grid", true)
.getSingleResult();

Of course there's no standard JPA hint for whether to direct a query to a data grid or not. This means that you'd have to introduce implementation-specific hints into your code. Fortunately the JPA specification requires that implementations ignore hints they don't understand so your code isn't tightly coupled to any particular one through hints.

Updating Objects
Naturally querying is the first thing you think of when looking at JPA on the grid but we also have to consider updates: persisting new objects, modifying existing objects, and deleting objects. When the grid is the L2 cache, it's important to ensure that the grid is only updated after a database transaction has successfully committed. Persisting a new object will result in a database INSERT and the new object will be placed into the grid. Modifying an object will result in a database UPDATE and the updated object being placed into the grid. And finally deleting an object will result in a database DELETE and the object being removed from the grid. The key thing is to update the data grid once the database transaction successfully commits.

Data Grid as System of Record
When JPA uses the data grid as a distributed cache, the database is the "system of record." It's the ultimate source of truth and is kept up-to-date at all times. But what if the data grid were the system of record? This is often the case in many financial applications dealing with rapidly changing and transient data. What would JPA on the grid look like if there were no database or if the database were used more as a data archive or warehouse than as an online system? (see Figure 2)

In this architecture, all JPA operations that would normally have resulted in SQL directed to the database are directed instead to the data grid. This includes all queries and all updates. Essentially we replace the database entirely with the data grid. With JP QL translation support we can continue to use JPA as our programming API while working with data stored exclusively in the middle tier. For systems that don't need long-term persistent storage this is ideal. And if more storage or query performance is required you simply add servers to the grid.

Database-backed Data Grid
Even with all queries and updates being performed against the data grid, it's still possible to integrate a database for persistent storage. In this architecture, the grid is responsible for propagating the operations performed on the grid to the database. For example, putting an object into the grid would result in a database INSERT. The advantage of this configuration is that data continues to be highly available but updates are communicated back to the database for permanent storage, reporting purposes, etc. Ideally a grid operation wouldn't be propagated to the database synchronously since that would dramatically reduce throughput. Asynchronous writes of updates to a backing database keeps the grid responsive and yet still supports persistence storage requirements (see Figure 3).

Mix and Match - Heterogeneous Configuration
So far we've looked at a data grid as a cache for JPA and using JPA as a standard API on top of a data grid. The difference in the two architectures is actually fairly minor. For new objects, it boils down to configuring whether or not JPA writes first to the database and then to the data grid or whether it just writes to the grid. The same logic applies to update and delete operations. Querying, as we've seen, is also similar.

If we can configure how to read/write/query on an Entity-by-Entity basis we can mix architectures. Consider a stock trading application. In such an application you've got "enduring" Entities like Companies, Stocks, and Bonds. But you've also got transient Entities like Bids and Asks. An Entity-level configuration would enable JPA to use the data grid as a cache for persistent Entities, like Company, and the data grid as the system of record for transient Entities like Bids.

Scaling JPA with a Data Grid
Hopefully it's clear by now that integrating JPA with data grids is possible and that they can increase system throughput by providing fast access to data managed in the middle tier. But they also offer significantly better scalability for JPA applications as compared with commonly used approaches.

Traditionally, scaling up a JPA application is done by increasing the number of servers in the application cluster and using a load balancer to distribute the work evenly. But as you increase the cluster size you are limited in what you can cache without introducing inter-process messaging and locking. Updates to shared data must be communicated to all cluster servers to ensure no JPA caches contain stale data. For a cluster with N servers this means each update will require N-1 messages. As you increase the number of servers in the cluster the cost of processing a single concurrent update per server increases quadratically according to (N-1)² because each server must message every other server for every update. Worse still, as the cluster grows each server will have to spend a significant amount of its available processing time dealing with incoming update messages. These non-linear communication and update processing costs means that while traditional approaches to clustering JPA applications that employ caching do work well, they are limited to small-to-medium-size clusters.

A data grid solves this communication problem by having only one shared copy of an object accessible from all servers. An update doesn't require messaging to all servers because they'll each pick up the change next time (if ever) they need the updated object. In a data grid with a scalable peer-to-peer communication architecture (i.e., one without a central message routing bottleneck) an update requires communicating to the server that stores the object and to the server(s) that stores a backup copy in case of failover. In this case, the communication cost for processing a single concurrent update per server is described by the linear function C(N) where C is a constant reflecting the number of copies (primary and backups). This linear update cost means that it's possible to scale JPA application using a data grid to large clusters and achieve much higher throughput than would typically be possible.

Challenges
Of course JPA on the grid isn't without its challenges. The first thing developers familiar with object relational mapping and JPA will undoubtedly be thinking about is cache staleness. This is the most common problem caches introduce. Staleness has two sources: third-party updates to the database, and updates performed by JPA applications running on other servers in the cluster. Dealing with third-party updates is no different with a data grid than it is with any other cache. Most JPA implementations offer a range of techniques to deal with this, including eviction policies, query refresh options, and for extremely volatile data, the ability to disable caching. This is well-worn territory that data grids don't particularly complicate.

As discussed earlier, staleness due to updates made in other cluster servers is traditionally solved by messaging although it has its limitations. In high-transaction-rate systems where the messaging overhead is significant, JPA applications tend to minimize their cache usage and rely on the database to ensure they have the most recent version of the data. Ironically, as transaction rates increase and the value of caching increases it's often disabled because the cost of maintaining cache coherence is too expensive. The use of a data grid to virtually eliminate the messaging and update processing overhead means that high-transaction-rate systems can take advantage of caching to achieve even higher throughput without having to manage staleness.

Querying is another challenge. JPA defines the general-purpose JP QL that is in many ways similar to SQL and includes many of the same notions. The goal of JP QL is to provide an object-based query language that's easy to translate into SQL for execution on a relational database. Of course data grids aren't relational databases and each has its own query framework. The extent that JP QL can be translated and executed on a particular grid depends on the expressiveness of the grid's query framework.

Another challenging area is object relationships. JPA supports a number of relationship types along with the notion of embedded objects. Relationship support varies by data grid product and each has its subtleties. Issues include: what kind of relationships are supported; whether objects can have relationships across the grid or must be co-located; and what query operators are supported on relationships. The answer to this last question obviously has a big impact on what kind of JP QL queries can be executed.

This list is definitely not exhaustive but it highlights the kinds of issues that have an impact on JPA/data grid integration.

Conclusion
Data grids are not relational databases and so we can't expect a perfect match between JPA and data grids. But even with some limitations, JPA on the grid is an exciting technology that provides a way to evolve JPA applications to leverage the power of data grids to build scalable high-performance systems.

References

More Stories By Shaun Smith

Shaun Smith is a Principal Product Manager for Oracle TopLink and an active member of the Eclipse community. He's Ecosystem Development Lead for the Eclipse Persistence Services Project (EclipseLink) and a committer on the Eclipse EMF Teneo and Dali Java Persistence Tools projects. He’s currently involved with the development of JPA persistence for OSGi and Oracle TopLink Grid, which integrates Oracle Coherence with Oracle TopLink to provide JPA on the grid.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
WebRTC is the future of browser-to-browser communications, and continues to make inroads into the traditional, difficult, plug-in web communications world. The 6th WebRTC Summit continues our tradition of delivering the latest and greatest presentations within the world of WebRTC. Topics include voice calling, video chat, P2P file sharing, and use cases that have already leveraged the power and convenience of WebRTC.
Amazon has gradually rolled out parts of its IoT offerings, but these are just the tip of the iceberg. In addition to optimizing their backend AWS offerings, Amazon is laying the ground work to be a major force in IoT - especially in the connected home and office. In his session at @ThingsExpo, Chris Kocher, founder and managing director of Grey Heron, explained how Amazon is extending its reach to become a major force in IoT by building on its dominant cloud IoT platform, its Dash Button strat...
Internet-of-Things discussions can end up either going down the consumer gadget rabbit hole or focused on the sort of data logging that industrial manufacturers have been doing forever. However, in fact, companies today are already using IoT data both to optimize their operational technology and to improve the experience of customer interactions in novel ways. In his session at @ThingsExpo, Gordon Haff, Red Hat Technology Evangelist, will share examples from a wide range of industries – includin...
"We build IoT infrastructure products - when you have to integrate different devices, different systems and cloud you have to build an application to do that but we eliminate the need to build an application. Our products can integrate any device, any system, any cloud regardless of protocol," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
The cloud promises new levels of agility and cost-savings for Big Data, data warehousing and analytics. But it’s challenging to understand all the options – from IaaS and PaaS to newer services like HaaS (Hadoop as a Service) and BDaaS (Big Data as a Service). In her session at @BigDataExpo at @ThingsExpo, Hannah Smalltree, a director at Cazena, provided an educational overview of emerging “as-a-service” options for Big Data in the cloud. This is critical background for IT and data professionals...
"Once customers get a year into their IoT deployments, they start to realize that they may have been shortsighted in the ways they built out their deployment and the key thing I see a lot of people looking at is - how can I take equipment data, pull it back in an IoT solution and show it in a dashboard," stated Dave McCarthy, Director of Products at Bsquare Corporation, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Fact is, enterprises have significant legacy voice infrastructure that’s costly to replace with pure IP solutions. How can we bring this analog infrastructure into our shiny new cloud applications? There are proven methods to bind both legacy voice applications and traditional PSTN audio into cloud-based applications and services at a carrier scale. Some of the most successful implementations leverage WebRTC, WebSockets, SIP and other open source technologies. In his session at @ThingsExpo, Da...
As data explodes in quantity, importance and from new sources, the need for managing and protecting data residing across physical, virtual, and cloud environments grow with it. Managing data includes protecting it, indexing and classifying it for true, long-term management, compliance and E-Discovery. Commvault can ensure this with a single pane of glass solution – whether in a private cloud, a Service Provider delivered public cloud or a hybrid cloud environment – across the heterogeneous enter...
"IoT is going to be a huge industry with a lot of value for end users, for industries, for consumers, for manufacturers. How can we use cloud to effectively manage IoT applications," stated Ian Khan, Innovation & Marketing Manager at Solgeniakhela, in this SYS-CON.tv interview at @ThingsExpo, held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA.
Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more business becomes digital the more stakeholders are interested in this data including how it relates to business. Some of these people have never used a monitoring tool before. They have a question on their mind like “How is my application doing” but no id...
@GonzalezCarmen has been ranked the Number One Influencer and @ThingsExpo has been named the Number One Brand in the “M2M 2016: Top 100 Influencers and Brands” by Onalytica. Onalytica analyzed tweets over the last 6 months mentioning the keywords M2M OR “Machine to Machine.” They then identified the top 100 most influential brands and individuals leading the discussion on Twitter.
Information technology is an industry that has always experienced change, and the dramatic change sweeping across the industry today could not be truthfully described as the first time we've seen such widespread change impacting customer investments. However, the rate of the change, and the potential outcomes from today's digital transformation has the distinct potential to separate the industry into two camps: Organizations that see the change coming, embrace it, and successful leverage it; and...
Extracting business value from Internet of Things (IoT) data doesn’t happen overnight. There are several requirements that must be satisfied, including IoT device enablement, data analysis, real-time detection of complex events and automated orchestration of actions. Unfortunately, too many companies fall short in achieving their business goals by implementing incomplete solutions or not focusing on tangible use cases. In his general session at @ThingsExpo, Dave McCarthy, Director of Products...
The Internet of Things (IoT) promises to simplify and streamline our lives by automating routine tasks that distract us from our goals. This promise is based on the ubiquitous deployment of smart, connected devices that link everything from industrial control systems to automobiles to refrigerators. Unfortunately, comparatively few of the devices currently deployed have been developed with an eye toward security, and as the DDoS attacks of late October 2016 have demonstrated, this oversight can ...
Machine Learning helps make complex systems more efficient. By applying advanced Machine Learning techniques such as Cognitive Fingerprinting, wind project operators can utilize these tools to learn from collected data, detect regular patterns, and optimize their own operations. In his session at 18th Cloud Expo, Stuart Gillen, Director of Business Development at SparkCognition, discussed how research has demonstrated the value of Machine Learning in delivering next generation analytics to impr...
More and more brands have jumped on the IoT bandwagon. We have an excess of wearables – activity trackers, smartwatches, smart glasses and sneakers, and more that track seemingly endless datapoints. However, most consumers have no idea what “IoT” means. Creating more wearables that track data shouldn't be the aim of brands; delivering meaningful, tangible relevance to their users should be. We're in a period in which the IoT pendulum is still swinging. Initially, it swung toward "smart for smar...
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
Businesses and business units of all sizes can benefit from cloud computing, but many don't want the cost, performance and security concerns of public cloud nor the complexity of building their own private clouds. Today, some cloud vendors are using artificial intelligence (AI) to simplify cloud deployment and management. In his session at 20th Cloud Expo, Ajay Gulati, Co-founder and CEO of ZeroStack, will discuss how AI can simplify cloud operations. He will cover the following topics: why clou...
"ReadyTalk is an audio and web video conferencing provider. We've really come to embrace WebRTC as the platform for our future of technology," explained Dan Cunningham, CTO of ReadyTalk, in this SYS-CON.tv interview at WebRTC Summit at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Internet of @ThingsExpo, taking place June 6-8, 2017 at the Javits Center in New York City, New York, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @ThingsExpo New York Call for Papers is now open.