Welcome!

Java IoT Authors: Elizabeth White, Ruxit Blog, Sematext Blog, Cloud Best Practices Network, Liz McMillan

Related Topics: Java IoT, IoT User Interface, Recurring Revenue

Java IoT: Article

Java Persistence on the Grid: Approaches to Integration

JPA - the enterprise standard for accessing relational data in Java

Oracle on Ulitzer

The Java Persistence API (JPA) is the enterprise standard for accessing relational data in Java. JPA provides support for mapping Java objects to a database schema and includes a simple programming API and expressive query language for retrieving mapped entities from a database and writing back changes made to these entities. JPA offers developers productivity gains over writing and maintaining their own mapping code allowing a single API regardless of the platform, application server, or persistence provider implementation. Besides the productivity gains the leading implementations offer developers valuable performance and scalability benefits through the inclusion of caching solutions. These caching solutions allow frequently accessed entities to be cached which reduces the number of queries going to the database and the amount of processing time spent converting database query results into objects. Caching can have a significant positive effect on application performance.

JPA and Data Grids
A data grid is software that runs on a cluster of typically low-cost hardware to provide data storage and processing services. Data grid products aggregate the processing power and storage capacity of cluster servers and make it available to clients through APIs designed to shield them from the complexity of distributed computing. Data grids are commonly used as scalable distributed caches; however, distributed data processing is also a common feature. As a cache, a data grid provides a way to exceed the heap size of a single server by distributing data across all cluster servers.

The relevance of data grids to today's enterprise applications is huge, yet their usage is still limited to technology specialists. Data grids are becoming mainstream and developers should consider grid architectures when developing applications and be aware that an application might be expected to scale up to a grid in the future.

Consider a banking system that processes incoming deposit and withdrawal requests by validating all fields before writing them to the database. Validation might include whether the account is valid, whether the requester is the account owner, whether the account contains sufficient funds for the request, etc. You can imagine there are many other validations that could be performed in such a system. The amount of data you have to read from the database to perform the validation of a single request can be significant and result in a large number of queries. Fortunately building such a database-centric application in JPA is straightforward. You map each of the classes in your domain to the database and write the necessary JP QL queries to retrieve the objects required for validation. The system may have to read large amounts of data from the database to process each request, but it works.

Now if we want to dramatically increase the throughput of this system we'd have to address its single greatest bottleneck: querying the database for validation data. Most JPA implementations either provide an L2 cache or support the integration of third-party L2 caches. But if we have to handle very large numbers of requests that arrive in a random order it's unlikely we'll have the required reference data in cache. Caches are useful when you're repeatedly accessing the same data. If your access pattern is random then it's unlikely your cache will contain what you need when you need it. Of course you can always increase your cache size to better your odds of a hit, but each server only has so much heap.

Data grids provide a way to exceed the heap size of a single server and distribute your cached objects over a cluster of servers. The challenge is to integrate data grid technology with JPA to increase throughput without requiring complete application rewrites. Of course as is typically the case with software systems, there's more then one approach to integration, each with its advantages and disadvantages. Let's look at different integration architectures and how we could use them.

Data Grid as Middle Tier Object Cache
As we mentioned, data grid products let you spread your cache across a cluster and can be used as a shared middle tier cache (see Figure 1). They provide a single logical heap that's physically spread over multiple servers with a total storage capacity that's the sum of the heaps of all the cluster servers. In the example, this would mean that by adding more servers to the grid its storage capacity could be increased to the point where all data required for validation could be pre-loaded (commonly referred to as "warming" the cache). Since validation data access is our bottleneck, caching all the required data effectively eliminates it.

For example, consider a simple validation method in our banking system:

public boolean isValidAccount(Request request) {
Account account = entityManager.find(
Account.class, request.getAccountId());
if (account == null) {
return false;
} else {
return account.isValid();
}
}

With the data grid integrated as the L2 cache, the find() will check the grid for the desired Account. If not found, it can then proceed to query the underlying database. However, if the grid is warmed with all the Accounts then there will be no need to query the database. Warming the appropriate caches can eliminate database access from the validation process entirely.

Primary key finds are easily directed to the data grid but what about JP QL queries? Consider this method, which finds the Customer associated with a request using a non-primary key query:

public Customer getTxCustomer(Request request)
throws NoResultException {
Customer customer = entityManager
.createQuery("select c from Customer c
where c.masterAccountId = :id")
.setParameter("id", request.getMasterAccountId())
.getSingleResult();
return customer;
}

Querying the data grid for an object that matches an arbitrary criterion is problematic. First it requires that the data grid provides some sort of query framework and second that the JPA/data grid integration can translate from JP QL into this framework. If both requirements are satisfied then it's possible that the query in our example could be directed to the grid and not the database.

One of the most valuable features of this approach is the possibility of parallel query execution. It stands to reason that the query in our example could be executed in parallel on all the servers in the grid to find the desired object. However, a query that returns many objects is much more interesting. Each grid server could execute a query in parallel to identify those objects it holds that match a given criteria. Performing such a query 10 times in parallel on 10,000 objects is going to be much faster than one time on 100,000 objects. The more servers the smaller the number of objects on each server and the faster the query executes!

Unfortunately there's one complication with queries that return multiple results. Unlike a primary key find() in which a cache miss could automatically result in a database query, it's not clear whether the results obtained from the grid are sufficient. Perhaps only half of the objects you're looking for are in the grid so a grid query wouldn't return the other half from the database. Warming the cache solves this problem by ensuring all objects are in the grid but that isn't always possible. However, for a given use case, you may know whether a particular query should be directed to the grid or to the database. The way you effect query execution in JPA is through query hints. Perhaps something like:

Customer customer = entityManager
.createQuery("select c from Customer c
where c.masterAccountId = :id")
.setParameter("id", request.getMasterAccountId())
.setHint("my-jpa-implementation.dont-query-grid", true)
.getSingleResult();

Of course there's no standard JPA hint for whether to direct a query to a data grid or not. This means that you'd have to introduce implementation-specific hints into your code. Fortunately the JPA specification requires that implementations ignore hints they don't understand so your code isn't tightly coupled to any particular one through hints.

Updating Objects
Naturally querying is the first thing you think of when looking at JPA on the grid but we also have to consider updates: persisting new objects, modifying existing objects, and deleting objects. When the grid is the L2 cache, it's important to ensure that the grid is only updated after a database transaction has successfully committed. Persisting a new object will result in a database INSERT and the new object will be placed into the grid. Modifying an object will result in a database UPDATE and the updated object being placed into the grid. And finally deleting an object will result in a database DELETE and the object being removed from the grid. The key thing is to update the data grid once the database transaction successfully commits.

Data Grid as System of Record
When JPA uses the data grid as a distributed cache, the database is the "system of record." It's the ultimate source of truth and is kept up-to-date at all times. But what if the data grid were the system of record? This is often the case in many financial applications dealing with rapidly changing and transient data. What would JPA on the grid look like if there were no database or if the database were used more as a data archive or warehouse than as an online system? (see Figure 2)

In this architecture, all JPA operations that would normally have resulted in SQL directed to the database are directed instead to the data grid. This includes all queries and all updates. Essentially we replace the database entirely with the data grid. With JP QL translation support we can continue to use JPA as our programming API while working with data stored exclusively in the middle tier. For systems that don't need long-term persistent storage this is ideal. And if more storage or query performance is required you simply add servers to the grid.

Database-backed Data Grid
Even with all queries and updates being performed against the data grid, it's still possible to integrate a database for persistent storage. In this architecture, the grid is responsible for propagating the operations performed on the grid to the database. For example, putting an object into the grid would result in a database INSERT. The advantage of this configuration is that data continues to be highly available but updates are communicated back to the database for permanent storage, reporting purposes, etc. Ideally a grid operation wouldn't be propagated to the database synchronously since that would dramatically reduce throughput. Asynchronous writes of updates to a backing database keeps the grid responsive and yet still supports persistence storage requirements (see Figure 3).

Mix and Match - Heterogeneous Configuration
So far we've looked at a data grid as a cache for JPA and using JPA as a standard API on top of a data grid. The difference in the two architectures is actually fairly minor. For new objects, it boils down to configuring whether or not JPA writes first to the database and then to the data grid or whether it just writes to the grid. The same logic applies to update and delete operations. Querying, as we've seen, is also similar.

If we can configure how to read/write/query on an Entity-by-Entity basis we can mix architectures. Consider a stock trading application. In such an application you've got "enduring" Entities like Companies, Stocks, and Bonds. But you've also got transient Entities like Bids and Asks. An Entity-level configuration would enable JPA to use the data grid as a cache for persistent Entities, like Company, and the data grid as the system of record for transient Entities like Bids.

Scaling JPA with a Data Grid
Hopefully it's clear by now that integrating JPA with data grids is possible and that they can increase system throughput by providing fast access to data managed in the middle tier. But they also offer significantly better scalability for JPA applications as compared with commonly used approaches.

Traditionally, scaling up a JPA application is done by increasing the number of servers in the application cluster and using a load balancer to distribute the work evenly. But as you increase the cluster size you are limited in what you can cache without introducing inter-process messaging and locking. Updates to shared data must be communicated to all cluster servers to ensure no JPA caches contain stale data. For a cluster with N servers this means each update will require N-1 messages. As you increase the number of servers in the cluster the cost of processing a single concurrent update per server increases quadratically according to (N-1)² because each server must message every other server for every update. Worse still, as the cluster grows each server will have to spend a significant amount of its available processing time dealing with incoming update messages. These non-linear communication and update processing costs means that while traditional approaches to clustering JPA applications that employ caching do work well, they are limited to small-to-medium-size clusters.

A data grid solves this communication problem by having only one shared copy of an object accessible from all servers. An update doesn't require messaging to all servers because they'll each pick up the change next time (if ever) they need the updated object. In a data grid with a scalable peer-to-peer communication architecture (i.e., one without a central message routing bottleneck) an update requires communicating to the server that stores the object and to the server(s) that stores a backup copy in case of failover. In this case, the communication cost for processing a single concurrent update per server is described by the linear function C(N) where C is a constant reflecting the number of copies (primary and backups). This linear update cost means that it's possible to scale JPA application using a data grid to large clusters and achieve much higher throughput than would typically be possible.

Challenges
Of course JPA on the grid isn't without its challenges. The first thing developers familiar with object relational mapping and JPA will undoubtedly be thinking about is cache staleness. This is the most common problem caches introduce. Staleness has two sources: third-party updates to the database, and updates performed by JPA applications running on other servers in the cluster. Dealing with third-party updates is no different with a data grid than it is with any other cache. Most JPA implementations offer a range of techniques to deal with this, including eviction policies, query refresh options, and for extremely volatile data, the ability to disable caching. This is well-worn territory that data grids don't particularly complicate.

As discussed earlier, staleness due to updates made in other cluster servers is traditionally solved by messaging although it has its limitations. In high-transaction-rate systems where the messaging overhead is significant, JPA applications tend to minimize their cache usage and rely on the database to ensure they have the most recent version of the data. Ironically, as transaction rates increase and the value of caching increases it's often disabled because the cost of maintaining cache coherence is too expensive. The use of a data grid to virtually eliminate the messaging and update processing overhead means that high-transaction-rate systems can take advantage of caching to achieve even higher throughput without having to manage staleness.

Querying is another challenge. JPA defines the general-purpose JP QL that is in many ways similar to SQL and includes many of the same notions. The goal of JP QL is to provide an object-based query language that's easy to translate into SQL for execution on a relational database. Of course data grids aren't relational databases and each has its own query framework. The extent that JP QL can be translated and executed on a particular grid depends on the expressiveness of the grid's query framework.

Another challenging area is object relationships. JPA supports a number of relationship types along with the notion of embedded objects. Relationship support varies by data grid product and each has its subtleties. Issues include: what kind of relationships are supported; whether objects can have relationships across the grid or must be co-located; and what query operators are supported on relationships. The answer to this last question obviously has a big impact on what kind of JP QL queries can be executed.

This list is definitely not exhaustive but it highlights the kinds of issues that have an impact on JPA/data grid integration.

Conclusion
Data grids are not relational databases and so we can't expect a perfect match between JPA and data grids. But even with some limitations, JPA on the grid is an exciting technology that provides a way to evolve JPA applications to leverage the power of data grids to build scalable high-performance systems.

References

More Stories By Shaun Smith

Shaun Smith is a Principal Product Manager for Oracle TopLink and an active member of the Eclipse community. He's Ecosystem Development Lead for the Eclipse Persistence Services Project (EclipseLink) and a committer on the Eclipse EMF Teneo and Dali Java Persistence Tools projects. He’s currently involved with the development of JPA persistence for OSGi and Oracle TopLink Grid, which integrates Oracle Coherence with Oracle TopLink to provide JPA on the grid.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
As cloud adoption continues to transform business, today’s global enterprises are challenged with managing a growing amount of information living outside of the data center. The rapid adoption of IoT and increasingly mobile workforce are exacerbating the problem. Ensuring secure data sharing and efficient backup poses capacity and bandwidth considerations as well as policy and regulatory compliance issues.
Why do your mobile transformations need to happen today? Mobile is the strategy that enterprise transformation centers on to drive customer engagement. In his general session at @ThingsExpo, Roger Woods, Director, Mobile Product & Strategy – Adobe Marketing Cloud, covered key IoT and mobile trends that are forcing mobile transformation, key components of a solid mobile strategy and explored how brands are effectively driving mobile change throughout the enterprise.
Although it has gained significant traction in the consumer space, IoT is still in the early stages of adoption in enterprises environments. However, many companies are working on initiatives like Industry 4.0 that includes IoT as one of the key disruptive technologies expected to reshape businesses of tomorrow. The key challenges will be availability, robustness and reliability of networks that connect devices in a business environment. Software Defined Wide Area Network (SD-WAN) is expected to...
Data is an unusual currency; it is not restricted by the same transactional limitations as money or people. In fact, the more that you leverage your data across multiple business use cases, the more valuable it becomes to the organization. And the same can be said about the organization’s analytics. In his session at 19th Cloud Expo, Bill Schmarzo, CTO for the Big Data Practice at EMC, will introduce a methodology for capturing, enriching and sharing data (and analytics) across the organizati...
SYS-CON Events announced today that Pulzze Systems will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Pulzze Systems, Inc. provides infrastructure products for the Internet of Things to enable any connected device and system to carry out matched operations without programming. For more information, visit http://www.pulzzesystems.com.
Developing software for the Internet of Things (IoT) comes with its own set of challenges. Security, privacy, and unified standards are a few key issues. In addition, each IoT product is comprised of (at least) three separate application components: the software embedded in the device, the back-end service, and the mobile application for the end user’s controls. Each component is developed by a different team, using different technologies and practices, and deployed to a different stack/target –...
SYS-CON Events announced today Telecom Reseller has been named “Media Sponsor” of SYS-CON's 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
Almost two-thirds of companies either have or soon will have IoT as the backbone of their business in 2016. However, IoT is far more complex than most firms expected. How can you not get trapped in the pitfalls? In his session at @ThingsExpo, Tony Shan, a renowned visionary and thought leader, will introduce a holistic method of IoTification, which is the process of IoTifying the existing technology and business models to adopt and leverage IoT. He will drill down to the components in this fra...
Pulzze Systems was happy to participate in such a premier event and thankful to be receiving the winning investment and global network support from G-Startup Worldwide. It is an exciting time for Pulzze to showcase the effectiveness of innovative technologies and enable them to make the world smarter and better. The reputable contest is held to identify promising startups around the globe that are assured to change the world through their innovative products and disruptive technologies. There w...
There is growing need for data-driven applications and the need for digital platforms to build these apps. In his session at 19th Cloud Expo, Muddu Sudhakar, VP and GM of Security & IoT at Splunk, will cover different PaaS solutions and Big Data platforms that are available to build applications. In addition, AI and machine learning are creating new requirements that developers need in the building of next-gen apps. The next-generation digital platforms have some of the past platform needs a...
With so much going on in this space you could be forgiven for thinking you were always working with yesterday’s technologies. So much change, so quickly. What do you do if you have to build a solution from the ground up that is expected to live in the field for at least 5-10 years? This is the challenge we faced when we looked to refresh our existing 10-year-old custom hardware stack to measure the fullness of trash cans and compactors.
The emerging Internet of Everything creates tremendous new opportunities for customer engagement and business model innovation. However, enterprises must overcome a number of critical challenges to bring these new solutions to market. In his session at @ThingsExpo, Michael Martin, CTO/CIO at nfrastructure, outlined these key challenges and recommended approaches for overcoming them to achieve speed and agility in the design, development and implementation of Internet of Everything solutions wi...
Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more business becomes digital the more stakeholders are interested in this data including how it relates to business. Some of these people have never used a monitoring tool before. They have a question on their mind like “How is my application doing” but no id...
Cloud computing is being adopted in one form or another by 94% of enterprises today. Tens of billions of new devices are being connected to The Internet of Things. And Big Data is driving this bus. An exponential increase is expected in the amount of information being processed, managed, analyzed, and acted upon by enterprise IT. This amazing is not part of some distant future - it is happening today. One report shows a 650% increase in enterprise data by 2020. Other estimates are even higher....
Smart Cities are here to stay, but for their promise to be delivered, the data they produce must not be put in new siloes. In his session at @ThingsExpo, Mathias Herberts, Co-founder and CTO of Cityzen Data, will deep dive into best practices that will ensure a successful smart city journey.
Identity is in everything and customers are looking to their providers to ensure the security of their identities, transactions and data. With the increased reliance on cloud-based services, service providers must build security and trust into their offerings, adding value to customers and improving the user experience. Making identity, security and privacy easy for customers provides a unique advantage over the competition.
SYS-CON Events announced today that 910Telecom will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Housed in the classic Denver Gas & Electric Building, 910 15th St., 910Telecom is a carrier-neutral telecom hotel located in the heart of Denver. Adjacent to CenturyLink, AT&T, and Denver Main, 910Telecom offers connectivity to all major carriers, Internet service providers, Internet backbones and ...
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices - comp...
SYS-CON Events announced today that Adobe has been named “Bronze Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. Adobe is changing the world though digital experiences. Adobe helps customers develop and deliver high-impact experiences that differentiate brands, build loyalty, and drive revenue across every screen, including smartphones, computers, tablets and TVs. Adobe content solutions are used daily by millions of co...
Data is the fuel that drives the machine learning algorithmic engines and ultimately provides the business value. In his session at Cloud Expo, Ed Featherston, a director and senior enterprise architect at Collaborative Consulting, will discuss the key considerations around quality, volume, timeliness, and pedigree that must be dealt with in order to properly fuel that engine.