| Close Window |
Most non-trivial applications involve high degrees of concurrency and many layers of abstraction. Concurrency is associated with resource contention and an increase in deadlock conditions. The multiple layers of abstraction make it more difficult to isolate and fix the deadlock conditions.
Generally, a deadlock happens when two or more concurrent threads of execution each hold a resource and request another resource. Since neither one continues until it acquires the resource, we say each specific thread is blocked; if each thread is blocked on a resource that is held by another thread in the same group, we say the group of threads is deadlocked.
In this article, we'll discuss two broad categories of deadlocks that occur in typical non-trivial J2EE applications: "simple" database deadlocks and cross-resource deadlocks. Although the discussion is based on J2EE, it also applies to other technology platforms.
Database Deadlocks
In a database, one connection can block another if it's holding database locks needed by the other. If two or more connections are blocking each other, none of them can proceed, and they're deadlocked.
Database deadlocks can be tricky because the locks involved often aren't explicit. Typically, updating a row implicitly requires locking that row, doing the update, and then releasing the lock when the enclosing transaction is committed or rolled back. Depending on the database platform, the configured isolation level, and any query hints, the acquired lock can be fine- or coarse-grained and block or not block other queries on the same row, table, or database.
The locks acquired can depend on the internally generated query plan that can change over time as the data size and distribution changes, so a query that acquired one set of locks in one environment can try to acquire a completely different set of locks in another. The database is free to escalate its locks when necessary - instead of locking 10 rows on the same data page, it may choose to lock the entire page, which may block reads or writes to rows that don't need to be locked.
Depending on the database schema, a read or a write can require traversing or updating multiple indexes, validating constraints, executing triggers, etc., each of which can introduce more locks. Further, other applications may be hitting some of the same database schema objects and acquiring locks differently than your application ever does.
All of these factors conspire to make it practically impossible to eliminate the possibility of database deadlocks. Fortunately, database deadlocks are generally recoverable: the database will notice the deadlock, forcibly kill one of the connections (typically the one that has done the least work), and roll back its transaction. This releases all of the locks associated with the terminated transaction, which should allow at least one of the other connection(s) to acquire the locks on which they were blocking.
Because of this typical deadlock-handling behavior, it often suffices simply to retry the entire transaction in the case of a database deadlock. When a database connection is killed, an exception is emitted that can be caught by the application and identified as a database deadlock condition. If that deadlock exception is allowed to propagate out to the layer of code that initiated the transaction, that code can simply start a new transaction and redo all of the earlier work. For this strategy to be correct, the code inside the transaction must have no side-effects until after the transaction has been successfully committed. Note: You'll want to put a limit on the number of times you retry or else a particularly deadlock-prone piece of code may loop forever.
This feels a bit clunky - if something goes wrong, we just try again. However, because of the freedom the database has in the locks that it acquires, it's nearly impossible to guarantee that two threads of execution can't cause a database deadlock. At least this approach guarantees that the application behaves correctly in the rare case that the deadlock happens, and is far less clunky than asking your users to retry the operation.
In a J2EE application, developers can set an EJB call to use either bean-managed transactions (BMT), where the developer specifically starts and commits or rolls back the transaction, or container-managed transactions (CMT), where the application server starts a transaction before calling the method and commits or rolls back the transaction after the method completes. It would be nice if the EJB vendors supplied a retry-on-deadlock parameter that would do this automatically with container-managed transactions. Without this automated feature, developers end up forcing EJB calls to use bean-managed transactions just to be able to retry on deadlocks. (One of the disadvantages to making your EJB calls use bean-managed transactions is that it's not obvious how to get the same semantics as a container-managed transaction with "RequiresNew" or "NotSupported." See www.onjava.com/pub/a/onjava/2005/07/20/transactions.html for how it can be done.)
The specifics of how often you'll run into deadlocks and which locks will block other threads depend a lot on your database platform, hardware, database schema, and queries. In databases that use lock-based concurrency control, like MSSQL, uncommitted writes can block reads and uncommitted reads can block writes, which makes them more deadlock-prone. In MVCC (multi-version concurrency control) databases like Oracle, uncommitted writes won't block reads - the read will simply see the old version of the row. That can introduce other problems but doesn't create as many deadlock opportunities. Familiarize yourself with these database locking schemes and be aware of which type you are using.
There are a number of good references on how to find, fix, and avoid database deadlocks, but none of them will eliminate the possibility of deadlocks.
Cross-Resource Deadlocks
When your deadlock condition isn't completely contained within a database, it can be much more difficult to track down. Since the database is aware of the locks held and requested, it can detect deadlocks that are entirely contained in a database; also, because database transactions provide a nice boundary of what things should and shouldn't be atomic, the database can simply roll back a transaction to recover from a deadlock. Deadlocks that are in other environments, like the JVM, or deadlocks that span environments can be more dangerous because the environment can't (or doesn't) detect them and try to recover. Worse, these deadlocks can have a compounding effect - if two threads are deadlocked while holding some set of resources, any other thread that tries to access one of those resources also becomes blocked, along with any resources that thread has already acquired. Often, these deadlocks can be difficult to track down, but some familiarity with the general patterns usually helps to identify and fix the problem.
There are a few questions to ask when an environment gets into a suspected deadlock condition. The answers to these questions will indicate which of the following scenarios you're dealing with, if any, and give more detail about how to fix the underlying problem. Some of the key things to ask are:
Cross-Resource Deadlock #1: Pool Exhaustion with Escalating Clients
The first deadlock scenario we'll look at happens only under load, when a resource pool is too small and each thread needs more than the available resources from the pool. For example, consider an EJB call that uses a database connection, then makes a nested EJB call that uses a separate database connection from the same connection pool. This will happen if the nested EJB call is declared as "RequiresNew," for example.
Under normal load, or with a sufficiently sized connection pool, the EJB call will get a database connection from the pool, and then call the nested EJB. The nested EJB call will get another database connection from the pool, commit the inner transaction, and return the connection to the pool. The outer EJB call will then commit its transaction, and return its connection to the pool.
However, suppose the connection pool has a maximum size of 10 connections, and there are 10 concurrent calls to the outer EJB. Each of those threads acquires a database connection, emptying the pool. Now, they each try to make the nested EJB call, which needs to acquire a second database connection. None of them can proceed, and none of them will give up their first database connection, so all 10 threads are deadlocked.
When investigating a deadlock of this type, you'll see a large number of threads in your thread dump waiting to acquire resources, and the same number of active database connections, all idle and unblocked. If you can inspect the connection pool at runtime while the application is deadlocked, you should be able to verify that it's actually empty.
The fix for a deadlock of this type is either to increase the size of the connection pool or to refactor the code so that a single thread doesn't require as many database connections at the same time. If the maximum number of database connections required by a single thread is M, and the maximum number of possible concurrent calls is N, the minimum number of connections required in the pool to prevent this problem is (N*(M-1))+1. Alternatively, you could set up the inner EJB call to use a different connection pool, so that even if the outer call's connection pool is empty, the inner calls will be able to proceed using their own connection pool.
Cross-Resource Deadlock #2: Single-Thread, Multiple Conflicting Database Connections
The second cross-resource deadlock scenario can also arise when making nested EJB calls on the same thread, although this case typically happens even in systems that aren't under load. Just as in the example above, the two EJB calls use different connections to the same database. Since the caller can't continue until the nested call completes, the caller's database connection is effectively blocked by the nested call's database connection, although the database isn't aware of this relationship. If the first (outer) connection has acquired a database lock that the second (inner) connection needs, the second connection will block indefinitely waiting for the first connection to be committed or rolled back and a deadlock arises. Since the database isn't aware of the relationship between the two connections, the database won't detect this as a deadlock.
As a concrete example, consider a data-loading EJB call. This EJB call takes in a large object and persists it to the database in several stages. As it performs the data-load, it updates a separate table that records the state of the pending data-load operation. We'd like the state-update to be visible immediately, but we don't want the loaded data to be visible in an incomplete state, so the state-update is done with a call to a "RequiresNew" EJB. Roughly, our (flawed) data-load method looks like the code in Listing 1.
In the example, the updateBatchStatus method makes a "RequiresNew" EJB call to actually update the batch_status database table so the status change is immediately visible, even though the effects of the current transaction aren't yet visible. The call to executeUpdate isn't an EJB call, so it executes in the same transaction as the rest of bulkLoadData.
As described, this code will cause a deadlock even in the absence of concurrency. When bulkLoadData calls the executeUpdate method, it updates an existing database row, which involves acquiring a write-lock on that row. The nested EJB call to updateBatchStatus will execute on a separate database connection and try to execute a very similar query, but it will block because it can't acquire the necessary write-lock. From the database's perspective, this second connection will be allowed to proceed as soon as the first connection's transaction is committed or rolled back, which is why the database doesn't detect this as a deadlock. The VM, however, won't allow the call to bulkLoadData to complete before each call to updateBatchStatus completes, so we have a deadlock.
This example shows one update blocking another update, so it causes a deadlock on any database. If the initial update query was a simple select query instead, this example would only cause deadlocks on databases that use lock-based concurrency control, where one connection's read-lock can block another connection from acquiring a write-lock. In any case, a deadlock of this type is neither timing-dependent nor load-dependent, and the thread dump will show one Java thread waiting for a database response, but that thread will be associated with two active database connections. And one of those database connections will be idle, but blocking the other connection.
This scenario has many subtle variations, where more than one thread and more than two database connections may be involved. For example, the outer EJB call's database connection may have acquired a database lock that blocks another unrelated database connection from proceeding, but that unrelated database connection has acquired a lock that blocks the nested EJB call's database operation. This particular case is timing-dependent, and will show several Java threads waiting for database responses. At least one of those Java threads will be associated with two active database connections.
Cross-Resource Deadlock #3: JVM Locks Crossed with Database Locks
A third deadlock scenario occurs when mixing database locks with JVM
locks. In this case, one thread holds a database lock and is trying to
acquire a JVM lock (trying to enter a synchronized block), and another
thread holds that JVM lock and is trying to acquire a database lock.
Again, the database sees one connection blocking another, but nothing
is stopping that connection from proceeding, so it won't detect a
deadlock. The JVM sees one thread inside a synchronized block and
another trying to enter, so even if the JVM could detect deadlocks and
handle them somehow, it wouldn't detect this case.
For an example that illustrates this deadlock scenario, consider a simple (flawed) read-through cache. This cache is a HashMap backed by a database table. If a cache hit occurs, it will simply return the value from the HashMap; on a cache miss however, it will read the value from the database, add it to the HashMap, and return it as shown in Listing 2.
It's a simple read-through cache. Note that the get() method is synchronized, since we access a non-thread safe container, and we want the containsKey/put combination to be atomic in the case of a cache miss.
This cache is fairly straightforward: the contract is that if we change the data in the table backing the cache, we should call clearCache(), so the cache can avoid handing back stale data. The resulting cache misses will repopulate the cache appropriately.
Let's now consider some code that changes this data and clears the cache:
public void updateData(String key, String value) {
executeUpdate("update cache_table set value='" + value +
"' where key='" + key + "'");
SimpleCache.getInstance().clearCache();
}
In a simple case, this will work with no problems. However, on a database that uses lock-based concurrency control, the query in updateData will prevent the select query in queryForValue from executing, because the update statement will have acquired a write-lock that prevents the select query from acquiring a read-lock on the same row. If the timing is just right, one thread can try to read a given value out of the cache and get a cache miss while another thread updates that value in the database. If the database executes the update statement first, it will block the select statement from continuing. However, the thread executing the select statement came through the synchronized get method, so it has acquired the lock on the SimpleCache. For the thread in updateData to return, it has to call clearCache(), but it can't acquire the lock (clearCache() is synchronized).
When dealing with an instance of this scenario, there will be one Java thread waiting for a response from the database, and one waiting to acquire a JVM lock. Each thread will be associated with a database connection, with one connection blocking the other. The fix is to avoid doing database operations while holding JVM locks: the get() method of our SimpleCache could be rewritten like this:
public Object get(String key) {
synchronized(this) {
if (cache.containsKey(key)) {
return cache.get(key);
}
}
Object value = queryForValue(key);
synchronized(this) {
cache.put(key, value);
}
return value;
}
Since we now know that this deadlock case can happen, we can add a check to our queryForValue method to try to avoid the deadlock condition by using Thread.holdsLock():
private Object queryForValue(String key) {
assert(!Thread.holdsLock(this));
return executeQuery(...);
}
While Thread.holdsLock() can be useful in this case, it only works if we know which locks we're specifically worried about. It would be useful to have a similar method that could determine whether the current thread holds any JVM locks. Then, any piece of code that made any kind of RPC call, database access, etc., could throw an exception or log a warning to indicate that these operations can be dangerous while holding a JVM lock.
Note: even though we've fixed the deadlock problem in this example, it's still flawed because the cache is cleared before updateData's transaction has been committed. If a cache miss happens after the clearCache call but before updateData's transaction is committed, the cache will load the old data because the new data isn't visible yet. The fix here is to clear the cache only after the changes have been committed. Observe that this would only happen in an MVCC database; in a lock-based database, the pending update would block the cache's read, so the cache would read the correct value after the update's transaction was committed.
Rules of Thumb
These guidelines should help you avoid these problems, or at least diagnose and fix them if they do happen.
Resources
© 2008 SYS-CON Media Inc.