Welcome!

Java IoT Authors: Elizabeth White, Pat Romanski, Yeshim Deniz, Carmen Gonzalez, Liz McMillan

Related Topics: Java IoT

Java IoT: Article

Looking Inside Stuck Threads

The transmigration of Java threads

Thread pooling is a common technique that modern application servers adopted to run Java applications efficiently. Even application servers not implemented by Java share the concept of using system resources more compactly to maximize overall throughput. Besides the underlying programming mystery of native OS threads, a Java thread object encapsulates some hurdles to easy-to-use and flexible synchronization at the programming level. JDK 5.0 has built-in thread pooling classes in its 'java.util.concurrent' package to facilitate programming the thread pool quickly. If we're using a J2EE application server, the container inherently enforces thread synchronization from its runtime nature. That means we don't have to fight difficult threading issues day and night, but it doesn't mean we can dismiss them. Instead, we should attend to the thread issues inside the code and the architecture. If we don't, system performance will degrade. A once well-running system will gradually become slower and slower, then application throughput will be blocked and external requests start to queue up. There's some degree of denial of service. In most commercial production environments like telecom, e-commerce and banking, this situation impacts the business and can create unplanned system outages.

While the server operator calls for help, an experienced engineer often asks something outstanding of the application environment. During the incident, we may see either ultra-high or ultra-low CPU usage at the OS level along with applications hanging and threads sticking at the JVM level from a three-dimensional viewpoint. How does one disclose the bottleneck and abnormality at the JVM level? The answer is: When the problem is reproducible then a commercial productive profiling tool or remotely debugging the JVM is an option. But taking copies of the thread dumps is widely used because it's straightforward and instantaneous. And it involves the least overhead.

Thread dumps provide a snapshot of the JVM internals at a special point at a minimal cost. We may give the JVM hosting the applications a signal SIG-QUIT with the JVM process ID (PID) on a Unix-like system (e.g., kill -3 xxxxxx; where 'xxxxxx' is the JVM PID) or have a control-break on the Windows Java console ask the JVM to output its thread information in detail to a standard output when the JVM didn't start in company with a '-Xrs' option before. Due to the importance of the thread dump, it's best to redirect the standard output to a file or pipe the information to a utility that can store and rotate the standard output to log files (see Figure1).

A JVM has a complementary function that enables it to get the thread dump at the undocumented C API level. (We can look at the Java source code that Sun released recently under the GPL to see this feature.) We may utilize this API for a simple debugging framework to address many common issues inside the application. But it requires a JNI implementation in C because there's no pure Java API to force the JVM to generate the thread dump, though we may get similar thread stack traces in JDK 5.0 via the 'getAllStackTraces()' API. Despite this tricky function, we're interested in a snapshot of the thread dump while we have identified the stuck threads (see Figure 2).

With copies of the thread dump collected at intervals of seconds, we may identify the stuck threads from the running state of each thread in the thread pool. Fortunately, some application servers do an automatic health check on the application thread pools. In fact, it acts like a watchdog that periodically check the last running statistics on the threads in the thread pools. Once the threads have run for fixed long-running seconds, it will print out the execution information on the stuck threads either in standard output or log files. Second, some platform JDK vendors have out their diagnostic utilities in the public domain to aid us in detecting stuck threads (e.g., HP's JMeter and IBM's thread analyzer). However, once we isolate the stuck threads, we'll have to figure out why they got stuck from the information (i.e., the stack traces of these stuck threads) about what they were doing when they got stuck. This way we can improve code quality and tier architecture in the next iteration (see Figure 3).

A stuck thread means a thread is blocked and can't return to the thread pool smoothly in a given period of time. When an application thread is blocked unintentionally, it means it can't quickly complete its dispatch and be reused. In most of production situations, the root cause of these stuck threads is also the root cause of bad system performance because it interferes with regular task execution. [It's also a performance issue for producers and healthy consumers. < 1 ] (request frequency) < (healthy thread count for request execution/average measured request execution time per healthy thread.]

Blocking without specifying a network connect or read timeout is the most frequent reason we have seen. When we don't manually configure a timeout for each method call involving networking, it will have a potential blocking behavior by the underlying physical socket read/connect characteristic. While waiting infinitely for the response from the other side, the native OS networking layer probably throws an I/O exception. By default this behavior takes an unexpectedly long time (e.g., 240 seconds). Modern distributed systems need to factor in this situation (especially, Web Services invocations). Though we may set timeouts for well-known protocols via some system properties (e.g., sun.net.client.defaultConnectTimeout and sun.net.client.defaultReadTimeout), the newer version of JDK might provide a generic mechanism to explicitly configure each default timeout value for those whose methods call socket connect/read as a security policy file. For example, com.sun.jndi.ldap.read.timeout (http://java.sun.com/docs/books/tutorial/jndi/newstuff/readtimeout.html) wasn't available prior to JDK 6.0 for LDAP service provider read timeout. Otherwise, when the problematic code isn't under the control of end users, it usually needs to restart the application to temporarily reset the abnormal phenomenon propagated from the other side. In addition, we should take into account whether the service we called is idempotent while analyzing this kind of issue in the design phase because we don't know whether the service at the other end keeps executing when the thread has ended its invocation after a timeout (see Figure 4).

The unexpectedly long execution time of a SQL statement is a common condition that causes a stuck thread. In the thread dump we collected, we can see that the stuck thread was running a network socket read for a long time without changes and the thread's stack trace contains many JDBC driver classes. Under these conditions, we can also check the status of the database it connected with and set the query timeout for all application code using a JDBC statement setQueryTimeout method. (Most JDBC drivers support this feature but we'd have to read the JDBC driver's release note first.) According to the different nature of every SQL query, it would be better to segregate the programs that have a longer execution time in another thread pool and tune the database table with indices for faster access. We would also need to check whether the JDBC driver is certified with the connected database. A sub-issue is the accessed table locked by other processes so the threads for the JDBC query couldn't continue because of table locking.

Resource contention is an issue that's hard to find if we don't get the entire thread dump to analyze. Basically, it's an issue of producers and consumers. Any limited resources on the system (JDBC connections, socket connections, etc.) will impact this issue. The best thing to do is look at the thread dump, get the stuck thread name from the log, and find the bottleneck that's causing the stuck thread.

File descriptor leaking is an issue that causes this phenomenon (Note that a Unix socket implementation requires a file descriptor). So the JVM should have enough file descriptor numbers to host our applications. Generally, we can adjust the open file limit with the Unix shell 'ulimit' command for the current shell. And we can list the open files with the public domain 'lsof' tool. It's intensely interesting that many developers don't explicitly use the 'close()' method in the final block when an object inherently provides a 'close()' method and want JVM to release these unclosed objects when garbage is collected. We should keep firmly in mind that that act is bad without closing the system resource after use. A special case is when the socket connections in the application don't close properly while still being underdeployed and then the application begins to throw an IOException with a 'Too many open files' message after repeated application redeployment.


More Stories By Patrick Yeh

Patrick Yeh (WEN-PIN, YEH) A senior technical consultant of BEA Systems, Taiwan for solving the critical production issues. The core value of this position is to provide the solid technical power on problem solving and to reduce the customer's downtime losses that may have a critical impact on their business (+4 years).

Comments (2) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Patrick 08/03/07 04:21:08 AM EDT

My friends,
if you need the source code on this article, please give me an email ([email protected])with title named 'Source code about looking inside stuck threads' !

Omar 04/09/07 04:10:36 PM EDT

Hi Patrick,

First of all, excellent article!! Very informative and practical.

You make reference of a utility to monitor stack threads. Where can I download this utility? There seems to be a .jar file an a shared library.

Thanking you in advance,
Omar

@ThingsExpo Stories
Amazon started as an online bookseller 20 years ago. Since then, it has evolved into a technology juggernaut that has disrupted multiple markets and industries and touches many aspects of our lives. It is a relentless technology and business model innovator driving disruption throughout numerous ecosystems. Amazon’s AWS revenues alone are approaching $16B a year making it one of the largest IT companies in the world. With dominant offerings in Cloud, IoT, eCommerce, Big Data, AI, Digital Assis...
SYS-CON Events announced today that Progress, a global leader in application development, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Enterprises today are rapidly adopting the cloud, while continuing to retain business-critical/sensitive data inside the firewall. This is creating two separate data silos – one inside the firewall and the other outside the firewall. Cloud ISVs oft...
The 21st International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Digital Transformation, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 21st International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @ThingsExpo Silicon Valley Call for Papers is now open.
As cloud adoption continues to transform business, today's global enterprises are challenged with managing a growing amount of information living outside of the data center. The rapid adoption of IoT and increasingly mobile workforce are exacerbating the problem. Ensuring secure data sharing and efficient backup poses capacity and bandwidth considerations as well as policy and regulatory compliance issues.
SYS-CON Events announced today that Interoute has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Interoute is the owner operator of Europe's largest network and a global cloud services platform, which encompasses over 70,000 km of lit fiber, 15 data centers, 17 virtual data centers and 33 colocation centers, with connections to 195 additional partner data centers. Our full-service Unifie...
In order to meet the rapidly changing demands of today’s customers, companies are continually forced to redefine their business strategies in order to meet these needs, stay relevant and continue to see profitable growth. IoT deployment and development is integral in this transformation, and today businesses are increasingly seeing the value of investing their resources into IoT deployments. These technologies are able increase ROI through projects such as connecting supply chains or enabling sm...
SYS-CON Events announced today that Progress, a global leader in application development, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Enterprises today are rapidly adopting the cloud, while continuing to retain business-critical/sensitive data inside the firewall. This is creating two separate data silos – one inside the firewall and the other outside the firewall. Cloud ISVs ofte...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
SYS-CON Events announced today that DivvyCloud will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. DivvyCloud software enables organizations to achieve their cloud computing goals by simplifying and automating security, compliance and cost optimization of public and private cloud infrastructure. Using DivvyCloud, customers can leverage programmatic Bots to identify and remediate common cloud problems in rea...
SYS-CON Events announced today that Outscale, a global pure play Infrastructure as a Service provider and strategic partner of Dassault Systèmes, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2010, Outscale simplifies infrastructure complexities and boosts the business agility of its customers. Outscale delivers a secure, reliable and industrial strength solution for its customers, which in...
New competitors, disruptive technologies, and growing expectations are pushing every business to both adopt and deliver new digital services. This ‘Digital Transformation’ demands rapid delivery and continuous iteration of new competitive services via multiple channels, which in turn demands new service delivery techniques – including DevOps. In this power panel at @DevOpsSummit 20th Cloud Expo, moderated by DevOps Conference Co-Chair Andi Mann, panelists will examine how DevOps helps to meet th...
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
SYS-CON Events announced today that A&I Solutions has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 1999, A&I Solutions is a leading information technology (IT) software and services provider focusing on best-in-class enterprise solutions. By partnering with industry leaders in technology, A&I assures customers high performance levels across all IT environments including: mai...
Every successful software product evolves from an idea to an enterprise system. Notably, the same way is passed by the product owner's company. In his session at 20th Cloud Expo, Oleg Lola, CEO of MobiDev, will provide a generalized overview of the evolution of a software product, the product owner, the needs that arise at various stages of this process, and the value brought by a software development partner to the product owner as a response to these needs.
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software in the hope of capturing value in IoT. Although IoT is relatively new in the market, it has already gone through many promotional terms such as IoE, IoX, SDX, Edge/Fog, Mist Compute, etc. Ultimately, irrespective of the name, it is about deriving value from independent software assets participating in an ecosystem as one comprehensive solution.
SYS-CON Events announced today that EARP will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. "We are a software house, so we perfectly understand challenges that other software houses face in their projects. We can augment a team, that will work with the same standards and processes as our partners' internal teams. Our teams will deliver the same quality within the required time and budget just as our partn...
SYS-CON Events announced today that delaPlex will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. delaPlex pioneered Software Development as a Service (SDaaS), which provides scalable resources to build, test, and deploy software. It’s a fast and more reliable way to develop a new product or expand your in-house team.
SYS-CON Events announced today that Tappest will exhibit MooseFS at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. MooseFS is a breakthrough concept in the storage industry. It allows you to secure stored data with either duplication or erasure coding using any server. The newest – 4.0 version of the software enables users to maintain the redundancy level with even 50% less hard drive space required. The software func...
In his keynote at @ThingsExpo, Chris Matthieu, Director of IoT Engineering at Citrix and co-founder and CTO of Octoblu, focused on building an IoT platform and company. He provided a behind-the-scenes look at Octoblu’s platform, business, and pivots along the way (including the Citrix acquisition of Octoblu).