Java IoT Authors: Pat Romanski, Stackify Blog, Elizabeth White, Liz McMillan, Mamoon Yunus

Related Topics: Java IoT, Microservices Expo, Machine Learning

Java IoT: Article

Application Performance Monitoring in Production

A step-by-step guide – measuring a distributed system

Last time I explained logical and organizational prerequisites to a successful production level application performance monitoring. I originally wanted to look at the concrete metrics we need on every tier, but was asked how you can correlate data in a distributed environment, so this will be the first thing that we look into. So let’s take a look at the technical prerequisites of successful production monitoring.

Collecting data from distributed environment
The first problem that we have is the distributed nature of most applications. In order to isolate response time problems or errors we need to know which tier and component is responsible. The first step is to record response times on every entry and exit from a tier.

A Simple Transaction Flow showing Tier response time

A Simple Transaction Flow showing Tier response time

The problem with this is twofold. First the externalJira tier will host multiple different Services which will have different characteristics. This is why we need to measure the response time on that service level and not just on the tier level. We need to do this on both sides of the fence, otherwise we will run into an averaging problem. The second problem is that externalJira is called from different other tiers and not just one.

In a complex System average tier response times are not helpful. Tier response times need to be look at with a transaction context as provided by this transaction flow

In a complex System average tier response times are not helpful. Tier response times need to be look at with a transaction context as provided by this transaction flow

When we look at the picture we see that externalJira is called from three different tiers. These tiers sometimes call the same services on externalJira, but with vastly different parameters which leads to different response times of externalJira. We have a double averaging problem:

  • Different tiers calling different services on externalJira skewing the average
  • Different tiers calling the same service on externalJira with different parameters skewing the average

Let’s look at this in a little more detail with following example:

Tier Entry Point Avg. Tier Response Time Service 1 Response Time Service 2 Response Time Service 3 Response Time
100x Payment 1 1000ms 305ms 403ms 102ms
200x Payment 2 1500ms 607ms 151ms 202ms
300x Service 1 500ms 101ms
600x Service 2 166ms 52ms
400x Service 3 89ms

In this table we see which Tier Entry Point calls which Services on other tiers. The Payment 1 Service calls Services 1-3 and measures the response time on its side. The Payment 2 Service calls the same three services but with very different response times. When we look at the times measured on Services 1-3 respectively we will see a completely different timing. We did measure the response time of Service 1-3 irrespective of their calling context and ended up with an average! Service 1 does not contribute 500ms to the response times of either Payment 1 or 2, but the overall average is 500 ms. This average becomes more and more useless the more tiers we add. One of our biggest customers hits 30 JVMs in every single transaction. In such complex environments quick root cause isolation is nearly impossible if you only measure on a tier by tier basis.

In order to correlate the response times in a complex system we need to retain the transaction context of the original caller. One way to solve this is to trace transactions, either by using a monitoring tool that can do that or by modifying code and build it into the application.

Correlating data in a distributed environment
HTTP uses a concept called referrer that enables a webpage to know from which other pages it was called. We can use something similar and leverage this to do our response time monitoring. Let’s assume for the moment that the calls done in our imaginary Application are all WebService HTTP calls. We can then either use the referrer tag or some custom URL query parameter to track the calling services. Once that is achieved we can track response time based on that custom property. Many Monitoring tools allow you to segregate response time based on referrer or query parameters. Another possibility, as always, is to report this yourself via your own JMX Bean. If we do that we will get response that is context aware.

Tier Entry Point Referrer Avg. Tier Response Time Service 1 Response Time Service 2 Response Time Service 3 Response Time
100x Payment 1 1000ms 305ms 403ms 102ms
200x Payment 2 1500ms 607ms 151ms 202ms
100x Service 1 Payment 1 300ms 30ms
200x Service 1 Payment 2 600ms 100ms
100x Service 2 Payment 1 400ms 52ms
200x Service 2 Payment 2 150ms
100x Service 2 Payment 1, Service 1 25 ms
200x Service 2 Payment 2, Service 1 90 ms
100x Service 3 Payment 1, Service 2 50 ms

We now see that Service 2 only calls Service 3 when it is called directly from Payment 1, which means its contribution is far less than the original table suggested.We also still see a difference in the request and the response time of the services. This is due to the involved network communication. By measuring the the response time context aware we can now also see the time that we spend in the communication layer more clearly, which enables us to isolate network bottlenecks and their impact. The average response time table did not allow us to do that.

We can push this to any level we want. E.g. we can divide the Payment 1 WebService call into its three variants supported by our shop: Visa, MasterCard, AMEX. If we push this as a tag/referrer down the chain we get an even more detailed picture of where we spend time.

The problem with this approach is that it is not agnostic to your application or the remoting technology. It requires you to change your code base and monitoring the different response times becomes more complicated with every tier you add. Of course also need to maintain this alongside the actual application features, which increases cost and risk.

This is where professional APM Tools come in. Among other things they do transaction tracing and tagging transparently without code changes. They can also split measure response time in a way that is context aware; they can differentiate between an AMEX and a Visa CreditCard payment via Business Transactions. And finally they allow you to focus on the entry response time, in our case Payment 1 and Payment 2. In case you have a problem, you can drill down to the next level from there. So there is no need to keep an eye on all the deeper level response times.

dynaTrace automatically traces calls across tiers (synchronous and asynchronous), captures contextual information per transactions and highlights which tiers contribute how much to the response time

dynaTrace automatically traces calls across tiers (synchronous and asynchronous), captures contextual information per transactions and highlights which tiers contribute how much to the response time

Beyond response time
By having the response time distribution across services and tiers we can quickly isolate the offending tier/service in case we face a performance problem. I stated before that monitoring must not only allow us to detect a problem but also isolate the root cause. To do this we need to measure everything that can impact the response time either directly or indirectly. In generally I like to distinguish between usage, utilization and impact measures.

Usage and Utilization Measurement
A usage measure describes how much a particular application or transaction uses a particular resource. A usage metric can usually be counted and is not time based. An exception is the maybe best known usage measure of CPU time. But CPU time is not really time based, it is based on cpu cycles; and there is a limited number of cpu cycles that can be executed in a specific time. We can directly measure how much CPU time is consumed by our request, by looking at the threads consumed CPU time. In addition we can measure the CPU usage on the process and system level. Most of the time we are measuring a limited resource and as such we also have a utilization measure, e.g. the CPU utilization of a system. Other examples include the number of database calls of a transaction or the connection pool usage. What is important is that the usage is a characteristic of the transaction and does not increase if performance goes down. If the specific resource is fully utilized, we have to wait for it, but then will use the same amount we always do!

While Response time and Overall CPU Usage fluctuates the average CPU usage per transaction is stable

While Response time and Overall CPU Usage fluctuates the average CPU usage per transaction is stable

To illustrate that we can again think about the CPU time. An AMEX Credit Card payment Transaction will always use roughly the same amount of CPU time (unless there is an severe error in your code). If the CPU is utilized the response time will go up, because it has to wait for CPU, but the amount of CPU time consumed will stay the same.This is what is illustrated in the chart. The same should be true if you measure the amount of database statements executed per transaction, how many webservices were called or how many connections were used. If a usage measure has a high volatility then you either are not measuring on a granular enough business transaction level (e.g. AMEX and Visa Payment may very well have different usage measures) or it is an indicator for an architectural problem within the application. This in itself is, of course, useful information for the R&D department. The attentive reader will note that caches might also lead to volatile response times, but then they should only do so during the warmup phase. If we still have high volatility after that due to the cache, then the cache configuration is not optimal.

The bottom line is that a usage measure is ideally suited to measure which sort of transactions utilize your system and resources the most. If one of your resources reaches 100% utilization you can use this to easily identify which transactions or applications are the main contributors. With this information you can plan capacity properly or change the deployment to better distribute the load. Usage measures on a transaction level are also the starting point for every performance optimization activity and are therefore most important for the R&D department.

Unfortunately the very fact that makes a usage measure ideal for performance tuning, makes it unsuitable for troubleshooting in case of scalability or performance problems in production. If the connection pool is exhausted all the time, we can assume that it has a negative impact on performance, but we do not know which transactions are impacted. Turning this around means that if you have a performance problem with a particular transaction type you can not automatically assume that the performance would be better if the connection pool were not exhausted!

The response time will increase, but all your transaction usage measures will stay the same, so how do you isolate the root cause?

Impact Measurement
In contrast to usage measures, impact measures are always time based. We measure the time that we have to wait for a specific resource or a specific request. An example is the getConnection method in case of a database connection pool. If the connection pool is exhausted the getConnection call will wait until a connection is free. That means if we have a performance problem due to an exhaustion of the database connection pool, we can measure that impact by measuring the getConnection method. The important point is that we can measure this inside the transaction and therefore know that it negatively impacts the AMEX , but not the Visa transactions. Other examples are the execution time of a specific database statement. If the database slows down in a specific area we will see this impact on the AMEX transaction by measuring how long it has to wait for its database statements.

The Database Statement impacts the response time by contributing 20%

The Database Statement impacts the response time by contributing 20%

When we take that thought further we will see that every time a transaction calls a different service, we can measure the impact that the external service has by measuring its response time at the calling point. This closes the cycle to our tier response times and why we need to measure them in a context-aware fashion. If we would only measure the overall average response time of that external service we would never know the impact it has on our transaction.

This brings us to a problem that we have with impact measurement on the system level and in general.

Lack of transaction context
In an Application Server we can measure the utilization of the connection pool, which tells us if there is a resource shortage. We can measure average wait time and/or the average number of threads waiting on a connection from the pool, which similar to the Load Average tells us that the resource shortage does indeed have an impact on our application. But both the usage and the impact measure lack transaction context. We can correlate the measurements on time basis if we know which transactions use which resources, but we will have to live with a level of uncertainty. This forces us to do guesswork in case we have a performance problem. Instead of zeroing in on the root cause quickly and directly, this is the main reason that trouble shooting performance problems takes a long time and lots of experts. The only way to avoid that guesswork is to measure the impact directly in the transaction, by either modifying the code or use a motoring tool that leverages byte-code injection and provides a transaction context.

Of course there are some things that just cannot be measured directly. The CPU again is a good example. We cannot directly measure that we wait for a CPU to be assigned to us, at least not easily. So how do we tackle this? We measure the usage, utilization and indirect impact. In case of the CPU the indirect impact is measured via the Load Average. If the load average indicates that our process is indeed waiting for CPU we need to rule out any other cause for the performance problem. We do this by measuring the usage and impact of all other resources and services used by the application. To quote Sherlock Holmes: ” If all directly measured root causes can be ruled out than the only logical conclusion is that the performance problem is caused by whatever resource cannot be directly measured.”  In other words if nothing else can explain the increased response time, you can be sufficiently certain that the CPU exhaustion is the root cause.

What about log files
As a last item I want to look at how to store and report the monitoring data. I was asked before whether log files are an option.The idea was to change the code and measure the application from inside (as hinted at several times by me) and write this to a log file. The answer is a definitive NO; log files are not a good option.

The first problem is the distributed nature of most applications. At the beginning I explained how to measure distributed transactions. It becomes clear that while you can write all this information to a log file periodically, it will be highly unusable, because you would have to retrieve all the log files, correlate the distributed transaction manually and on top of that correlate it with system and application server metrics taken during that time. While doable, it is a nightmare if we talk about more than two tiers.

Trying to corrlate log files from all the involved servers and databases is nearly impossible in bigger systems

Trying to manually correlate log files from all the involved servers and databases is nearly impossible in bigger systems. You need a tool that automates this.

The second problem is lack of context. If you only write averages to the log file you will quickly run into the averaging problem. One can of course refine this to no end, but it will take a long time to reach a satisfactory level of granularity and you will have to maintain this in addition to application functionality, which is what should really matter to you. On the other hand if you write the measured data for every transaction you will never be able to correlate the data without tremendous effort and will also have to face a third problem.

Both logging all of the measures and aggregating the measures before logging them will lead to overhead which will have a negative impact on performance. On the other hand if you only turn on this performance logging in case you already have a problem, we are not talking about monitoring anymore. You will not be able to isolate the cause of an already occurred problem until it happens again.

The same is true if you do that automatically and e.g. automatically start capturing data once you realize something is wrong. It sounds intuitive, but it really means that you already missed the original root cause of why it is slow.

On the other hand log messages often provide valuable error or warning information that is needed to pinpoint problems quickly. The solution is to capture log messages the same way that we measure response times and execution counts. Within the transaction itself. This way we get the valuable log information within the transaction context and do not have to correlate dozens of log files manually.

The dynaTrace Purepath includes Log messages and exceptions in the context of a single transaction

The dynaTrace Purepath includes Log messages and exceptions in the context of a single transaction

In addition a viable Monitoring solution must store, aggregate and automatically correlate all retrieved measurements outside the monitored application. It must store it permanently, or at least for some time. This way you can analyze the data after the problem happened and do not have to actively wait until it happens again.

By now it should be clear why we need to measure everything that we can in the context of the calling transaction. By doing this we can create an accurate picture of what is going on. It enables us to rule out possible root causes for a problem and zero in on the real cause quickly. It also enables us to identify resource and capacity issues on an application and service level instead of just the server level. This is equally important for capacity planning and cost accounting.

As a next step we will look at the exact metrics we need to measure in each tier, how to interpret and correlate them to our transaction response time.

Related reading:

  1. Application Performance Monitoring in production – A Step-by-Step Guide – Part 1 // Setting up Application Performance Monitoring is a big task,...
  2. Troubleshooting response time problems – why you cannot trust your system metrics // Production Monitoring is about ensuring the stability and health...
  3. Application Performance and Architectural Problems You can Find in One Hour // When we get engaged with prospects that are interested...
  4. Hands-On Guide: Verifying FIFA World Cup Web Site against Performance Best Practices Whether you call it Football, Futbol, Fussball, Futebol, Calcio or...
  5. Week 9 – How to Measure Application Performance Measurement is the most central concept in any performance-related activity....

More Stories By Michael Kopp

Michael Kopp has over 12 years of experience as an architect and developer in the Enterprise Java space. Before coming to CompuwareAPM dynaTrace he was the Chief Architect at GoldenSource, a major player in the EDM space. In 2009 he joined dynaTrace as a technology strategist in the center of excellence. He specializes application performance management in large scale production environments with special focus on virtualized and cloud environments. His current focus is how to effectively leverage BigData Solutions and how these technologies impact and change the application landscape.

@ThingsExpo Stories
Detecting internal user threats in the Big Data eco-system is challenging and cumbersome. Many organizations monitor internal usage of the Big Data eco-system using a set of alerts. This is not a scalable process given the increase in the number of alerts with the accelerating growth in data volume and user base. Organizations are increasingly leveraging machine learning to monitor only those data elements that are sensitive and critical, autonomously establish monitoring policies, and to detect...
"We're a cybersecurity firm that specializes in engineering security solutions both at the software and hardware level. Security cannot be an after-the-fact afterthought, which is what it's become," stated Richard Blech, Chief Executive Officer at Secure Channels, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. Jack Norris reviews best practices to show how companies develop, deploy, and dynamically update these applications and how this data-first...
Intelligent Automation is now one of the key business imperatives for CIOs and CISOs impacting all areas of business today. In his session at 21st Cloud Expo, Brian Boeggeman, VP Alliances & Partnerships at Ayehu, will talk about how business value is created and delivered through intelligent automation to today’s enterprises. The open ecosystem platform approach toward Intelligent Automation that Ayehu delivers to the market is core to enabling the creation of the self-driving enterprise.
The question before companies today is not whether to become intelligent, it’s a question of how and how fast. The key is to adopt and deploy an intelligent application strategy while simultaneously preparing to scale that intelligence. In her session at 21st Cloud Expo, Sangeeta Chakraborty, Chief Customer Officer at Ayasdi, will provide a tactical framework to become a truly intelligent enterprise, including how to identify the right applications for AI, how to build a Center of Excellence to ...
SYS-CON Events announced today that Massive Networks will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Massive Networks mission is simple. To help your business operate seamlessly with fast, reliable, and secure internet and network solutions. Improve your customer's experience with outstanding connections to your cloud.
Everything run by electricity will eventually be connected to the Internet. Get ahead of the Internet of Things revolution and join Akvelon expert and IoT industry leader, Sergey Grebnov, in his session at @ThingsExpo, for an educational dive into the world of managing your home, workplace and all the devices they contain with the power of machine-based AI and intelligent Bot services for a completely streamlined experience.
Because IoT devices are deployed in mission-critical environments more than ever before, it’s increasingly imperative they be truly smart. IoT sensors simply stockpiling data isn’t useful. IoT must be artificially and naturally intelligent in order to provide more value In his session at @ThingsExpo, John Crupi, Vice President and Engineering System Architect at Greenwave Systems, will discuss how IoT artificial intelligence (AI) can be carried out via edge analytics and machine learning techn...
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
Consumers increasingly expect their electronic "things" to be connected to smart phones, tablets and the Internet. When that thing happens to be a medical device, the risks and benefits of connectivity must be carefully weighed. Once the decision is made that connecting the device is beneficial, medical device manufacturers must design their products to maintain patient safety and prevent compromised personal health information in the face of cybersecurity threats. In his session at @ThingsExpo...
In his session at @ThingsExpo, Arvind Radhakrishnen discussed how IoT offers new business models in banking and financial services organizations with the capability to revolutionize products, payments, channels, business processes and asset management built on strong architectural foundation. The following topics were covered: How IoT stands to impact various business parameters including customer experience, cost and risk management within BFS organizations.
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
From 2013, NTT Communications has been providing cPaaS service, SkyWay. Its customer’s expectations for leveraging WebRTC technology are not only typical real-time communication use cases such as Web conference, remote education, but also IoT use cases such as remote camera monitoring, smart-glass, and robotic. Because of this, NTT Communications has numerous IoT business use-cases that its customers are developing on top of PaaS. WebRTC will lead IoT businesses to be more innovative and address...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics ...
SYS-CON Events announced today that App2Cloud will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. App2Cloud is an online Platform, specializing in migrating legacy applications to any Cloud Providers (AWS, Azure, Google Cloud).
Recently, IoT seems emerging as a solution vehicle for data analytics on real-world scenarios from setting a room temperature setting to predicting a component failure of an aircraft. Compared with developing an application or deploying a cloud service, is an IoT solution unique? If so, how? How does a typical IoT solution architecture consist? And what are the essential components and how are they relevant to each other? How does the security play out? What are the best practices in formulating...
SYS-CON Events announced today that MobiDev, a client-oriented software development company, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex software systems for startups and enterprises. Since 2009 it has grown from a small group of passionate engineers and business...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
SYS-CON Events announced today that Ayehu will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Ayehu provides IT Process Automation & Orchestration solutions for IT and Security professionals to identify and resolve critical incidents and enable rapid containment, eradication, and recovery from cyber security breaches. Ayehu provides customers greater control over IT infrastructure throu...