Welcome!

Java IoT Authors: Pat Romanski, Liz McMillan, Elizabeth White, Yeshim Deniz, Frank Lupo

Related Topics: Java IoT, Microservices Expo

Java IoT: Article

Application Performance Monitoring in Production

A step-by-step guide – Part 1

Setting up Application Performance Monitoring is a big task, but like everything else it can be broken down into simple steps. You have to know what you want to achieve and subsequently where to start. So let’s start at the beginning and take a top-down approach

Know What You Want
The first thing to do is to be clear of what we want when monitoring the application. Let’s face it: we “do not want to” ensure CPU utilization to be below 90 percent or a network latency of under one millisecond. We are also not really interested in garbage collection activity or whether the database connection pool is utilized. We need to monitor all of these things in order to reach our main goal. And the main goal for this article series is to ensure the health and stability of our application and business services. To ensure that we need to leverage all of the mentioned metrics.

What does health and stability of the application mean though? A healthy and stable application performs its function without errors and delivers accurate results within a predefined satisfactory time frame. In technical terms this means low response time and/or high throughput and low to not existing error rate. If we monitor and ensure this than the health and stability of the application is likewise guaranteed.

Define Your KPIs
At first we need to define what satisfactory performance means. In case of an end-user facing application things like first impression and page load time are good KPIs. The good thing is that satisfactory is relatively simple as the user will tolerate up to 3-4 seconds but will get frustrated after that. Other interactions, like a credit card payment or a search have very different thresholds though and you need to define them. In addition to response time you also need to define how many concurrent users you want, or need, to be able to serve without impacting the overall response time. These two KPIs, response time and concurrent users, will get you very far if you apply them on a granular enough level.

If we are talking about a transaction oriented application your main KPI will be throughput. The desired throughput will depend on the transaction type. Most likely you will have a time window in which you have to process a certain known number of transactions, which dictates what satisfactory performance means to you.

Resource and Hardware usage can be considered secondary KPIs. As long as the primary KPI is not met, we will not look too closely at the secondary one. On the other hand, as soon as the primary is met optimizations must always be towards improving this secondary KPI.

If we take a strict top-down approach and measure end-to-end we will not need more detailed KPIs for response time or throughput. We of course need to measure more detailed than that in order to ensure performance.

Know What, Where and How to Measure
In addition to specifying a KPI for e.g. the response time of the search feature we also need to define where we measure it.

The different places where we can measure response time

The different places where we can measure response time

This picture shows several different places where we can measure the response time of our application. In order to have objective and comparable measurements we need to define where we measure it. This needs to be communicated to all involved parties. This way you ensure that everybody talks about the same thing. In general the closer you come to the end user the close it gets to the real world and also the harder it is to measure.

We also need to define how we measure. If we measure the average we will need to define how that is calculated. Averages themselves are alright if you talk about throughput, but very inaccurate for response time. The average tells you nearly nothing about the actual user experience, because it ignores the volatility. Even if you are only interested in throughput volatility is interesting. It is harder to plan capacity for a highly volatile application than for one that is stable.  Personally I prefer percentiles over averages as they give us a good picture of the distribution and thus the volatility.

50th, 75th, 90th and 95th percentile of End User Response time of the Page Load

50th, 75th, 90th and 95th percentile of End User Response time of the Page Load

In the picture we see that the page load time of our sample has a very high volatility. While 50 percent of all page requests are loaded in 3 seconds, the slowest 10 percent take between 5 and 20 seconds! That not only bodes ill for our end user experience and performance goals, but also for our capacity management (we’d need to over provision a lot to compensate). High volatility in itself indicates instability and is not desirable. It can also mean that we measure the response time not granular enough. It might not be enough to measure the response time of e.g. the payment transactions in general. The CreditCard and DebitCard payment transaction might have very different characteristics and we should measure them separately. Without doing that type of measuring response time becomes meaningless because we will not see performance problems and monitoring a trend will be impossible.

This brings us to the next point, what do we measure? Most monitoring solutions allow the monitoring either on an URL level, servlet level (JMX/App Servers) or Network level. In many cases the URL level is good enough as we can use pattern matching on specific URI parameters.

Create measures by matching the URI of our Application and Transaction tyoe

Create measures by matching the URI of our Application and Transaction type

For Ajax, WebService Transactions or SOA applications in general this will not be enough. WebService frameworks often provide a single URI entry point per application or service and distinguish between different business transactions in the SOAP message. Transaction oriented applications have different transaction types which will have very different performance characteristics, yet the entry point to the application will be the same nearly every time (e.g. JMS). The transaction type will only be available within the request and within the executed code. In our Credit/Debit card example we would most likely see this only as part of the SOAP message. So what we need to do is to identify the transaction within our application. We can do this by modifying the code and provide the measures ourselves (e.g. via JMX). If we do not want to modify our code we could also use aspects to inject it or use one of the many monitoring solutions that supports this kind of categorization via business transactions.

We want to measure response time of requests that call a method with a given parameter

We want to measure response time of requests that call a method with a given parameter

In our case we would measure the response time of every transaction and label it as a DebitCard payment transaction when the shown method is executed and the argument of the first parameter is “DebitCard”. This way we can measure the different types of transactions even if they cannot be distinguished via the URI.

Think About Errors
Apart from performance we also need to take errors into account. Very often we see applications where most transactions respond within 1.5 seconds and sometimes a lot faster, e.g. 0.2 seconds. More often than not these very fast transactions represent errors. The result is that the more errors you have the better your average response time will get, that is of course misleading.

Show error rate, warning rate and response time of two business transactions

Show error rate, warning rate and response time of two business transactions

We need to count errors on the business transaction level as well. If you don’t want to have your response time skewed by those errors, you should exclude erroneous transaction from your response time measurement. The error rate of your transaction would be another KPI on which you can put a static threshold. An increased error rate is often the first sign of an impeding performance problem, so you should watch it carefully.

I will cover how to monitor errors in more detail in one of my next posts.

What Are Problems?
It sounds like a silly question but I decided to ask it anyway, because in order to detect problems, we first need to understand them.

ITIL defines a problem as a reoccurring incident or an incident with high impact. In our case this means that a single transaction that exceeds our response time goal is not considered a problem. If you are monitoring a big system you will not have the time or the means to analyze every single violation anyway. But it is a problem if the response time goal is exceeded by 20% of your end user requests. This is one key reason why I prefer percentiles over averages. I know I have a problem if the 80th percentile exceeds the response time goal.

The same can be said for errors and exceptions. A single exception or error might be interesting to the developer. We should therefore save the information so that it can be fixed in a later release. But as far as operations is concerned, it will be ignored if it only happens once or twice. On the other hand if the same error happens again and again we need to treat it as a problem as it clearly violates our goal of ensuring a healthy application.

Alerting in a production environment must be set up around this idea. If we were to produce an alert for every single incident we would have a so called alarm storm and would either go mad or ignore them entirely. On the other hand if we wait until the average is higher than the response time goal customers will be calling our support line, before we are aware of the problem.

Know Your System and Application
The goal of monitoring is to ensure proper performance. Knowing there is a problem is not enough, we need to isolate the root cause  quickly. We can only do that if we know our application and which other resources or services it uses. It is best to have a system diagram or flow chart that describes our application. You most likely will want to have at least two or three different detail levels of this.

  1. System Topology
    This should include all your applications, service, resources and the communication patterns on a high level. It gives us an idea of what exists and which other applications might influence ours.
  2. Application Topology
    This would concentrate on the topology of the application itself. It is a subset of the system topology and would only include communication flows as seen from that applications point of view. It would end when it calls third party applications.
  3. Transaction Response Flow
    Here we would see the individual business transaction type. This is the level that we use for response time measurement.

Maintaining this can be tricky, but many monitoring tools provide this automatically these days. Once we know which other applications and services our transaction is using we can break down the response time into its contributors. We do this by measuring the request on the calling side, inside our application and on the receiving end.

Show response time distribution throughout the system of a single transaction type

Show response time distribution throughout the system of a single transaction type

This way we get a definite picture of where response time is spent. In addition we will also see if we loose time on the network in form of latency.

Next Steps
At this point we can monitor health, stability and performance of our application and we can isolate the causing tier in case we have a problem. If we do this for all of our applications we will also get a good picture how the applications impact each other. The next steps are to monitor each application tier in more detail, including used resources and system metrics. In the coming weeks I will explain how to monitor each of these tiers with the specific goal of allowing production level root cause analysis. At every level we will focus on monitoring the tier from an application and transaction point of view as this is the only way we can accurately measure performance impact on the end user.

Finally I will also cover System Monitoring. Our goal is however not to monitor and describe the system itself, but measure how it affects the application. In terms of Application Performance Monitoring, system monitoring is an integral part and not a separate discipline.

Related reading:

  1. Troubleshooting response time problems – why you cannot trust your system metrics // Production Monitoring is about ensuring the stability and health...
  2. End-to-End Monitoring and Load Testing with Keynote and dynaTrace Watch the 6 Minute Walk-Through Video that guides you through...
  3. Week 22 – Is There a Business Case for Application Performance? We all know that slow performance – and service disruption...
  4. Hands-On Guide: Verifying FIFA World Cup Web Site against Performance Best Practices Whether you call it Football, Futbol, Fussball, Futebol, Calcio or...
  5. Week 9 – How to Measure Application Performance Measurement is the most central concept in any performance-related activity....

More Stories By Michael Kopp

Michael Kopp has over 12 years of experience as an architect and developer in the Enterprise Java space. Before coming to CompuwareAPM dynaTrace he was the Chief Architect at GoldenSource, a major player in the EDM space. In 2009 he joined dynaTrace as a technology strategist in the center of excellence. He specializes application performance management in large scale production environments with special focus on virtualized and cloud environments. His current focus is how to effectively leverage BigData Solutions and how these technologies impact and change the application landscape.

@ThingsExpo Stories
SYS-CON Events announced today that Daiya Industry will exhibit at the Japanese Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Ruby Development Inc. builds new services in short period of time and provides a continuous support of those services based on Ruby on Rails. For more information, please visit https://github.com/RubyDevInc.
As businesses evolve, they need technology that is simple to help them succeed today and flexible enough to help them build for tomorrow. Chrome is fit for the workplace of the future — providing a secure, consistent user experience across a range of devices that can be used anywhere. In her session at 21st Cloud Expo, Vidya Nagarajan, a Senior Product Manager at Google, will take a look at various options as to how ChromeOS can be leveraged to interact with people on the devices, and formats th...
SYS-CON Events announced today that Yuasa System will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Yuasa System is introducing a multi-purpose endurance testing system for flexible displays, OLED devices, flexible substrates, flat cables, and films in smartphones, wearables, automobiles, and healthcare.
SYS-CON Events announced today that Taica will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Taica manufacturers Alpha-GEL brand silicone components and materials, which maintain outstanding performance over a wide temperature range -40C to +200C. For more information, visit http://www.taica.co.jp/english/.
As hybrid cloud becomes the de-facto standard mode of operation for most enterprises, new challenges arise on how to efficiently and economically share data across environments. In his session at 21st Cloud Expo, Dr. Allon Cohen, VP of Product at Elastifile, will explore new techniques and best practices that help enterprise IT benefit from the advantages of hybrid cloud environments by enabling data availability for both legacy enterprise and cloud-native mission critical applications. By rev...
Organizations do not need a Big Data strategy; they need a business strategy that incorporates Big Data. Most organizations lack a road map for using Big Data to optimize key business processes, deliver a differentiated customer experience, or uncover new business opportunities. They do not understand what’s possible with respect to integrating Big Data into the business model.
Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, will discuss how they b...
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities – ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups. As a result, many firms employ new business models that place enormous impor...
SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
SYS-CON Events announced today that TidalScale, a leading provider of systems and services, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale has been involved in shaping the computing landscape. They've designed, developed and deployed some of the most important and successful systems and services in the history of the computing industry - internet, Ethernet, operating s...
SYS-CON Events announced today that TidalScale will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale is the leading provider of Software-Defined Servers that bring flexibility to modern data centers by right-sizing servers on the fly to fit any data set or workload. TidalScale’s award-winning inverse hypervisor technology combines multiple commodity servers (including their ass...
Amazon is pursuing new markets and disrupting industries at an incredible pace. Almost every industry seems to be in its crosshairs. Companies and industries that once thought they were safe are now worried about being “Amazoned.”. The new watch word should be “Be afraid. Be very afraid.” In his session 21st Cloud Expo, Chris Kocher, a co-founder of Grey Heron, will address questions such as: What new areas is Amazon disrupting? How are they doing this? Where are they likely to go? What are th...
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
Infoblox delivers Actionable Network Intelligence to enterprise, government, and service provider customers around the world. They are the industry leader in DNS, DHCP, and IP address management, the category known as DDI. We empower thousands of organizations to control and secure their networks from the core-enabling them to increase efficiency and visibility, improve customer service, and meet compliance requirements.
Join IBM November 1 at 21st Cloud Expo at the Santa Clara Convention Center in Santa Clara, CA, and learn how IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Cognitive analysis impacts today’s systems with unparalleled ability that were previously available only to manned, back-end operations. Thanks to cloud processing, IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Imagine a robot vacuum that becomes your personal assistant tha...
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, will lead you through the exciting evolution of the cloud. He'll look at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering ...
SYS-CON Events announced today that N3N will exhibit at SYS-CON's @ThingsExpo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. N3N’s solutions increase the effectiveness of operations and control centers, increase the value of IoT investments, and facilitate real-time operational decision making. N3N enables operations teams with a four dimensional digital “big board” that consolidates real-time live video feeds alongside IoT sensor data a...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...