|By Michael Kopp||
|December 12, 2012 07:30 AM EST||
Anyone who ever monitored or analyzed an application uses or has used averages. They are simple to understand and calculate. We tend to ignore just how wrong the picture is that averages paint of the world. To emphasis the point let me give you a real-world example outside of the performance space that I read recently in a newspaper.
The article was explaining that the average salary in a certain region in Europe was 1900 Euro's (to be clear this would be quite good in that region!). However when looking closer they found out that the majority, namely 9 out of 10 people, only earned around 1000 Euros and one would earn 10.000 (I over simplified this of course, but you get the idea). If you do the math you will see that the average of this is indeed 1900, but we can all agree that this does not represent the "average" salary as we would use the word in day to day live. So now let's apply this thinking to application performance.
The Average Response Time
The average response time is by far the most commonly used metric in application performance management. We assume that this represents a "normal" transaction, however this would only be true if the response time is always the same (all transaction run at equal speed) or the response time distribution is roughly bell curved.
A Bell curve represents the "normal" distribution of response times in which the average and the median are the same. It rarely ever occurs in real applications
In a Bell Curve the average (mean) and median are the same. In other words observed performance would represent the majority (half or more than half) of the transactions.
In reality most applications have few very heavy outliers; a statistician would say that the curve has a long tail. A long tail does not imply many slow transactions, but few that are magnitudes slower than the norm.
This is a typical Response Time Distribution with few but heavy outliers - it has a long tail. The average here is dragged to the right by the long tail.
We recognize that the average no longer represents the bulk of the transactions but can be a lot higher than the median.
You can now argue that this is not a problem as long as the average doesn't look better than the median. I would disagree, but let's look at another real-world scenario experienced by many of our customers:
This is another typical Response Time Distribution. Here we have quite a few very fast transactions that drag the average to the left of the actual median
In this case a considerable percentage of transactions are very, very fast (10-20 percent), while the bulk of transactions are several times slower. The median would still tell us the true story, but the average all of a sudden looks a lot faster than most of our transactions actually are. This is very typical in search engines or when caches are involved - some transactions are very fast, but the bulk are normal. Another reason for this scenario are failed transactions, more specifically transactions that failed fast. Many real-world applications have a failure rate of 1-10 percent (due to user errors or validation errors). These failed transactions are often magnitudes faster than the real ones and consequently distorted an average.
Of course performance analysts are not stupid and regularly try to compensate with higher frequency charts (compensating by looking at smaller aggregates visually) and by taking in minimum and maximum observed response times. However we can often only do this if we know the application very well, those unfamiliar with the application might easily misinterpret the charts. Because of the depth and type of knowledge required for this, it's difficult to communicate your analysis to other people - think how many arguments between IT teams have been caused by this. And that's before we even begin to think about communicating with business stakeholders!
A better metric by far are percentiles, because they allow us to understand the distribution. But before we look at percentiles, let's take a look a key feature in every production monitoring solution: Automatic Baselining and Alerting.
Automatic Baselining and Alerting
In real-world environments, performance gets attention when it is poor and has a negative impact on the business and users. But how can we identify performance issues quickly to prevent negative effects? We cannot alert on every slow transaction, since there are always some. In addition, most operations teams have to maintain a large number of applications and are not familiar with all of them, so manually setting thresholds can be inaccurate, quite painful and time-consuming.
The industry has come up with a solution called Automatic Baselining. Baselining calculates out the "normal" performance and only alerts us when an application slows down or produces more errors than usual. Most approaches rely on averages and standard deviations.
Without going into statistical details, this approach again assumes that the response times are distributed over a bell curve:
The Standard Deviation represents 33% of all transactions with the mean as the middle. 2xStandard Deviation represents 66% and thus the majority, everything outside could be considered an outlier. However most real world scenarios are not bell curved...
Typically, transactions that are outside two times standard deviation are treated as slow and captured for analysis. An alert is raised if the average moves significantly. In a bell curve this would account for the slowest 16.5 percent (and you can of course adjust that); however; if the response time distribution does not represent a bell curve, it becomes inaccurate. We either end up with a lot of false positives (transactions that are a lot slower than the average but when looking at the curve lie within the norm) or we miss a lot of problems (false negatives). In addition if the curve is not a bell curve, then the average can differ a lot from the median; applying a standard deviation to such an average can lead to quite a different result than you would expect. To work around this problem these algorithms have many tunable variables and a lot of "hacks" for specific use cases.
Why I Love Percentiles
A percentile tells me which part of the curve I am looking at and how many transactions are represented by that metric. To visualize this look at the following chart:
This chart shows the 50th and 90th percentile along with the average of the same transaction. It shows that the average is influenced far mor heavily by the 90th, thus by outliers and not by the bulk of the transactions
The green line represents the average. As you can see it is very volatile. The other two lines represent the 50th and 90th percentile. As we can see the 50th percentile (or median) is rather stable but has a couple of jumps. These jumps represent real performance degradation for the majority (50%) of the transactions. The 90th percentile (this is the start of the "tail") is a lot more volatile, which means that the outliers slowness depends on data or user behavior. What's important here is that the average is heavily influenced (dragged) by the 90th percentile, the tail, rather than the bulk of the transactions.
If the 50th percentile (median) of a response time is 500ms that means that 50% of my transactions are either as fast or faster than 500ms. If the 90th percentile of the same transaction is at 1000ms it means that 90% are as fast or faster and only 10% are slower. The average in this case could either be lower than 500ms (on a heavy front curve), a lot higher (long tail) or somewhere in between. A percentile gives me a much better sense of my real world performance, because it shows me a slice of my response time curve.
For exactly that reason percentiles are perfect for automatic baselining. If the 50th percentile moves from 500ms to 600ms I know that 50% of my transactions suffered a 20% performance degradation. You need to react to that.
In many cases we see that the 75th or 90th percentile does not change at all in such a scenario. This means the slow transactions didn't get any slower, only the normal ones did. Depending on how long your tail is the average might not have moved at all in such a scenario.!
In other cases we see the 98th percentile degrading from 1s to 1.5 seconds while the 95th is stable at 900ms. This means that your application as a whole is stable, but a few outliers got worse, nothing to worry about immediately. Percentile-based alerts do not suffer from false positives, are a lot less volatile and don't miss any important performance degradations! Consequently a baselining approach that uses percentiles does not require a lot of tuning variables to work effectively.
The screenshot below shows the Median (50th Percentile) for a particular transaction jumping from about 50ms to about 500ms and triggering an alert as it is significantly above the calculated baseline (green line). The chart labeled "Slow Response Time" on the other hand shows the 90th percentile for the same transaction. These "outliers" also show an increase in response time but not significant enough to trigger an alert.
Here we see an automatic baselining dashboard with a violation at the 50th percentile. The violation is quite clear, at the same time the 90th percentile (right upper chart) does not violate. Because the outliers are so much slower than the bulk of the transaction an average would have been influenced by them and would not have have reacted quite as dramatically as the 50th percentile. We might have missed this clear violation!
How Can We Use Percentiles for Tuning?
Percentiles are also great for tuning, and giving your optimizations a particular goal. Let's say that something within my application is too slow in general and I need to make it faster. In this case I want to focus on bringing down the 90th percentile. This would ensure sure that the overall response time of the application goes down. In other cases I have unacceptably long outliers I want to focus on bringing down response time for transactions beyond the 98th or 99th percentile (only outliers). We see a lot of applications that have perfectly acceptable performance for the 90th percentile, with the 98th percentile being magnitudes worse.
In throughput oriented applications on the other hand I would want to make the majority of my transactions very fast, while accepting that an optimization makes a few outliers slower. I might therefore make sure that the 75th percentile goes down while trying to keep the 90th percentile stable or not getting a lot worse.
I could not make the same kind of observations with averages, minimum and maximum, but with percentiles they are very easy indeed.
Averages are ineffective because they are too simplistic and one-dimensional. Percentiles are a really great and easy way of understanding the real performance characteristics of your application. They also provide a great basis for automatic baselining, behavioral learning and optimizing your application with a proper focus. In short, percentiles are great!
|rtalexander 11/21/12 12:58:00 AM EST|
Hey, could you post a reference or two that covers the theory and/or practicalities of the approach you describe?
From telemedicine to smart cars, digital homes and industrial monitoring, the explosive growth of IoT has created exciting new business opportunities for real time calls and messaging. In his session at @ThingsExpo, Ivelin Ivanov, CEO and Co-Founder of Telestax, shared some of the new revenue sources that IoT created for Restcomm – the open source telephony platform from Telestax. Ivelin Ivanov is a technology entrepreneur who founded Mobicents, an Open Source VoIP Platform, to help create, deploy, and manage applications integrating voice, video and data. He is the co-founder of TeleStax, a...
Apr. 21, 2015 05:00 AM EDT Reads: 4,746
SYS-CON Events announced today that Akana, formerly SOA Software, has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. Akana’s comprehensive suite of API Management, API Security, Integrated SOA Governance, and Cloud Integration solutions helps businesses accelerate digital transformation by securely extending their reach across multiple channels – mobile, cloud and Internet of Things. Akana enables enterprises to share data as APIs, connect and integrate applications, drive part...
Apr. 21, 2015 05:00 AM EDT Reads: 1,883
VoxImplant has announced full WebRTC support in the newest versions of its Android SDK and iOS SDK. The updated SDKs, which enable audio and video calls on mobile devices, are now compatible with the WebRTC standard to allow any mobile app to communicate with WebRTC-enabled browsers, including Google Chrome, Mozilla Firefox, Opera, and, when available, Microsoft Spartan. The WebRTC-updated SDKs represent VoxImplant's continued leadership in simplifying the development of real-time communications (RTC) services for app developers. VoxImplant (built by Zingaya, the real-time communication servi...
Apr. 21, 2015 04:45 AM EDT Reads: 2,005
SYS-CON Events announced today that CommVault has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY, and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. A singular vision – a belief in a better way to address current and future data management needs – guides CommVault in the development of Singular Information Management® solutions for high-performance data protection, universal availability and sim...
Apr. 21, 2015 04:45 AM EDT Reads: 1,805
SYS-CON Events announced today that SafeLogic has been named “Bag Sponsor” of SYS-CON's 16th International Cloud Expo® New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. SafeLogic provides security products for applications in mobile and server/appliance environments. SafeLogic’s flagship product CryptoComply is a FIPS 140-2 validated cryptographic engine designed to secure data on servers, workstations, appliances, mobile devices, and in the Cloud.
Apr. 21, 2015 04:30 AM EDT Reads: 1,920
The IoT Bootcamp is coming to Cloud Expo | @ThingsExpo on June 9-10 at the Javits Center in New York. Instructor. Registration is now available at http://iotbootcamp.sys-con.com/ Instructor Janakiram MSV previously taught the famously successful Multi-Cloud Bootcamp at Cloud Expo | @ThingsExpo in November in Santa Clara. Now he is expanding the focus to Janakiram is the founder and CTO of Get Cloud Ready Consulting, a niche Cloud Migration and Cloud Operations firm that recently got acquired by Aditi Technologies. He is a Microsoft Regional Director for Hyderabad, India, and one of the f...
Apr. 21, 2015 04:00 AM EDT Reads: 1,327
SYS-CON Events announced today that StorPool Storage will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. StorPool is distributed storage software that allows service providers, enterprises and other cloud builders to run data storage on standard x86 servers, instead of using expensive and inefficient storage arrays (SAN).
Apr. 21, 2015 04:00 AM EDT Reads: 2,061
SYS-CON Events announced today that kintone has been named “Bronze Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY, and the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. kintone promotes cloud-based workgroup productivity, transparency and profitability with a seamless collaboration space, build your own business application (BYOA) platform, and workflow automation system.
Apr. 21, 2015 03:00 AM EDT Reads: 1,569
SYS-CON Events announced today that Site24x7, the cloud infrastructure monitoring service, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Site24x7 is a cloud infrastructure monitoring service that helps monitor the uptime and performance of websites, online applications, servers, mobile websites and custom APIs. The monitoring is done from 50+ locations across the world and from various wireless carriers, thus providing a global perspective of the end-user experience. Site24x7 supports monitoring H...
Apr. 21, 2015 02:30 AM EDT Reads: 1,922
SYS-CON Events announced today that Intelligent Systems Services will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Established in 1994, Intelligent Systems Services Inc. is located near Washington, DC, with representatives and partners nationwide. ISS’s well-established track record is based on the continuous pursuit of excellence in designing, implementing and supporting nationwide clients’ mission-critical systems. ISS has completed many successful projects in Healthcare, Commercial, Manufacturing, ...
Apr. 21, 2015 02:15 AM EDT Reads: 2,507
WebRTC is an up-and-coming standard that enables real-time voice and video to be directly embedded into browsers making the browser a primary user interface for communications and collaboration. WebRTC runs in a number of browsers today and is currently supported in over a billion installed browsers globally, across a range of platform OS and devices. Today, organizations that choose to deploy WebRTC applications and use a host machine that supports audio through USB or Bluetooth can use Plantronics products to connect and transit or receive the audio associated with the WebRTC session.
Apr. 21, 2015 02:00 AM EDT Reads: 1,722
The best mobile applications are augmented by dedicated servers, the Internet and Cloud services. Mobile developers should focus on one thing: writing the next socially disruptive viral app. Thanks to the cloud, they can focus on the overall solution, not the underlying plumbing. From iOS to Android and Windows, developers can leverage cloud services to create a common cross-platform backend to persist user settings, app data, broadcast notifications, run jobs, etc. This session provides a high level technical overview of many cloud services available to mobile app developers, includi...
Apr. 21, 2015 02:00 AM EDT Reads: 690
SYS-CON Events announced today that B2Cloud, a provider of enterprise resource planning software, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. B2cloud develops the software you need. They have the ideal tools to help you work with your clients. B2Cloud’s main solutions include AGIS – ERP, CLOHC, AGIS – Invoice, and IZUM
Apr. 21, 2015 02:00 AM EDT Reads: 3,324
The WebRTC Summit 2015 New York, to be held June 9-11, 2015, at the Javits Center in New York, NY, announces that its Call for Papers is open. Topics include all aspects of improving IT delivery by eliminating waste through automated business models leveraging cloud technologies. WebRTC Summit is co-located with 16th International Cloud Expo, @ThingsExpo, Big Data Expo, and DevOps Summit.
Apr. 21, 2015 01:45 AM EDT Reads: 2,331
SYS-CON Events announced today that Tufin, the market-leading provider of Security Policy Orchestration Solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. As the market leader of Security Policy Orchestration, Tufin automates and accelerates network configuration changes while maintaining security and compliance. Tufin's award-winning Orchestration Suite™ gives IT organizations the power and agility to enforce security policy across complex, multi-vendor enterprise networks. With more than 1...
Apr. 21, 2015 01:45 AM EDT Reads: 3,439
As enterprises move to all-IP networks and cloud-based applications, communications service providers (CSPs) – facing increased competition from over-the-top providers delivering content via the Internet and independently of CSPs – must be able to offer seamless cloud-based communication and collaboration solutions that can scale for small, midsize, and large enterprises, as well as public sector organizations, in order to keep and grow market share. The latest version of Oracle Communications Unified Communications Suite gives CSPs the capability to do just that. In addition, its integration ...
Apr. 21, 2015 01:30 AM EDT Reads: 4,068
SYS-CON Events announced today that Cloudian, Inc., the leading provider of hybrid cloud storage solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Cloudian, Inc., is a Foster City, California - based software company specializing in cloud storage software. The main product is Cloudian, an Amazon S3-compliant cloud object storage platform, the bedrock of cloud computing systems, that enables cloud service providers and enterprises to build reliable, affordable and scalable cloud storage solu...
Apr. 21, 2015 01:00 AM EDT Reads: 2,523
SYS-CON Events announced today that Gridstore™, the leader in hyper-converged infrastructure purpose-built to optimize Microsoft workloads, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Gridstore™ is the leader in hyper-converged infrastructure purpose-built for Microsoft workloads and designed to accelerate applications in virtualized environments. Gridstore’s hyper-converged infrastructure is the industry’s first all flash version of HyperConverged Appliances that include both compute and storag...
Apr. 21, 2015 12:45 AM EDT Reads: 4,443
SYS-CON Events announced today that IDenticard will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. IDenticard™ is the security division of Brady Corp (NYSE: BRC), a $1.5 billion manufacturer of identification products. We have small-company values with the strength and stability of a major corporation. IDenticard offers local sales, support and service to our customers across the United States and Canada. Our partner network encompasses some 300 of the world's leading systems integrators and security s...
Apr. 21, 2015 12:00 AM EDT Reads: 5,018
BroadSoft on Tuesday announced that it is a recipient of the 2014 Frost & Sullivan Market Leadership Award in the Hosted/Cloud Internet Protocol (IP) Telephony market for Latin America. According to Frost & Sullivan market research, the Latin America (LATAM) hosted/cloud Internet Protocol (IP) telephony market, including integrated unified communications and collaboration (UC&C) applications, is currently experiencing a rapid growth trajectory and is expected to exhibit a tenfold rise in annual revenues in the 2013-2020 period. With more than 600 cloud deployments internationally, BroadSoft w...
Apr. 21, 2015 12:00 AM EDT Reads: 2,105