Welcome!

Java IoT Authors: APM Blog, Stackify Blog, XebiaLabs Blog, Liz McMillan, William Schmarzo

Related Topics: Java IoT, Microservices Expo, Microsoft Cloud, Machine Learning , Agile Computing, @DXWorldExpo

Java IoT: Article

Part 2: An Integrated Approach to Load Test Analysis

The Follow-up Test

In a Part 1, I demonstrated how to add more depth to the analysis of a Compuware APM Web Load Test by combining the external load results with the application and infrastructure data collected by the Compuware PureStack Technology. But, now that we have tested the system once, what would happen if we tested it again after we identified and "resolved" the issues we found? Would running a test using the same parameters as in the initial test show a clear performance improvement? Would the system be able to achieve the desired load of 200 virtual users with little or no performance degradation?

This article takes you through the steps you should follow in order to directly compare the results of two load tests and measure the performance improvement (or degradation) that occurred with the fixes put in place.

Step 1: Identify issues and implement changes based on initial results
During the April 14 load test session, Andreas Grabner and I found that there were substantial performance concerns for the application under a load that is well in excess of what is currently seen even on the APM Community's busiest days. The issue was that the load that caused the performance issues was well short of the goal of 200 virtual users (VUs) that the application team wanted to reach.

An Aggregated Data View Showing External and Internal Performance Indicators During the April 14 Load Test

During the April 14 load test execution, a number of environment issues were identified. The critical ones the systems team addressed by included:

  • Deployment of critical APM Community applications to different machines to prevent the application performance of one layer negatively affecting another layer
  • Optimization of the way APM Community pages are built in the application layer to reduce CPU usage
  • Optimized Cache Settings in Confluence to reduce roundtrips to the database when loading commonly used objects
  • Increasing the CPU power on the virtualized machines so that they can handle more load.

Step 2: Re-Run the test (With the same parameters!)
Once these steps were complete, a second test cycle was scheduled to determine if the updated environment would be able to reach the desired 200 VU target without encountering response time degradation. The follow-up load test was executed exactly one week later, on April 21, and used the same parameters as the initial test (see previous post for load ramping details). Using the same test parameters (load ramp, test scripts, testing locations, databanks, etc.) is critical in order to allow a like-for-like comparison to occur. Any deviation in the test configuration can skew the results and potentially lead to an unintended sense of confidence (or fear of implosion) regarding the application environment.

When the April 21 round of load testing was complete and we began to analyze the results of the test, the initial data (higher throughput, faster response times, lower CPU utilization and a reduction in the amount of database load) suggested that this load test was substantially more successful than the previous test execution. This initial conclusion was based on the performance charts containing the same metrics we used to analyze the April 14 test, which showed a direct comparison of critical measures, demonstrating if the pattern of performance had dramatically changed between the two test executions.

Step 3: Compare the Results
So, to start the comparative analysis, we took three key metrics of the April 14 and April 21 results and charted them together: External Web Load Test Average Response Time; Web Load Test Transactions per Minute; and percentage of CPU Utilization on the web server. Using just these three comparisons, it is clear that the two load tests had very different performance profiles.

Starting with the Web Load Test Average Response Times (the time required to completely download all of the content in the scripted synthetic transactions used in the load test), it is very clear that after 08:50 EDT - 40 minutes into both tests - that the response times diverged and remained on different paths for the remainder of the comparative test run. From this point on, the April 21 load test averaged load times that were around 50% faster than the April 14 test (Note: the Moving Average of percentage change averages 5 minutes of response time change to produce an clearer trend line). It took nearly 20 more minutes for Average Transaction Response Time to reach 20 seconds on April 21, even with load being applied at the same volume as in the April 14 test.

Comparison of Response Times showing improvement in April 21 APM Community Load Test resulting in lower average response times

The Web Load Transactions per Minute (the number of WLT transactions executed in a minute at that point of the load test) showed a pattern where the April 21 test also diverged from the April 14 test at 08:50 EDT. With the faster WLT Average Response Times, the April 21 test saw the system process 40-50% more transactions per minute than the April 14 test from 08:50 EDT until the end of the test cycle.

Comparison of Transactions per Minute showing improvement in April 21 APM Community Load Test resulting in higher transactions per minute

Much of this improvement can be tracked to the third metric: CPU utilization on the Web Server (the percentage of CPU used by the system and applications for performing all necessary activities on the machine). Throughout the April 21 test, the CPU of the web server, with more hardware and optimized page rendering processes helping out, the CPU was less heavy stressed throughout the test, reaching 100% utilization much later than in the April 14 test.

Comparison of CPU Utilization showing improvement in April 21 APM Community Load Test resulting in lower CPU utilization up until 09:40 EDT

These three metrics are directly tied to the Number of Web Requests per Minuterecorded at the Confluence application layer for the April 21 test. This metric peaked at 125-140 per minute during the April 21 test, compared to the April 14 test where the peak was at approximately 100 Web Requests per minute.

Despite the seeming success of the second load test on April 21, there were still issues that appeared. Building an integrated results chart for the April 21 load test shows that multiple performance events occurred once the load test reached the 100% CPU Utilization boundary (red vertical line in chart below). This appears to indicate that despite the improvements to the environment discussed above, there is still a CPU bottleneck present at higher loads.

An area of extreme contrast between the two tests was recorded in the Database Results. Database stats were clearly visible in the data from April 14 test (see the aggregated performance metric chart in Step 1), including a large spike in the number and length of queries just before the application reached the CPU bottleneck. But in order to find the same metrics in the April 21 test, you have to break out your microscope and look very closely at the bottom of the chart.

External and Internal Performance Metrics for April 21 Load Test

The reduction in database load was the direct result of the optimized cache settings enabled after the April 14 load test. With more of the data being stored in the application cache, the number of calls to the database decreased, removing this layer as a potential bottleneck at this load volume.

Step 4: Results and Next Steps
The lack of a sudden spike in Confluence/Atlassian processing time in the April 21 test (along with the accompanying database spike) was due to the removal of an application layer process that had been scheduled to run during the load test period. This process, and its effects on the systems and user experience, was quickly recognized once Andreas reviewed his data. Once the job that caused this issue was identified, it was removed in time for the April 21 test, completely eliminating a performance bottleneck that was encountered early in the April 14 test.

Lesson learned: Don't schedule system-intensive jobs to run during peak traffic periods; find a window with the lowest traffic volume to perform these tasks so that the fewest visitors possible are affected.

As we noted at the start of this post, it appears on the surface that the April 21 load test was more successful than the April 14 test. Yet, despite the improved performance of the April 21 load test, the results still show that there are still performance concerns in the test that need to be addressed. These concerns center around a dramatic spike in response times between 09:40 and 09:50 EDT, occurring after the load test had been running for 90 minutes.

When the system began to show degraded performance, it could easily be tracked using the 3 key metrics: WLT Average Response Time; WLT Transactions per minute; and CPU Utilization. When running transactions began to take much longer to execute, decreasing both the number of incoming web requests to the application layer and the number of transactions per minute executed by the load generation system, the root cause can be seen in the chart below, which removes some of the data series.

At 167 VUs, the system redlines, and has a sudden, 10-minute degradation of performance, after which it recovers when the test stabilizes at 200 VUs

The period of degradation that was detected during the load test started at 09:40 EDT and coincided with the:

  • Web Load Test achieving 167 VUs
  • CPU on the web server measuring 100%
  • Web Load Test Transactions per Minute averaging 130
  • Confluence Web Requests (the application layer of the APM Community Portal) measured at 135 per minute

Interestingly, after 10 minutes, this issue cleared up completely, except for transaction response times. The response times did not return to pre-spike values, but were now averaging almost 20 seconds higher than before the spike. With the system now peaked at 200 VUs and no additional load being generated, it was interesting to see that other metrics returned immediately to their pre-spike levels - notably Transactions per Minute and Web Requests per Minute. So, with 33 more VUs than before the spike, the system again appeared to be directly affected by a CPU bottleneck, as a higher load could not increase the number of requests processed at the application layer.

Out of this sea of metrics we determined that the performance of the April 21 load test saw a comparative improvement in the application when examined next to the April 14 load test, but the second test was still unable to reach the target of 200 VUs without suffering a bottleneck that caused performance to degrade dramatically.

Analyzing the degradation
To find the cause of the CPU bottleneck that prevented the April 21 test from reaching the goal of 200 VUs with little or no performance degradation, we have to dig deeper into the server-side metrics, especially those related to the health of application server. The dip in transactions throughout the system is aligned with the issue captured when the system hit 167 VUs. The question is: Was the dip in transactions processed and the rise in transaction response times the result of this load volume or a symptom of the actual cause of the performance degradation?

When the system degrades, the server-side data shows that high Garbage Collection could be a problem, as this automated process happened at the same time. It is clear that executing a very intensive system process when the web server CPU was already exhausted can cause a very large performance degradation.

Increased GC is normal while increasing the load - but there is an unusual spike exactly when we see a dip in transaction throughput

Looking at the application server specific transaction response times it is easy to spot the potential problem. The following charts show that "yet another" background job is executed every hour taking away CPU cycles from the already exhausted system.

A background job executes every hour taking up to 300s in CPU cycles at the time when virtual users experienced the performance degradation

Looking at these transactions reveals that the job is an hourly update job that synchronizes the cached user objects with the user directory database. This takes a considerable amount of time because we have 65k+ users on the APM Community system. This update job causes a lot of objects to be created and destroyed - hence the increased memory and GC activity.

The synchronization job is the root cause of the degradation by consuming a lot of CPU as well as allocating memory which causes high GC activity

As with the April 14 load test, the April 21 load test exposed issues with the system that prevented the achievement of the 200 VU goal. But now, we have a clear culprit for the prevention of this goal, so efforts can focus on reducing or eliminating the effect that this update process has on the system when it is under peak load.

Conclusion
In both tests, regardless of how you measure the "success" of a load test, something was learned about the system by aggregating metrics from inside and outside the infrastructure being tested. We now know that the optimizations that were performed after the April 14 load test allowed the system to process 40-50% more transactions per minute up to 167 VUs when a scheduled system process caused a severe application degradation.

This data was only able to be turned into actionable information because we had a process in place that allowed results captured from inside the firewall to be easily aligned with the external results from the load test system. By doing this, the customer, albeit in a very controlled form, becomes a factor in the analysis of system performance.

By creating a full performance perspective, PureStack delivers more than just deeper technical metrics on a system under load. PureStack places the experience of the visitor at the same level of importance as CPU, database, and web requests processed by the application layer when the results are analyzed. The importance of the user experience then dictates how infrastructure issues are prioritized and resolved, as the effect these issues have on end users provides real-world feedback into the true cost of performance issues that occur to your application during peak periods.

Using the data from this load test, it was realized that additional changes to the system were needed, especially in the area of page rendering, in order to further reduce CPU load and allow the system to reach and maintain a peak load of 200 virtual users. With the upgrade to the Confluence application software - deployed in early July 2013 - it was expected that the desired goal would be reached. But assuming this is not sufficient; it is expected that an additional load test on the new Confluence system will occur in July 2013, once the system has been completed stabilized. And using the same transaction paths as in the April 14 and 21 load tests, the system will be verified to confirm that the upgrade is delivering the expected performance.

More Stories By Stephen Pierzchala

With more than a decade in the web performance industry, Stephen Pierzchala has advised many organizations, from Fortune 500 to startups, in how to improve the performance of their web applications by helping them develop and evolve the unique speed, conversion, and customer experience metrics necessary to effectively measure, manage, and evolve online web and mobile applications that improve performance and increase revenue. Working on projects for top companies in the online retail, financial services, content delivery, ad-delivery, and enterprise software industries, he has developed new approaches to web performance data analysis. Stephen has led web performance methodology, CDN Assessment, SaaS load testing, technical troubleshooting, and performance assessments, demonstrating the value of the web performance. He noted for his technical analyses and knowledge of Web performance from the outside-in.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
"There's plenty of bandwidth out there but it's never in the right place. So what Cedexis does is uses data to work out the best pathways to get data from the origin to the person who wants to get it," explained Simon Jones, Evangelist and Head of Marketing at Cedexis, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that CrowdReviews.com has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5–7, 2018, at the Javits Center in New York City, NY. CrowdReviews.com is a transparent online platform for determining which products and services are the best based on the opinion of the crowd. The crowd consists of Internet users that have experienced products and services first-hand and have an interest in letting other potential buye...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, introduced two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a multip...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics gr...