Welcome!

Java IoT Authors: Liz McMillan, Elizabeth White, Pat Romanski, Yeshim Deniz, Jason Bloomberg

Related Topics: Java IoT, Microservices Expo, Microsoft Cloud, Machine Learning , Agile Computing, @BigDataExpo

Java IoT: Article

Part 2: An Integrated Approach to Load Test Analysis

The Follow-up Test

In a Part 1, I demonstrated how to add more depth to the analysis of a Compuware APM Web Load Test by combining the external load results with the application and infrastructure data collected by the Compuware PureStack Technology. But, now that we have tested the system once, what would happen if we tested it again after we identified and "resolved" the issues we found? Would running a test using the same parameters as in the initial test show a clear performance improvement? Would the system be able to achieve the desired load of 200 virtual users with little or no performance degradation?

This article takes you through the steps you should follow in order to directly compare the results of two load tests and measure the performance improvement (or degradation) that occurred with the fixes put in place.

Step 1: Identify issues and implement changes based on initial results
During the April 14 load test session, Andreas Grabner and I found that there were substantial performance concerns for the application under a load that is well in excess of what is currently seen even on the APM Community's busiest days. The issue was that the load that caused the performance issues was well short of the goal of 200 virtual users (VUs) that the application team wanted to reach.

An Aggregated Data View Showing External and Internal Performance Indicators During the April 14 Load Test

During the April 14 load test execution, a number of environment issues were identified. The critical ones the systems team addressed by included:

  • Deployment of critical APM Community applications to different machines to prevent the application performance of one layer negatively affecting another layer
  • Optimization of the way APM Community pages are built in the application layer to reduce CPU usage
  • Optimized Cache Settings in Confluence to reduce roundtrips to the database when loading commonly used objects
  • Increasing the CPU power on the virtualized machines so that they can handle more load.

Step 2: Re-Run the test (With the same parameters!)
Once these steps were complete, a second test cycle was scheduled to determine if the updated environment would be able to reach the desired 200 VU target without encountering response time degradation. The follow-up load test was executed exactly one week later, on April 21, and used the same parameters as the initial test (see previous post for load ramping details). Using the same test parameters (load ramp, test scripts, testing locations, databanks, etc.) is critical in order to allow a like-for-like comparison to occur. Any deviation in the test configuration can skew the results and potentially lead to an unintended sense of confidence (or fear of implosion) regarding the application environment.

When the April 21 round of load testing was complete and we began to analyze the results of the test, the initial data (higher throughput, faster response times, lower CPU utilization and a reduction in the amount of database load) suggested that this load test was substantially more successful than the previous test execution. This initial conclusion was based on the performance charts containing the same metrics we used to analyze the April 14 test, which showed a direct comparison of critical measures, demonstrating if the pattern of performance had dramatically changed between the two test executions.

Step 3: Compare the Results
So, to start the comparative analysis, we took three key metrics of the April 14 and April 21 results and charted them together: External Web Load Test Average Response Time; Web Load Test Transactions per Minute; and percentage of CPU Utilization on the web server. Using just these three comparisons, it is clear that the two load tests had very different performance profiles.

Starting with the Web Load Test Average Response Times (the time required to completely download all of the content in the scripted synthetic transactions used in the load test), it is very clear that after 08:50 EDT - 40 minutes into both tests - that the response times diverged and remained on different paths for the remainder of the comparative test run. From this point on, the April 21 load test averaged load times that were around 50% faster than the April 14 test (Note: the Moving Average of percentage change averages 5 minutes of response time change to produce an clearer trend line). It took nearly 20 more minutes for Average Transaction Response Time to reach 20 seconds on April 21, even with load being applied at the same volume as in the April 14 test.

Comparison of Response Times showing improvement in April 21 APM Community Load Test resulting in lower average response times

The Web Load Transactions per Minute (the number of WLT transactions executed in a minute at that point of the load test) showed a pattern where the April 21 test also diverged from the April 14 test at 08:50 EDT. With the faster WLT Average Response Times, the April 21 test saw the system process 40-50% more transactions per minute than the April 14 test from 08:50 EDT until the end of the test cycle.

Comparison of Transactions per Minute showing improvement in April 21 APM Community Load Test resulting in higher transactions per minute

Much of this improvement can be tracked to the third metric: CPU utilization on the Web Server (the percentage of CPU used by the system and applications for performing all necessary activities on the machine). Throughout the April 21 test, the CPU of the web server, with more hardware and optimized page rendering processes helping out, the CPU was less heavy stressed throughout the test, reaching 100% utilization much later than in the April 14 test.

Comparison of CPU Utilization showing improvement in April 21 APM Community Load Test resulting in lower CPU utilization up until 09:40 EDT

These three metrics are directly tied to the Number of Web Requests per Minuterecorded at the Confluence application layer for the April 21 test. This metric peaked at 125-140 per minute during the April 21 test, compared to the April 14 test where the peak was at approximately 100 Web Requests per minute.

Despite the seeming success of the second load test on April 21, there were still issues that appeared. Building an integrated results chart for the April 21 load test shows that multiple performance events occurred once the load test reached the 100% CPU Utilization boundary (red vertical line in chart below). This appears to indicate that despite the improvements to the environment discussed above, there is still a CPU bottleneck present at higher loads.

An area of extreme contrast between the two tests was recorded in the Database Results. Database stats were clearly visible in the data from April 14 test (see the aggregated performance metric chart in Step 1), including a large spike in the number and length of queries just before the application reached the CPU bottleneck. But in order to find the same metrics in the April 21 test, you have to break out your microscope and look very closely at the bottom of the chart.

External and Internal Performance Metrics for April 21 Load Test

The reduction in database load was the direct result of the optimized cache settings enabled after the April 14 load test. With more of the data being stored in the application cache, the number of calls to the database decreased, removing this layer as a potential bottleneck at this load volume.

Step 4: Results and Next Steps
The lack of a sudden spike in Confluence/Atlassian processing time in the April 21 test (along with the accompanying database spike) was due to the removal of an application layer process that had been scheduled to run during the load test period. This process, and its effects on the systems and user experience, was quickly recognized once Andreas reviewed his data. Once the job that caused this issue was identified, it was removed in time for the April 21 test, completely eliminating a performance bottleneck that was encountered early in the April 14 test.

Lesson learned: Don't schedule system-intensive jobs to run during peak traffic periods; find a window with the lowest traffic volume to perform these tasks so that the fewest visitors possible are affected.

As we noted at the start of this post, it appears on the surface that the April 21 load test was more successful than the April 14 test. Yet, despite the improved performance of the April 21 load test, the results still show that there are still performance concerns in the test that need to be addressed. These concerns center around a dramatic spike in response times between 09:40 and 09:50 EDT, occurring after the load test had been running for 90 minutes.

When the system began to show degraded performance, it could easily be tracked using the 3 key metrics: WLT Average Response Time; WLT Transactions per minute; and CPU Utilization. When running transactions began to take much longer to execute, decreasing both the number of incoming web requests to the application layer and the number of transactions per minute executed by the load generation system, the root cause can be seen in the chart below, which removes some of the data series.

At 167 VUs, the system redlines, and has a sudden, 10-minute degradation of performance, after which it recovers when the test stabilizes at 200 VUs

The period of degradation that was detected during the load test started at 09:40 EDT and coincided with the:

  • Web Load Test achieving 167 VUs
  • CPU on the web server measuring 100%
  • Web Load Test Transactions per Minute averaging 130
  • Confluence Web Requests (the application layer of the APM Community Portal) measured at 135 per minute

Interestingly, after 10 minutes, this issue cleared up completely, except for transaction response times. The response times did not return to pre-spike values, but were now averaging almost 20 seconds higher than before the spike. With the system now peaked at 200 VUs and no additional load being generated, it was interesting to see that other metrics returned immediately to their pre-spike levels - notably Transactions per Minute and Web Requests per Minute. So, with 33 more VUs than before the spike, the system again appeared to be directly affected by a CPU bottleneck, as a higher load could not increase the number of requests processed at the application layer.

Out of this sea of metrics we determined that the performance of the April 21 load test saw a comparative improvement in the application when examined next to the April 14 load test, but the second test was still unable to reach the target of 200 VUs without suffering a bottleneck that caused performance to degrade dramatically.

Analyzing the degradation
To find the cause of the CPU bottleneck that prevented the April 21 test from reaching the goal of 200 VUs with little or no performance degradation, we have to dig deeper into the server-side metrics, especially those related to the health of application server. The dip in transactions throughout the system is aligned with the issue captured when the system hit 167 VUs. The question is: Was the dip in transactions processed and the rise in transaction response times the result of this load volume or a symptom of the actual cause of the performance degradation?

When the system degrades, the server-side data shows that high Garbage Collection could be a problem, as this automated process happened at the same time. It is clear that executing a very intensive system process when the web server CPU was already exhausted can cause a very large performance degradation.

Increased GC is normal while increasing the load - but there is an unusual spike exactly when we see a dip in transaction throughput

Looking at the application server specific transaction response times it is easy to spot the potential problem. The following charts show that "yet another" background job is executed every hour taking away CPU cycles from the already exhausted system.

A background job executes every hour taking up to 300s in CPU cycles at the time when virtual users experienced the performance degradation

Looking at these transactions reveals that the job is an hourly update job that synchronizes the cached user objects with the user directory database. This takes a considerable amount of time because we have 65k+ users on the APM Community system. This update job causes a lot of objects to be created and destroyed - hence the increased memory and GC activity.

The synchronization job is the root cause of the degradation by consuming a lot of CPU as well as allocating memory which causes high GC activity

As with the April 14 load test, the April 21 load test exposed issues with the system that prevented the achievement of the 200 VU goal. But now, we have a clear culprit for the prevention of this goal, so efforts can focus on reducing or eliminating the effect that this update process has on the system when it is under peak load.

Conclusion
In both tests, regardless of how you measure the "success" of a load test, something was learned about the system by aggregating metrics from inside and outside the infrastructure being tested. We now know that the optimizations that were performed after the April 14 load test allowed the system to process 40-50% more transactions per minute up to 167 VUs when a scheduled system process caused a severe application degradation.

This data was only able to be turned into actionable information because we had a process in place that allowed results captured from inside the firewall to be easily aligned with the external results from the load test system. By doing this, the customer, albeit in a very controlled form, becomes a factor in the analysis of system performance.

By creating a full performance perspective, PureStack delivers more than just deeper technical metrics on a system under load. PureStack places the experience of the visitor at the same level of importance as CPU, database, and web requests processed by the application layer when the results are analyzed. The importance of the user experience then dictates how infrastructure issues are prioritized and resolved, as the effect these issues have on end users provides real-world feedback into the true cost of performance issues that occur to your application during peak periods.

Using the data from this load test, it was realized that additional changes to the system were needed, especially in the area of page rendering, in order to further reduce CPU load and allow the system to reach and maintain a peak load of 200 virtual users. With the upgrade to the Confluence application software - deployed in early July 2013 - it was expected that the desired goal would be reached. But assuming this is not sufficient; it is expected that an additional load test on the new Confluence system will occur in July 2013, once the system has been completed stabilized. And using the same transaction paths as in the April 14 and 21 load tests, the system will be verified to confirm that the upgrade is delivering the expected performance.

More Stories By Stephen Pierzchala

With more than a decade in the web performance industry, Stephen Pierzchala has advised many organizations, from Fortune 500 to startups, in how to improve the performance of their web applications by helping them develop and evolve the unique speed, conversion, and customer experience metrics necessary to effectively measure, manage, and evolve online web and mobile applications that improve performance and increase revenue. Working on projects for top companies in the online retail, financial services, content delivery, ad-delivery, and enterprise software industries, he has developed new approaches to web performance data analysis. Stephen has led web performance methodology, CDN Assessment, SaaS load testing, technical troubleshooting, and performance assessments, demonstrating the value of the web performance. He noted for his technical analyses and knowledge of Web performance from the outside-in.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
"When we talk about cloud without compromise what we're talking about is that when people think about 'I need the flexibility of the cloud' - it's the ability to create applications and run them in a cloud environment that's far more flexible,” explained Matthew Finnie, CTO of Interoute, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), provided an overview of various initiatives to certify the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldwide re...
Amazon started as an online bookseller 20 years ago. Since then, it has evolved into a technology juggernaut that has disrupted multiple markets and industries and touches many aspects of our lives. It is a relentless technology and business model innovator driving disruption throughout numerous ecosystems. Amazon’s AWS revenues alone are approaching $16B a year making it one of the largest IT companies in the world. With dominant offerings in Cloud, IoT, eCommerce, Big Data, AI, Digital Assista...
When growing capacity and power in the data center, the architectural trade-offs between server scale-up vs. scale-out continue to be debated. Both approaches are valid: scale-out adds multiple, smaller servers running in a distributed computing model, while scale-up adds fewer, more powerful servers that are capable of running larger workloads. It’s worth noting that there are additional, unique advantages that scale-up architectures offer. One big advantage is large memory and compute capacity...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
IoT solutions exploit operational data generated by Internet-connected smart “things” for the purpose of gaining operational insight and producing “better outcomes” (for example, create new business models, eliminate unscheduled maintenance, etc.). The explosive proliferation of IoT solutions will result in an exponential growth in the volume of IoT data, precipitating significant Information Governance issues: who owns the IoT data, what are the rights/duties of IoT solutions adopters towards t...
With the introduction of IoT and Smart Living in every aspect of our lives, one question has become relevant: What are the security implications? To answer this, first we have to look and explore the security models of the technologies that IoT is founded upon. In his session at @ThingsExpo, Nevi Kaja, a Research Engineer at Ford Motor Company, discussed some of the security challenges of the IoT infrastructure and related how these aspects impact Smart Living. The material was delivered interac...
No hype cycles or predictions of zillions of things here. IoT is big. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, Associate Partner at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He discussed the evaluation of communication standards and IoT messaging protocols, data analytics considerations, edge-to-cloud tec...
The Internet giants are fully embracing AI. All the services they offer to their customers are aimed at drawing a map of the world with the data they get. The AIs from these companies are used to build disruptive approaches that cannot be used by established enterprises, which are threatened by these disruptions. However, most leaders underestimate the effect this will have on their businesses. In his session at 21st Cloud Expo, Rene Buest, Director Market Research & Technology Evangelism at Ara...
Artificial intelligence, machine learning, neural networks. We’re in the midst of a wave of excitement around AI such as hasn’t been seen for a few decades. But those previous periods of inflated expectations led to troughs of disappointment. Will this time be different? Most likely. Applications of AI such as predictive analytics are already decreasing costs and improving reliability of industrial machinery. Furthermore, the funding and research going into AI now comes from a wide range of com...
We build IoT infrastructure products - when you have to integrate different devices, different systems and cloud you have to build an application to do that but we eliminate the need to build an application. Our products can integrate any device, any system, any cloud regardless of protocol," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
SYS-CON Events announced today that Enzu will exhibit at SYS-CON's 21st Int\ernational Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive advantage. By offering a suite of proven hosting and management services, Enzu wants companies to focus on the core of their ...
SYS-CON Events announced today that Cloud Academy named "Bronze Sponsor" of 21st International Cloud Expo which will take place October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara, CA. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud com...
SYS-CON Events announced today that MobiDev, a client-oriented software development company, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex software systems for startups and enterprises. Since 2009 it has grown from a small group of passionate engineers and business...
SYS-CON Events announced today that GrapeUp, the leading provider of rapid product development at the speed of business, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market acr...
SYS-CON Events announced today that Ayehu will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara California. Ayehu provides IT Process Automation & Orchestration solutions for IT and Security professionals to identify and resolve critical incidents and enable rapid containment, eradication, and recovery from cyber security breaches. Ayehu provides customers greater control over IT infras...
In his session at Cloud Expo, Alan Winters, an entertainment executive/TV producer turned serial entrepreneur, presented a success story of an entrepreneur who has both suffered through and benefited from offshore development across multiple businesses: The smart choice, or how to select the right offshore development partner Warning signs, or how to minimize chances of making the wrong choice Collaboration, or how to establish the most effective work processes Budget control, or how to ma...
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...