Welcome!

Java IoT Authors: Stackify Blog, Derek Weeks, Liz McMillan, Elizabeth White, Pat Romanski

Related Topics: Java IoT, Microservices Expo, Microsoft Cloud, Machine Learning , Agile Computing, @BigDataExpo

Java IoT: Article

Part 2: An Integrated Approach to Load Test Analysis

The Follow-up Test

In a Part 1, I demonstrated how to add more depth to the analysis of a Compuware APM Web Load Test by combining the external load results with the application and infrastructure data collected by the Compuware PureStack Technology. But, now that we have tested the system once, what would happen if we tested it again after we identified and "resolved" the issues we found? Would running a test using the same parameters as in the initial test show a clear performance improvement? Would the system be able to achieve the desired load of 200 virtual users with little or no performance degradation?

This article takes you through the steps you should follow in order to directly compare the results of two load tests and measure the performance improvement (or degradation) that occurred with the fixes put in place.

Step 1: Identify issues and implement changes based on initial results
During the April 14 load test session, Andreas Grabner and I found that there were substantial performance concerns for the application under a load that is well in excess of what is currently seen even on the APM Community's busiest days. The issue was that the load that caused the performance issues was well short of the goal of 200 virtual users (VUs) that the application team wanted to reach.

An Aggregated Data View Showing External and Internal Performance Indicators During the April 14 Load Test

During the April 14 load test execution, a number of environment issues were identified. The critical ones the systems team addressed by included:

  • Deployment of critical APM Community applications to different machines to prevent the application performance of one layer negatively affecting another layer
  • Optimization of the way APM Community pages are built in the application layer to reduce CPU usage
  • Optimized Cache Settings in Confluence to reduce roundtrips to the database when loading commonly used objects
  • Increasing the CPU power on the virtualized machines so that they can handle more load.

Step 2: Re-Run the test (With the same parameters!)
Once these steps were complete, a second test cycle was scheduled to determine if the updated environment would be able to reach the desired 200 VU target without encountering response time degradation. The follow-up load test was executed exactly one week later, on April 21, and used the same parameters as the initial test (see previous post for load ramping details). Using the same test parameters (load ramp, test scripts, testing locations, databanks, etc.) is critical in order to allow a like-for-like comparison to occur. Any deviation in the test configuration can skew the results and potentially lead to an unintended sense of confidence (or fear of implosion) regarding the application environment.

When the April 21 round of load testing was complete and we began to analyze the results of the test, the initial data (higher throughput, faster response times, lower CPU utilization and a reduction in the amount of database load) suggested that this load test was substantially more successful than the previous test execution. This initial conclusion was based on the performance charts containing the same metrics we used to analyze the April 14 test, which showed a direct comparison of critical measures, demonstrating if the pattern of performance had dramatically changed between the two test executions.

Step 3: Compare the Results
So, to start the comparative analysis, we took three key metrics of the April 14 and April 21 results and charted them together: External Web Load Test Average Response Time; Web Load Test Transactions per Minute; and percentage of CPU Utilization on the web server. Using just these three comparisons, it is clear that the two load tests had very different performance profiles.

Starting with the Web Load Test Average Response Times (the time required to completely download all of the content in the scripted synthetic transactions used in the load test), it is very clear that after 08:50 EDT - 40 minutes into both tests - that the response times diverged and remained on different paths for the remainder of the comparative test run. From this point on, the April 21 load test averaged load times that were around 50% faster than the April 14 test (Note: the Moving Average of percentage change averages 5 minutes of response time change to produce an clearer trend line). It took nearly 20 more minutes for Average Transaction Response Time to reach 20 seconds on April 21, even with load being applied at the same volume as in the April 14 test.

Comparison of Response Times showing improvement in April 21 APM Community Load Test resulting in lower average response times

The Web Load Transactions per Minute (the number of WLT transactions executed in a minute at that point of the load test) showed a pattern where the April 21 test also diverged from the April 14 test at 08:50 EDT. With the faster WLT Average Response Times, the April 21 test saw the system process 40-50% more transactions per minute than the April 14 test from 08:50 EDT until the end of the test cycle.

Comparison of Transactions per Minute showing improvement in April 21 APM Community Load Test resulting in higher transactions per minute

Much of this improvement can be tracked to the third metric: CPU utilization on the Web Server (the percentage of CPU used by the system and applications for performing all necessary activities on the machine). Throughout the April 21 test, the CPU of the web server, with more hardware and optimized page rendering processes helping out, the CPU was less heavy stressed throughout the test, reaching 100% utilization much later than in the April 14 test.

Comparison of CPU Utilization showing improvement in April 21 APM Community Load Test resulting in lower CPU utilization up until 09:40 EDT

These three metrics are directly tied to the Number of Web Requests per Minuterecorded at the Confluence application layer for the April 21 test. This metric peaked at 125-140 per minute during the April 21 test, compared to the April 14 test where the peak was at approximately 100 Web Requests per minute.

Despite the seeming success of the second load test on April 21, there were still issues that appeared. Building an integrated results chart for the April 21 load test shows that multiple performance events occurred once the load test reached the 100% CPU Utilization boundary (red vertical line in chart below). This appears to indicate that despite the improvements to the environment discussed above, there is still a CPU bottleneck present at higher loads.

An area of extreme contrast between the two tests was recorded in the Database Results. Database stats were clearly visible in the data from April 14 test (see the aggregated performance metric chart in Step 1), including a large spike in the number and length of queries just before the application reached the CPU bottleneck. But in order to find the same metrics in the April 21 test, you have to break out your microscope and look very closely at the bottom of the chart.

External and Internal Performance Metrics for April 21 Load Test

The reduction in database load was the direct result of the optimized cache settings enabled after the April 14 load test. With more of the data being stored in the application cache, the number of calls to the database decreased, removing this layer as a potential bottleneck at this load volume.

Step 4: Results and Next Steps
The lack of a sudden spike in Confluence/Atlassian processing time in the April 21 test (along with the accompanying database spike) was due to the removal of an application layer process that had been scheduled to run during the load test period. This process, and its effects on the systems and user experience, was quickly recognized once Andreas reviewed his data. Once the job that caused this issue was identified, it was removed in time for the April 21 test, completely eliminating a performance bottleneck that was encountered early in the April 14 test.

Lesson learned: Don't schedule system-intensive jobs to run during peak traffic periods; find a window with the lowest traffic volume to perform these tasks so that the fewest visitors possible are affected.

As we noted at the start of this post, it appears on the surface that the April 21 load test was more successful than the April 14 test. Yet, despite the improved performance of the April 21 load test, the results still show that there are still performance concerns in the test that need to be addressed. These concerns center around a dramatic spike in response times between 09:40 and 09:50 EDT, occurring after the load test had been running for 90 minutes.

When the system began to show degraded performance, it could easily be tracked using the 3 key metrics: WLT Average Response Time; WLT Transactions per minute; and CPU Utilization. When running transactions began to take much longer to execute, decreasing both the number of incoming web requests to the application layer and the number of transactions per minute executed by the load generation system, the root cause can be seen in the chart below, which removes some of the data series.

At 167 VUs, the system redlines, and has a sudden, 10-minute degradation of performance, after which it recovers when the test stabilizes at 200 VUs

The period of degradation that was detected during the load test started at 09:40 EDT and coincided with the:

  • Web Load Test achieving 167 VUs
  • CPU on the web server measuring 100%
  • Web Load Test Transactions per Minute averaging 130
  • Confluence Web Requests (the application layer of the APM Community Portal) measured at 135 per minute

Interestingly, after 10 minutes, this issue cleared up completely, except for transaction response times. The response times did not return to pre-spike values, but were now averaging almost 20 seconds higher than before the spike. With the system now peaked at 200 VUs and no additional load being generated, it was interesting to see that other metrics returned immediately to their pre-spike levels - notably Transactions per Minute and Web Requests per Minute. So, with 33 more VUs than before the spike, the system again appeared to be directly affected by a CPU bottleneck, as a higher load could not increase the number of requests processed at the application layer.

Out of this sea of metrics we determined that the performance of the April 21 load test saw a comparative improvement in the application when examined next to the April 14 load test, but the second test was still unable to reach the target of 200 VUs without suffering a bottleneck that caused performance to degrade dramatically.

Analyzing the degradation
To find the cause of the CPU bottleneck that prevented the April 21 test from reaching the goal of 200 VUs with little or no performance degradation, we have to dig deeper into the server-side metrics, especially those related to the health of application server. The dip in transactions throughout the system is aligned with the issue captured when the system hit 167 VUs. The question is: Was the dip in transactions processed and the rise in transaction response times the result of this load volume or a symptom of the actual cause of the performance degradation?

When the system degrades, the server-side data shows that high Garbage Collection could be a problem, as this automated process happened at the same time. It is clear that executing a very intensive system process when the web server CPU was already exhausted can cause a very large performance degradation.

Increased GC is normal while increasing the load - but there is an unusual spike exactly when we see a dip in transaction throughput

Looking at the application server specific transaction response times it is easy to spot the potential problem. The following charts show that "yet another" background job is executed every hour taking away CPU cycles from the already exhausted system.

A background job executes every hour taking up to 300s in CPU cycles at the time when virtual users experienced the performance degradation

Looking at these transactions reveals that the job is an hourly update job that synchronizes the cached user objects with the user directory database. This takes a considerable amount of time because we have 65k+ users on the APM Community system. This update job causes a lot of objects to be created and destroyed - hence the increased memory and GC activity.

The synchronization job is the root cause of the degradation by consuming a lot of CPU as well as allocating memory which causes high GC activity

As with the April 14 load test, the April 21 load test exposed issues with the system that prevented the achievement of the 200 VU goal. But now, we have a clear culprit for the prevention of this goal, so efforts can focus on reducing or eliminating the effect that this update process has on the system when it is under peak load.

Conclusion
In both tests, regardless of how you measure the "success" of a load test, something was learned about the system by aggregating metrics from inside and outside the infrastructure being tested. We now know that the optimizations that were performed after the April 14 load test allowed the system to process 40-50% more transactions per minute up to 167 VUs when a scheduled system process caused a severe application degradation.

This data was only able to be turned into actionable information because we had a process in place that allowed results captured from inside the firewall to be easily aligned with the external results from the load test system. By doing this, the customer, albeit in a very controlled form, becomes a factor in the analysis of system performance.

By creating a full performance perspective, PureStack delivers more than just deeper technical metrics on a system under load. PureStack places the experience of the visitor at the same level of importance as CPU, database, and web requests processed by the application layer when the results are analyzed. The importance of the user experience then dictates how infrastructure issues are prioritized and resolved, as the effect these issues have on end users provides real-world feedback into the true cost of performance issues that occur to your application during peak periods.

Using the data from this load test, it was realized that additional changes to the system were needed, especially in the area of page rendering, in order to further reduce CPU load and allow the system to reach and maintain a peak load of 200 virtual users. With the upgrade to the Confluence application software - deployed in early July 2013 - it was expected that the desired goal would be reached. But assuming this is not sufficient; it is expected that an additional load test on the new Confluence system will occur in July 2013, once the system has been completed stabilized. And using the same transaction paths as in the April 14 and 21 load tests, the system will be verified to confirm that the upgrade is delivering the expected performance.

More Stories By Stephen Pierzchala

With more than a decade in the web performance industry, Stephen Pierzchala has advised many organizations, from Fortune 500 to startups, in how to improve the performance of their web applications by helping them develop and evolve the unique speed, conversion, and customer experience metrics necessary to effectively measure, manage, and evolve online web and mobile applications that improve performance and increase revenue. Working on projects for top companies in the online retail, financial services, content delivery, ad-delivery, and enterprise software industries, he has developed new approaches to web performance data analysis. Stephen has led web performance methodology, CDN Assessment, SaaS load testing, technical troubleshooting, and performance assessments, demonstrating the value of the web performance. He noted for his technical analyses and knowledge of Web performance from the outside-in.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@ThingsExpo Stories
When shopping for a new data processing platform for IoT solutions, many development teams want to be able to test-drive options before making a choice. Yet when evaluating an IoT solution, it’s simply not feasible to do so at scale with physical devices. Building a sensor simulator is the next best choice; however, generating a realistic simulation at very high TPS with ease of configurability is a formidable challenge. When dealing with multiple application or transport protocols, you would be...
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
SYS-CON Events announced today that GrapeUp, the leading provider of rapid product development at the speed of business, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market acr...
In the enterprise today, connected IoT devices are everywhere – both inside and outside corporate environments. The need to identify, manage, control and secure a quickly growing web of connections and outside devices is making the already challenging task of security even more important, and onerous. In his session at @ThingsExpo, Rich Boyer, CISO and Chief Architect for Security at NTT i3, discussed new ways of thinking and the approaches needed to address the emerging challenges of security i...
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
In his opening keynote at 20th Cloud Expo, Michael Maximilien, Research Scientist, Architect, and Engineer at IBM, discussed the full potential of the cloud and social data requires artificial intelligence. By mixing Cloud Foundry and the rich set of Watson services, IBM's Bluemix is the best cloud operating system for enterprises today, providing rapid development and deployment of applications that can take advantage of the rich catalog of Watson services to help drive insights from the vast t...
There is only one world-class Cloud event on earth, and that is Cloud Expo – which returns to Silicon Valley for the 21st Cloud Expo at the Santa Clara Convention Center, October 31 - November 2, 2017. Every Global 2000 enterprise in the world is now integrating cloud computing in some form into its IT development and operations. Midsize and small businesses are also migrating to the cloud in increasing numbers. Companies are each developing their unique mix of cloud technologies and service...
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, will introduce two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a...
Recently, IoT seems emerging as a solution vehicle for data analytics on real-world scenarios from setting a room temperature setting to predicting a component failure of an aircraft. Compared with developing an application or deploying a cloud service, is an IoT solution unique? If so, how? How does a typical IoT solution architecture consist? And what are the essential components and how are they relevant to each other? How does the security play out? What are the best practices in formulating...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics ...
In his session at @ThingsExpo, Arvind Radhakrishnen discussed how IoT offers new business models in banking and financial services organizations with the capability to revolutionize products, payments, channels, business processes and asset management built on strong architectural foundation. The following topics were covered: How IoT stands to impact various business parameters including customer experience, cost and risk management within BFS organizations.
SYS-CON Events announced today that Elastifile will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Elastifile Cloud File System (ECFS) is software-defined data infrastructure designed for seamless and efficient management of dynamic workloads across heterogeneous environments. Elastifile provides the architecture needed to optimize your hybrid cloud environment, by facilitating efficient...
SYS-CON Events announced today that Golden Gate University will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Since 1901, non-profit Golden Gate University (GGU) has been helping adults achieve their professional goals by providing high quality, practice-based undergraduate and graduate educational programs in law, taxation, business and related professions. Many of its courses are taug...
"We provide IoT solutions. We provide the most compatible solutions for many applications. Our solutions are industry agnostic and also protocol agnostic," explained Richard Han, Head of Sales and Marketing and Engineering at Systena America, in this SYS-CON.tv interview at @ThingsExpo, held June 6-8, 2017, at the Javits Center in New York City, NY.
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
Recently, WebRTC has a lot of eyes from market. The use cases of WebRTC are expanding - video chat, online education, online health care etc. Not only for human-to-human communication, but also IoT use cases such as machine to human use cases can be seen recently. One of the typical use-case is remote camera monitoring. With WebRTC, people can have interoperability and flexibility for deploying monitoring service. However, the benefit of WebRTC for IoT is not only its convenience and interopera...
SYS-CON Events announced today that Secure Channels, a cybersecurity firm, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Secure Channels, Inc. offers several products and solutions to its many clients, helping them protect critical data from being compromised and access to computer networks from the unauthorized. The company develops comprehensive data encryption security strategie...
From 2013, NTT Communications has been providing cPaaS service, SkyWay. Its customer’s expectations for leveraging WebRTC technology are not only typical real-time communication use cases such as Web conference, remote education, but also IoT use cases such as remote camera monitoring, smart-glass, and robotic. Because of this, NTT Communications has numerous IoT business use-cases that its customers are developing on top of PaaS. WebRTC will lead IoT businesses to be more innovative and address...