|By Michael Kopp||
|November 4, 2012 09:00 AM EST||
An eCommerce site that crashes seven times during the Christmas season being down for up to five hours each time it crashes is a site that loses a lot of money and reputation. It happened to one of our customers who told this story at our annual performance conference earlier this month. Among the several reasons that led to these crashes I want to share more details on one of them that I see more often with other websites as well. Load balancers on a round-robin instead of least-busy can easily lead to app server crashes caused by heap memory exhaustion. Let's dig into some details on how to identify these problems and how to avoid it.
The Symptom: Crashing Tomcat Instances
The website is deployed on six Tomcats with three front-end Apache Web Servers. During peak load hours individual Tomcat instances started showing growing response times and a growing number of requests in the Tomcat processing queue. After a while these instances crashed due to out-of-memory exceptions and with that also brought down the rest of the site as load couldn't be handled any more with the remaining servers. Figure 1 shows the actual flow of transactions through the system highlighting unevenly distributed response time in the application servers and functional errors being reported on all tiers (red-colored server icon):
Even with equally distributed load (Round Robin Load Balancer Setting) one of the Tomcats spiked in Response Time Contribution before crashing
Once the App Server started rejecting incoming connections we can observe the first ripple effect of errors. We can see a very high number of exceptions in the database layer, exceptions thrown between application tiers with the web app responding with HTTP 500s:
Within 30 minutes the application serves 43000 pages with an HTTP 500 Response correlating to Exceptions in the Database and Inter-Tier Communication
The Root Cause: Inefficient Database Statements and Connection Pool Usage
The exceptions caught in the Database Layer (JDBC) were already a very good hint of the root cause of this problem. A closer look at the Exceptions shows that connection pools are exhausted, which causes problems in the different components of the application:
Exhausted Connection Pool causes Exceptions that impact Data Access Layer as well as Widget Rendering
Looking at the performance breakdown by application layer reveals how much performance impact connection pooling has on the overall transaction response time:
Due to the connection pool problem a single request had to wait 3.8s on average to obtain a connection from the pool
Now - it was not only the size of the pool that was the problem - but - several very inefficient database statements that took long to execute for certain business transactions of the application. This caused the application server to hold on to the connection longer than normal. As the load balancer was configured with Round Robin the app server still got additional requests served. Eventually - just by the random nature of incoming requests - one app server received several of these requests that executed these inefficient database calls. Once the connection pool was exhausted the application started throwing exceptions that ultimately also led to a crash of the JVM. Once the first app server crashed, it didn't take too long to take the other app servers down as well.
The Solution: Optimizing App and Load Balancer
The problem was fixed by looking at the slowest database statements and optimizing them for performance by, e.g, adding indices on the database or making the SQL statements more efficient. They also optimized the pool size to accommodate the expected load during peak hours.
They started by optimizing SQL Statements that took long to execute and those that got executed several times within the same transaction
They also changed the load balancer setting from Round-Robin to Least-Busy, which was the preferred setting from the LB vendor but had simply forgotten to configure in the production environment.
The Result: Site Has Not Been Down Since
Since they made the changes to the application and the load balancer the site has never gone down since. Now - the next holiday season is coming up and they are ready for the next seasonal spikes. Even though they are really confident that everything will work without any problems, they learned their lesson and are approaching performance proactively through proper load testing.
Next Steps: Proactive Performance Management
The lesson learned was that these problems could have been found prior to the holiday shopping season by doing proper load testing. They did load testing before but never encountered this problem because of two reasons: 1) they didn't test using expected peak volumes for long enough sessions and 2) didn't use a tool that simulated real customer behavior variations (too few scripts and the scripts were too simple) that tested their highly interactive web site.
Their strategy for proactive performance management is that they:
- Perform load tests on the production system during low traffic hours (2 a.m.-6 a.m.), accepting the risk of minor sales losses in case of a crash, versus major sales losses during the holiday shopping season.
- Multiply the hourly load test volume by 2.5 since their actual peaks are 10 hours long.
- Use a load testing service that uses real browsers in different locations around the U.S.
- Use an APM solution that identifies problems within the application while running the load test.
If you want to read more on common performance problems that are not found prior to moving to production check out my recent series of blogs: Supersized Content, Deployment Mistakes or Excessive Logging
SYS-CON Events announced today that BMC will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. BMC delivers software solutions that help IT transform digital enterprises for the ultimate competitive business advantage. BMC has worked with thousands of leading companies to create and deliver powerful IT management services. From mainframe to cloud to mobile, BMC pairs high-speed digital innovation with robust IT industrialization – allowing customers to provide amazing user experiences with optimized IT per...
May. 30, 2015 08:15 AM EDT Reads: 1,825
We’re entering a new era of computing technology that many are calling the Internet of Things (IoT). Machine to machine, machine to infrastructure, machine to environment, the Internet of Everything, the Internet of Intelligent Things, intelligent systems – call it what you want, but it’s happening, and its potential is huge. IoT is comprised of smart machines interacting and communicating with other machines, objects, environments and infrastructures. As a result, huge volumes of data are being generated, and that data is being processed into useful actions that can “command and control” thi...
May. 30, 2015 08:00 AM EDT Reads: 1,443
Building low-cost wearable devices can enhance the quality of our lives. In his session at Internet of @ThingsExpo, Sai Yamanoor, Embedded Software Engineer at Altschool, provided an example of putting together a small keychain within a $50 budget that educates the user about the air quality in their surroundings. He also provided examples such as building a wearable device that provides transit or recreational information. He then reviewed the resources available to build wearable devices at home including open source hardware, the raw materials required and the options available to power s...
May. 30, 2015 05:30 AM EDT Reads: 4,547
In their session at @ThingsExpo, Shyam Varan Nath, Principal Architect at GE, and Ibrahim Gokcen, who leads GE's advanced IoT analytics, focused on the Internet of Things / Industrial Internet and how to make it operational for business end-users. Learn about the challenges posed by machine and sensor data and how to marry it with enterprise data. They also discussed the tips and tricks to provide the Industrial Internet as an end-user consumable service using Big Data Analytics and Industrial Cloud.
May. 30, 2015 04:30 AM EDT Reads: 5,859
We certainly live in interesting technological times. And no more interesting than the current competing IoT standards for connectivity. Various standards bodies, approaches, and ecosystems are vying for mindshare and positioning for a competitive edge. It is clear that when the dust settles, we will have new protocols, evolved protocols, that will change the way we interact with devices and infrastructure. We will also have evolved web protocols, like HTTP/2, that will be changing the very core of our infrastructures. At the same time, we have old approaches made new again like micro-services...
May. 30, 2015 03:30 AM EDT Reads: 5,822
How do APIs and IoT relate? The answer is not as simple as merely adding an API on top of a dumb device, but rather about understanding the architectural patterns for implementing an IoT fabric. There are typically two or three trends: Exposing the device to a management framework Exposing that management framework to a business centric logic Exposing that business layer and data to end users. This last trend is the IoT stack, which involves a new shift in the separation of what stuff happens, where data lives and where the interface lies. For instance, it's a mix of architectural styles ...
May. 30, 2015 03:00 AM EDT Reads: 6,190
Connected devices and the Internet of Things are getting significant momentum in 2014. In his session at Internet of @ThingsExpo, Jim Hunter, Chief Scientist & Technology Evangelist at Greenwave Systems, examined three key elements that together will drive mass adoption of the IoT before the end of 2015. The first element is the recent advent of robust open source protocols (like AllJoyn and WebRTC) that facilitate M2M communication. The second is broad availability of flexible, cost-effective storage designed to handle the massive surge in back-end data in a world where timely analytics is e...
May. 30, 2015 02:00 AM EDT Reads: 6,564
Collecting data in the field and configuring multitudes of unique devices is a time-consuming, labor-intensive process that can stretch IT resources. Horan & Bird [H&B], Australia’s fifth-largest Solar Panel Installer, wanted to automate sensor data collection and monitoring from its solar panels and integrate the data with its business and marketing systems. After data was collected and structured, two major areas needed to be addressed: improving developer workflows and extending access to a business application to multiple users (multi-tenancy). Docker, a container technology, was used to ...
May. 30, 2015 01:00 AM EDT Reads: 2,832
The true value of the Internet of Things (IoT) lies not just in the data, but through the services that protect the data, perform the analysis and present findings in a usable way. With many IoT elements rooted in traditional IT components, Big Data and IoT isn’t just a play for enterprise. In fact, the IoT presents SMBs with the prospect of launching entirely new activities and exploring innovative areas. CompTIA research identifies several areas where IoT is expected to have the greatest impact.
May. 29, 2015 09:00 PM EDT Reads: 5,560
2015 predictions circa 1970: houses anticipate our needs and adapt, city infrastructure is citizen and situation aware, office buildings identify and preprocess you. Today smart buildings have no such collective conscience, no shared set of fundamental services to identify, predict and synchronize around us. LiveSpace and M2Mi are changing that. LiveSpace Smart Environment devices deliver over the M2Mi IoT Platform real time presence, awareness and intent analytics as a service to local connected devices. In her session at @ThingsExpo, Sarah Cooper, VP Business of Development at M2Mi, will d...
May. 29, 2015 04:27 PM EDT Reads: 988
The Industrial Internet revolution is now underway, enabled by connected machines and billions of devices that communicate and collaborate. The massive amounts of Big Data requiring real-time analysis is flooding legacy IT systems and giving way to cloud environments that can handle the unpredictable workloads. Yet many barriers remain until we can fully realize the opportunities and benefits from the convergence of machines and devices with Big Data and the cloud, including interoperability, data security and privacy.
May. 29, 2015 03:45 PM EDT Reads: 5,136
Explosive growth in connected devices. Enormous amounts of data for collection and analysis. Critical use of data for split-second decision making and actionable information. All three are factors in making the Internet of Things a reality. Yet, any one factor would have an IT organization pondering its infrastructure strategy. How should your organization enhance its IT framework to enable an Internet of Things implementation? In this session, James Kirkland, Red Hat's Chief Architect for the Internet of Things and Intelligent Systems, will describe how to revolutionize your architecture and...
May. 29, 2015 02:33 PM EDT Reads: 932
The Internet of Things is tied together with a thin strand that is known as time. Coincidentally, at the core of nearly all data analytics is a timestamp. When working with time series data there are a few core principles that everyone should consider, especially across datasets where time is the common boundary. In his session at Internet of @ThingsExpo, Jim Scott, Director of Enterprise Strategy & Architecture at MapR Technologies, discussed single-value, geo-spatial, and log time series data. By focusing on enterprise applications and the data center, he will use OpenTSDB as an example t...
May. 29, 2015 02:00 PM EDT Reads: 6,914
All major researchers estimate there will be tens of billions devices - computers, smartphones, tablets, and sensors - connected to the Internet by 2020. This number will continue to grow at a rapid pace for the next several decades. With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo, June 9-11, 2015, at the Javits Center in New York City. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be
May. 29, 2015 01:15 PM EDT Reads: 3,131
Scott Jenson leads a project called The Physical Web within the Chrome team at Google. Project members are working to take the scalability and openness of the web and use it to talk to the exponentially exploding range of smart devices. Nearly every company today working on the IoT comes up with the same basic solution: use my server and you'll be fine. But if we really believe there will be trillions of these devices, that just can't scale. We need a system that is open a scalable and by using the URL as a basic building block, we open this up and get the same resilience that the web enjoys.
May. 29, 2015 01:00 PM EDT Reads: 7,557
We are reaching the end of the beginning with WebRTC, and real systems using this technology have begun to appear. One challenge that faces every WebRTC deployment (in some form or another) is identity management. For example, if you have an existing service – possibly built on a variety of different PaaS/SaaS offerings – and you want to add real-time communications you are faced with a challenge relating to user management, authentication, authorization, and validation. Service providers will want to use their existing identities, but these will have credentials already that are (hopefully) i...
May. 29, 2015 01:00 PM EDT Reads: 4,812
SYS-CON Events announced today that MetraTech, now part of Ericsson, has been named “Silver Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York, NY. Ericsson is the driving force behind the Networked Society- a world leader in communications infrastructure, software and services. Some 40% of the world’s mobile traffic runs through networks Ericsson has supplied, serving more than 2.5 billion subscribers.
May. 29, 2015 01:00 PM EDT Reads: 2,512
Thanks to widespread Internet adoption and more than 10 billion connected devices around the world, companies became more excited than ever about the Internet of Things in 2014. Add in the hype around Google Glass and the Nest Thermostat, and nearly every business, including those from traditionally low-tech industries, wanted in. But despite the buzz, some very real business questions emerged – mainly, not if a device can be connected, or even when, but why? Why does connecting to the cloud create greater value for the user? Why do connected features improve the overall experience? And why do...
May. 29, 2015 12:42 PM EDT Reads: 1,138
SYS-CON Events announced today that O'Reilly Media has been named “Media Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York City, NY. O'Reilly Media spreads the knowledge of innovators through its books, online services, magazines, and conferences. Since 1978, O'Reilly Media has been a chronicler and catalyst of cutting-edge development, homing in on the technology trends that really matter and spurring their adoption by amplifying "faint signals" from the alpha geeks who are creating the future. An active participa...
May. 29, 2015 12:30 PM EDT Reads: 1,426
Imagine a world where targeting, attribution, and analytics are just as intrinsic to the physical world as they currently are to display advertising. Advances in technologies and changes in consumer behavior have opened the door to a whole new category of personalized marketing experience based on direct interactions with products. The products themselves now have a voice. What will they say? Who will control it? And what does it take for brands to win in this new world? In his session at @ThingsExpo, Zack Bennett, Vice President of Customer Success at EVRYTHNG, will answer these questions a...
May. 29, 2015 12:13 PM EDT Reads: 945