|By Andreas Grabner||
|November 4, 2013 03:00 PM EST||
I personally don't like the term "War Room" when describing a firefighting situation that many software companies have to deal with when systems go down or have problems. The way these war rooms typically play out is that key personnel (engineers, operations, business) are summoned into a room until the problem is solved. This was the case back with the Apollo 13 mission and still is now when we look at the famous Facebook war room from Dec 2012:
The War Room back then - And Now: Not a whole lot different
What's the problem with these pictures? There are a lot of people in the room that have no clue whether the problem on hand is actually something they can fix or are responsible for. All of these people are summoned without first figuring out which people should look at the problem. Why is that? Because the collected "evidence" in the form of infrastructure monitoring data, log files, user complaints, etc., just shows symptoms but doesn't tell us anything about the actual impact and root cause of issues:
Would you know whom to bring into a war room based on these "facts"? Would you want to be one of them?
Looking at the previous image, it is hard to tell which people need to get in a room. Do we just need an Ops guy to restart the process that consumes all of the CPU? Or do we need an application expert that sifts through log files? Do we need to contact our mobile solution provider because it is an actual problem in the 3rd party mobile native app? The typical MO is to simply call-in everybody to figure out the root cause of the problem and with that pulling critical resources from other important projects without even knowing if these folks can actually help solving these problems. How can we change this? By asking the right questions first!
The 10 Real Questions to Ask
You don't need nice and shiny dashboards that show you an aggregated overview of twitter statuses, infrastructure health or insight into slow application transactions. You need data to answer the following questions - whether it is presented in nice dashboards or log files doesn't really matter:
Having answers to these 10 questions avoids calling too many people in a war room and improves handling of critical application problems
1. Is an individual user complaining?
Is it "just" the CEO that complains about a problem with your newly deployed internal app because a report doesn't work on his old IE6? Or is it "just" the end user in a remote location that still uses dial-up? Knowing whether a problem just happens for a single or a very small group of users is important to prioritize.
Analyzing the problem of the complaining user lets us assess whether it is a problem related to just "that" user, e.g, using an unsupported browser version, slow network connectivity,...
2. Are "all" users impacted?
If a large number of users are impacted but you may not have individuals that really complain about it you still need to know as it is very critical to you fix any problems that impact a large number of your users?
Having the evidence that a large number of people in a certain region, using a certain browser or a certain device makes it easy to prioritize this issue
3. Is the problem in the application?
The next question, after knowing whether users are impacted or not, is to figure out if the problem is in the application or not. This allows us to call in the application experts, architects and developers if needed. Looking at the performance distribution gives us an overview where our hotspots really are:
Where are the performance and problem hotspots? Is it really the application? Or do we need to involve other teams?
4. Is there a problem in the delivery chain?
Modern web applications rely on a long list of services along the delivery chain that lies outside of our own data center. That includes CDNs, third-party services, ISPs or mobile networks. Knowing the status of these services and their impact on end user performance of our own application allows us to answer whether to look into our own data center or calling up Akamai, Facebook & Co:
Do CDNs or other third-party services experience any performance issues and is that the root cause of our complaining users?
5. Is one uncritical transaction impacted?
When the error rate goes up - is it a critical transaction such as search? Or is it rather uncritical such as the Contact Page. Or is a BOT causing lots of errors because it crawls through pages that do not exist anyway or that require authentication and with that skews the overall error rate?
Analyzing which transactions drive the error rate may show you that these are not critical because either caused by a BOT or on pages that are not business critical
6. Are critical transactions impacted?
What if your critical transactions are impacted such as the landing page, login, search, or entering a ticket in your support system? These are critical transactions to you, your end users, or your colleagues that need to use the back office software for their daily tasks. If these are impacted you need to act fast. Therefore it is important to monitor these critical transactions on failure rate as well as performance. If these are impacted it is more important to act than other transactions that are not vital to your business - and - you also know which subject matter experts to call:
Monitoring your critical transactions allows you to identify problems on those areas that are critical for your business
7. Is the problem related to bad coding?
If application response time is getting slower, the first question is whether it is because of bad coding. Analyzing the performance hotspot to the code level can tell you whether most of the time is spent because of inefficient algorithms or just not following coding and architectural best practices:
Throwing thousands of exceptions to control program flow is not a good coding practice and also impacts performance
8. Does the infrastructure cause an issue?
What if it is not the app itself, but the app is running low on resources provided by the infrastructure? What if the CPU required to run the Garbage Collector is not available because the machine also runs lots of other services on an already over utilized machine? In that case it is time to think about the infrastructure - better distributing these applications and services or scaling the infrastructure:
Where does the memory shortage come from? Does it impact other processes on that machine? Which processes to move to a different machine?
9. Is the AppServer the issue?
Depending on the AppServer you are using you have multiple configuration options to optimize the usage for your environment. The question remains whether the AppServer might be responsible for performance issues caused by an incorrect setting or corrupt deployment. Correct resource pool (threads, database connection, ...) sizing, security settings or logging options can impact the performance. If it turns out that the AppServer is the problem contact your IBM, Oracle, Microsoft ... specialist:
A global synchronized logging feature of IBMs WebSphere caused this performance issue which can be resolved through configuration settings in the AppServer
10. Is the problem in the virtual machine?
Leveraging virtual compute power - whether it is from your local running VM server farm or running in one of the cloud providers - provides lots of flexibility. But it can also be the reason for performance problems if the virtual machines are not properly sized or are battling for resources with other virtual machines on the same virtual server. Knowing the impact of virtualization on the application allows you to call in the VM experts and not the app developers to solve a problem:
Understanding what is going in EC2, Azure or your VMware ESX Server allows you to figure out whether the virtualize environment is the root cause
Have an Answer to These Questions?
Now that you have an idea about the right questions to ask before you call a war room session together - or before you accept a call into such scenario, you can start focusing on preventing these sessions. Whether you are a developer, architect or on the business side, make sure you have the real facts available in order to get through these situations as fast as possible by calling in the RIGHT people and giving them the RIGHT data to analyze.
Better than spending time in War Rooms however is to prevent the number of times these situations come up. If you want to learn more about this check out some of the other blogs we recently wrote such as Performance-focused DevOps or - in case you happen to be getting ready for the holiday shopping season - Verify Readiness in Test & Pre-Production.
The Industrial Internet revolution is now underway, enabled by connected machines and billions of devices that communicate and collaborate. The massive amounts of Big Data requiring real-time analysis is flooding legacy IT systems and giving way to cloud environments that can handle the unpredictable workloads. Yet many barriers remain until we can fully realize the opportunities and benefits from the convergence of machines and devices with Big Data and the cloud, including interoperability, data security and privacy.
Jan. 25, 2015 07:45 PM EST Reads: 2,282
SYS-CON Media announced that Cisco, a worldwide leader in IT that helps companies seize the opportunities of tomorrow, has launched a new ad campaign in Cloud Computing Journal. The ad campaign, a webcast titled 'Is Your Data Center Ready for the Application Economy?', focuses on the latest data center networking technologies, including SDN or ACI, and how customers are using SDN and ACI in their organizations to achieve business agility. The Cisco webcast is available on-demand.
Jan. 25, 2015 07:00 PM EST Reads: 1,169
IoT is still a vague buzzword for many people. In his session at @ThingsExpo, Mike Kavis, Vice President & Principal Cloud Architect at Cloud Technology Partners, discussed the business value of IoT that goes far beyond the general public's perception that IoT is all about wearables and home consumer services. He also discussed how IoT is perceived by investors and how venture capitalist access this space. Other topics discussed were barriers to success, what is new, what is old, and what the future may hold. Mike Kavis is Vice President & Principal Cloud Architect at Cloud Technology Pa...
Jan. 25, 2015 06:15 PM EST Reads: 3,766
Dale Kim is the Director of Industry Solutions at MapR. His background includes a variety of technical and management roles at information technology companies. While his experience includes work with relational databases, much of his career pertains to non-relational data in the areas of search, content management, and NoSQL, and includes senior roles in technical marketing, sales engineering, and support engineering. Dale holds an MBA from Santa Clara University, and a BA in Computer Science from the University of California, Berkeley.
Jan. 25, 2015 06:00 PM EST Reads: 2,949
The Internet of Things (IoT) is rapidly in the process of breaking from its heretofore relatively obscure enterprise applications (such as plant floor control and supply chain management) and going mainstream into the consumer space. More and more creative folks are interconnecting everyday products such as household items, mobile devices, appliances and cars, and unleashing new and imaginative scenarios. We are seeing a lot of excitement around applications in home automation, personal fitness, and in-car entertainment and this excitement will bleed into other areas. On the commercial side, m...
Jan. 25, 2015 06:00 PM EST Reads: 2,679
The Internet of Things (IoT) promises to evolve the way the world does business; however, understanding how to apply it to your company can be a mystery. Most people struggle with understanding the potential business uses or tend to get caught up in the technology, resulting in solutions that fail to meet even minimum business goals. In his session at @ThingsExpo, Jesse Shiah, CEO / President / Co-Founder of AgilePoint Inc., showed what is needed to leverage the IoT to transform your business. He discussed opportunities and challenges ahead for the IoT from a market and technical point of vie...
Jan. 25, 2015 04:30 PM EST Reads: 2,990
Things are being built upon cloud foundations to transform organizations. This CEO Power Panel at 15th Cloud Expo, moderated by Roger Strukhoff, Cloud Expo and @ThingsExpo conference chair, addressed the big issues involving these technologies and, more important, the results they will achieve. Rodney Rogers, chairman and CEO of Virtustream; Brendan O'Brien, co-founder of Aria Systems, Bart Copeland, president and CEO of ActiveState Software; Jim Cowie, chief scientist at Dyn; Dave Wagstaff, VP and chief architect at BSQUARE Corporation; Seth Proctor, CTO of NuoDB, Inc.; and Andris Gailitis, C...
Jan. 25, 2015 04:00 PM EST Reads: 2,407
SYS-CON Events announced today that CodeFutures, a leading supplier of database performance tools, has been named a “Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York, NY. CodeFutures is an independent software vendor focused on providing tools that deliver database performance tools that increase productivity during database development and increase database performance and scalability during production.
Jan. 25, 2015 04:00 PM EST Reads: 1,526
Today’s enterprise is being driven by disruptive competitive and human capital requirements to provide enterprise application access through not only desktops, but also mobile devices. To retrofit existing programs across all these devices using traditional programming methods is very costly and time consuming – often prohibitively so. In his session at @ThingsExpo, Jesse Shiah, CEO, President, and Co-Founder of AgilePoint Inc., discussed how you can create applications that run on all mobile devices as well as laptops and desktops using a visual drag-and-drop application – and eForms-buildi...
Jan. 25, 2015 03:00 PM EST Reads: 2,412
"People are a lot more knowledgeable about APIs now. There are two types of people who work with APIs - IT people who want to use APIs for something internal and the product managers who want to do something outside APIs for people to connect to them," explained Roberto Medrano, Executive Vice President at SOA Software, in this SYS-CON.tv interview at Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
Jan. 25, 2015 02:30 PM EST Reads: 2,220
Performance is the intersection of power, agility, control, and choice. If you value performance, and more specifically consistent performance, you need to look beyond simple virtualized compute. Many factors need to be considered to create a truly performant environment. In his General Session at 15th Cloud Expo, Harold Hannon, Sr. Software Architect at SoftLayer, discussed how to take advantage of a multitude of compute options and platform features to make cloud the cornerstone of your online presence.
Jan. 25, 2015 02:15 PM EST Reads: 2,836
SYS-CON Media announced that Splunk, a provider of the leading software platform for real-time Operational Intelligence, has launched an ad campaign on Big Data Journal. Splunk software and cloud services enable organizations to search, monitor, analyze and visualize machine-generated big data coming from websites, applications, servers, networks, sensors and mobile devices. The ads focus on delivering ROI - how improved uptime delivered $6M in annual ROI, improving customer operations by mining large volumes of unstructured data, and how data tracking delivers uptime when it matters most.
Jan. 25, 2015 02:00 PM EST Reads: 3,408
Almost everyone sees the potential of Internet of Things but how can businesses truly unlock that potential. The key will be in the ability to discover business insight in the midst of an ocean of Big Data generated from billions of embedded devices via Systems of Discover. Businesses will also need to ensure that they can sustain that insight by leveraging the cloud for global reach, scale and elasticity.
Jan. 25, 2015 01:00 PM EST Reads: 3,755
“The age of the Internet of Things is upon us,” stated Thomas Svensson, senior vice-president and general manager EMEA, ThingWorx, “and working with forward-thinking companies, such as Elisa, enables us to deploy our leading technology so that customers can profit from complete, end-to-end solutions.” ThingWorx, a PTC® (Nasdaq: PTC) business and Internet of Things (IoT) platform provider, announced on Monday that Elisa, Finnish provider of mobile and fixed broadband subscriptions, will deploy ThingWorx® platform technology to enable a new Elisa IoT service in Finland and Estonia.
Jan. 25, 2015 11:00 AM EST Reads: 1,449
Advanced Persistent Threats (APTs) are increasing at an unprecedented rate. The threat landscape of today is drastically different than just a few years ago. Attacks are much more organized and sophisticated. They are harder to detect and even harder to anticipate. In the foreseeable future it's going to get a whole lot harder. Everything you know today will change. Keeping up with this changing landscape is already a daunting task. Your organization needs to use the latest tools, methods and expertise to guard against those threats. But will that be enough? In the foreseeable future attacks w...
Jan. 25, 2015 11:00 AM EST Reads: 2,843
As enterprises move to all-IP networks and cloud-based applications, communications service providers (CSPs) – facing increased competition from over-the-top providers delivering content via the Internet and independently of CSPs – must be able to offer seamless cloud-based communication and collaboration solutions that can scale for small, midsize, and large enterprises, as well as public sector organizations, in order to keep and grow market share. The latest version of Oracle Communications Unified Communications Suite gives CSPs the capability to do just that. In addition, its integration ...
Jan. 25, 2015 11:00 AM EST Reads: 2,785
SYS-CON Events announced today that ActiveState, the leading independent Cloud Foundry and Docker-based PaaS provider, has been named “Silver Sponsor” of SYS-CON's DevOps Summit New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. ActiveState believes that enterprises gain a competitive advantage when they are able to quickly create, deploy and efficiently manage software solutions that immediately create business value, but they face many challenges that prevent them from doing so. The Company is uniquely positioned to help address these challenges thro...
Jan. 25, 2015 11:00 AM EST Reads: 1,743
From telemedicine to smart cars, digital homes and industrial monitoring, the explosive growth of IoT has created exciting new business opportunities for real time calls and messaging. In his session at @ThingsExpo, Ivelin Ivanov, CEO and Co-Founder of Telestax, shared some of the new revenue sources that IoT created for Restcomm – the open source telephony platform from Telestax. Ivelin Ivanov is a technology entrepreneur who founded Mobicents, an Open Source VoIP Platform, to help create, deploy, and manage applications integrating voice, video and data. He is the co-founder of TeleStax, a...
Jan. 25, 2015 10:45 AM EST Reads: 2,914
We certainly live in interesting technological times. And no more interesting than the current competing IoT standards for connectivity. Various standards bodies, approaches, and ecosystems are vying for mindshare and positioning for a competitive edge. It is clear that when the dust settles, we will have new protocols, evolved protocols, that will change the way we interact with devices and infrastructure. We will also have evolved web protocols, like HTTP/2, that will be changing the very core of our infrastructures. At the same time, we have old approaches made new again like micro-services...
Jan. 25, 2015 10:30 AM EST Reads: 2,241
Disruptive macro trends in technology are impacting and dramatically changing the "art of the possible" relative to supply chain management practices through the innovative use of IoT, cloud, machine learning and Big Data to enable connected ecosystems of engagement. Enterprise informatics can now move beyond point solutions that merely monitor the past and implement integrated enterprise fabrics that enable end-to-end supply chain visibility to improve customer service delivery and optimize supplier management. Learn about enterprise architecture strategies for designing connected systems tha...
Jan. 25, 2015 10:00 AM EST Reads: 2,820