Week 1 – The Proactivity of Troubleshooting
Troubleshooting of performance problems is very often – if not almost always – viewed as a reactive activity. Frankly, I have often seen it done in such a reactive as a firefight; however effective troubleshooting should build upon a solid diagnostics process. If you handle troubleshooting as if firefighting rather than based on solid diagnosis, this inevitably is a sign you have not taken the right proactive measures.
The goal of troubleshooting is to resolve an immediate performance problem – ideally yesterday. As some might expect this does not start when the problem occurs: troubleshooting done right means having a defined process as part of your performance management activities. While we try to avoid this situation as often as possible, we have to accept it is a normal part of our work. To properly plan in advance, we have to define what has to happen when and what information is necessary to efficiently troubleshoot.
The metric to measure the effectiveness of problem resolution is mean-time-to-repair (MTTR). The beauty of this metric is that it is easy measure as the time from when a problem occurs to when it is solved. However the actual process behind it is much more complex. Let’s look at the various steps of problem resolution.
First we have to know that we have a problem. This means we need adequate monitoring of our application as well as proper alerting and raising of incidents. So effective monitoring is a pre-requisite for effective problem resolution and must answer the following questions:
- What has happened?
- When did it happen?
- Who is impacted?
- What is the difference compared to before the problem?
- Why did the problem happen?
The next step is problem analysis. This process step – also referred to as triage – aims at identifying the problem’s root cause. In this step following a structured approach is the key to success. This is where “the rubber meets the road”. If you cannot find the root cause as quickly as possible your processes are not effective. Experience shows that this is also where many companies have the greatest optimization potential (to put positive spin on it
). Besides detailed technical knowledge about the application, the database, the network or the operating system the (immediate) availability of required information is crucial. If you don’t have this information, you have to start guessing.
In the problem resolution phase the problem gets fixed. This can range from a “simple” configuration change up to complex changes in the application. Choosing the optimal solution for a problem is a challenge in itself. More than the actual coding it requires a lot of brain work. Therefore very often the smartest people are on this, as the solution not only has to be reliable but developed as quickly as possible. This also means that rare development resources are blocked from other work they need to do.
Regression analysis is happening ideally in parallel to the resolution process. Each of us knows regression problems far too well. The goal to efficiently avoid them was and is one of the key drivers for increased test automation. A central concept in regression analysis is to have a baseline to measure against … well besides the actual test cases for sure. If you have not collected such information you cannot do any regression analysis. Ideally you have this information stored in some performance repository. Otherwise you will first have to run tests to get a baseline, which slows down your resolution process.
When development is then finished the application is tested in a final large scale system test. I have heard rumors of projects where this does not happen….. All these act ivies are occurring under massive time pressure and heavy management scrutiny. In case you find new problems in this phase, you have to make sure they get fixed immediately. This means that part of this process step is having all required analysis data right at hand – having to re-run a test to get proper diagnostics data is one of the worst things that can happen to you.
Back in production the application must be continuously monitored to insure that the problem is really solved and does not happen again. Additionally you have to verify that no other parts of the application are negatively impacted by changes.
As we can see problem resolution – or troubleshooting – processes are highly complex. The involvement of a number of departments makes proper information delivery vital for success. The first impression that these processes are purely reactive also proves wrong. The definition of a proper process and responsibilities as well as the necessary information to collect and the availability of the required infrastructure must be managed beforehand – proactively.
This post is part of our 2010 Application Performance Almanach.
Related posts:
- Getting Started and Troubleshooting Tips for dynaTrace AJAX Edition The FREE dynaTrace AJAX Edition has been out for several...
- Week 22 – Is There a Business Case for Application Performance? We all know that slow performance – and service disruption...
- Week 4 – Why “top ten” Performance Reports are not the final answer In this post I will address top ten reports and...
- Week 23 – 7 Rules to Improve your Application Performance Practices In this post I discuss the seven most important steps...
- Week 5 – Hunting Lost Treasures: Understanding and Finding Memory Leaks Searching for memory leaks can easily become an adventure –...























That was very kind of you to put a positive spin on the inability to find the root cause. I think many times a lot of responsibility is placed on a single individual. This works well most of the time, but when that person is unavailable things can get out of hand pretty quickly. Of course, this may not be true in larger companies.
Myra Shelley
Management Software
http://www.managepro.com
Myra,
I would not call it inability. Most organizations will eventually find the root cause. However the investment from a time and human resource perspective is higher than it should.
Even the most clever people have problems to identify the root cause if they are missing proper data