In my last post, I discussed the issues facing IT Operations today that make real time analytics-based solutions a “must have”. In this post, I’d like to address more specifically the problems they can solve. Since I want to give enough detail on the problems and how real time analytics can help, I’ll spread this over a couple of posts.
“Too many alerts that don’t help me solve problems”
One of the biggest issues facing Operations teams is that when performance problems occur with mission critical applications, they have to expend a massive amount of manual effort to identify and repair them. They sift through the endless stream of alerts from their siloed monitoring solutions and try to humanly correlate them based on tribal knowledge. Much of this effort is because static threshold-based monitoring solutions give them alert storms for perfectly normal behavior and mask abnormalities that are the earliest precursors to problems.
Real time analytics-based solutions solve this by obviating the need for static thresholds. Instead, these solutions learn the normal behavior of every metric being collected. Armed with this understanding of normal, these solutions alert only to the abnormal behaviors that are the true precursors to problems. There is no longer a need to sift through alert storms and try to determine which alerts are relevant to the current problem and which are not. Sophisticated dynamic thresholding algorithms are used to learn the normal behavior down to the most granular level possible using clustering calculations. Algorithmic sophistication is critical as different metrics behave very differently and cannot be modeled from a single algorithm or assumed distribution. These solutions must also have mechanisms to handle seasonal events that may differ greatly from normal behavior, but do not indicate a real problem. Without this level of sophistication, large amounts of false positives result as was seen in early forays into dynamic thresholding that assumed normally distributed data (which IT metric data rarely is). My company’s solution, Integrien Alive, provides mechanisms to import large amounts of historical data very quickly and use it to calculate normal behavior immediately to remove the need for a learning period. Alive also performs topology-based rollup of alerts to provide a smaller total number of alerts with better context. The bottom line is that dynamic threshold-based alerting eliminates a ton of manual effort in problem solving.
“We don’t understand what leads to problems so we are always reactive”
Because the IT Operations team is relying on human correlation of alerts after a problem occurs, they are always in a reactive state. In some cases they may have tribal knowledge of the patterns of behavior of certain problems that gives them a heads up, however even in these cases it is most often too late to do anything before the problem occurs. Some Operations teams attempt to capture their tribal knowledge in correlation rules. These rules may help for awhile, however as the business or infrastructure changes they soon become obsolete, resulting in more manual effort to manage them. The problem is the sheer number of devices and metrics being managed in today’s infrastructures. There is no way a human can correlate hundreds of thousands (even millions) of metrics from tens of thousands of devices. This is another problem that real time analytics solutions are built to solve.
Armed with the knowledge of normal, these solutions can correlate previous alert and metric behaviors and predict future abnormal behaviors based on currently observed abnormalities. For example, the solution can alert to a problem in the application server tier based on the amount of devices in that tier that are performing abnormally. The Level 3 application expert receiving this alert is also provided with predictions of future abnormal behavior. In this case, the alert indicates that a key database performance indicator will be breached in 15 minutes or less with 86% probability, bringing down the database. The Level 3 application expert forwards this information to the DBA, who then makes a quick configuration change that avoids the database crash. This type of automated correlation allows a proactive approach to problem solving that isn’t possible with manual methods.
My company’s Integrien Alive solution takes correlation one step further, allowing users to set performance key indicators at the business, user experience, or IT infrastructure level. When these key indicators are breached, Alive captures a model of the building pattern of abnormalities that led to the problem, up to an hour before it occurred. These problem models (called Problem FingerprintsTM) focus troubleshooting efforts and reduce Mean Time To Identify (MTTI) and Mean Time To Repair (MTTR) the first time the problem occurs by indicating exactly which tiers of the application (and what specific metrics) are performing abnormally. Once these models have been captured, Alive can scan real-time metric data for a return of that pattern. If a problem pattern defined in one of the Problem Fingerprints is matched with high enough probability, Alive sends a predictive alert informing the Operations team of the looming problem, the probability the problem will occur, when it is likely to occur, what to look for and how it was solved previously.
As we’ve seen in this discussion, real time analytics solutions are all about:
- large reductions in the manual effort associated with static threshold-based alerting
- increased focus for troubleshooting efforts to reduce MTTI/MTTR
- predictive alerting to allow proactive performance management
In my next post we’ll discuss additional problems solved by real time analytics. We’ll also delve into BSM and how these solutions are an essential catalyst to achieving it.
Comments on this entry are closed.