Category — Blog
Quest Software’s Foglight Group is Blogging
If you’re a Quest Software Foglight user, check out the Foglight community that’s recently launched. (where was my heads up Tyler and Greg?
)
Greg Crow has a post sharing some insight into Quest’s thoughts on the Business Service Management (BSM) space and Tyler Jewell shares some of Quest’s thoughts in the Performance Management space.
Congrats to Quest Software and the Foglight team for opening up some and sharing your thoughts and ideas with the community. To my knowledge, there are ZERO IBM Tivoli, HP, CA, BMC or Compuware Product Managers blogging outside the firewall or actively participating in public forums. Keep it up Tyler, Greg and crew. Post often!
May 27, 2008 4 Comments
What Problems Can Real Time Analytics Solutions Solve?
In my last post, I discussed the issues facing IT Operations today that make real time analytics-based solutions a “must have”. In this post, I’d like to address more specifically the problems they can solve. Since I want to give enough detail on the problems and how real time analytics can help, I’ll spread this over a couple of posts.
“Too many alerts that don’t help me solve problems”
One of the biggest issues facing Operations teams is that when performance problems occur with mission critical applications, they have to expend a massive amount of manual effort to identify and repair them. They sift through the endless stream of alerts from their siloed monitoring solutions and try to humanly correlate them based on tribal knowledge. Much of this effort is because static threshold-based monitoring solutions give them alert storms for perfectly normal behavior and mask abnormalities that are the earliest precursors to problems.
Real time analytics-based solutions solve this by obviating the need for static thresholds. Instead, these solutions learn the normal behavior of every metric being collected. Armed with this understanding of normal, these solutions alert only to the abnormal behaviors that are the true precursors to problems. There is no longer a need to sift through alert storms and try to determine which alerts are relevant to the current problem and which are not. Sophisticated dynamic thresholding algorithms are used to learn the normal behavior down to the most granular level possible using clustering calculations. Algorithmic sophistication is critical as different metrics behave very differently and cannot be modeled from a single algorithm or assumed distribution. These solutions must also have mechanisms to handle seasonal events that may differ greatly from normal behavior, but do not indicate a real problem. Without this level of sophistication, large amounts of false positives result as was seen in early forays into dynamic thresholding that assumed normally distributed data (which IT metric data rarely is). My company’s solution, Integrien Alive, provides mechanisms to import large amounts of historical data very quickly and use it to calculate normal behavior immediately to remove the need for a learning period. Alive also performs topology-based rollup of alerts to provide a smaller total number of alerts with better context. The bottom line is that dynamic threshold-based alerting eliminates a ton of manual effort in problem solving.
“We don’t understand what leads to problems so we are always reactive”
Because the IT Operations team is relying on human correlation of alerts after a problem occurs, they are always in a reactive state. In some cases they may have tribal knowledge of the patterns of behavior of certain problems that gives them a heads up, however even in these cases it is most often too late to do anything before the problem occurs. Some Operations teams attempt to capture their tribal knowledge in correlation rules. These rules may help for awhile, however as the business or infrastructure changes they soon become obsolete, resulting in more manual effort to manage them. The problem is the sheer number of devices and metrics being managed in today’s infrastructures. There is no way a human can correlate hundreds of thousands (even millions) of metrics from tens of thousands of devices. This is another problem that real time analytics solutions are built to solve.
Armed with the knowledge of normal, these solutions can correlate previous alert and metric behaviors and predict future abnormal behaviors based on currently observed abnormalities. For example, the solution can alert to a problem in the application server tier based on the amount of devices in that tier that are performing abnormally. The Level 3 application expert receiving this alert is also provided with predictions of future abnormal behavior. In this case, the alert indicates that a key database performance indicator will be breached in 15 minutes or less with 86% probability, bringing down the database. The Level 3 application expert forwards this information to the DBA, who then makes a quick configuration change that avoids the database crash. This type of automated correlation allows a proactive approach to problem solving that isn’t possible with manual methods.
My company’s Integrien Alive solution takes correlation one step further, allowing users to set performance key indicators at the business, user experience, or IT infrastructure level. When these key indicators are breached, Alive captures a model of the building pattern of abnormalities that led to the problem, up to an hour before it occurred. These problem models (called Problem FingerprintsTM) focus troubleshooting efforts and reduce Mean Time To Identify (MTTI) and Mean Time To Repair (MTTR) the first time the problem occurs by indicating exactly which tiers of the application (and what specific metrics) are performing abnormally. Once these models have been captured, Alive can scan real-time metric data for a return of that pattern. If a problem pattern defined in one of the Problem Fingerprints is matched with high enough probability, Alive sends a predictive alert informing the Operations team of the looming problem, the probability the problem will occur, when it is likely to occur, what to look for and how it was solved previously.
As we’ve seen in this discussion, real time analytics solutions are all about:
- large reductions in the manual effort associated with static threshold-based alerting
- increased focus for troubleshooting efforts to reduce MTTI/MTTR
- predictive alerting to allow proactive performance management
In my next post we’ll discuss additional problems solved by real time analytics. We’ll also delve into BSM and how these solutions are an essential catalyst to achieving it.
April 3, 2008 1 Comment
Why Real Time Analytics?
First of all, I’d like to thank Doug for inviting me to the conversation. I’m really looking forward to discussing how solutions such as Integrien’s will be an essential part of the future of Business Service Management. We’ll look at what problems these solutions address, how they do what they do and real life implementation and operationalization issues. Now to the topic at hand! Why Real Time Analytics? Why now?
IT Operations executives are facing a dichotomous situation today. On one hand, their infrastructures are growing rapidly. I was recently talking with the Senior Director of Operations at a large social networking site and he told me that his server infrastructure (at the time 15K servers) was growing at 6% a week! They are also dealing with increasing complexity due to new technologies like virtualization and SOAs. Look at virtualization alone. While it can provide the tremendous cost benefits of server consolidation, it increases the complexity of the management problem considerably. You now have to deal with the hypervisor, virtual machines and guest operating systems as problem sources, in addition to the physical servers, O/S and applications. Lets not even get into the issue of dynamically moving VMs…
On the other hand, Operations executives are being asked to reduce their budgets or at the very least keep them flat. Consider that historically, increasing infrastructure and complexity have been handled by throwing more bodies at it. Hmmm…, if 70% or more of the Operations budget is currently labor spend and you have to reduce or keep it flat, how can you scale to meet the needs of the business?
There is an obvious parallel with what happened in manufacturing 30+ years ago with the advent of Total Quality Management (TQM) practices. Increasing complexity and scale in the manufacturing process was making it impossible to keep up with the quality and just-in-time delivery demands imposed by customers. Deterministic, rules-based methods of analyzing manufacturing lines that relied on trying to identify and measure every variable to stop problems from slipping by were no longer working. Some manufacturers (such as Toyota) started to take a different approach using advanced probability and statistics, collecting a subset of variables and looking at possible outcomes to get more proactive in approaching problems. The manufacturers that adopted these techniques gained a competitive advantage that persists to this day.
IT Operations is in a similar situation today. Alerting based on static monitoring thresholds and the collection of more metric data at faster intervals simply hasn’t provided a proactive approach to business service performance and availability issues. It also requires massive manual effort of IT staff to process the alert storms and perform manual or rule-based correlation to solve problems. Consider as well that in a system with thousands of servers and hundreds of thousands, even millions of metrics, the correlation problem is humanly unsolvable. Manual efforts can no longer scale in the face of increasing infrastructure and complexity.
That is why real time analytics-based solutions that leverage existing monitoring infrastructures are a necessity in today’s environment. I’ll go into the requirements for these solutions in a future post, however, the basic premise is to use advanced statistics and probability to learn normal system behaviors, only alert to abnormal behaviors that are true precursors to problems and perform advanced correlation techniques to predict future abnormalities and specific performance and availability problems. This new approach will be a competitive advantage for the companies who adopt it now, just as it was for the early adopters of similar approaches in manufacturing years earlier.
February 27, 2008 5 Comments
Managed Objects joins the BSM Conversation
Ahh, Frank, you’ve been holding out on me. I guess a guest blog post is out of the question now?
Managed Objects has joined in the Business Service Management (BSM) conversation and launched a new blog. A cool domain name, blog title and a fresh feel, “BSM Communique’ : An Exclusive Blog from the Pioneers, Innovators and Leaders of BSM” has been opened for business. Take a look and join in the conversation.
I hope to see lots of good content within this blog. I hope that they share insight into their technologies and products and how practitioners can be successful implementing BSM using the Managed Objects suite. I’ll also continue to encourage you to talk about how clients who use other domain tools from IBM, HP, BMC, etc. can better prepare their investments in those technologies so they can fully leverage the capabilities within the Managed Objects suite.
I hope that this blog isn’t just another sales and marketing vehicle (I know you will control that right Frank?) and that it can be of real value to perspective and actual Managed Object clients. If the message is just to the higher level audience, I’m afraid little will be done to advance BSM success within their client base. Use your blog to clearly define what BSM means to Managed Objects. How are you different? What sets you apart from the growing BSM field? Why do I want to invest in BSM with Managed Obejcts? Be open. Be honest. Be transparent!
Welcome to the BSM conversation! (Now how do I leave comments on your blog?)
February 21, 2008 No Comments
BSM “Cool Post of the Day”
I don’t know what’s cooler, that the President of FireScope went to the Van Halen concert or that he compared Business Service Management (BSM) to Van Halen’s innovative approach to going on the road with Eddie Van Halen’s son Wolfgang.
What can I draw from this? I think that the next generation of BSM will require out-of-the box thinking and new innovative ways to address the challenges of IT and business alignment. The next generation of BSM will take a different approach. It will find a new way of doing the things that the incumbents continue to do. It will find the diamonds in the rough like making BSM easier to implement and maintain, quicker time to value and powerful visualizations that get the point across in less than 10 seconds.
BSM compared to Van Halen and the President of an emerging BSM company who made the analogy…priceless!
January 28, 2008 3 Comments
BMC’er Brian Alexander blogs about BSM
A newcomer to the Business Service Management (BSM) conversation has emerged from BMC over at BSM Views. He’s got a great tag line for his blog “Taking the BS out of BSM”! Wish I would’ve thought of that!!
Brian shares his blogs mission in his About page.
“As IT becomes an increasingly critical driver of business, the pace of maturity and change increases within the IT department, and many smart strategies are being employed with, at times, great results. IT now has a direct effect on the bottom line. In my work with BMC Software, I talk IT strategy with diverse organizations weekly basis, and this blog is a attempt to share many of the good ideas I am exposed to.”
Welcome to the BSM conversation Brian! I look forward to what you have to say and helping you “Take the BS out of BSM”.
January 25, 2008 No Comments
Tivoli TTUC 2007 Blog
Looks like our execs started a blog during the TTUC conference last week with the intentions of getting our valued clients involved in the converstation on how Tivoli can improve.
I hope this stays alive and can serve as a viable way to dialogue directly with those who make things happen here. This could be the perfect forum for our “tools group” heros who are a few layers removed from dealing with management and IBM Tivoli routinely to share their stories from the deckplates of their organization.
Not much here yet, but all it takes is someone to jump in!
TTUC 2007 — Heroic stories, lessons learned and Service Management value
May 17, 2007 No Comments
