thoughts on business, service and technology operations and management
Random header image... Refresh for more!

Category — Integrien

Top 5 Reasons for a Predictive/Proactive Solution

Let us see if we can find the five leading reasons (maybe more, maybe less) for why we need a proactive or predictive solution these days.

#1: I don’t have effective change control in place that spans into and incorporates the monitoring that I do on end point systems, applications and services.

#2: My boss wants me to “do more with less” so I need to figure out a way to clean up the mess I have today in my resource monitoring and event management solution.

#3: I know that when this thingy begins to slow down and that thingy drops packets that my transactions begin to fail. Now how do I write that policy to correlate all my thingys?

#4: My tool is better than your tool. I need to figure out a way to make you believe that your tool is always wrong so you’ll work my trouble ticket.

#5: My manager told us that we need to become more proactive. I sent the dba an email to tell him that we were going to have an outage to this database in three hours. He’d already gone home for the day.

These are tongue in cheek, but the underlying themes of each one are very valid in nearly all operations and application support groups. Why are we interested in predictive and proactive tools when we probably don’t have our own house in order in the first place?

How would you write the business justification and capital purchase plan to explain why you need them? How will you quantify your reasoning? Are you willing to give up one or more FTEs to purchase this solution? Have you had an honest look into the far reaching corners of your organization to see where the real root causes may be that spark your interest in these solutions? Are you ‘really’ ready to try and be proactive or predictive? Are you ‘really’ doing reactive well? What does predictive and proactive really mean to you? How would you describe the core capabilities such a solution should have? How would you associate expected value and ROI from having those capabilities? Where should we be looking elsewhere for help in these areas (BI, operational BI, BPM, BAM, analytic databases, statistical modeling and forecasting, etc.)

Please share your thoughts and ideas on why proactive and predictive solutions are of interest these days.

June 10, 2008   5 Comments

Does a “Proactive/Predictive” Tool make for a “Proactive/Predictive” Organization?

Just some rambling thoughts here…feel free to join in.

Is another tool what’s really required here? What should/could be done in domain specific resource monitoring solutions that addresses the problems at the edge? Should I really be monitoring everything that comes out of the box in a default configuration? Why do I have all of these profiles, situations, thresholds, events, etc. in the first place? Do I even now what I’m monitoring and why?

What if I have a multi-vendor, multi-sourced environment where I may or may not have visibility? What if I don’t have a CMDB or other source of topology, relationships and dependencies? What if I don’t even know the state and status of the applications, databases or services to begin with? What will I be able to do with investments into these technologies?

What if I have adopted a “manager of managers” concept where I have a consolidated operations eventing environment with feeds from across the entire business environment (facilities, plant, IT, datacenter, logistics, telephony, manufacturing, contact centers, etc.)? Shouldn’t this dynamic “learning” and “thresholding” concept be really applied at this level for some sort of “intelligent event management” free from manual intervention, policies, codebooks, etc? How about the context of the business calendar and schedule merged with the IT operations calendar and schedule? I doubt that this can all be “learned” magically.

If I invest in a BMC ProactiveNet, Netuitive or Integrien (or other fundamental dynamic “learning” or “trending” tool - my favorite was a company called Premonitia - now defunct, based on research from accoustic modelling of whales and shrimp IIRC), how will I recognize and measure the value from that investment? How should the operations environment change to adopt the promises of the “secret sauce” within these emerging technology areas? Will IT operations and second/third tier support teams need to change the ways they work today? If so, how? Does IT operations know how to respond to a future state that hasn’t occurred or someone stating that a service is “slow”? I think most operations and support teams are still in their infancy here.

I’m all for emerging technologies that speak towards making the lives of the folks on the front line better and for sensing, isolating and resolving issues within complex IT environments before they impact the business services, but will investing in these tools really improve the status quo within the typical operations environment? The Next Generation Operations Center, Command Center, Service Management Center or whatever we want to call it must be enabled with these types of technology, but also must prepared to think, operate and respond differently than they do today.

How are you changing? Will you change? Where’s your value proposition? Is it at the front line, second/third line of the support process, at the LoB? Is it about efficiencies in workflow? Do more, with less? Automation? Availability? Becoming proactive? Do you know the real root causes prompting your interests in this technology? What are your vendors doing about it? What is your monitoring tools group doing about it? Should they be doing something different?

Please share your thoughts on how best to operationalize and really recognize value from your investments into these technologies or what you’re doing to address the real root causes of the symptoms this technology addresses.

June 3, 2008   13 Comments

What Problems Can Real Time Analytics Solutions Solve?

In my last post, I discussed the issues facing IT Operations today that make real time analytics-based solutions a “must have”. In this post, I’d like to address more specifically the problems they can solve. Since I want to give enough detail on the problems and how real time analytics can help, I’ll spread this over a couple of posts.

“Too many alerts that don’t help me solve problems”

One of the biggest issues facing Operations teams is that when performance problems occur with mission critical applications, they have to expend a massive amount of manual effort to identify and repair them. They sift through the endless stream of alerts from their siloed monitoring solutions and try to humanly correlate them based on tribal knowledge. Much of this effort is because static threshold-based monitoring solutions give them alert storms for perfectly normal behavior and mask abnormalities that are the earliest precursors to problems.

Real time analytics-based solutions solve this by obviating the need for static thresholds. Instead, these solutions learn the normal behavior of every metric being collected. Armed with this understanding of normal, these solutions alert only to the abnormal behaviors that are the true precursors to problems. There is no longer a need to sift through alert storms and try to determine which alerts are relevant to the current problem and which are not. Sophisticated dynamic thresholding algorithms are used to learn the normal behavior down to the most granular level possible using clustering calculations. Algorithmic sophistication is critical as different metrics behave very differently and cannot be modeled from a single algorithm or assumed distribution. These solutions must also have mechanisms to handle seasonal events that may differ greatly from normal behavior, but do not indicate a real problem. Without this level of sophistication, large amounts of false positives result as was seen in early forays into dynamic thresholding that assumed normally distributed data (which IT metric data rarely is). My company’s solution, Integrien Alive, provides mechanisms to import large amounts of historical data very quickly and use it to calculate normal behavior immediately to remove the need for a learning period. Alive also performs topology-based rollup of alerts to provide a smaller total number of alerts with better context. The bottom line is that dynamic threshold-based alerting eliminates a ton of manual effort in problem solving.

“We don’t understand what leads to problems so we are always reactive”

Because the IT Operations team is relying on human correlation of alerts after a problem occurs, they are always in a reactive state. In some cases they may have tribal knowledge of the patterns of behavior of certain problems that gives them a heads up, however even in these cases it is most often too late to do anything before the problem occurs. Some Operations teams attempt to capture their tribal knowledge in correlation rules. These rules may help for awhile, however as the business or infrastructure changes they soon become obsolete, resulting in more manual effort to manage them. The problem is the sheer number of devices and metrics being managed in today’s infrastructures. There is no way a human can correlate hundreds of thousands (even millions) of metrics from tens of thousands of devices. This is another problem that real time analytics solutions are built to solve.

Armed with the knowledge of normal, these solutions can correlate previous alert and metric behaviors and predict future abnormal behaviors based on currently observed abnormalities. For example, the solution can alert to a problem in the application server tier based on the amount of devices in that tier that are performing abnormally. The Level 3 application expert receiving this alert is also provided with predictions of future abnormal behavior. In this case, the alert indicates that a key database performance indicator will be breached in 15 minutes or less with 86% probability, bringing down the database. The Level 3 application expert forwards this information to the DBA, who then makes a quick configuration change that avoids the database crash. This type of automated correlation allows a proactive approach to problem solving that isn’t possible with manual methods.

My company’s Integrien Alive solution takes correlation one step further, allowing users to set performance key indicators at the business, user experience, or IT infrastructure level. When these key indicators are breached, Alive captures a model of the building pattern of abnormalities that led to the problem, up to an hour before it occurred. These problem models (called Problem FingerprintsTM) focus troubleshooting efforts and reduce Mean Time To Identify (MTTI) and Mean Time To Repair (MTTR) the first time the problem occurs by indicating exactly which tiers of the application (and what specific metrics) are performing abnormally. Once these models have been captured, Alive can scan real-time metric data for a return of that pattern. If a problem pattern defined in one of the Problem Fingerprints is matched with high enough probability, Alive sends a predictive alert informing the Operations team of the looming problem, the probability the problem will occur, when it is likely to occur, what to look for and how it was solved previously.

As we’ve seen in this discussion, real time analytics solutions are all about:

  • large reductions in the manual effort associated with static threshold-based alerting
  • increased focus for troubleshooting efforts to reduce MTTI/MTTR
  • predictive alerting to allow proactive performance management

In my next post we’ll discuss additional problems solved by real time analytics. We’ll also delve into BSM and how these solutions are an essential catalyst to achieving it.

April 3, 2008   1 Comment

Why Real Time Analytics?

First of all, I’d like to thank Doug for inviting me to the conversation. I’m really looking forward to discussing how solutions such as Integrien’s will be an essential part of the future of Business Service Management. We’ll look at what problems these solutions address, how they do what they do and real life implementation and operationalization issues. Now to the topic at hand! Why Real Time Analytics? Why now? 

IT Operations executives are facing a dichotomous situation today. On one hand, their infrastructures are growing rapidly. I was recently talking with the Senior Director of Operations at a large social networking site and he told me that his server infrastructure (at the time 15K servers) was growing at 6% a week! They are also dealing with increasing complexity due to new technologies like virtualization and SOAs. Look at virtualization alone. While it can provide the tremendous cost benefits of server consolidation, it increases the complexity of the management problem considerably. You now have to deal with the hypervisor, virtual machines and guest operating systems as problem sources, in addition to the physical servers, O/S and applications. Lets not even get into the issue of dynamically moving VMs… 

On the other hand, Operations executives are being asked to reduce their budgets or at the very least keep them flat. Consider that historically, increasing infrastructure and complexity have been handled by throwing more bodies at it. Hmmm…, if 70% or more of the Operations budget is currently labor spend and you have to reduce or keep it flat, how can you scale to meet the needs of the business? 

There is an obvious parallel with what happened in manufacturing 30+ years ago with the advent of Total Quality Management (TQM) practices. Increasing complexity and scale in the manufacturing process was making it impossible to keep up with the quality and just-in-time delivery demands imposed by customers. Deterministic, rules-based methods of analyzing manufacturing lines that relied on trying to identify and measure every variable to stop problems from slipping by were no longer working. Some manufacturers (such as Toyota) started to take a different approach using advanced probability and statistics, collecting a subset of variables and looking at possible outcomes to get more proactive in approaching problems. The manufacturers that adopted these techniques gained a competitive advantage that persists to this day.

IT Operations is in a similar situation today. Alerting based on static monitoring thresholds and the collection of more metric data at faster intervals simply hasn’t provided a proactive approach to business service performance and availability issues. It also requires massive manual effort of IT staff to process the alert storms and perform manual or rule-based correlation to solve problems. Consider as well that in a system with thousands of servers and hundreds of thousands, even millions of metrics, the correlation problem is humanly unsolvable. Manual efforts can no longer scale in the face of increasing infrastructure and complexity.

That is why real time analytics-based solutions that leverage existing monitoring infrastructures are a necessity in today’s environment. I’ll go into the requirements for these solutions in a future post, however, the basic premise is to use advanced statistics and probability to learn normal system behaviors, only alert to abnormal behaviors that are true precursors to problems and perform advanced correlation techniques to predict future abnormalities and specific performance and availability problems. This new approach will be a competitive advantage for the companies who adopt it now, just as it was for the early adopters of similar approaches in manufacturing years earlier.

February 27, 2008   5 Comments

SME Guest Author: Steve Henning

One of the goals that I have for this blog is to complement my thoughts and views with other like minded people. I started this blog out over two years ago with an invitation to other folks internal to the IBM Tivoli Business Service Management (BSM) team to contribute via this blog but they never have been able to commit to the five or ten minutes to share their thoughts. I have always envisioned that the comments that folks leave would be a significant part of the “spirit” that this blog has. I think we’ve had some good conversations via the comment threads recently but I’m still always looking for more.

With that, I’m introducing what I hope will be the first of many SME Guest Authors for my blog. My goal here is simple, to enable others to share their thoughts on business, service and technology operations and management. I’ve laid out some “ground rules” for these SME Guest Authors and set them free. Free to discuss how emerging technologies and products align with the goals and objectives of Business Service Management. Free to talk about how practitioners can be truly successful. Free to offer practical implementation knowledge and insight. Free to make the sales and marketing slicks “come to life” and become something believable, implementable and manageable over the lifecycle (and free from the sales and marketing hype you all deal with weekly).

I’m introducing Steve Henning. Steve is currently the VP of Product for an exciting company called Integrien. While I do not know as much as I’d like to about their company, technology or products, I do know that what they bring to the market is something that could dramatically improve the status quo within the typical operations environment. I believe that having capabilities such as theirs are desperately needed within any maturing Business Service Management solution and will play a key role in the next generation of Business Service Management solutions in the future. I’ve invited Steve to share his thoughts and ideas on the technology and capabilities in this market segment. I expect that the conversations will be straight up, down to earth, transparent and honest and something that we can all learn from. Please welcome Steve to the conversation!

February 27, 2008   No Comments

A deeper look at Netuitive

I’ve been following Netuitive for over two years now. You can do a search for them on my blog and see all of the various activities and how they’ve evolved over the years. I was very skeptical of their early claims such as “BSM by Lunch” which I’m glad they’ve now backed away from to focus on their core competencies and value add to the overall BSM solution stack. I wish they would have stuck with the blog they started at BSM Digest, but I understand the challenges.

The power of getting accurate, trusted events free from false positives and false negatives is CRITICAL to the underpinning of any good BSM solution. If you’re putting garbage on the dashboards of your tools that operations, support and executives have to see, you’re NOT going to be successful with your BSM strategy. I’m also now very interested in Integrien and ProactiveNet (BMC) and look forward to digging in deeper into their solutions. Nearly every client I’ve seen and even when I ran the monitoring tools group at EarthLink we all have the same problems that these vendors are addressing. They’re the ONLY ones filling these gaps as best I can tell.

I’m looking for a really good discussion on Netuitive’s Active Behavior Profile (ABP). Which of your nine patents apply to this concept. Does every managed element type have a unique ABP or does every actual component have their own ABP? If I want to model/manage a Windows 2000 server different from a Windows 2003 server, how does this work? Is this where Templates come into play. What data streams are “mashed up” in the ABP? Templates?

What vendors do you play best with? Where are the key details of how/what you leverage from each of these vendor solutions? Please share some details. If a client only has the out of the box hardware and OS monitoring using their vendor’s solution with out of the box configurations, what can I expect to see in SI? What will I be missing? Do you recommend certain things be turned on to get health, workload and other outputs? Please discuss. Is one vendor’s CPU or Memory treated the same as another vendors?

When Trusted Alarms are sent outbound towards an event management solution as SNMP Traps, do they include group and function information? What can be mapped into varbinds? How is the trap constructed? Where does this happen?

I’d love to see some hard tangible ROI discussed on how these products are helping. I’m also very interested to know if the typical reactive based operations and support organizations are ready to get more proactive based on what these three vendors can provide. Can they mature from the comfortable, reactive “it’s broke” world and operate in a proactive, predictive “it’s a problem in this area, trust me” world?

Look forward to the discussion!

February 8, 2008   1 Comment

Integrien Alive

I’m always skeptical by what I see in a demo until I can dig into what’s under the covers, but what I saw in the Integrien Alive demo impressed me. It looks like what could be a solid foundation for Business Service Management (BSM) in the future with focus by Integrien in key areas such as dashboard visualization, modeling and alignment to business services and applications.

It looks like Integrien competes firmly with Netuitive and the former ProactiveNet (now BMC), maybe Firescope and Managed Objects to some degree.

Effective, trusted and value oriented Business Service Management absolutely depends on an accurate data stream whether it be events, metrics, KPIs, etc. Taking the default out of the box configurations and thresholds with your monitoring tools and poor monitoring and event management lifecycles has led to the development of solution such as Integrien’s to “take back control” and give you back a trusted insight into IT infrastructure.

I’d love to see or hear more about Integrien technology. Anyone have any first hand experience? IMO, we have a gap in the IBM Tivoli portfolio in this technology and capability area.

January 16, 2008   7 Comments