First of all, I’d like to thank Doug for inviting me to the conversation. I’m really looking forward to discussing how solutions such as Integrien’s will be an essential part of the future of Business Service Management. We’ll look at what problems these solutions address, how they do what they do and real life implementation and operationalization issues. Now to the topic at hand! Why Real Time Analytics? Why now?
IT Operations executives are facing a dichotomous situation today. On one hand, their infrastructures are growing rapidly. I was recently talking with the Senior Director of Operations at a large social networking site and he told me that his server infrastructure (at the time 15K servers) was growing at 6% a week! They are also dealing with increasing complexity due to new technologies like virtualization and SOAs. Look at virtualization alone. While it can provide the tremendous cost benefits of server consolidation, it increases the complexity of the management problem considerably. You now have to deal with the hypervisor, virtual machines and guest operating systems as problem sources, in addition to the physical servers, O/S and applications. Lets not even get into the issue of dynamically moving VMs…
On the other hand, Operations executives are being asked to reduce their budgets or at the very least keep them flat. Consider that historically, increasing infrastructure and complexity have been handled by throwing more bodies at it. Hmmm…, if 70% or more of the Operations budget is currently labor spend and you have to reduce or keep it flat, how can you scale to meet the needs of the business?
There is an obvious parallel with what happened in manufacturing 30+ years ago with the advent of Total Quality Management (TQM) practices. Increasing complexity and scale in the manufacturing process was making it impossible to keep up with the quality and just-in-time delivery demands imposed by customers. Deterministic, rules-based methods of analyzing manufacturing lines that relied on trying to identify and measure every variable to stop problems from slipping by were no longer working. Some manufacturers (such as Toyota) started to take a different approach using advanced probability and statistics, collecting a subset of variables and looking at possible outcomes to get more proactive in approaching problems. The manufacturers that adopted these techniques gained a competitive advantage that persists to this day.
IT Operations is in a similar situation today. Alerting based on static monitoring thresholds and the collection of more metric data at faster intervals simply hasn’t provided a proactive approach to business service performance and availability issues. It also requires massive manual effort of IT staff to process the alert storms and perform manual or rule-based correlation to solve problems. Consider as well that in a system with thousands of servers and hundreds of thousands, even millions of metrics, the correlation problem is humanly unsolvable. Manual efforts can no longer scale in the face of increasing infrastructure and complexity.
That is why real time analytics-based solutions that leverage existing monitoring infrastructures are a necessity in today’s environment. I’ll go into the requirements for these solutions in a future post, however, the basic premise is to use advanced statistics and probability to learn normal system behaviors, only alert to abnormal behaviors that are true precursors to problems and perform advanced correlation techniques to predict future abnormalities and specific performance and availability problems. This new approach will be a competitive advantage for the companies who adopt it now, just as it was for the early adopters of similar approaches in manufacturing years earlier.