Just some rambling thoughts here…feel free to join in.
Is another tool what’s really required here? What should/could be done in domain specific resource monitoring solutions that addresses the problems at the edge? Should I really be monitoring everything that comes out of the box in a default configuration? Why do I have all of these profiles, situations, thresholds, events, etc. in the first place? Do I even now what I’m monitoring and why?
What if I have a multi-vendor, multi-sourced environment where I may or may not have visibility? What if I don’t have a CMDB or other source of topology, relationships and dependencies? What if I don’t even know the state and status of the applications, databases or services to begin with? What will I be able to do with investments into these technologies?
What if I have adopted a “manager of managers” concept where I have a consolidated operations eventing environment with feeds from across the entire business environment (facilities, plant, IT, datacenter, logistics, telephony, manufacturing, contact centers, etc.)? Shouldn’t this dynamic “learning” and “thresholding” concept be really applied at this level for some sort of “intelligent event management” free from manual intervention, policies, codebooks, etc? How about the context of the business calendar and schedule merged with the IT operations calendar and schedule? I doubt that this can all be “learned” magically.
If I invest in a BMC ProactiveNet, Netuitive or Integrien (or other fundamental dynamic “learning” or “trending” tool – my favorite was a company called Premonitia – now defunct, based on research from accoustic modelling of whales and shrimp IIRC), how will I recognize and measure the value from that investment? How should the operations environment change to adopt the promises of the “secret sauce” within these emerging technology areas? Will IT operations and second/third tier support teams need to change the ways they work today? If so, how? Does IT operations know how to respond to a future state that hasn’t occurred or someone stating that a service is “slow”? I think most operations and support teams are still in their infancy here.
I’m all for emerging technologies that speak towards making the lives of the folks on the front line better and for sensing, isolating and resolving issues within complex IT environments before they impact the business services, but will investing in these tools really improve the status quo within the typical operations environment? The Next Generation Operations Center, Command Center, Service Management Center or whatever we want to call it must be enabled with these types of technology, but also must prepared to think, operate and respond differently than they do today.
How are you changing? Will you change? Where’s your value proposition? Is it at the front line, second/third line of the support process, at the LoB? Is it about efficiencies in workflow? Do more, with less? Automation? Availability? Becoming proactive? Do you know the real root causes prompting your interests in this technology? What are your vendors doing about it? What is your monitoring tools group doing about it? Should they be doing something different?
Please share your thoughts on how best to operationalize and really recognize value from your investments into these technologies or what you’re doing to address the real root causes of the symptoms this technology addresses.
Comments on this entry are closed.
Have given this subject some thoughts. I’ll give you some of these thoughts, perhaps not in proper order but hey it’s an initial attempt in setting up a discussion with you.
In my case I see several obstacles in the path towards my goal, which I call Automated Impact Management. Just a name for something that ties technical status and performance information to true business implication.
In practice I have built many systems for different customers, all trying to obtain status and performance information by implementing monitoring tools.
On the other hand I have seen other implementations of various tools to map business to assets, assets to problems etc.
Trying to optimise from all ends, one soon runs into the 80/20 rule, where it takes a lot of effort to go beyond the 80% coverage.
Even when processes are in place to continue improvements, this results in continued dependancy on manual intervention and people being alert. Together with continuing changes within most organisations this means they will never get into a comfortable position with the IT in business.
I’ve come up with the idea of taking a two-forked approach, a pincer movement so to speak. Looking from bottom-up, so from the various IT systems upward to the business:
First, I leave the common practices and procedures aforementioned as is. They are needed but should be focused on the quality of information. Effort should be continued in the direction of CMDB accuracy, quality monitors and procedures to enforce change management etc.
Then I wondered why don’t we just do the opposite to enhance this effort? So, before we start filtering and enriching information, we just hook up any measurement device we can find to any system we can and pump a steady (but standardised) stream of information into an automated learning system, a neural network type of application. This will enable us to focus on the quantity of information, the bulk of which we humans do not want to see and want to get rid of. A (smart) neural network will be able to do a couple of extra things:
– It will be able to detect patterns in the stream. For instance it can’t help but find a relationship between dates and times and certain types of alerts. “Always happens on friday afternoon” kind of things;
– If properly configured it will be able to auto-detect new sources of information. “I never saw somethign coming from there before” kind of things;
– It will detect anomalies. “This never happens, so it must be important” kind of things;
– It will detect relationships. “Hey if that happens, something else always happens too” kind of things;
– It might be able to analyse gaps. “This is missing” kind of things.
Imagine an application that can do this, then connected up to the information quality system? It can help automate the improvement of the information quality, can offer a ‘second opinion’ on problems found and could help adapt to changes much faster.
So this results in a system covering more aspects of the IT environment, providing a much better status picture.
The next thing needed is something that provides accurate business information. Perhaps even another learning system that uses feedback information on impact to relate to situations. This might enable us to not only put a cost on outage, but apply a figure to slowdowns or partial outages as well, perhaps even allow prediction.
What do you think?
Martin
At the cost of being trite, I’d say it’s not just the technology – it’s the people.
I’ll give two examples:
1. The Ops guys are monitoring a queue because last week, a transaction got stuck and there was major downtime. This week, a transaction got stuck on the same queue but when the application people were alerted they said “it can wait”.
For how long are the Ops guys going to be monitoring that queue? Not long, I’d say.
2. Some database is raising hell because of some misconfiguration or other. The DBAs claim that there’s busy and that they’ll fix it “next week”
My point is that the biggest problems are communication gaps between Ops people, IT people, application people and managers.
Until some system can prioritize between them and give them a shared view – and CMDBs are excellent examples of shared views, but they’re only the beginning.
I’d like to see how application guys can dynamically change monitoring thresholds on the fly – these are the only people who know how exactly the application should be working.
Well – application people may know how things work. Or should be working.
But they hardly know how the application behaves at some point-and-time of the day. Let alone predict how it will behave within the next 2-3-4 hours or so.
In short answer Robert to your last question: I believe the monitors should not have any thresholds, just forward all the status and performance information to the pattern recognition tool I mentioned. We should go for automated anomaly detection and towards less dependancy on people.
Good discussion!
I agree with what you’re saying Martin and this is where I see the convergence of technologies and concepts from the BPM, BAM, BEM, CEP/ESP, and event BI to some degree converging with the traditional IT operations management and monitoring data. This type of stuff is also done in the financial, insurance and the security information and event management (SIEM) today.
But along with those capabilities must come a rethinking of how those things are operationalized. How do we need to change the way the operations and application support teams work? How do we get them to become more proactive and potentially do more work? I’m thinking that this is where you’ve absolutely got to have some sort of autonomic, automation or intelligent system type capability to handle more and more of these proactive/predictive type things. (I’m generally not a believer in automation until you’ve got things in order and controlled first)
What can we do as an industry to put a reference model together to help guide the discussion in this area? I think that it’s something required to help clear up some of the FUD out there and speak about the spiffy patents and secret sauce in more layman’s terms.
Thoughts?
Doug
OK,
I’ll modify what I said – how do we get the Application, IT people, Ops people and even Managers š to cooperate over how the pattern-matcher should match patterns.
True, the apps people may not know what happens in Real Life, but if they put out a new version, then there should be some change in the way the system behaves – and the Ops people need to know what the change is before it happens.
Well, IMO, this is where a “begin with the end in mind” approach must come into play. Aside from developing and deploying the application to meet the stated business goals and objectives, emphasis must be put on instrumentation, management and monitoring early in the development process. The operations and monitoring tools teams must collaborate with application development and support teams to establish standards and practices for this. The “throw it over the wall and they will figure it out” approach that exists in almost all shops I’ve seen must stop or you’ll never be able to adopt an emerging “smart” system.
Solid development to operations processes (ITIL or other best practices based) can and should be considered here to really address these types of organizational and operational challenges.
Doug
The proactive approach I’m implementing is focusing on performance issues. When the operations team is able to effectively respond to performance events then this is a sign they are proactively working issues. If the operations team is waiting for availability alerts then they are stuck in a reactive mode.
To implement effective performance management there is a need for a tool to correlate performance metrics which is separate but similar to the event correlation engine.
The approach weāre taking is to install a learning tool to analyze centrally collected performance metrics in real-time from the various domain managers. Gartner recently labeled this as a performance management database(PMDB). http://www.gartner.com/DisplayDocument?id=606707
The tool weāve implemented does two levels of performance metric correlation. The first level generates composite performance events from learned thresholds for a single element. For example, the correlation engine analyzes all the performance metrics on one server. The second level is to prioritize the performance events by their impact to the customer. Incorporating BPM or QoS performance metrics allows the tool to correlate spikes in customer experience to anomalies in application infrastructure performance metrics.
Good to hear the progress Jay!
Are you looking into such a database? (not the product you’re using, but an analytical db such as Vertica?) What do you think about the convergence of Business Intelligence (BI) or Business Performance Management (BPM) into the IT management area as something like “Operational BI”? Do you think what’s been fairly mature in those markets can be adapted for IT type situations? Can we move entirely away from an event based world and work exclusively from graphs, charts and trends?
Can you share some scenarios where you feel the approach you’re implementing works best? How about where it doesn’t work best? Is there a certain best practices scenario?
How does this dynamic capability change the way you or the tools administrators change the way they work? How about Operations? Can you quantify the value in some way?
Tks,
Doug
Nice to see the robust discussion here!
Clearly, a solution like our own Integrien Alive can provide tremendous benefits, however, unless an organization is willing to embrace the technology AND examine where their current processes are lacking, its full potential is unlikely to be recognized. However, I certainly don’t believe that wholesale changes to processes already established are required by the solutions we are speaking about themselves. It is more about Operations teams taking the time to determine what is working and what isn’t and committing to correcting what isn’t. Many Operations teams I speak with are stagnating with antiquated procedures that just don’t work as effectively in today’s more complex environment. Due to the fact that so much of their work is time consuming and manual, no one has the time to give much thought to process improvement.
This is precisely why analytics solutions are needed. Automating the manual efforts that go into problem solving can free up enough cycles to allow the team to commit to continuous process improvement. Solutions that can pinpoint the tiers of the application that are behaving abnormally in the context of a business performance or user experience problem eliminate bridge calls and focus the individuals who can actually solve the problem. It also isn’t very difficult to measure the impact that reducing manual effort through automation provides to the business.
On the proactive front, once again, I don’t see alot of changes to procedures that are required. For example, lets say that a Level 3 application expert receives an alert from Integrien Alive that the application server tier is acting abnormally. As he/she is looking at the constituent alarms that make up the alert, he can also see predictions of future abnormal behavior based on correlation techniques. He may see a prediction that a key indicator for the database is predicted to be exceeded with 95% probability in 15 minutes or less. Since this will bring down the DB and also bring down this mission critical business application, he forwards the alert to a DBA who makes a configuration change that averts the problem. No need for a change to procedures, the solution just provides an advance warning that can be acted on. I’d say that measuring the benefit is pretty simple here as well. Just measure the amount of downtime you saved based on the last time a configuration problem brought down the DB.
We’re not looking at any additional databases to store this information because the current database schema is easy to work with. We are trying to incorporate the same BI analysis to the application performance datasets. By connecting statistics tools like Matlab and Simulink to the database, we have the capability to perform regression analysis, adjust standard deviation settings, and “what-if” senarios with different datasets.
For best practices, we’re still working on it. Maybe that can be a separate discussion in a couple of months.
The shift from working availability alerts to proactive/performance alerts does have a dramatic impact the existing processes and that is also a work in process. The standard reponses for availability alerts are different compared to performance alerts -especially a composite performance alert. Also, the operations teams usually don’t send proactive/performance alerts to the higher teir support teams, so this takes buy-in from multiple groups.
Those are the short answers to some of the challenges we’re facing during implementation.
Jay,
I was really thinking about emerging “analytics databases” and technology rather than just a database to aggregate and store your data. I think there’s a lot that these emerging vendors, vendors who are evaluating their play here, and future start ups can learn from the well established BI, BPM and statistical tool vendors out there. Another large service provider plans to manage their entire environment based on performance trend charts and graphs with smart analytics – and NO MORE EVENTS!
Thanks for being honest about the realities of the organizational readiness required to adopt predictive or proactive tools. Being that you’re a customer of one of the three vendors in the space, is your vendor helping in any way to work through the organizational and operational challenges? Do you see a need for this type of consulting service from your vendor or from some other source?
I’d love to hear you comments and thoughts on this other post I made here http://dougmcclure.net/blog/2008/06/top-5-reasons-for-a-predictiveproactive-solution/!
Tks Jay, and ping me when you want to talk f2f again!
Doug