During my time at EarthLink in the early 2000’s, the Service and Technology Management/Monitoring team seemed to always be at the end of the pointing finger as to the reason a service, application or technology issue wasn’t detected, events not acted upon and tickets not created by the NOC or other support teams. It was always ‘our tools’ fault! After one of the numerous re-orgs that seemed to occur, a new Director was convinced he had a way to ‘fix’ the problem with the NOC or others not getting events about issues in our datacenters and networks or where the NOC was unable to ‘find’ the right event to act upon because there were so many of them.
His idea was that he’d tie a portion of my annual bonus to the quality of monitoring we implemented and ultimately by the percentage of false positive events (an event was sent but determined to ‘not’ be a problem by the NOC engineer) and false negative events (a problem occurred and the NOC didn’t ‘see’ an event for that problem). The NOC Problem Management teams would ultimately determine what events existed (or didn’t) related to the issues and report monthly on the quality of the event stream delivered to the front line of the NOC. My boss set a goal for less than 10% false positive and false negative events per month.
As you could imagine, the amount of work required in the early 2000’s to achieve this was pretty significant. I can recall spending countless hours writing Crystal Reports based report templates to query our historical event database. I can recall the custom web/php apps created to front end that database to allow queries to be written across standard event fields to make things somewhat easier. Those days were pretty painful to say the least.
A lot has changed over the years and we now find pretty much the same types of interfaces and capabilities from IBM/CSI (fka Tivoli) in the area of analyzing and reporting on events. In the early 2000’s we had Netcool/Reporter (OEM from someone), then post acquisition we gained Tivoli Common Reporter (TCR) based on Eclipse BIRT and now we have TCR based upon our Cognos technology. Reporting based solutions on top of databases really haven’t changed much. Define a report template, execute the report, see the results. Of course things get a bit easier as we evolve over time, but a report’s a report in my opinion and someone needs to really be familiar with the reporting platform (and databases, data management, SQL, etc.) of choice to get the desired reports and value. As an example, we have a number of fairly simple Netcool/OMNIbus TCR reports today any of our customers can use and extend.
What did I spend a lot of time doing back then? Searching for the needle in the haystack! Did we get an event? What time did the event come in? Did we have the right flags so that it was seen in a NOC event list? Did they touch the event? Did they ticket the event? I ended up having all kinds of reports to slice and dice the historical event data. I had to change the report time periods and re-run them, over and over and over. Painful to say the least!
We hear from a lot of clients these days that they have too many events. Event (and tickets created from them) volumes I’m hearing about are gigantanormous!! This is the same type of thing I heard from our NOC back then, too many events and not enough smart stuff to pre-process them into the best possible set that a front line NOC engineer should spend their precious time investigating. I recall when taking over the monitoring team that there were well over 150K events in Netcool/OMNIbus. The reasons were many, and tools maintenance/upkeep and adherence to best practices certainly played a part back then. But the task to find out why they were there and whether or not they should be there was challenging. You guessed it – more reports!
Over a year ago I was introduced to some pretty powerful open source projects called elasticsearch, logstash and kibana and this is where I had my ‘aha’ moment that search is the new report, the new dashboard, the new UI into real time and historical data like those events I’d been reporting on from an Oracle database for so many years. I thought surely that search would allow me to more easily and intuitively find those needles in the haystack. I dove into the concepts of search and learned about facets, directed search, free form search, meta-data search. I started experimenting and quickly discovered how easy it was to pull these products together and really create a compelling and powerful search oriented interface that had these existed in early 2000’s, wow my life would have been much easier!
Here we are today and IBM/CSI has released a product in June that sits squarely in this area of search as an interface for interacting with, acting upon and visualizing all kinds of data. With our new Smart Cloud Analytics Log Analysis v1102 refresh release we introduce powerful technology support toolkits for logstash and delimited data sources that will allow for the consumption of a real time event stream from Netcool/OMNIbus using a very simple socket gateway. What I want to write about here is how you can build your own Event Analysis solution using SCALA.
While you sit on pins and needles and await my next post, spend your weekend checking out these resources! Leave me some comments on what you’ve had to do in the past in this area. What does event analysis or event analytics mean to you and how can they help you with event hot spot reduction, improved monitoring, optimized IT Operations, etc.
- SmartCloud Analytics Log Analysis
- Download the SCALA v1103 Trial Version, Insight Packs and Docs for DSV and Logstash Toolkits
- Check out Logstash! (if you download, get version 1.13 here. Our 1.2.x support will come soon.)
Comments on this entry are closed.
in an effort to build my own event analysis dashboard from SCALA i ended up finding this nice article (great fan from now on) but am wondering if it is possible to add an implementation of a custom made algorithm like regression or whatever for predictive insight?
ps: not a spam
Yes, I have been playing with many types of analytics in this area. Once your data is in LA (and the backend Solr), it’s accessible by many of the great big data and analytic tools. If you integrate LA with Hadoop, you’ve now got all your event/log/etc data in HDFS and available to do all kinds of things on the data.
Reach out to me directly if you’d like to get involved with some of our prototyping work in this area!