≡ Menu


thoughts on business, service and technology operations and management in a big data and analytics world

To catch up, check out part 1, part 2 and part 3.

I wanted to get an up to date configuration out based on some recent work for our upcoming Pulse 2014 demo making use of the latest versions of logstash v1.3.3 and our SCALA v1.2.0 release. Nothing significantly different per se, but the changes in logstash syntax and internal event flow/routing has significantly changed from v1.1.x.

I’ve included an example logstash v1.3.3 configuration file in my git repo here. It should be simple to follow the flow from inputs, filters and outputs. The use of tags and conditionals is key to control filter activation and output routing. It’s very powerful stuff!

I’ll get another post out this week with our next key component being the SCALA DSV pack to consume the events routed via logstash to SCALA.


These are my links for November 19th through December 11th:

  • The Netflix Tech Blog: Announcing Suro: Backbone of Netflix’s Data Pipeline – Suro, which we are proud to announce as our latest offering as part of the NetflixOSS family, serves as the backbone of our data pipeline. It consists of a producer client, a collector server, and plugin framework that allows events to be dynamically filtered and dispatched to multiple consumers.
  • Sensu | An open source monitoring framework – Designed for the Cloud The Cloud introduces new challenges to monitoring tools, Sensu was created with them in mind. Sensu will scale along with the infrastructure that it monitors.
  • datastack.io – data integration as a service – collect data. share insights.data integration as a service * * Kinda Logstash or Heka. But without the pain.
  • Glassbeam Begins Where Splunk Ends – Going Beyond Operational Intelligence with IoT Logs | Glassbeam – Glassbeam SCALAR is a flexible, hyper scale cloud-based platform capable of organizing and analyzing complex log bundles including syslogs, support logs, time series data and unstructured data generated by machines and applications. By creating structure on the fly based on the data and its semantics, Glassbeam’s platform allows traditional BI tools to plug into this parsed multi-structured data so companies can leverage existing BI and analytics investments without having to recreate their reports and dashboards. By mining machine data for product and customer intelligence, Glassbeam goes beyond traditional log management tools to leverage this valuable data across the enterprise. With a focus on providing value to the business user, Glassbeam’s platform and applications enable users to reduce costs, increase revenues and accelerate product time to market. In fact, Enterprise Apps Today’s Drew Robb recognized this critical value proposition naming Glassbeam a hot Big Data startup for analytics, which is attracting interest from investors, partners and customers. Today’s acquisition serves to showcase a market that is heating up, and new requirements around data analytics. But this is only the start and Glassbeam deliberately picks up where Splunk ends. We remain committed to cutting through the clutter and providing a clear view of operational AND business analytics to users across the enterprise.
  • Splunk Buys Cloudmeter to Boost Operational Intelligence Portfolio – The acquisition of Cloudmeter rounds out Splunk's portfolio with a capability to analyze machine data from a wider range of sources. Financial terms of the deal were not disclosed. The transaction was funded with cash from Splunk's balance sheet, the company said. Indeed, the addition of Cloudmeter will enhance the ability of Splunk customers to analyze machine data directly from their networks and correlate it with other machine-generated data to gain insights across Splunk's core use cases in application and infrastructure management, IT operations, security and business analytics.
  • Netuitive Files for Ground-Breaking New Patent in IT Operations Analytics – Press Release – Digital Journal – The patent filing is led by Dr. Elizabeth A. Nichols, Chief Data Scientist for Netuitive, a quantitative analytics expert focused on extending Netuitive's portfolio of IT Operations Analytics (ITOA) solutions to new applications and services. "Netuitive is committed to delivering industry leading IT Operations Analytics that proactively address business performance," said Dr. Nichols. "In addition, Netuitive's research and development is actively focused on new algorithm initiatives that will further advance our abilities to monitor new managed elements associated with next-generation IT architecture and online business applications."
  • Legume for Logstash – Legume Web Interface for Logstash & Elasticsearch Legume is a zeroconfig web interface run entirely on the client side that allows to browse and search log messages in Elasticsearch indexed by Logstash.
  • Deploying an application to Liberty profile on Cloud Foundry | WASdev – As part of the partnership between Pivotal and IBM we have created the WebSphere Application Server Liberty Buildpack, which enables Cloud Foundry users to easily deploy apps on Liberty profile.
  • IBM’s project Neo takes aim at the data discovery and visualisation market – MWD’s Insights blog – Project Neo is IBM’s answer to data visualisation and discovery for business users. It promises to help those who don’t possess specialist skills or training in analytics, to visually interact with their data and surface interesting trends and patterns by using a more simplistic dashboard interface that helps and guides users in the analysis process. Whereas previous tool incarnations are often predisposed to using data models, scripting or require knowledge of a query language, Project Neo takes a different tack. It aims to bypass this approach by enabling users to ask questions in plain English against a raw dataset (including CSV or Excel files) and return results in the form of interactive visualisations.
  • Machine learning is way easier than it looks | Inside Intercom – Like all of the best frameworks we have for understanding our world, e.g. Newton’s Laws of Motion, Jobs to be Done, Supply & Demand — the best ideas and concepts in machine learning are simple. The majority of literature on machine learning, however, is riddled with complex notation, formulae and superfluous language. It puts walls up around fundamentally simple ideas.

    Let’s take a practical example. Say we wanted to include a “you might also like” section at the bottom of this post. How would we go about that?

  • Where Are My AWS Logs? – Logentries Blog – Over my time at Logentries, we’ve had users contact us about where to find their logs while they were setting up Logentries. As a result, we recently released a feature for Amazon Web Services called the AWS Connector, which automatically discovers your log files across your Linux EC2 instances, no matter how many instances you have. Finding your linux logs however may only be a first step in the process as AWS logs can be all over the map… so to speak…. So where are they located? Here’s where you can start to find some of these.
  • Responsive Log Management… Like Beauty, it’s in the Eye of the Bug-holder | – As a software engineer, I’m responsible for the code I write and responsible for what we ship. But designing, building, and deploying SaaS is a real challenge – it means software developers are now responsible for making sure the live system runs well too. This is a real challenge, but with Loggly I get real-time telemetry on how my code is running, how my systems are behaving – and how well our software meets the need of our customers.
  • Mahout Explained in 5 Minutes or Less – blog.credera.com – In the spectrum of big data tools, Apache Mahout is a machine-learning engine that fits into the data mining category of the big data landscape. It is one of the more interesting tools in the big data toolbox because it allows you to extract actionable tasks from a big data set. What do we mean by actionable tasks? Things such as purchase recommendations based on a similar customer’s buying habits, or determining whether a user comment is spam based on the word clusters it contains.
  • Change management using Evolven’s IT Operations Analytics – TechRepublic – Evolven is designed to track and report change across an array of operating systems, databases, servers, and more to help pinpoint inconsistencies. It can also assist you in preventing issues and determining root causes of problems. Evolven can be helpful with automation—to find out why things didn’t work as expected and what to do next—and can also alert you to suspicious or unauthorized changes in your environment.

    Human and technological policies go hand-in-hand to balance each other and ensure the best possible results. Whereas my last article on the subject referenced the human processes IT departments should follow during change management, I’ll now take a look at technology that can back those processes up by examining what Evolven does and what benefits it can bring

  • Fluentd vs Logstash – Jason Wilder’s Blog – Fluentd and Logstash are two open-source projects that focus on the problem of centralized logs. Both projects address the collection and transport aspect of centralized logging using different approaches.

    This post will walk through a sample deployment to see how each differs from the other. We’ll look at the dependencies, features, deployment architecture and potential issues. The point is not to figure out which one is the best, but rather to see which one would be a better fit for your environment.

  • astanway/crucible · GitHub – Crucible is a refinement and feedback suite for algorithm testing. It was designed to be used to create anomaly detection algorithms, but it is very simple and can probably be extended to work with your particular domain. It evolved out of a need to test and rapidly generate standardized feedback for iterating on anomaly detection algorithms.
  • Now in Public Beta – Log Search & Log Watch | The AppFirst Blog – The decision to open our new log applications to the public was not one taken lightly. Giving our customers the ability to search all of their log files for any keywords is quite taxing on our system, so we had to take several precautions. To ensure the reliability of our entire architecture, we decided to create a separate web server solely responsible for retrieving log data from our persistence storage HBase. By making this an isolated subsystem, we don’t run the risk of a potentially large query bogging everything else down as well.
  • Log Insight: Remote Syslog Architectures | VMware Cloud Management – VMware Blogs – When architecting a syslog solution, it is important to understand the requirements both from a business and a product perspective. I would like to discuss the different remote syslog architectures that are possible when using vCenter Log Insight.
  • Why We Need a New Model for Anomaly Detection: #1 | Metafor Software – Share on reddit Share on hackernews Share on email

    I’m not talking about anomaly detection in stable enterprise IT environments. Those are doing just fine. Those infrastructures have mature, tested procedures for rolling out software updates and implementing new applications on an infrequent basis (still running FORTRAN written in the 70s, on servers from the 70s, yeah, that’s a thing).

    I’m talking about anomaly detection in the cloud, where the number of virtual machines fluctuates as often as application roll outs. Current solutions for anomaly detection track dozens or even hundreds of metrics per server in an attempt to divine normal performance and spot anomalous behavior. An ideal solution would adapt itself to the quirks of each metric, to different application scenarios, and to machine re-configurations.

    This is a problem that lends itself to machine learning techniques, but it’s still an incredibly difficult problem to solve. Why?

  • Beyond The Pretty Charts – A Report From #devopsdays in Austin | Metafor Software – Don’t just look at timeline charts. We’ve fallen into the trap of looking at all the pretty charts as time series charts. When we do that, we end up missing some important characteristics. For example, a simple histogram of the data, instead of just a time chart, can tell you a lot about anomalies and distribution. Using different kinds of visualization is crucial to giving us a different aspect on our data.
  • Server Anomaly Detection | Predictive IT Analytics | Config Drift Monitoring | Metafor Software – Know about problems before your threshold based monitoring tool does. Get alerted to issues your thresholds will never catch.

    Metafor’s machine learning algorithms alert you to anomalous behavior in your servers, clusters, applications, and KPIs.


If you’d like to catch up, check out the first three posts in this tutorial, starting here.

The easiest way to get started is by having a good understanding of the structure of your Netcool events. With a fairly default deployment we know there are a number of standard alerts.status fields of interest such as first and last occurrence, node, agent, alert group, alert key, manager to name a few. Nearly every customer I have ever worked with has extended their alerts.status schema to accommodate the various probe and gateway level integrations they have as well as to support event enrichment, auto-ticketing, etc.

There’s definitely a level of maturity here that needs to be understood through brief analysis of your events via the AEL. Which slots are you populating with a high degree of completeness? Which ones help you understand the context of an event beyond the node name? Which ones are used to determine if the event is ever acted upon? Which ones will help you assess the event streams, ask questions and take actions on investigating event validity within your environment? Your goal is to ensure you have the best possible set of fields that will enable your event analysis, event analytics and most importantly the decisions, actions and next steps you will be able to take based upon your analysis.

One place you can get a complete snapshot of the alerts.status configuration is the ../omnibus/var/Tivoli_eif.NCOMS.alerts.status.def file. I used this to get the list of all the field names for easy copy and paste when building my socket gateway mapping file.

With the fields of interest identified, download and install the Netcool/OMNIbus socket gateway in accordance with the install instructions in the docs. If you don’t already own the socket gateway, check with your sales rep. In most cases since you’re using it to route events from one C&SI product to another, there isn’t a charge. But, IANAL and T&C’s change with the wind so check. If you have a problem with this, ping me and I can suggest a number of other alternative approaches.

Once installed, the first configuration activity is to update the gateway’s socket.map file with the fields you’re interested in.

  • Make a backup copy of the original.
  • Remove the default fields you’re not interested in.
  • Add fields you are interested in.
  • Place the fields in a logical order.
  • NOTE: I’m placing the @Identifier first as the socket gateway inserts an event type (INSERT, UPDATE, DELETE) in front of each event it sends across so we don’t want that to mess up any other slot.

This is the socket map I used within in our pretty default environment when sending events from ITM, APM/ITCAM, BSM, etc. For a bare bones set up, the ones I’ve highlighted in bold are probably good enough to get started.

'' = '@Identifier',
'' = '@LastOccurrence' CONVERT TO DATE,
'' = '@FirstOccurrence' CONVERT TO DATE,
'' = '@Node',

'' = '@NodeAlias',
'' = '@Summary',
'' = '@Severity',
'' = '@Manager',
'' = '@Agent',
'' = '@AlertGroup',
'' = '@AlertKey',
'' = '@Type',
'' = '@Tally',
'' = '@Class',
'' = '@Grade',

'' = '@Location',
'' = '@ITMDisplayItem',
'' = '@ITMEventData',
'' = '@ITMTime',
'' = '@ITMHostname',
'' = '@ITMSitType',
'' = '@ITMThruNode',
'' = '@ITMSitGroup',
'' = '@ITMSitFullName',
'' = '@ITMApplLabel',
'' = '@ITMSitOrigin',
'' = '@CAM_Application_Name',
'' = '@CAM_Transaction_Name',
'' = '@CAM_SubTransaction_Name',
'' = '@CAM_Client_Name',
'' = '@CAM_Server_Name',
'' = '@CAM_Profile_Name',
'' = '@CAM_Response_Time',
'' = '@CAM_Percent_Available',
'' = '@CAM_Expected_Value',
'' = '@CAM_Actual_Value',
'' = '@CAM_Details',
'' = '@CAM_Total_Requests',
'' = '@BSMAccelerator_Service',
'' = '@BSMAccelerator_Function'

Next, we need to set up some simple filtering to control the event types we send across the gateway. The socket.reader.tblrep.def is used to define what comes across the socket gateway and what filters we might want to apply. Here are a couple examples I’ve used.

Only sends INSERTS and UPDATES (not DELETES as they don’t send across the entire event structure) and filter out all of the internal TBSM events which are Class 12000.

USING MAP 'StatusMap'
FILTER WITH 'Class !=12000';

Only sends INSERTS and UPDATES (not DELETES as they don’t send across the entire event structure) and filter out events with Severity 0, 1 and 2.

USING MAP 'StatusMap'
FILTER WITH 'Severity >=3';

I was unable to figure out a more complex filter example which I would have liked to use for more filtering so these had to do.

Next, the core socket gateway properties need to be configured. Edit the NCO_GATE.props file as follows.

#Update these based on your install preferences
MessageLevel : 'warn'
MessageLog : '$OMNIHOME/log/NCO_GATE.log'
Name : 'NCO_GATE'
PropsFile : '$OMNIHOME/etc/NCO_GATE.props'

#This will be the IP and Port for your logstash installation and the TCP Input you use
Gate.Socket.Host : ''
Gate.Socket.Port : 1234

#These will create a comma separated (CSV) event format with fields wrapped in " ".
Gate.Socket.EndString : '"'
Gate.Socket.StartString : '"'
Gate.Socket.Separator : ','

#This sets First/Last Occurrence format to mimic ISO8601 format supported by SCALA
Gate.Socket.DateFormat : '%Y-%m-%dT%H:%M:%S%Z'

Here’s how to start the socket gateway for reference later. We’ll need the remote end of the TCP connection to be started up first.

../omnibus/bin/nco_g_socket &

You can check that your gateway is running by running the ps aux | grep nco_g command. To stop the gateway, kill the process.

Check the output file you created on the Logstash server to verify that you’ve captured some events from the gateway. If you see some there, we’re all set for our next activity to set up annotation and indexing of the events in SCALA v1103.


Now that I’m done with what felt like months of work for our big demo at IBM’s IOD show last week, let me get this series done! Next up we’ll walk through the use of Logstash to serve as the collection and mediation tool for streaming in events from Netcool/OMNIbus and getting them indexed within SCALA v1103. We’re still using Logstash v113 here. We’ll have support for Logstash v1.2.x in our next release very soon. NOTE: With SCALA v1103 now available, that will be what I mention moving forward.

To catch up, check out part 1 and part 2.

On a separate system if at all possible, prepare for installation of Logstash v113 and the SCALA Logstash toolkit.

  • Download logstsah v1.1.13 from here
  • Create a new directory for the logtash environment. I generally create /opt/logstash.
  • Copy the SCALA Logstash Toolkit to this directory
  • Review the SCALA Logstash Toolkit installation steps
  • Explode the SCALA Logstash Toolkit
  • Copy the logstash-1.1.13-flatjar.jar package to this /opt/logstash/lstoolkit directory
  • Update the install configuration file install-scala-logstash.conf
  • Update the eif.conf file
  • Run the ./install-scala-logstash.sh script.

The lstoolkit directory contains the following files:

- LogstashLogAnalysis_v1.1.0.0.zip
- install-scala-logstash.conf
- startlogstash-scala.sh
- install-scala-logstash.sh
- logstash-1.1.13-flatjar.jar
- start-logstash.conf
- logstash/

- conf/
-- logstash-scala.conf
- outputs/
-- eif-
-- scala_custom_eif.rb
- unity/

Next, we need to make a few simple configurations in the Logstash configuration file to get us up and running. In this simple scenario, the following configuration file for Logstash should be updated with a configuration similar to this:

#Create your TCP input which your Netcool/OMNIbus socket gateway will connect to

type=> "netcool"
format=> "plain"
port=> 1234
data_timeout=> -1

} #End of Inputs

#Use the Mutate filter to set the hostname and log path to anything you want. This is used in the SCALA LogSource definition.

type=> "netcool"

#Have some events you want to drop out? I used the Grep filter type to filter out some poorly formatted events whose summary message included commas which broke SCALA DSV processing

type=> "netcool"
match=>[ "@message",".*WAS_YN_WebAppNoActivity_W.* | .*WAS_YN_WebAppActivity_H.*" ]
negate=> true

} #End of Filters

#Create a simple output file of all your raw CSV delimited events for future use, replay, etc.

type=> "netcool"
message_format=> "%{@message}"
path=> "/opt/logstash/raw-events-csv.log"

#Create one or more ouputs to spray events to as many SCALA boxes as you'd like

eif_config=> "logstash/outputs/eif-"
debug_log=> "/tmp/scala/scala-logstash-"
debug_level=> "debug"

} #End of Outputs

Note: If you have multiple SCALA systems, you can spray events to each of them by having more than one output stanza for the scala_custom_eif plugin. Each one must have its own unique eif_config and debug_log configurations. I just put in the IP address of my end points to easily identify each one.

To start up Logstash, use the ./startlogstash-scala.sh script. You may wish to update this to send Logstash to the background when starting up. To stop Logstash, use ps aux | grep logstash and kill the Logstash process.

When we complete the next series of tasks in Netcool/OMNIbus we can peek at the output file we created via Logstash, we can see the raw CSV events that resemble the example below. This is what’s sent across the socket gateway.

INSERT: "WAS_YN_EJBConNoActivity_W:syswasslesNode01:syswassles:KYNS::ITM_EJB_Containers",
2013-09-27T13: 46: 44EDT,
2013-09-27T13: 46: 44EDT,
"WAS_YN_EJBConNoActivity_W[(Method_Invocation_Rate=0.000 ) ON syswasslesNode01:syswassles:KYNS (Method_Invocation_Rate=0 )]",
"tivoli_eif probe on systbsmsles",
"09/27/2013 08:29:45.000",

This is the event passed in from the TCP Input and through the filters to the scala_custom_eif output:

2013-09-27T13: 46: 42.601000#21554
]DEBUG--: scala_custom_eif: Receivedevent: #"tcp://",


"@message"=>"INSERT: \"WAS_YN_EJBConNoActivity_W:syswasslesNode01:syswassles:KYNS::ITM_EJB_Containers\",2013-09-27T13:46:44EDT,2013-09-27T13:46:44EDT,\"syswasslesNode01:syswassles:KYNS\",\"syswasslesNode01:syswassles:KYNS\",\"WAS_YN_EJBConNoActivity_W[(Method_Invocation_Rate=0.000 ) ON syswasslesNode01:syswassles:KYNS (Method_Invocation_Rate=0 )]\",1,\"tivoli_eif probe on systbsmsles\",\"ITM\",\"ITM_EJB_Containers\",\"WAS_YN_EJBConNoActivity_W\",20,2,6601,1,\"\",\"\",\"~\",\"09/27/2013 08:29:45.000\",\"sysitm.poc.ibm.com\",\"S\",\"TEMS\",\"\",\"WAS_YN_EJBConNoActivity_W\",\"\",\"syswasslesNode01:syswassles:KYNS\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",0,\"\",\"\"\n",

This is the event sent out of the scala_custom_eif output in the IBM Event Integration Framework (EIF) format fit for consumption by the SCALA EIF Receiver.

2013-09-27T13: 46: 42.602000#21554
]DEBUG--: scala_custom_eif: Sendingtecevent: AllRecords;hostname='s3systbsmsles';RemoteHost='';text='INSERT: "WAS_YN_EJBConNoActivity_W:syswasslesNode01:syswassles:KYNS::ITM_EJB_Containers",
2013-09-27T13: 46: 44EDT,
2013-09-27T13: 46: 44EDT,
"WAS_YN_EJBConNoActivity_W[(Method_Invocation_Rate=0.000 ) ON syswasslesNode01:syswassles:KYNS (Method_Invocation_Rate=0 )]",
"tivoli_eif probe on systbsmsles",
"09/27/2013 08:29:45.000",

Logstash is far more powerful than what I’ve showed in this very simple example. I’d encourage you to investigate its capabilities further by reading the website, user group or IRC.

Up next, we’ll walk through the configuration of Netcool/OMNIbus and get our events flowing towards Logstash and SCALA.


Wish I was there to see this talk on how Loggly has evolved at the AWS re:Invent show! Very impressive scale numbers (EPS) for logging geeks out there. Check out there use of tools like Kafka, Storm and ElasticSearch in this deck. This is definitely something anyone planning on building or buying “logging as a service” needs to review.


Jason Wilder published a nice overview and comparison of Logstash and Fluentd today and it’s well worth a read if you’re looking for a tool to help with data (log, metric, event, etc.) collection, mediation and routing.

We chose to use Logstash as part of our integration and mediation toolkit for our IT Operations Analytics (ITOA) portfolio and appreciate the flexibility it offers.


I’m liking what these guys at Metafor Software are doing in the IT Operations Analytics area. They’re out there in the community really talking about this stuff and the value it can provide. Lots of participation in key community events and they have a great blog with easy to read, intuitive posts explaining things so even I can understand. Watch some of their presentation videos!

Adding them to my watch list for sure!

Check out their website here and their blog.

Others in this space: Netuitive, NEC, Prelert, IBM, Evolven, BMC, VMWare — Others?


These are my links for October 1st through November 18th:

  • Subscribing to the WebSphere MQ FTE Transfer log topic (Computing minutiea – notes from a small island.) – A customer was just asking me about the 'transfer log' for WebSphere MQ File Transfer Edition. They had mistakenly took the term to refer to a log file which contained a record of all transfers. In fact, WMQ FTE provides for auditing by publishing transfer-related information on a well-known topic name, namely "SYSTEM.FTE/Log/".
    I explained to them that there are two built in facilities which take advantage of these publications to provide a full auditing solution.

    The WMQ Explorer plug-in for FTE, subscribes and presents the information in a tabular report.
    The Database logger, subscribes and then stores the information in database tables.

  • Splunk Drives Operational Intelligence with Amazon Web Services – vailability of new Amazon Machine Images (AMIs) for Splunk® Enterprise 6 and Hunk™: Splunk Analytics for Hadoop. The new AMIs further accelerate the speed at which organizations can deploy Splunk software and gain critical visibility into their cloud-based applications and data. Splunk also released the new version of the Splunk App for Amazon Web Services (AWS), which leverages the newly announced AWS CloudTrail, a new service that logs all AWS API calls, to enable organizations to improve monitoring, security and compliance across all applications and infrastructure running in AWS. The Splunk Enterprise AMI and Hunk AMI are available in the AWS Marketplace. The Splunk App for AWS is available on Splunk Apps.
  • Splunk and Prelert Predict: What’s the Difference? – In conclusion, the Prelert Anomaly Detective is different from Splunk’s ‘predict’ command in the following ways:

    Less false alerts on data with non-Gaussian profiles;
    Easily scales to analyze multiple items simultaneously, even across sourcetypes;
    Automatically scores each anomaly based on the severity of the deviations; and
    Can be easily operationalized to a Real-Time search (with alerts) without manually having to read/write summary indexes.

    Prelert seeks to complement Splunk, as Anomaly Detective extends the capabilities of Splunk’s “predict” by filtering out noise, analyzing multiple items simultaneously, and isolating true anomalies without setting thresholds.

  • The-Field-Guide-to-Data-Science/ at master · booz-allen-hamilton/The-Field-Guide-to-Data-Science · GitHub – We cannot capture all that is Data Science. Nor can we keep up – the pace at which this field progresses outdates work as fast as it is produced. As a result, we have opened this field guide to the world as a living document to bend and grow with the community, technology, expertise, and evolving techniques. Therefore, if you find the guide to be useful, neat, or even lacking, then we encourage you to add your expertise, including:

    Case studies from which you have learned
    Citations for journal articles or papers that inspire you
    Algorithms and techniques that you love
    Your thoughts and comments on other people’s additions

  • csvfix – CSVfix is a tool for manipulating CSV data – Google Project Hosting – CSVfix is a command-line tool specifically designed to deal with CSV data. With it you can, among other things:

    Reorder, remove, split and merge fields
    Convert case, trim leading & trailing spaces
    Search for specific content using regular expressions
    Filter out duplicate data or data on exclusion lists
    Enrich with data from other sources
    Add sequence numbers and file source information
    Split large CSV files into smaller files based on field contents
    Perform arithmetic calculations on individual fields
    Validate CSV data against a collection of validation rules
    Convert between CSV and fixed format, XML, SQL and DSV

  • RDataMining.com: R and Data Mining – This website presents documents, examples, tutorials and resources on R and data mining
  • ? IBM IT Operations Analytics: Solving the IT Operations Big Data Challenge – YouTube – IBM IT Operations Analytics: Solving the IT Operations Big Data Challenge
  • IBM IT Operations Analytics: Achieving Actionable Insights from IT Operations Big Data – YouTube – IBM IT Operations Analytics: Achieving Actionable Insights from IT Operations Big Data
  • ? IT Operations Analytics: The Magic Inside – Video Blog – YouTube – IT Operations Analytics: The Magic Inside
  • Agile Insights: Big Data Capture & Software Analytics | New Relic – Software Analytics is about gathering billions and billions of metrics from your live production software, including user clickstreams, mobile activity, end user experiences and transactions, and then making sense of those — providing you with business insights. Software analytics includes Application Performance Management, but extends to User Behavior, Business Transactions, Customer Insights and much, much more.
  • iis and logstash – IIS grok pattern
  • Lumberjack – a Light Weight Log Shipper for Logstash | beingasysadmin – Lumberjack is one such input plugin designed for logstash. Though the plugin is still in beta state, i decided to give it a try. By default we can also use logstash itself for shipping logs to centralized Logstash server, the JVM made it difficult to work with many of my constrained machines. Lumberjack claims to be a light weight log shipper which uses SSL and we can add custom fields for each line of log which we ships.
  • Machine Learning Platform – Text Analysis Service | Datumbox – Power-up your own Intelligent Applications by using our cutting edge Machine Learning platform. Sign-up today and start building intelligent services with our powerful & easy-to-use API.
  • BigML – Machine Learning Made Easy – Easily add data-driven decisions and predictive power to your company
  • Forecasting: principles and practice | OTexts – This textbook is intended to provide a comprehensive introduction to forecasting methods and to present enough information about each method for readers to be able to use them sensibly. We don’t attempt to give a thorough discussion of the theoretical details behind each method, although the references at the end of each chapter will fill in many of those details. The book is written for three audiences: (1) people finding themselves doing forecasting in business when they may not have had any formal training in the area; (2) undergraduate students studying business; (3) MBA students doing a forecasting elective. We use it ourselves for a second-year subject for students undertaking a Bachelor of Commerce degree at Monash University, Australia.
  • Is Splunk Cloud a Cop-Out? – “It's a tacit admission by Splunk that Storm isn't really competitive with the new breed of log management SaaS guys like Loggly, SumoLogic, Logentries, etc. They’ve got some work to do with Splunk Cloud, including on the pricing model, but I do expect it to be a formidable offering in the still very nascent log management SaaS space. I'd be careful to sell Splunk short, they're still the team to beat in this space. And there's still a lot of opportunity out there.”
  • Orange – Data Mining Fruitful & Fun – Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Add-ons for bioinformatics and text mining. Packed with features for data analytics.
  • Apache Mahout: Scalable machine learning and data mining – Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.
  • Did Splunk Just Surrender on SaaS? | – Today, it appears that Splunk has thrown in the towel on Software as a Service (SaaS) and replaced Splunk Storm with a hosted software model. We were always skeptical that a company with such a phenomenally successful enterprise software business would disrupt its own business with a serious SaaS offering. And with today’s announcement of Splunk Cloud now it seems that the doubts were justified.
  • Using ElasticSearch And Logstash To Serve Billions Of Searchable Events For Customers | Blog | Elasticsearch – There are quite a bit of projects and services out there focussed on logging events. We ultimately picked Logstash, a tool for collecting, parsing, mangling and passing on logs.

    Internally, the events pushed out via our webhooks are also used in other parts of our system. We currently use Redis for this. Logstash has a Redis input plugin that retrieves log events from a Redis list. After some minor filtering, the events can then be sent out via an output plugin. A very commonly used output plugin is the Elasticsearch plugin.

    A great way to use Elasticsearch’s very rich API is by setting up Kibana, a tool to “make sense of a mountain of logs”. The new incarnation, Kibana 3, is fully client side JavaScript, and will be the default interface for Logstash. Unlike previous versions, it no longer depends on a Logstash-like schema, but is now usable for any Elasticsearch index.