dougmcclure.net — thoughts on business, service and technology operations and management in the digital transformation era

Design and Deployment of a BlueMix Dev/Ops Log Solution: Progressing towards Milestone 1

by doug on August 13, 2014

To catch up on this series, start back here.

We started with a simple deployment planning activity – get some servers built so we can get the necessary software installed and then and start hacking some configurations together to allow sample data to be collected and indexed and some simple searches and visualizations from that data to be available. Enter fun activity #1 – SoftLayer and associated VPNs. I would do anything for some way to stay logged into my SoftLayer VPN for longer than a day! Nothing pains me more than having 20 more more putty sessions up and then getting kicked off the VPN when the 24 hour timer ends. Please tell me how I can change this behavior or have some magical way to reconnect the VPN and all those putty sessions!

My partner Hao on the Dev/Ops team sent me some sample logs from a handful of Cloud Foundry (CF) components including the Cloud Controller and DEA components. Turns out each CF component type and end point system (CCI – SoftLayer Cloud Computing Instance) could have upwards of 10-20 different log types. I started out as I do with any project like this and began to immerse myself in the log samples. Before I was asked to start, Hao started kicking the tires on our Log Analysis solution on his own and hit some of the fundamental challenges in this space in terms of how to get logs in. He specifically called out logs that were wrapped in JSON, logs with timestamps in Epoch format, logs without timestamps, etc. as a few of his key challenges.

I think most of us in this space always start out in a common way – find the timestamps, find the message boundaries, find the unique message patterns that might exist within each log type and then think about what meaningful data should be extracted from each log type to enable problem isolation and resolution activities. I talk to clients about taking an approach for finding a sweet spot in their log analysis architecture between simple consolidation and archival of everything, to enabling everything for search to gaining high value insights from log data. Just because you have all kinds of logs doesn’t mean you must send them all into your log solution or invest a lot of time trying to parse and annotate every possible log message into unique and detailed fields. Finding the right balance of effort put into integration, parsing and annotation of logs, frequency of use and the value to the primary persona (Dev/Ops, IT Ops, App/Dev, etc.) are all dimensions and trade offs to consider before starting to boil the ocean. Simple indexing and search can go a long way before deep parsing and annotation is really required across all log types.

What this means to us for our first few milestones is that there are a lot of log types and we don’t have any firm requirements for what’s exactly needed yet. We don’t know which logs will prove to be most valuable to the global BlueMix Fabric Dev/Ops team. Obviously, some of these log types will be more useful in problem isolation and resolution activities. We want to bring everything in initially in a simple manner and then get that iterative feedback on which logs provide most useful insights and work on parsing and annotating those for high value search, analysis and apps. Out of two dozen unique log sources in the CF environment, we started out by taking an approach that kept things as simple as possible. We wanted to get as much data in as quick as possible so we could get feedback from the global Dev/Ops team on which log types were the most useful to them.

From the beginning of my discussions with the Dev/Ops team, we needed to keep in line with their deployment model, base images and automation approaches as much as possible. We didn’t want to deploy the Tivoli Log File Agent (actually not supported on Ubuntu anyway) or install anything else to move the log data off the end point systems. We decided to make use of the standard installation of rsyslog 4.2 (ancient!) on the Ubuntu 10.4 virtual cloud computing instances (CCIs) used within the BlueMix/CF/Softlayer environment.

We’re using the standard rsyslog imfile module to map in the 10-20 different log files per CCI into the standard rsyslog message format for shipping to a centralized rsyslog server. Each file we ship using imfile gets a functional tag assigned (eg DEA, CCNG, etc) which is useful downstream for filtering and parsing. On the centralized rsyslog server we’re using a custom template to create a simple CSV output format which we send to logstash for parsing and annotation. The point here is that we took a simple approach and normalized all of the CF logs to a standard rsyslog format which gives us a number of standardized, well formatted slots single including one slot containing the entire original message content to use downstream. I’ll spend an number of posts on rsyslog and logstash later!

Here’s an example of the typical imfile configuration:

### warden.log $InputFileName /var/vcap/sys/log/warden/warden.log $InputFileTag DEA_WARDEN_warden $InputFileStateFile stat-DEA_WARDEN_warden $InputFileSeverity debug $InputFileFacility local3 $InputRunFileMonitor

This is the template we use to take the incoming rsyslog stream from all of the CCIs and turn it into a simple CSV formatted message structure. In reality I could even simplify this a bit more by removing a couple of the fields I’m ultimately not indexing in the Log Analysis solution.

template(name="scalaLogFormatDSV" type="list") { property(name="timestamp" dateFormat="rfc3339" position.from="1" position.to="19") constant(value="Z,") property(name="hostname") constant(value=",") property(name="fromhost") constant(value=",") property(name="syslogtag") constant(value=",") property(name="programname") constant(value=",") property(name="procid") constant(value=",") property(name="syslogfacility-text") constant(value=",") property(name="syslogseverity-text") constant(value=",") property(name="app-name") constant(value=",") property(name="msg" ) constant(value="\n") }

This gives a simple output message format like this for each of the 24 or more log types in this environment.

2014-08-07T14:42:17Z,localhost,10.x.x.x,DEA_WARDEN_warden,DEA_WARDEN_warden,-,local3,debug,DEA_WARDEN_warden, {"timestamp":1407422537.8179543,"message":"info (took 0.065380)","log_level":"debug","source":"Warden::Container::Linux","data":{"handle":"123foobar","request":{"handle":"123foobar"},"response":{"state":"active","events":[],"host_ip":"10.x.x.x","container_ip":"10.x.x.x","container_path":"/var/vcap/data/warden/depot/123foobar","memory_stat":"#","cpu_stat":"#","disk_stat":"#","bandwidth_stat":"#","job_ids":[123foobar]}},"thread_id":123foobar,"fiber_id":123foobar,"process_id":123foobar,"file":"/var/vcap/data/packages/warden/43.2/warden/lib/warden/container/base.rb","lineno":300,"method":"dispatch"}

So much like the picture in this blog post, we’ve got lots of logs in all shapes and sizes. Some are huge, some are small and we’re getting them into a uniform format (eg uniform size/length for the log truck) in preparation for getting the most value from them (eg lumber). Up next, shipping an aggregated stream to logstash for parsing these normalized messages

1 comment

IBM Log Analysis (SCALA) Tuning App v1

by doug on August 7, 2014

in IBM Log Analytics, IT Operations Analytics, Log Analytics, Logstash, Machine Data, Monitoring, Smart Cloud Analytics

As part of this BlueMix Fabric Log Solution project, getting visibility into everything in my log solution architecture is pretty important. I’ve got a lot of instrumentation across the end-to-end pipeline so metrics are overflowing in my environment. I started to work on a simple app to pull all of this together so I can trend and visualize it over time and be able to see the impacts of tuning activities.

This is a first cut at pulling some of the metrics out of the Distributed EIF Receiver / Unity Generic Receiver logs. I’m shipping them with the logstash-forwarder, parsing them in logstash and sending them to the internal elasticsearch server for easy search and visualization using kibana. ELK at its finest!

I’ll update this as I go for other SCALA logs as well as others I’m using frequently such as rsyslog impstats.

Ping me or check my github soon for the configurations.

0 comments

Swimming in the BIG BlueMix Sea – Design and Deployment of a Log Analysis Solution for IBM’s BlueMix Cloud Foundry PaaS

by doug on August 7, 2014

in Analytics, Anomaly Detection, Application Analytics, Big Data, Event Analytics, IBM, IBM Log Analytics, IT Operations Analytics, Log Analytics, Logstash, Machine Data, Predictive Analytics, Smart Cloud Analytics, Uncategorized, Usability, User Experience

A few months back, I was asked to help deploy our Log Analysis solution for our BlueMix Fabric Dev/Ops team. Their pain point – getting value and insights from massive amounts of Cloud Foundry (CF) log data across multiple development, staging and production environments in order to provide a highly available BlueMix offering. No problem I thought. A log is a log is a log. I’d done this a number of times for various applications or technologies using our Log Analysis solution. Three months later as we move this into an environment most closely resembling our production environment, things have gotten very interesting to say the least in terms of designing for a scale out log solution supporting 100’s to 1,000’s of GB of log data each day.

I want to share my journey here on my blog so others may benefit who chose to do something similar in their own Cloud Foundry environment or within their other very large application or technology environments using our Log Analysis solution. Parts of this are certainly reusable with other similar log collection, consolidation, search and visualization solutions available today and are not all dependent on use of the IBM Log Analysis solution. The overall architecture and design approach, decisions and many of the configurations are reusable for anyone desiring to design, build and deploy a log management solution using modern products, tools and techniques.

Within most highly dynamic and growth projects, start-ups, etc., management and monitoring stuff is often an afterthought, a “we’ll get to it later” kind of thing. There were no firm business or technical requirements guiding us as we began this project. I think everyone on the global Dev/Ops team knew it should be done and many were trying to attack the problem with scripts and one-off tools to help them keep their heads above water and deal with daily problems. What we need, expect and desire from the solution has evolved over each milestone of this project and will continue to as more of the global Dev/Ops team begins to use the solution on a daily basis. There are however, a few fundamental architecture and design goals that I’ve anchored my work on this project around based on our early experiences in the project:

Architecture Design Goals – We didn’t start with these from day zero, but they quickly became the focus of our work as we discovered the operational characteristics of each BlueMix Cloud Foundry environment.

Support a sustained message volume of XX MB|GB/s < -- not hiding numbers here, we just haven't set a target yet!
Message delivery quality of XX %
End to end source to search availability in XX minutes – when a record is written, when is it available for search?
Absorb a sustained burst in message volume of XX MB|GB/s over XX minutes
Process all rsyslog disk assist cache and/or buffers from burst within XX minutes

Need to Understand – With anything new, there are lots of unanswered questions and concerns from various parts of the Dev/Ops team. We need to work towards being able to answer some fundamental questions such as these.

Total daily message volume (GB/TB), messages/sec, network utilization
Total message volume by CF component type (eg Cloud Controller) day/week/month
Total message volume by end point CCI (with a given CF component)
Retention period (eg 30 days) – system resources required, pruning frequency
Find the high value log types and/or messages needed for problem isolation/resolution and determine parsing/annotation requirements
Find the lower value log types and/or messages and enable filtering and the edge or consolidation elsewhere

Need to Answer – We need to know how to scale up the solution as the overall BlueMix offering grows.

How must the log solution architecture scale with expected BlueMix growth?
What is the impact on current solution resources when new environments, components or end points are added?
How to scale architecture, control costs and provide good UX across a global deployment of BlueMix environments (datacenters)?

That’s a good intro to what I’ve been up to lately along with all of the normal customer and development activity! A lot more to come for sure. Up next, The path to milestone 1 – Sample Cloud Controller and DEA Logs

0 comments

Next Posts Previous Posts