A few months back, I was asked to help deploy our Log Analysis solution for our BlueMix Fabric Dev/Ops team. Their pain point – getting value and insights from massive amounts of Cloud Foundry (CF) log data across multiple development, staging and production environments in order to provide a highly available BlueMix offering. No problem I thought. A log is a log is a log. I’d done this a number of times for various applications or technologies using our Log Analysis solution. Three months later as we move this into an environment most closely resembling our production environment, things have gotten very interesting to say the least in terms of designing for a scale out log solution supporting 100’s to 1,000’s of GB of log data each day.
I want to share my journey here on my blog so others may benefit who chose to do something similar in their own Cloud Foundry environment or within their other very large application or technology environments using our Log Analysis solution. Parts of this are certainly reusable with other similar log collection, consolidation, search and visualization solutions available today and are not all dependent on use of the IBM Log Analysis solution. The overall architecture and design approach, decisions and many of the configurations are reusable for anyone desiring to design, build and deploy a log management solution using modern products, tools and techniques.
Within most highly dynamic and growth projects, start-ups, etc., management and monitoring stuff is often an afterthought, a “we’ll get to it later” kind of thing. There were no firm business or technical requirements guiding us as we began this project. I think everyone on the global Dev/Ops team knew it should be done and many were trying to attack the problem with scripts and one-off tools to help them keep their heads above water and deal with daily problems. What we need, expect and desire from the solution has evolved over each milestone of this project and will continue to as more of the global Dev/Ops team begins to use the solution on a daily basis. There are however, a few fundamental architecture and design goals that I’ve anchored my work on this project around based on our early experiences in the project:
Architecture Design Goals – We didn’t start with these from day zero, but they quickly became the focus of our work as we discovered the operational characteristics of each BlueMix Cloud Foundry environment.
- Support a sustained message volume of XX MB|GB/s < -- not hiding numbers here, we just haven't set a target yet!
- Message delivery quality of XX %
- End to end source to search availability in XX minutes – when a record is written, when is it available for search?
- Absorb a sustained burst in message volume of XX MB|GB/s over XX minutes
- Process all rsyslog disk assist cache and/or buffers from burst within XX minutes
Need to Understand – With anything new, there are lots of unanswered questions and concerns from various parts of the Dev/Ops team. We need to work towards being able to answer some fundamental questions such as these.
- Total daily message volume (GB/TB), messages/sec, network utilization
- Total message volume by CF component type (eg Cloud Controller) day/week/month
- Total message volume by end point CCI (with a given CF component)
- Retention period (eg 30 days) – system resources required, pruning frequency
- Find the high value log types and/or messages needed for problem isolation/resolution and determine parsing/annotation requirements
- Find the lower value log types and/or messages and enable filtering and the edge or consolidation elsewhere
Need to Answer – We need to know how to scale up the solution as the overall BlueMix offering grows.
- How must the log solution architecture scale with expected BlueMix growth?
- What is the impact on current solution resources when new environments, components or end points are added?
- How to scale architecture, control costs and provide good UX across a global deployment of BlueMix environments (datacenters)?
That’s a good intro to what I’ve been up to lately along with all of the normal customer and development activity! A lot more to come for sure. Up next, The path to milestone 1 – Sample Cloud Controller and DEA Logs