{"id":7912,"date":"2019-01-31T06:00:12","date_gmt":"2019-01-31T10:00:12","guid":{"rendered":"http:\/\/dougmcclure.net\/blog\/?p=7912"},"modified":"2019-01-31T09:47:10","modified_gmt":"2019-01-31T13:47:10","slug":"exploring-operational-impacts-of-running-a-default-nagios-pagerduty-integration","status":"publish","type":"post","link":"https:\/\/dougmcclure.net\/blog\/2019\/01\/exploring-operational-impacts-of-running-a-default-nagios-pagerduty-integration\/","title":{"rendered":"Exploring Operational Impacts of Running a Default Nagios &#8211; PagerDuty Integration"},"content":{"rendered":"\n<p>In the prior blog post, I walked through how following the PagerDuty &#8211; Nagios XI integration guides leads us to the creation of a \u201cMonitoring Service\u201d. At the end of that post, I mentioned I\u2019d talk about some of the reasons why running in this default configuration isn\u2019t best practice and how this impacts an ops team\u2019s response when using PagerDuty.\u00a0 I\u2019ll talk about this today as well as layout the next few blog posts about moving to better practices when integrating Nagios with PagerDuty. <\/p>\n\n\n\n<p>These few items called out here are by no means an exhaustive or complete list but do represent many of the significant areas I see in both small and large PagerDuty customer environments and spend the most time optimizing for them.<\/p>\n\n\n\n<p><strong><em>Your PagerDuty Foundation Isn\u2019t Ready for Event Intelligence, Visibility, Analytics and Modern Incident Response!<\/em><\/strong><\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"alignleft is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/upload.wikimedia.org\/wikipedia\/en\/9\/95\/Crooked_house_dudley.jpg\" alt=\"\" width=\"193\" height=\"145\"\/><figcaption>&#8220;Weak Foundation?&#8221;<\/figcaption><\/figure><\/div>\n\n\n\n<p>When the sum of all the PagerDuty parts converge in a best practice configuration, PagerDuty\u2019s platform capabilities ensure people (responder, team lead, manager, exec, etc) receive notifications with the right context at the right time so the appropriate response can be taken.<\/p>\n\n\n\n<p>If the context conveyed via a PagerDuty service and incoming events is super generalized or named after a monitoring tool like &#8220;Nagios Service&#8221;, the ability to respond with the right urgency, understanding (context) and then take the appropriate action can be significantly impacted.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><em>For example, if an on-call responder is paged at 3 AM for a problem with the \u201cNagios Service\u201d, what\u2019s the appropriate response?\u00a0Does\u00a0the\u00a0&#8220;SERVICE_DESC&#8221;\u00a0in\u00a0your\u00a0Nagios\u00a0Alerts\u00a0prompt\u00a0the\u00a0desired\u00a0response?<\/em><\/li><li><em>If the MTTA\/R is increasing for the \u201cNagios Service\u201d, what is the root cause? Is it due to a single server or systemic problem across all the things, certain teams?<\/em><\/li><li><em>There can only be one Escalation Policy (EP) for that \u201cMonitoring Service\u201d. This means all events from Nagios into your &#8220;Nagios Service&#8221; go to the same responder(s) or schedule(s)!\u00a0If\u00a0you\u00a0own\u00a0it\u00a0all,\u00a0great,\u00a0but\u00a0chances\u00a0are\u00a0you&#8217;ve\u00a0got\u00a0many\u00a0responsible\u00a0groups\u00a0to\u00a0deal\u00a0with.<\/em><\/li><li><em>There can only be one <strong>automated<\/strong> Response Play for that \u201cMonitoring Service\u201d. Mature operations teams seek to automate operational response where seconds count using automated responses with very specific Response Plays for applications, functional technology types, specific teams or responders.\u00a0 This isn\u2019t possible due to the limitations of a single automated Response Play for your \u201cMonitoring Service\u201d.\u00a0Don&#8217;t hit the big red panic button for 60% disk full events! (* Multiple\u00a0Response\u00a0Plays\u00a0can\u00a0be configured\u00a0and launched\u00a0manually\u00a0via\u00a0the\u00a0Incident UI or Mobile App.)<\/em><\/li><li><em>Responder Notification (Urgency) is broadly applied (High Urgency by default &#8211; aka &#8220;Wake You Up at 3 AM Setting&#8221;) to everything that may be coming in rather than specifically applied based on the required response.\u00a0Maybe you want to use PagerDuty&#8217;s Dynamic Notifications on that &#8220;Monitoring Service&#8221; but do\u00a0you\u00a0&#8216;trust&#8217;\u00a0that\u00a0incoming\u00a0events\u00a0have\u00a0a\u00a0severity\u00a0that\u00a0accurately\u00a0maps\u00a0to\u00a0the\u00a0needed\u00a0urgency\u00a0of\u00a0an on-call responder&#8217;s response?\u00a0I&#8217;ll\u00a0bet\u00a0that\u00a0you&#8217;re\u00a0probably\u00a0sending\u00a0in\u00a0everything\u00a0as\u00a0&#8216;CRITICAL&#8217; anyway.\u00a0If\u00a01\u00a0of\u00a020\u00a0servers\u00a0in\u00a0your\u00a0web\u00a0tier\u00a0has\u00a0a\u00a0&#8216;CRITICAL&#8217;\u00a0disk\u00a0failure\u00a0&#8211;\u00a0does\u00a0that\u00a0warrant\u00a0a\u00a0high\u00a0urgency\u00a0page\u00a0at\u00a03\u00a0AM?<\/em><\/li><\/ul>\n\n\n\n<p><strong><em>Sometimes Things [ Are | Are NOT ] Better Together!<\/em><\/strong><\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"alignleft is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/data.whicdn.com\/images\/251841015\/superthumb.jpg?t=1469721452\" alt=\"\" width=\"202\" height=\"168\"\/><figcaption>mmmm bacon&#8230;<\/figcaption><\/figure><\/div>\n\n\n\n<p>One of our better practices is to use Alert Grouping as a means of controlling the noise from poorly configured thresholds or alerting logic, or just plain old \u201cSHTF\u201d situations and alert storms that happen in any ops environment.&nbsp; Without the use of something like PagerDuty&#8217;s Time-Based Alert Grouping (TBAG), every single incoming event results in a unique incident, which sends notifications to the on-call responder(s), rinse and repeat for every&#8230;single&#8230;event.<\/p>\n\n\n\n<p>The situation that many customers fear when talking about &#8220;smart stuff&#8221; that is supposed to do whiz-bang grouping, correlating or other AI-ML-EIEIO magic is that the wrong alerts are grouped and things get missed. <\/p>\n\n\n\n<p>PagerDuty TBAG is a hard, time-based approach to group things so if a Network Link 5% Packet Loss event (BFD!) happens in the same time window as a&nbsp;MySQL Process Failure event (Oh, shit!), those things likely don\u2019t relate yet they are grouped together. The first event&#8217;s description becomes the incident\u2019s description and someone is paged for the Network Link 5% Packet Loss item and the on-call responder dismissed that incident b\/c their quick scan of the incident in the mobile app doesn\u2019t prompt closer investigation or an urgent response. All the while, the critical business impacting MySQL Process Failure alert is unnoticed as it\u2019s grouped in with the Network Link 5% Packet Loss incident. See why the concern? Not a fun discussion with the boss&#8230;<\/p>\n\n\n\n<p>PagerDuty&#8217;s Intelligent Alert Grouping (IAG) \u2018learns\u2019 based upon historical TBAG grouping and responders manually merging alerts into incidents. If IAG makes sense in your future (and it will unless responders are dedicated to doing this manual correlation and merging, it can be challenging to do this within the PD Alert UI), you won&#8217;t want to influence what IAG may do with bogus groups that could happen with broad based \u201cMonitoring Services\u201d.<\/p>\n\n\n\n<p>Net net here is you probably don&#8217;t want to use alert grouping on big, broad based &#8220;Monitoring Services&#8221; for fear that things are grouped incorrectly and something uber important is missed. <\/p>\n\n\n\n<p><strong><em>The Journey along the \u201cSignal to Insight to Action\u201d Path Leads to a Peaceful On-Call Experience!<\/em><\/strong><\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"alignleft is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/714\/1*OrfT4OOCuXzxuraUOKRFUg.jpeg\" alt=\"\" width=\"218\" height=\"152\"\/><figcaption>You can get there&#8230;<\/figcaption><\/figure><\/div>\n\n\n\n<p>All PagerDuty customers are entitled to use Global Event Routing and certain Global Event Rules to process and route incoming events to the appropriate service. If you&#8217;re following the default Nagios &#8211; PagerDuty integration guide and directly integrating with the service, you&#8217;re <g class=\"gr_ gr_99 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace\" id=\"99\" data-gr-id=\"99\">bypas<\/g>sing this very powerful feature.<\/p>\n\n\n\n<p>Building upon this basic Global Event Routing capability is the broader Event Intelligence offering and its own associated Global Event Rules providing a growing toolbox of capabilities to deal with the operational realities of your environment. <\/p>\n\n\n\n<p><strong><em>When deployed properly you&#8217;ll efficiently move from signal to insight to action by ensuring the right events land on the right services at the right time so the right responder\/team have the right context to take the right action.<\/em><\/strong> Whew, that&#8217;s a mouthful &#8211; but that&#8217;s the real goal here right? If you could avoid waking up Fred, Sally and Shika at 3 AM with non-actionable, low urgency events, WHY WOULDN&#8217;T YOU WANT TO DO THAT?<\/p>\n\n\n\n<p><strong>Any of this sound familiar<\/strong><g class=\"gr_ gr_52 gr-alert gr_gramm gr_inline_cards gr_run_anim Style replaceWithoutSep\" id=\"52\" data-gr-id=\"52\"><strong>?<\/strong><\/g><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><em>Alert fatigue from too much noise in your monitoring tools? No problem, there\u2019s a rule for dealing with that!<\/em><\/li><li><em>False positive alerts waking people up at 3 AM due to reoccurring maintenance windows? No problem, there\u2019s a rule for dealing with that!<\/em><\/li><li><em>Crappy alert metadata leading to missed issues or long MTTA\/R because on-call responders don\u2019t grok what the alert is trying to tell them or don\u2019t know what to do next? No problem, there\u2019s a rule for dealing with that!<\/em><\/li><\/ul>\n\n\n\n<p><strong><em>Don\u2019t Let \u201cBusiness As Usual\u201d or &#8220;We&#8217;ve Always Done it This Way&#8221; Hold You Back!<\/em><\/strong><\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"alignleft is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/powerupyourmarketing.com\/wp-content\/uploads\/2018\/05\/Negative_Patterns.jpg\" alt=\"\" width=\"216\" height=\"144\"\/><figcaption>Insanity?<\/figcaption><\/figure><\/div>\n\n\n\n<p>Imagine a situation where Nagios is deployed and monitoring ALL of your infrastructure &#8211; dozens, hundreds maybe thousands of nodes, services, interfaces, URLs, etc. <strong>This would take considerable time and effort to move away from! (Worse, you probably have at least a dozen tools all set up similarly with PagerDuty!)<\/strong><\/p>\n\n\n\n<p>Imagine the sheer amount of manual work to move from your \u201cBusiness as Usual\u201d configuration of Nagios and PagerDuty &#8220;Monitoring Services&#8221; to something better &#8211; maybe you&#8217;re nervously thinking how you might unpack your &#8220;Nagios Service&#8221; &#8211; it may go something like this:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><em>Discover and map out exactly what\u2019s being monitored by Nagios <strong>&#8211; &#8220;I think Nagios Ned can help me with that\u2026&#8221;<\/strong><\/em><\/li><li><em>Discover server, application, \u2018thing\u2019 owners <strong>&#8211; &#8220;ugh, I have to talk with that group\/person&#8230;&#8221;<\/strong><\/em><\/li><li><em>Discover context of what that &#8216;thing&#8217; does, what it supports, what is impacted when problems found <strong>&#8211; &#8220;uh oh, I\u2019m feeling really uncomfortable\u2026&#8221;<\/strong><\/em><\/li><li><em>Discover what the appropriate operational response needs to be for all event classes\/types and who\u2019s responsible <strong>&#8211; &#8220;more meetings\u2026fml&#8230;&#8221;<\/strong><\/em><\/li><li><em>Translate all of the above to appropriate PagerDuty configurations following best practices <strong>&#8211; &#8220;a whole lot of point+click coming my way\u2026&#8221;<\/strong><\/em><\/li><li><em>\u2026<\/em><\/li><\/ul>\n\n\n\n<p><em>There is a much better way and I have some \u2018magic pixie dust\u2019 that can help you optimize this!<\/em><\/p>\n\n\n\n<p><strong><em>Where do we go from here?<\/em><\/strong><\/p>\n\n\n\n<p>The next few posts I&#8217;ve got in mind build out something like this:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Growing up from the Nagios &#8211; PagerDuty defaults &#8211; Crawling away from the default &#8220;Monitoring Service&#8221;<\/li><li>Introducing the Global Event Routing API &#8211; Walking in with your eyes wide open<\/li><li>Extending Nagios with Custom Attributes &#8211; Running with PagerDuty like a champ<\/li><li>Applying Event Intelligence to improve your Nagios + PagerDuty experience for on-call responders<\/li><li>Magic Pixie Dust &#8211; How PagerDuty can help you ADAPT to a better way of doing things in ops and on-call when using Nagios<\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In the prior blog post, I walked through how following the PagerDuty &#8211; Nagios XI integration guides leads us to the creation of a \u201cMonitoring Service\u201d. At the end of that post, I mentioned I\u2019d talk about some of the reasons why running in this default configuration isn\u2019t best practice and how this impacts an [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1079,1082,1081,1084,1077,1078,1080],"tags":[931,929,1088,1087,1073,1072],"class_list":{"0":"post-7912","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-event-intelligence","7":"category-event-routing","8":"category-event-rules","9":"category-integrations","10":"category-pagerduty","11":"category-pagerduty-best-practices","12":"category-service-design","13":"tag-best-practices","14":"tag-events","15":"tag-incident-response","16":"tag-nagios","17":"tag-nagios-xi","18":"tag-pagerduty"},"_links":{"self":[{"href":"https:\/\/dougmcclure.net\/blog\/wp-json\/wp\/v2\/posts\/7912","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dougmcclure.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dougmcclure.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dougmcclure.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dougmcclure.net\/blog\/wp-json\/wp\/v2\/comments?post=7912"}],"version-history":[{"count":5,"href":"https:\/\/dougmcclure.net\/blog\/wp-json\/wp\/v2\/posts\/7912\/revisions"}],"predecessor-version":[{"id":7936,"href":"https:\/\/dougmcclure.net\/blog\/wp-json\/wp\/v2\/posts\/7912\/revisions\/7936"}],"wp:attachment":[{"href":"https:\/\/dougmcclure.net\/blog\/wp-json\/wp\/v2\/media?parent=7912"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dougmcclure.net\/blog\/wp-json\/wp\/v2\/categories?post=7912"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dougmcclure.net\/blog\/wp-json\/wp\/v2\/tags?post=7912"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}