Times are changing in the IT world: the most popular monitoring stacks now consist of about six to eight tools, and generally include at least one from each of the following, plus other specialized or custom tools: an open source monitoring tool (Nagios, Zabbix, or Icinga) combined with an APM tool (New Relic, AppDynamics, or Dynatrace), an error detection tool like Sentry, and a log analytics tool (Splunk or Logstash).
While the proliferation of monitoring tools offers organizations unprecedented insight into their application performance, it has inadvertently created an offshoot problem: some of today’s most popular monitoring tools are also the noisiest. Take Nagios, for example: frustrated with noisy alerts, many people find themselves shopping for an alternative. However, the cost to switch is high and most Nagios alternatives don’t offer a fundamental improvement along the spectrum of granular visibility versus noisy alerts.
But the truth is, there’s a better way to manage alert floods: alert correlation. Alert correlation tools can be used on top of your existing monitoring platforms to fight alert overload and boost production health. Let’s discuss some best practices for implementing a solid strategy:
- Centralize and normalize your alerts
Too much time is wasted shuffling between various consoles with different interfaces and jargon. Select a solution that centralizes and normalizes your alerts into a unified model to ensure that they “speak the same language”, and then automatically clusters alerts into related incidents so that you can reduce noise and spot critical issues faster.
- Correlate alerts with code deployments and infrastructure changes
It’s not enough to know that you have a problem. You also need to know why it’s happening and what to do about it. Ensuring that your alerts are also correlated with your code deployments and infrastructure changes arms you with contextual data (e.g. metrics, runbooks, and relationships) so you can efficiently and effectively pinpoint the root of the problem.
- Integrate with your outbound workflows
Once you’ve identified a clustered incident, you’ll want to empower your teams to quickly take action. Integrating with a collaboration tool (Slack, HipChat) or ticketing platform (Jira, ServiceNow) will allow you to automatically communicate and pass information related to the incident through the correct workflows. But because you are only passing along consolidated incidents, rather than every single alert, you’ll avoid noisy on-call pages and ticketing clutter.
- Proactively identify potential issues
The ultimate “promise” of alert correlation is not only to identify the root cause of critical issues and resolve them faster, but to detect underlying issues before they even have the chance to cause a real problem. Too often, alert fatigue can lead Ops teams to miss patterns from low-severity alerts that are precursors to high-severity incidents. For instance, a CPU load issue might quickly evolve into a full outage. By ignoring the low-sev alerts, you force yourself into a reactive mode. By automatically clustering low-sev alerts into related incidents, you can use alert correlation to spot “brewing storms” before they significantly impact production.
While implementing an alert correlation strategy certainly has immediate advantages, such as reducing pager fatigue and MTTR, the long-term benefits are perhaps even more important. By discovering insights that are buried deep inside your stream of noisy alerts, you’ll be able to identify potential issues before they even have a chance to cause damage, allowing your team focus on proactive efforts, rather than fighting fires.