A Better Way of Zabbix Alerting Using Slack

Get only the alerts you need in Slack.

Did you know you can use Slack to get Zabbix alerts in a way that is clear and concise? Well, there is.

Every self-respecting company has some sort of alerting, be it Zabbix, Nagios, Datadog or something else. On the other hand, most major companies also use Slack for direct communication between colleagues and customers. More experienced users will set up file-sharing or DevOps. Some will even be tempted to send notifications to the users when certain actions happen. Let’s call that alerting for Slack.

We used to get tons of Zabbix alert e-mails, back in the day, most of which would go in a user’s mailbox. This mailbox would quickly reach the Exchange mailbox limit, the user would get frustrated by the number of alerts, not read them, place them in a separate folder, etc., all defeating the purpose of alerts. Worst case: we would get so many alerts that the mail queue would be full and the Exchange server completely unresponsive. Luckily, we don’t do this anymore.

What if Zabbix were a bit smarter and could use some of powerful features of the Slack API?

We tried the same with SMS alerts. That failed wonderfully for almost the same reason.

We never even started with alerting via Slack . It would just clutter up the channel with messages and people would get lost or would have to scroll too much to see if they missed something. Using a separate channel might help, but you don’t want to scroll two or more pages to play catch up with alerts that might have already been handled.

A few weeks ago, I was playing with the Zabbix API and thinking: ‘How would it be if Zabbix were a bit smarter and could use some of powerful features of the Slack API?’

The most important feature to (ab)use here is to delete messages. This would be the process:

  • Zabbix generates an alert for an event and sends it to a script (like it usually does for external alerts).

  • This script talks with the Zabbix API and figures out:

    • Is this a new alert? Then just post it into a channel.

    • Is it a recurring alert for the same event? Then delete the previous posted message and repost a new one with a counter (e.g. “alerted 7 times”).

    • Is this a handled alert (the status is OK)? Then post this message to the channel and shortly after remove all messages concerning the same event from this channel.

  • Configure the channel retention to 1 day. You’re not meant to talk into this channel. If you do anyway, Slack will clean up any crud older than a day. Alerts would never be this old since our Zabbix trigger action is configured to re-alert every 30 minutes (thus, older messages would be removed and the new alert reposted).

We also avoided using the bloated attachments from Zabbix. They are too large and we wanted to keep the message short and concise.

Why would you want this?
  • Quick birds eye view. If the channel is empty, nothing is going on. If it’s blood red, you have a serious problem (or you just spilled a tomato juice on your screen).

  • Alerts are never more than one page scroll away. It is quite unusual to have more than a whole page of alerts.

  • Slack notifications and the unread feature make sure you will notice the alert even quicker than when it appeared in your mailbox.

  • Possibility to @mention other users for a specific alert, so they can act quicker.

How did we do this?

In Zabbix we created a media type for Slack, the script is a Java program. Why Java? Because most of the work I do these days involves Java, and I didn’t want to spend too much time coding this tool (I clocked in at four hours).

Also, Java would actually allow us to use this tool on other platforms, and it had a good Slack API I could use.

In our Debian install, the scripts are located under /home/zabbix/bin, which is where we placed a Symlink for the Java binary – this would spare us another intermediate script to call in between.

root@server:/home/zabbix/bin# ls -al java
lrwxrwxrwx 1 zabbix zabbix 22 Nov  8 18:19 java -> /etc/alternatives/java
root@server:/home/zabbix/bin#

The rest of the arguments are passed directly to the Java binary and .jar.

The program is packaged into a fat jar by using Spring Boot's repackage goal.

<plugin>
	<groupId>org.springframework.boot</groupId>
	<artifactId>spring-boot-maven-plugin</artifactId>
	<version>2.1.0.RELEASE</version>
	<executions>
		<execution>
			<goals>
				<goal>repackage</goal>
                        </goals>
                        <configuration>
				<mainClass>
					com.foreach.ZabbixSlack
				</mainClass>
			</configuration>
		</execution>
	</executions>
</plugin>

To be able to find messages for a particular Zabbix event, we had to include something unique for this event in the message to Slack. We ended up using the event ID {EVENT.ID}, which you can configure in the Zabbix action trigger.

Our Action trigger looks like this:

zabbix example

Note the #ZBID:{EVENT.ID} and #ZBID:{EVENT.ID}:R which signal the script to handle a normal event, or a recovery event.

The rest is a piece of cake: you can use the search.messages method to find your message containing a #ZBIX:746387025 string and delete or update them, depending on the scenario.

Beware though, a first snag I hit was that search.messages seems to have a bit of a delay before your message will appear in the API (I assume it needs indexing in whatever search solution Slack uses). It might take about 20-30 seconds to get into the API result.

The first iteration of our script would poll every few seconds to see if we would find something, breaking out of the loop after a maximum of 60 seconds. That wasn’t very effective. Handling one recovery event would take about 30-60 seconds and since Zabbix doesn’t send the alerts asynchronously, Zabbix would have a very long alert queue.

A silly optimisation was to use channels.history, which seems to hold new messages in near real time. The API call can handle up to 1000 messages in this call, and since we keep this channel clean, it would be very unexpected if we would ever go past that.

Just in case we do get more results (there is a has more in the API result), we will fall back to the slower search.messages (and not wait or poll that long since these messages would be more than 1000 messages old, so waiting for something to be indexed has no point).

A second issue was that the allbegray/slack-api didn’t have a retry mechanism if a lot of messages are being sent to Slack. You would just get a SlackResponseRateLimitException. This was quickly fixed by recursively calling the same method, with a back-off time that Slack provides in its API response (Retry-After HTTP header).

Managing users

As it stands you will have configured your Action trigger, but you will still need to configure a user to receive an alert for this specific media type.

In a previous setup, all users had to configure their email and/or SMS number to receive alerts. Some people would forget to set this, people would not receive alerts, and the follow-up of Zabbix alerts would become cumbersome.

Now, we use LDAP authentication for our users and these users are automatically provisioned via the Zabbix API by an external program (when a user is created in Active Directory).

The media types for these users are empty. We would like to avoid that new users would need to configure their email or Zabbix channel(s) in their Zabbix settings.

Therefore, we created an internal Zabbix-notifier user per team that wants to receive Slack alerts. Each user is linked to one or more specific Slack channels where team members can join to follow up on alerts. This way, alerts will keep working, even if a user leaves the company and is deprovisioned from the list of Zabbix users. Right now, following up on alerts is as simple as joining the Slack channel (after some poking by team members).

If you're interested in the rest of the program, give us a shout in the comments or on Twitter or head over to Github for the source code.

Dit blog is geschreven door de specialisten van Foreach.

Inmiddels is Foreach onderdeel van iO. Meer weten? Neem gerust contact op!

logo iO