Alerting

Alerts allow you to receive a notification when your data does something unexpected, such as go above or below a set threshold, or stops suddenly. Select a recipient for your notifications and you can get immediate feedback via email, PagerDuty, Slack, or Hipchat when critical changes occur.

Alert Overview

Your alerts are listed in the alert overview section. We list them in three categories.

  • Healthy Alerts
    Alerts that are currently running and within acceptable boundaries.
  • Triggered Alerts
    Alerts that are currently running and outside acceptable boundaries, this alert will have already notified you via the set notification channel.
  • Muted Alerts
    Alerts which have been manually silenced. These alerts will not notify you until they become active again.

You can click on the metric name to see a recent graph of that metric. The pencil icon or clicking on the alert name opens up the edit alert dialog. The mute icon allows you to silence the alert for a certain amount of time.

Alert Overview

Alert Report Page

Creating An Alert

Alert Name and Metric

Click the “Add Alert” button in the top bar to open the alert creation panel.

  • Name
    This name is used in notifications. It is a reminder of why you added it, so make it clear and descriptive! e.g. “European Servers CPU usage”.
  • Metric Pattern
    This is the data that is tested against your criteria (which you’ll add on the next screen) e.g. “my.server.cpu”.

You can check a graph of your desired metric with the “Check Metric Graph” button. When you’re finished, click on “Confirm Metric Choice” to proceed to the Alert Criteria screen.

Adding an Alert

Set the Alert Name and Metric

Alert Criteria Panel

There are three ways to define the criteria that will result in a notification being sent.

  • Outside of Bounds
    An alert notification will be sent if the metric data you’ve selected goes either above the “above” threshold, or below the “below” threshold. This is useful when your data fits inside an expected range, e.g. a response time of a webserver
  • Below / Above a Threshold
    If you just enter one of the above or below values, it will check whichever one you use. This is useful when there’s an upper or lower bound that this data should not cross, for example the CPU load of a server.
  • Missing
    An alert notification will be sent to you if the metric does not arrive at all for a certain time period. This is useful for detecting when a system goes down entirely.

Alerting Notification Interval lets you control how often you want to be notified for an alert.

  • On state change
    A notification will be sent only when the alert transitions state from healthy to triggered or vice versa. An alert that that continues alerting will not sent subsequent notifications.
  • Every
    A notification will be sent when the alert triggers and recovers. Subsequent notifications will then be paused for the configured time period. This allows you to stop ‘flapping’ behaviour that would give you lots of notifications in a short period of time.
Access List Keys

Set the Alert Criteria and Select your Notification Channel

Managing An Alert

From the Alert Overview page, you can hover your mouse over an individual alert to see actions related to managing it.

Editing an Alert
  • View an alert
    Click the eye icon to open the overview popup for an alert. This displays an embedded Grafana graph and a history log of the last 3 days of data. There is also a link to the Grafana composer to view more detailed information on the metric being alerted on.
    Alert Overview
  • Edit an alert
    An alert can be edited to change its metric, criteria or notification channel, but any changes may take a few minutes to take effect.
  • Mute an alert
    An alert can be silenced from notifying you for a specified time period. Currently the available times are 30 mins, 6hrs, 1 day and 1 week.
  • Delete an alert
    An alert can be deleted from your panel here. This action is irreversible.

Notification Channels

Defining an notification channel allows you to receive a notification when an alert triggers. Currently we support six different ways to notify your team when an event occurs. You can see the available notification channels and add new ones on the Notification Channel Page.

  • Email
    Send an email to your team when the alert is triggered.
  • PagerDuty
    The PagerDuty notification uses a PagerDuty integration key, which you can find at the Pagerduty documentation Pagerduty documentation.
  • Slack
    Send an immediate notification to a Slack channel. The Slack notification requires an endpoint for your channel, see the Slack documentation for details.
  • HipChat
    Send notifications and show an alert overview in a HipChat channel. You need to set up the Hosted Graphite add-on using the HipChat interface first before linking it to your alerts. See our blog post for more details.
  • VictorOps
    You can send your alerts into your VictorOps hub to integrate with all your existing monitoring and alerting infrastructure. Check our VictorOps page.
  • Webhook
    Allows you to setup a webhook that we will notify with real-time information for your defined alerts.

The notification will be json encoded in the following format.

{
 "name": "The name of the triggered alert.",
 "criteria": "The defined alert criteria for the alert.",
 "graph": "PNG of the grafana rendered graph.",
 "value": "The current value of the metric.",
 "metric": "The name of the metric.",
 "status": "The current status of the metric."
}

Auto-Resolve Notifications

For Email, HipChat, Webhook and Slack notifications, incidents are automatically resolved.

For VictorOps and PagerDuty notification channels, we can automatically resolve your alerts when they have reached a recovered state. This can be enabled on the Notification Channel Page or can be done via our API as outlined in our alerting API docs here.

Troubleshooting your Alerts

The Alerting feature is new and in active development. Please contact support if you think you’ve found a bug, or have any questions or suggestions.

  • Is your metric arriving?
    If are not receiving notifications as expected, please check the alert overview page and select the alert in question. You can use this to check the metric values for the last few hours are as expected.
  • Are some events being ignored?
    We alert on a 30 second resolution. This means the finer data (5 second for example) is averaged and we alert off the 30 second aggregate.
  • Why is my wildcard metric not alerting?
    If an alert metric contains a wildcard (E.g. webservers.*) and is set to notify when data is missing, the alert will only trigger when all metrics grouped by that wildcard are missing.
    Similarly, if an alert is set to notify when a wildcard metric crosses a specified threshold, then it will trigger when one or more associated metrics cross that threshold.