Skip to content

Ganglia Nagios Integration

Ng Zhi An edited this page Jul 28, 2014 · 2 revisions

Ganglia Nagios Integration

Ganglia Nagios integration is a new feature that is included with Ganglia Web 2.2.0+. It is based on following implementation

[http://vuksan.com/linux/nagios_scripts.html#check_ganglia_metrics]

with the exception that it uses a shell script wrapper which is more efficient since PHP interpreter doesn't need to be spawned each time we check a metric.

There are 4 different Ganglia Checks

  • Check heartbeat
  • Check single metric on a specific host
  • Check multiple metrics on a specific host
  • Check multiple metrics on a range of hosts defined with a regular expression

Check Heartbeat

Ganglia uses heartbeat packets to determine if a machine has gone down. It is reset every time a new packet is received. This check avoids you from having to do things like check_ping to make sure machine is alive. To use this check please copy check_heartbeat.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

GANGLIA_URL="http://localhost/ganglia2/nagios/check_heartbeat.php"

Define the check command in Nagios. Threshold is the amount of time since last reported heartbeat to raise critical alert.

define command {
  command_name  check_ganglia_heartbeat
  command_line  /bin/sh /var/www/html/ganglia/nagios/check_heartbeat.sh host=$HOSTADDRESS$ threshold=$ARG1$
}

Now for every host you want monitored change check_command to be

  check_command   check_ganglia_heartbeat!50

This will mark any node that reported to Ganglia 50 seconds or more ago as CRITICAL.

Check single metric on a specific host

To use it please copy check_ganglia_metric.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

GANGLIA_URL="http://localhost/ganglia2/nagios/check_metric.php"

Nagios configuration consists of defining following command

define command {
  command_name  check_ganglia_metric
  command_line  /bin/sh /var/www/html/ganglia/nagios/check_ganglia_metric.sh host=$HOSTADDRESS$ metric_name=$ARG1$ operator=$ARG2$ critical_value=$ARG3$
}

Now you can use it in a service check. For instance say you want to be alerted if 1-minute load average goes over 5 you would add following directive

        check_command         check_ganglia_metric!load_one!more!5

If you wanted to alert when disk space goes less than 10 GB

        check_command         check_ganglia_metric!disk_free!less!10

Be reminded that operators indicate what should be "critical" state. For instance if you use notequal it means state is critical if the value is NOT equal. etc.

Check multiple metrics on a specific host

Check multiple metrics is a modification of the check single metric script. It will check multiple metrics on the same host e.g. instead of having separate checks for e.g. disk utilization on /, /tmp and /var which may produce three separate alerts you have a single alert any time disk utilization goes below or above a threshold.

To use it please copy check_multiple_metrics.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

GANGLIA_URL="http://localhost/ganglia2/nagios/check_multiple_metrics.php"

Then define a check command in Nagios

define command {
  command_name  check_ganglia_multiple_metrics
  command_line  /bin/sh /var/www/html/ganglia/nagios/check_multiple_metrics.sh host=$HOSTADDRESS$ checks='$ARG1$'
}

Then add a list of checks that are delimited with :. Each check consists of

metric_name,operator,critical_value

E.g.

  check_command       check_ganglia_multiple_metrics!disk_free_rootfs,less,10:disk_free_tmp,less,20

WARNING: Drawback of using check multiple metrics is that in certain instances you may not be aware of the scale of a problem. For example say you get an alert for /tmp nearing full. You get this alert over the weekend so you figure it's not THAT critical. After the alert your /var starts rapidly filling up which may be really serious. Unfortunately you will not get another alert (unless obviously you had an aggressive notification interval). Beware.

Check multiple metrics on a range of hosts defined with a regular expression

Use this check to check a single or multiple metrics on a range of hosts defined using a regular expression. This is useful when you want to get a single alert if particular metric is critical across a number of hosts.

To use it please copy check_multiple_metrics.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

GANGLIA_URL="http://localhost/ganglia2/nagios/check_host_regex.php"

Then define a check command in Nagios

define command {
  command_name  check_ganglia_host_regex
  command_line  /bin/sh /usr/share/ganglia-web2/nagios/check_host_regex.sh hreg='$ARG1$' checks='$ARG2$'
}

Then add a list of checks that are delimited with :. Each check consists of

metric_name,operator,critical_value

E.g.

For example to check free space on / and /tmp for any machine starting with web-* or app-* you would use something like this

  check_command       check_ganglia_host_regex!^web-|^app-!disk_free_rootfs,less,10:disk_free_tmp,less,10

DOWNSIDES: Downside of this approach similar to check multiple metrics on a single host is that in certain situation the scale of a problem may not be apparent since only a single alert will be generated. Also currently since Nagios and Ganglia are decoupled you may get an alert if machine is under scheduled maintenance and e.g. you start writing to /tmp.

Check value(s) is same on a set of hosts

Use this check to check a single or multiple metrics on a range of hosts have the same value. For example let's say you wanted to make sure SVN revision of the deployed code was the same across all servers. You would send the SVN revision as e.g. a string metric then list it as metric that needs to be same everywhere

To use it please copy check_value_same_everywhere.sh script from nagios subdirectory in Ganglia Web tarball. Make sure that the Ganglia Web URL inside the script is correct. This is the default

GANGLIA_URL="http://localhost/ganglia2/nagios/check_value_same_everywhere.php"

Then define a check command in Nagios

define command {
  command_name  check_ganglia_host_regex
  command_line  /bin/sh /usr/share/ganglia-web2/nagios/check_value_same_everywhere.sh hreg='$ARG1$' checks='$ARG2$'
}

e.g.

  check_command       check_ganglia_host_regex!^web-|^app-!svn_revision,num_config_files