Monitoring plugin for checking the status of IP SLAs on Cisco devices
This plugin can check the status of one or more IP SLA entries on a Cisco IOS device. IP SLAs can be used to monitor IP service levels for various IP applications and services. See the Cisco website for more details on SLA entries and their use. One simple usage example is to monitor a multi-connection failover routing setup to monitor SLAs which ping the other end of each line. SLA's can be set up to monitor a line/route and when this line goes down, the corresponding SLA will go down which this plugin can monitor. This is just one example, however SLAs can be configured for various other tasks. For more info on IP SLA's, see the manual for your Cisco device on IP SLA's. An example is the manual for a Cisco 4500 series At the moment, rtt-types echo, pathEcho and jitter are supported and tested (aka icmp-echo, path-echo and udp-jitter). Other types need to be implemented or at least tested. Suggestions and/or help is always welcome.
- v1.0.0 (2016-02-08)
- Initial release
- v1.0.1 (2017-02-22)
- Fixed bug which appeared when OID's were returned in text form when the RTT-MIB was installed on the system.
- v1.0.2 (2017-03-12)
- Added round trip time perf data.
- Added warning when checking unsupported IP SLA types.
- IMPORTANT backward incompatible change: Repurposed the '--version' parameter from setting the snmp-version to displaying the scripts version. To specify the snmp version, use '-v' or '--snmp-version'
- v1.1.0 (2017-06-28)
- Added support for rtt-type jitter with MOS and ICPIF thresholds and extensive perf data
- Removed sla tag suffix in perf data when checking only one entry
- v1.1.1 (2017-11-20)
- Fixed bug regarding dict sorting when using python3 icw parameter "--entries all"
- v1.1.2 (2020-09-16)
- Added support for 'Average Jitter' warning and critical thresholds
- v1.1.3 (2021-05-21)
- Fix for certain devices missing some OIDs. (Thanks to Luca Lesinigo for reporting and providing a fix)
Requirements
- Python version 2 or 3 (tested on 2.7+ and 3.4+, please read instructiuons below for 3.7+)
- easysnmp (lightweight and fast snmp library for python, see https://github.com/fgimian/easysnmp) See https://easysnmp.readthedocs.org/en/latest/ for installation instructions
Place the check script anywhere you'd like (eg /usr/local/lib/nagios/plugins) and run it.
Easysnmp and Python 3.7+
Using Python 3.7+, there's a bug in the easysnmp library
(see issue-108).
There's a patch by @nerosketch which can be used.
For example, on Ubuntu 20.04, you could do this:
sudo apt install libsnmp-dev python3-pip
git clone https://github.com/nerosketch/easysnmp.git
cd easysnmp
git checkout fix_is108
# Use with "sudo" for a system-wide install
pip3 install .
You can use this plugin to check a single SLA or multiple SLA's of the same type.
Checking multiple SLA's is mostly useful to check general internet connectivity. For example, you could set up an echo SLA to 4 IP's with expected near 100% uptime. Then you could use the either the --warning or the --warning-pct parameter to let the script issue a warning if say 2 go down and in the same way use a critical parameter to issue a critical when more than 2 go down.
Use "--mode list" to do a quick check of available SLA's on your Cisco device. Use "--mode check" to do the actual checking.
Performance data is output when using the --perf parameter. See the Output chapter for a description on performance values.
For a complete overview of command-line options, run the check with the parameter --help.
$ ./check_cisco_ip_sla.py --help
usage: check_cisco_ip_sla.py [-h] [--version] [-H HOSTNAME] [-v {1,2,3}]
[-c COMMUNITY] [-u SECURITY_NAME]
[-l {noAuthNoPriv,authNoPriv,authPriv}]
[-p PASSWORD] [-a {MD5,SHA}] [-A AUTH_PASSWORD]
[-x {DES,AES}] [-X PRIV_PASSWORD]
[-m {list,check}] [-e ENTRIES] [--perf]
[--critical-pct CRITICAL_PCT]
[--warning-pct WARNING_PCT] [--critical CRITICAL]
[--warning WARNING]
[--critical-jitter CRITICAL_JITTER]
[--warning-jitter WARNING_JITTER]
[--critical-mos CRITICAL_MOS]
[--warning-mos WARNING_MOS]
[--critical-icpif CRITICAL_ICPIF]
[--warning-icpif WARNING_ICPIF]
[--verbose {0,1,2}]
Monitoring check plugin to check Cisco SLA status for one or more entries. If
a checked SLA entry is not in active state, the status is raised to WARNING.
The script returns the worst status found for each checked SLA entry where
UNKNOWN is worse than CRITICAL and CRITICAL is worse than WARNING.
optional arguments:
-h, --help show this help message and exit
--version The version of this script
-H HOSTNAME, --hostname HOSTNAME
Hostname or ip-address
-v {1,2,3}, --snmp-version {1,2,3}
SNMP version (default '2')
-c COMMUNITY, --community COMMUNITY
SNMP v1/v2 Community string (default 'public')
-u SECURITY_NAME, --security-name SECURITY_NAME
SNMP v3 security name (username)
-l {noAuthNoPriv,authNoPriv,authPriv}, --security-level {noAuthNoPriv,authNoPriv,authPriv}
SNMP v3 security level (default 'authPriv')
-p PASSWORD, --password PASSWORD
SNMP v3 password (used for both authentication and
privacy)
-a {MD5,SHA}, --auth-protocol {MD5,SHA}
SNMP v3 authentication protocol (default 'SHA')
-A AUTH_PASSWORD, --auth-password AUTH_PASSWORD
SNMP v3 authentication password, overrides --password
if set
-x {DES,AES}, --priv-protocol {DES,AES}
SNMP v3 privacy protocol (default 'AES')
-X PRIV_PASSWORD, --priv-password PRIV_PASSWORD
SNMP v3 privacy password, overrides --password if set
-m {list,check}, --mode {list,check}
Operation mode
-e ENTRIES, --entries ENTRIES
SLA entry (or entries) to check, specify a single
value, a comma-separated list or 'all' to check all
entries available. All entries must be of the same
type. (default 'all')
--perf Return performance data (failed percentage, round-trip
times)
--critical-pct CRITICAL_PCT
Critical threshold in percentage of failed SLAs
(default '100')
--warning-pct WARNING_PCT
Warning threshold in percentage of failed SLAs
(default '50')
--critical CRITICAL Critical threshold in amount of failed SLAs
--warning WARNING Warning threshold in amount of failed SLAs
--critical-jitter CRITICAL_JITTER
Critical threshold for the Average Jitter value of
jitter SLAs
--warning-jitter WARNING_JITTER
Warning threshold for the Average Jitter value of
jitter SLAs
--critical-mos CRITICAL_MOS
Critical threshold for the MOS value of jitter SLAs
(1.00 .. 5.00)
--warning-mos WARNING_MOS
Warning threshold for the MOS value of jitter SLAs
(1.00 .. 5.00)
--critical-icpif CRITICAL_ICPIF
Critical threshold for the ICPIF value of jitter SLAs
--warning-icpif WARNING_ICPIF
Warning threshold for the ICPIF value of jitter SLAs
--verbose {0,1,2} Verbose output
This monitoring plugin follows the Nagios plugin guidelines for output. In check-mode the return value indicates the status (0 = OK, 1 = WARNING, 2 = CRITICAL and 3 = UNKNOWN) The status will also be printed as output, as well as some textual description about the status Examples can be seen below.
Performance data
Use the --perf parameter to make the script output performance data.
For all SLA types the Round Trip Time of the latest operation is returned:
- 'rtt': The RTT of the latest operation
or when checking multiple SLA entries at once:
- 'rtt ': The RTT of the latest operation for each entry (e.g. 'rtt 10')
For jitter-SLA's the following additional values are returned:
- 'RTT avg': The average, min and max of the successfully measured RTT's (example 'RTT avg'=12.2ms;9;24)
- 'RTT variance': The variance of measured RTT's (example: 'RTT variance'=571.4)
- 'RTT std dev': The standard deviation of measured RTT's (example: 'RTT std dev'=23.9)
- 'Avg jitter': The average jitter (example: 'Avg jitter'=2)
- 'Avg jitter SD': The average jitter from Source to Destination (example: 'Avg jitter SD'=3)
- 'Avg jitter DS': The average jitter from Destination to Source (example: 'Avg jitter DS'=1)
- 'Avg latency SD': The average latency from Source to Destination (example: 'Avg latency SD'=7)
- 'Avg latency DS': The average latency from Destination to Source (example: 'Avg latency DS'=10)
- 'MOS': The Mean Opinion Score value (example: 'MOS'=4.23)
- 'ICPIF': The Impairment Calculated Planning Impairment Factor value (example: 'ICPIF'=11)
- 'Packet loss SD': Packet loss from Source to Destination (example: 'Packet loss SD'=0)
- 'Packet loss DS': Packet loss from Destination to Source (example: 'Packet loss DS'=0)
- 'Packet out of seq': The number of packets arrived out of sequence (example: 'Packet out of seq'=0)
- 'Packet MIA': The number of packets that are lost for which the direction cannot be determined (example: 'Packet MIA'=0)
- 'Packet late arrival': The number of packets that arrived after the timeout (example: 'Packet late arrival'=0)
General use cases:
Get a list of all SLAs available on a device
$ ./check_cisco_ip_sla.py --hostname 192.168.0.1 --community public --mode list
SLAs available:
ID Type Tag
---- ------ ----------------------------
10 echo New York
20 echo Tokio
30 echo Amsterdam
40 echo London
2600 jitter Jitter from Site-X to Site-Y
Check an SLA
$ ./check_cisco_ip_sla.py --hostname 192.168.0.1 --community public --mode check --entries 10
OK - 1 OK
Check multiple SLAs, warning if one goes down, critical if two go down
$ ./check_cisco_ip_sla.py --hostname 192.168.0.1 --community public --mode check --entries 10,20,30,40 \
--warning-pct 25 --critical-pct 50
OK - 4 OK
Check via SNMPv3
$ ./check_cisco_ip_sla.py --hostname 192.168.0.1 -v 3 -m list \
--security-name example_user --security-level authPriv --password example_passsword \
--auth-protocol SHA --priv-protocol AES
SLAs available:
ID Type Tag
--- ------ --------
10 echo New York
Check with performance data
$ ./check_cisco_ip_sla.py --hostname 192.168.0.1 -v 2 -c public --mode check --perf
OK - 4 OK | 'Failed%'=0.0%;50;100;0;100 'rtt 10'=1ms 'rtt 20'=4ms 'rtt 30'=1ms 'rtt 40'=12ms
Check jitter with performance data
$ ./check_cisco_ip_sla.py --hostname 192.168.0.1 -v 2 -c public --mode check --entries 2600 --perf
OK - 1 OK | 'RTT avg'=24.6;17;31 'RTT variance'=571.4 'RTT std dev'=23.9 'Avg jitter'=2 'Avg jitter SD'=3 'Avg jitter DS'=1 'Avg latency SD'=7 'Avg latency DS'=10 'MOS'=4.23 'ICPIF'=11 'Packet loss SD'=0 'Packet loss DS'=0 'Packet out of seq'=0 'Packet MIA'=0 'Packet late arrival'=0 'rtt'=17ms
Command definition examples:
define command {
command_name check_cisco_sla
command_line path/to/check_cisco_ip_sla.py --hostname $HOSTADDRESS$ -v 3 --security-name "$ARG1$" --password "$ARG2$" --mode check --entries "$ARG3$" --warning-pct "$ARG4$" --critical-pct "$ARG5$"
;command_example !username!password!10,20!60!80
;$ARG1$ SNMP-v3 Username
;$ARG2$ SNMP-v3 auth and priv password
;$ARG3$ SLA(s) as comma separated list
;$ARG4$ Warning threshold (percentage SLAs failed)
;$ARG5$ Critical threshold (percentage SLAs failed)
}
define command {
command_name check_cisco_sla_jitter
command_line path/to/check_cisco_ip_sla.py --hostname $HOSTADDRESS$ -v 3 --security-name "$ARG1$" --password "$ARG2$" --mode check --entries "$ARG3$" --warning-mos "$ARG4$" --critical-mos "$ARG5$" --perf
;command_example !username!password!10,20!3!2
;$ARG1$ SNMP-v3 Username
;$ARG2$ SNMP-v3 auth and priv password
;$ARG3$ SLA(s) as comma separated list
;$ARG4$ Warning threshold (MOS value)
;$ARG5$ Critical threshold (MOS value)
}
define command {
command_name check_cisco_sla_v2
command_line path/to/check_cisco_ip_sla.py --hostname $HOSTADDRESS$ -v 2 --community "$ARG1$" --mode check --entries "$ARG2$" --warning-pct "$ARG3$" --critical-pct "$ARG4$"
;command_example !public!10,20!60!80
;$ARG1$ SNMP Community
;$ARG2$ SLA(s) as comma separated list
;$ARG3$ Warning threshold (percentage SLAs failed)
;$ARG4$ Critical threshold (percentage SLAs failed)
}
define command {
command_name check_cisco_sla_v2_all
command_line path/to/check_cisco_ip_sla.py --hostname $HOSTADDRESS$ -v 2 --community "$ARG1$" --mode check --entries all --warning-pct "$ARG3$" --critical-pct "$ARG4$" --perf
;command_example !public!60!80
;$ARG1$ SNMP Community
;$ARG2$ Warning threshold (percentage SLAs failed)
;$ARG3$ Critical threshold (percentage SLAs failed)
}
Service template examples:
define service {
name cisco-sla-check
service_description Cisco SLA - Check entries
use generic-service
process_perf_data 0
check_command check_cisco_sla!~!~!~!~!~
;$ARG1$ SNMP-v3 Username
;$ARG2$ SNMP-v3 auth and priv password
;$ARG3$ SLA(s) as comma separated list
;$ARG4$ Warning threshold (percentage SLAs failed)
;$ARG5$ Critical threshold (percentage SLAs failed)
register 0
}
define service {
name cisco-sla-check-jitter
service_description Cisco SLA - Check jitter
use generic-service
check_command check_cisco_sla_jitter!~!~!~!~!~
;$ARG1$ SNMP-v3 Username
;$ARG2$ SNMP-v3 auth and priv password
;$ARG3$ SLA(s) as comma separated list
;$ARG4$ Warning threshold (MOS value)
;$ARG5$ Critical threshold (MOS value)
register 0
}
define service {
name cisco-sla-check-v2-all
service_description Cisco SLA - Check entries
use generic-service
check_command check_cisco_sla_v2_all!~!~!~
;$ARG1$ SNMP Community
;$ARG3$ Warning level (in percent SLAs failed)
;$ARG4$ Critical level (in percent SLAs failed)
register 0
}
Service definition examples:
define service {
host_name cisco03.example.com
service_description Cisco SLA - Check line to NY
use cisco-sla-check
check_command check_cisco_sla!$USER11$!$USER12$!10!60!80
}
define service {
host_name cisco03.example.com
service_description Cisco SLA - Check jitter
use cisco-sla-check-jitter
check_command check_cisco_sla_jitter!$USER11$!$USER12$!2600!60!80
}
define service {
host_name cisco02.example.com
service_description Cisco SLA - Check entries
use cisco-sla-check-v2-all
check_command check_cisco_sla_v2_all!$USER10$!60!80
}