|
smartmon must be run with sufficient permission to access the device. The
command runs as the Nagios user, via net-mgmt/nrpe.
The following is the entry I add to /usr/local/etc/nrpe.cfg to monitor the
two HDD in this system:
command[check_smartmon_ad2]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad2
command[check_smartmon_ad4]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad4
After changing the above configuration file, remember to restart nrpe:
# /usr/local/etc/rc.d/nrpe2 restart
Stopping nrpe2.
Starting nrpe2.
In order to allow the nagios user to run this command via sudo, I add the following
via the visudo command:
nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_smartmon -d /dev/ad2
nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_smartmon -d /dev/ad4
From the nagios system, I ran these commands to verify that nrpe would return the
expected results:
$ /usr/local/libexec/nagios/check_nrpe2 -H bast -c check_smartmon_ad2
OK: device is functional and stable (temperature: 42)
Good. So we know NRPE will perform the command and return the expected results.
Now it's a simple matter of configuring nagios to run the above command.
Guess what. I found news:
WARNING: device temperature (57) exceeds warning temperature threshold (55)
I started a long self test:
# smartctl -t long /dev/ad6
smartctl version 5.38 [i386-portbld-freebsd8.0] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 54 minutes for test to complete.
Test will complete after Sat Mar 13 20:38:33 2010
Use smartctl -X to abort test.
And soon after that:
CRITICAL: device temperature (61) exceeds critical temperature threshold (60)
Nice.
After manually checking the HDD temperature, by putting my hand on the HDD, I
determined all were of a similar temperature. I concluded SMART was wrong,
which is not unknown. I adjusted nrpe.cfg to adjust for the higher reading:
command[check_smartmon_ad6]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad6 -w 65 -c 70
I also ran visudo and updated the ad6 entry to allow nagios to run the amended command.
|