Reviewer
Tony Reix <tony.reix@bull.net>
Change summary
- v0.1: Initial release
- v0.2: Separate between: monitoring/performances/global view
| Version Number |
Date of Revision |
| 0.2 |
2004/11/19 |
| 0.1 |
2004/11/15 |
Table of contents
Reference documents
Nagios Homepage:
http://www.nagios.org/
Nagios plugin developpement page:
http://sourceforge.net/projects/nagiosplug/
Nagios plugins developement guidelines:
http://nagiosplug.sourceforge.net/developer-guidelines.html
Network Monitoring with Nagios and MRTG: http://nagios.sourceforge.net/download/contrib/documentation/misc/monitoring_nagios.doc
1. Requirements
We have identified several usefull informations that could be monitored in a classic network using NFS.
These informations can be separated into three sets:
- Monitoring:
state (up/down) of the dameons (on both servers and clients), logs, and
check on a single machine if everything is running correctly
- Performances: check if the resources used by NFS are not too high: transfer rates, cpu consumption...
- A
global view: when there are many exports and mounts, it is very
difficult to keep the consistency of what is done. It is easier to
remember what has been done if the network is presented in a global
view with every mounts and exports.
We will present how to implement the monitoring of all these informations in the Nagios tool.
2.1 Nagios architecture
Nagios is a monitoring tool, widely used on linux networks.
Nagios is built on a server/agents architecture.
Usually, on a network, a Nagios server is running on a host, and plugins
are running on all the remote hosts that need to be monitored. These
plugins send informations to the server, which displays them in a GUI.
So Nagios is composed of three parts:
- A scheduler: this is the server part of Nagios.
At regular interval, the scheduler checks the plugins, and according to their results do some actions.
- A GUI: the interface of Nagios (with the configuration, the alerts, ...). It is displayed in web pages generated by CGI.
It can be state buttons (green,OK/red,Error), sounds, MRTG graphs, ...
- The plugins. They are configurable by the user. They check a service and return a result to the Nagios server.
A soft alert is raised when a plugin returns a warning or an error.
Then on the GUI, a green button turns to red, and a sound is emitted.
When this soft alert is raised many times (the number is configurable), a hard alert is raised, and the Nagios server sends notifications: email, SMS, ...
2.2 Plugins
A
plugin is a small program (in Perl, C, python, ...) that checks a
service (a daemon, some free space on a disk, ...). It must return a
value and a small line of text (Nagios will only grab the first line of
text).
Output should be in the format: METRIC STATUS: information text|performance data
The allowed METRIC STATUS are 0 (OK), 1 (WARNING), 2 (CRITICAL) or 3 (UNKNOWN)
The warning and critical thresholds are parameters, set by the user, passed as arguments to the plugin.
A plugin can also return performance data in the format: "label1=value1 label2=value2 ..."
These data are stored by Nagios and may be later displayed with MRTG (http://people.ee.ethz.ch/~oetiker/webtools/mrtg/)
The plugins can be run:
- Locally, on the Nagios server.
But such a plugin can check remote hosts, for example check_ping which pings remote hosts to check if they are running.
- Remotely,
through a remote Nagios server, whi ssh, with snmp, with NRPE (Nagios
Remote Plugin Executor), or with NSCA (Nagios Service Check Acceptor).
It means that the plugin either waits for a verification request from
the Nagios server before sending its result, or executes itself and
sends the result to the Nagios server.
3. Implementation in Nagios
3.1 Monitoring
Here we have both a classic monitoring (daemons up or down), and some checks on single host about the NFS mounts and exports.
- check_rpc is an existing plugin that checks if a rpc service is registered and running.
It uses rpcinfo.
The user needs to configure it by asking which rpc ports to check.
Thus it can be used to check if an NFS daemon is running on a remote host, and which is the NFS version of this daemon.
- Check if all the daemons are running correctly on servers (nfsd, idmapd, mountd, svcgssd) and on clients (idmapd, gssd)
- Display logs about NFS from both servers and clients; grep these logs for some interesting strings
- Check from the nfsstat command the rpc errors (strings "bad*")
- Detect when a mounting point hangs (for whatever reason: the NFS server is down, the network is broken, ...)
- Detect (in the /etc/fstab file) mounts in a wrong order: trying to mount a subdirectry before the parent directory.
- Check NFS security: authentication failures, krb5 (mainly from the errors found in the logs)
3.2 Performances
The main complain about NFS on a big network is: "It's lagging".
So the performances must be checked in order to know if it is really an NFS host which is lagging, and which one it is.
- The percentage of CPU taken by the nfsd processes.
It is not usual that nfsd takes too much CPU.
- Check the transfert rate of each NFS server and compare them with the max transfert rate of the filesystem.
Unfortunalty, this can not be done easily, mainly because there is no way to know what is due to the NFS server and what is due to the network.
- Check from the file /proc/net/rpc/nfsd
the number of effective processes of nfsd used. The file says what is
the load of each nfsd process. Raise an alert if the number (that is
the load) choosen is too low or too high.
3.3 Global view
We
have seen that with NFS, is it sometimes difficult to remember all of
the exports and mounts on a big network, with many NFS servers. It
seems usefull to create a software (a NAGIOS view? an html view?) that
can display a multihosts tree view or a multihosts table view of the network:
- the hosts
- the exported directories on each hosts
- all of the mounted directories
Then others informations may be added:
- Detect (in the /etc/fstab
file) cross mounts: host A wants to mount a filesystem from host B, and
host B wants to mount a fielsystem form host A. If this is not detected
before a shutdown, both the hosts A and B will hang at the next reboot.
- display and allow to change informations about replication, migration and load-balancing
4 Other developments
Some
points can not be done in Nagios or any monitoring tool, mainly because
they need to be checked just before the mounting or during the mounting.
So they could be done, for example, in Webmin (http://webmin.com) in the Mount module:
- Detect
a directory that disappears when a mount is done over it: display an
error to the user when he is trying to mount in a directory which is
not empty.
- Dectect a mount that disappears when another mount is made over it
- It
is not possible to unmount a filesystem if users are logged on it: help
to unmount a filesystem when users are working on it. For example,
display a list of these users, send them a warning, ...
- It
is not really usefull to mount a mount: if host C tries to mount the
directory 'dir' from host B, raise an alert if 'dir' has already been
mounted on B from host A, and explain that 'dir' should be directly
mounted on C from A
- Sort the /etc/fstab file so the mount of a subdirectory in a filesystem is always done after the mount of this filesystem