Very recently I was asked to setup Nagios to monitor the Lotus Notes/Domino Servers. There were some around 500 plus servers across the globe. It was an all Windows shop and the current monitoring was being done using GSX, HP Systems Insight Manager and IBM Director. The client wanted a comprehensive solution so that they have a single monitoring interface to look at and after an initial discussion they decided to go ahead with Nagios.
This document looks at monitoring Lotus Notes/Domino servers using SNMP through Nagios. I have provided some of the required OIDs and their initial warning and critical threshold values in tabular format. There are many more interesting OIDs listed in the domino.mib file. Also I have attached the Nagios commands definition file and service definition files at the end of the document. In order to use certain checks, some plugins are required which can be downloaded from http://www.barbich.net/websvn/wsvn/nagios/nagios/plugins/check_lotus_state.pl.
Note – I recently found that the required plugins are not available on the original site anymore, so I have made my copy available with this document. You can download the scripts from the link at the bottom of the document.
To start with I asked the windows administrators to install the Lotus/Domino SNMP Agent on all servers and after that I got hold of a copy of domino.mib file which is located in C:\system32.
Next I listed all the interesting parameters from the domino.mob file and started querying a set of test servers to find out if a value is being returned or not. Following is the OID list and what each OID means. Most of these checks are only valid in the Active node. This is important to know if the Domino servers are in a HA cluster (active-standby pair). If there is only one Domino Server then these checks will apply.
Moinitoring Checks on Active Node
Monitoring Checks on Active Node | |||
---|---|---|---|
Nagios Service Check | OID | Description | Threshholds (w- warning, c-critical) |
dead-mail | enterprises.334.72.1.1.4.1.0 | Number of dead (undeliverable) mail messages | w 80, c 100 |
routing-failures | enterprises.334.72.1.1.4.3.0 | Total number of routing failures since the server started | w 100, c 150 |
pending-routing | enterprises.334.72.1.1.4.6.0 | Number of mail messages waiting to be routed | w10, c 20 |
pending-local | enterprises.334.72.1.1.4.7.0 | Number of pending mail messages awaiting local delivery | w 10, c 20 |
average-hops | enterprises.334.72.1.1.4.10.0 | Average number of server hops for mail delivery | w 10, c 15 |
max-mail-delivery-time | enterprises.334.72.1.1.4.12.0 | Maximum time for mail delivery in seconds | w 300, c@600 |
router-unable-to-transfer | enterprises.334.72.1.1.4.19.0 | Number of mail messages the router was unable to transfer | w 80, c100 |
mail-held-in-queue | enterprises.334.72.1.1.4.21.0 | Number of mail messages in message queue on hold | w 80, c 100 |
mails-pending | enterprises.334.72.1.1.4.31.0 | Number of mail messages pending | w@80, c@100 |
mailbox-dns-pending | enterprises.334.72.1.1.4.34.0 | Number of mail messages in MAIL.BOX waiting for DNS | w 10, c 20 |
databases-in-cache | enterprises.334.72.1.1.10.15.0 | The number of databases currently in the cache. Administrators should monitor this number to see whether it approaches the NSF_DBCACHE_MAXENTRIES setting. If it does, this indicates the cache is under pressure. If this situation occurs frequently, the administrator should increase the setting for NSF_DBCACHE_MAXENTRIES | w 80, c 100 |
database-cache-hits | enterprises.334.72.1.1.10.17.0 | The number of times an lnDBCacheInitialDbOpen is satisfied by finding a database in the cache. A high ‘hits-to-opens’ ratio indicates the database cache is working effectively, since most users are opening databases in the cache without having to wait for the usual time required by an initial (non-cache) open. If the ratio is low (in other words, more users are having to wait for databases not in the cache to open), the administrator can increase the NSF_DBCACHE_MAXENTRIES | w, c |
database-cache-overcrowding | enterprises.334.72.1.1.10.21.0 | The number of times a database is not placed into the cache when it is closed because lnDBCacheCurrentEntries equals or exceeds lnDBCacheMaxEntries*1.5. This number should stay low. If it begins to rise, you should increase the NSF_DbCache_Maxentries settings | w 10, c 20 |
replicator-status | enterprises.334.72.1.1.6.1.3.0 | Status of the Replicator task | |
router-status | enterprises.334.72.1.1.6.1.4.0 | Status of the Router task | |
replication-failed | enterprises.334.72.1.1.5.4.0 | Number of replications that generated an error | |
server-availability-index | enterprises.334.72.1.1.6.3.19.0 | Current percentage index of server’s availability. Value range is 0-100. Zero (0) indicates no available resources; a value of 100 indicates server completely available |
Interesting OIDs to plot for trend analysis
Interesting OIDs to plot for Trend Analysis | |
---|---|
enterprises.334.72.1.1.4.2.0 | Number of messges received by router |
enterprises.334.72.1.1.4.4.0 | Total number of mail messages routed since the server started |
enterprises.334.72.1.1.4.5.0 | Number of messages router attempted to transfer |
enterprises.334.72.1.1.4.8.0 | Notes server’s mail domain |
enterprises.334.72.1.1.4.11.0 | Average size of mail messages delivered in bytes |
enterprises.334.72.1.1.4.13.0 | Maximum number of server hops for mail delivery |
enterprises.334.72.1.1.4.14.0 | Maximum size of mail delivered in bytes |
enterprises.334.72.1.1.4.15.0 | Minimum time for mail delivery in seconds |
enterprises.334.72.1.1.4.16.0 | Minimum number of server hops for mail delivery |
enterprises.334.72.1.1.4.17.0 | Minimum size of mail delivered in bytes |
enterprises.334.72.1.1.4.18.0 | Total mail transferred in kilobytes |
enterprises.334.72.1.1.4.20.0 | Count of actual mail items delivered (may be different from delivered which counts individual messages) |
enterprises.334.72.1.1.4.26.0 | Peak transfer rate |
enterprises.334.72.1.1.4.27.0 | Peak number of messages transferred |
enterprises.334.72.1.1.4.32.0 | Number of mail messages moved from MAIL.BOX via SMTP |
cache cmd hit rate | enterprises.334.72.1.1.15.1.24.0 |
cache db hit rate | enterprises.334.72.1.1.15.1.26.0 |
hourly access denials | enterprises.334.72.1.1.11.6.0 |
req per 5 min | enterprises.334.72.1.1.15.1.13.0 |
unsuccesfull run | enterprises.334.72.1.1.11.9.0 |