Flapping (Hosts)
How monitoring features provide detection of flapping hosts
Monitoring supports optional detection of hosts that are "flapping." Flapping occurs when a host changes state too frequently, resulting in a storm of problem and recovery notifications. Flapping can indicate configuration problems (e.g., thresholds set too low), troublesome services, or real network problems.
Whenever monitoring checks the status of a host, it will check to see if it has started or stopped flapping. It does this by
- Storing the results of the last 21 checks of the host.
- Analyzing the historical check results and determining where state changes/transitions occur.
- Using the state transitions to determine a percent state change value (a measure of change) for the host.
- Comparing the percent state change value against low and high flapping thresholds.
A host is determined to have started flapping when its percent state change first exceeds a high flapping threshold. A host is determined to have stopped flapping when its percent state goes below a low flapping threshold (assuming that it was previously flapping).
- The host is checked (actively or passively)
- Sometimes when a service associated with that host is checked. More specifically, when at least x amount of time has passed since the flap detection was last performed, where x is equal to the average check interval of all services associated with the host.
Flap Handling
- Log a message indicating that the host is flapping.
- Add a non-persistent comment to the host indicating that it is flapping.
- Send a flapping start notification for the host to appropriate contacts.
- Suppress other notifications for the host (this is one of the filters in the notification logic).
- Log a message indicating that the host has stopped flapping.
- Delete the comment that was originally added to the host when it started flapping.
- Send a flapping stop notification for the host to appropriate contacts.
- Remove the block on notifications for the host (notifications will still be bound to the normal notification logic).