Flapping (Services)
How monitoring features detect flapping services
Monitoring supports optional detection of services that are flapping. Flapping occurs when a service changes state too frequently, resulting in a storm of problem and recovery notifications. Flapping can be indicative of configuration problems (e.g., thresholds set too low), troublesome services, or real network problems.
Whenever monitoring checks the status of a service, it will check to see if it has started or stopped flapping. It does this by:
- Storing the results of the last 21 checks of the service.
- Analyzing the historical check results and determining where state changes/transitions occur.
- Using the state transitions to determine a percent state change value (a measure of change) for the service.
- Comparing the percent state change value against low and high flapping thresholds.
A service is determined to have started flapping when its percent state change first exceeds a high flapping threshold. A service is determined to have stopped flapping when its percent state goes below a low flapping threshold (assuming that it was previously flapping).
Flap Handling
- Log a message indicating that the service is flapping.
- Add a non-persistent comment to the service indicating that it is flapping.
- Send a flapping start notification for the service to appropriate contacts.
- Suppress other notifications for the service (this is one of the filters in the notification logic).
- Log a message indicating that the service has stopped flapping.
- Delete the comment that was originally added to the service when it started flapping.
- Send a flapping stop notification for the service to appropriate contacts.
- Remove the block on notifications for the service (notifications will still be bound to the normal notification logic).