Flapping (Services)

How monitoring features detect flapping services

Monitoring supports optional detection of services that are flapping. Flapping occurs when a service changes state too frequently, resulting in a storm of problem and recovery notifications. Flapping can be indicative of configuration problems (e.g., thresholds set too low), troublesome services, or real network problems.

Whenever monitoring checks the status of a service, it will check to see if it has started or stopped flapping. It does this by:

Storing the results of the last 21 checks of the service.
Analyzing the historical check results and determining where state changes/transitions occur.
Using the state transitions to determine a percent state change value (a measure of change) for the service.
Comparing the percent state change value against low and high flapping thresholds.

A service is determined to have started flapping when its percent state change first exceeds a high flapping threshold. A service is determined to have stopped flapping when its percent state goes below a low flapping threshold (assuming that it was previously flapping).

Flap Handling

When a service is first detected as flapping, monitoring will:

Log a message indicating that the service is flapping.
Add a non-persistent comment to the service indicating that it is flapping.
Send a flapping start notification for the service to appropriate contacts.
Suppress other notifications for the service (this is one of the filters in the notification logic).

When a service stops flapping, monitoring will:

Log a message indicating that the service has stopped flapping.
Delete the comment that was originally added to the service when it started flapping.
Send a flapping stop notification for the service to appropriate contacts.
Remove the block on notifications for the service (notifications will still be bound to the normal notification logic).