Flapping (Services)

How monitoring features detect flapping services

Monitoring supports optional detection of services that are flapping. Flapping occurs when a service changes state too frequently, resulting in a storm of problem and recovery notifications. Flapping can be indicative of configuration problems (e.g., thresholds set too low), troublesome services, or real network problems.

Whenever monitoring checks the status of a service, it will check to see if it has started or stopped flapping. It does this by:

  • Storing the results of the last 21 checks of the service.
  • Analyzing the historical check results and determining where state changes/transitions occur.
  • Using the state transitions to determine a percent state change value (a measure of change) for the service.
  • Comparing the percent state change value against low and high flapping thresholds.

A service is determined to have started flapping when its percent state change first exceeds a high flapping threshold. A service is determined to have stopped flapping when its percent state goes below a low flapping threshold (assuming that it was previously flapping).

Flap Handling

When a service is first detected as flapping, monitoring will:
  • Log a message indicating that the service is flapping.
  • Add a non-persistent comment to the service indicating that it is flapping.
  • Send a flapping start notification for the service to appropriate contacts.
  • Suppress other notifications for the service (this is one of the filters in the notification logic).
When a service stops flapping, monitoring will:
  1. Log a message indicating that the service has stopped flapping.
  2. Delete the comment that was originally added to the service when it started flapping.
  3. Send a flapping stop notification for the service to appropriate contacts.
  4. Remove the block on notifications for the service (notifications will still be bound to the normal notification logic).