State Types

Explanation of the two (2) state types (soft and hard) used on the Health Tab

The current state of monitored services and hosts is determined by two (2) components:

  • The status of the service or host (that is, OK, WARNING, UP, or DOWN, and so on).
  • The state type of the service or host.

There are two (2) state types: SOFT and HARD. These state types are a crucial part of the monitoring logic; they are used to determine when event handlers are executed and when notifications are initially sent out.

Service and Host Check Retries

To prevent false alarms from transient problems, the monitoring tool allows users to define how many times a service or host should be (re)checked before it is considered to have a real problem. This is controlled by the max_check_attempts option in the host and service definitions. Understanding how hosts and services are (re)checked is important in understanding how state types work.

Set the Service and Host Check Retry Counts

Setting or changing service and host check retry counts is done manually by editing Icinga configuration files and then running puppet. Host check retry counts cannot be set with theSonexion System Manager GUI.

The max_check_attempts value is set to 3 for all hosts and services in the file /etc/puppet/modules/icinga/files/objects/templates.cfg. It can also be set to a different value for each service or host. The configuration files that hold the hosts and services in Icinga are in the directory: /etc/icinga/objects/. They are generated by the puppet module /etc/puppet/modules/icinga/ and by /opt/xyratex/python/t0/monitoring/monitoring2puppet.py.

Soft States

Soft states occur in the following situations:

  • When a service or host check results in a non-OK or non-UP state and the service check has not yet been (re)checked the number of times specified by the max_check_attempts directive in the service or host definition. This situation is called a soft error.
  • When a service or host recovers from a soft error, it is considered a soft recovery.

The following things occur when hosts or services experience SOFT state changes:

  • The SOFT state is logged.
  • Event handlers are executed to handle the SOFT state.

SOFT states are only logged when the log_service_retries or log_host_retries options in the main configuration file are enabled.

The main configuration file is located in directory /etc/puppet/modules/icinga/files/icinga.cfg. The file can be edited on both MGMT nodes. Once the file has been changed, restart puppet on those nodes using the following command, run with root privileges:

/opt/xyratex/bin/beUpdatePuppet -s -g mgmt

The only important thing that really happens during a soft state is the execution of event handlers. Using event handlers can be useful when trying to proactively fix a problem before it turns into a HARD state.

Event handlers can be tweaked manually in Icinga. For more information on event handlers, see the Icinga documentation.

Hard States

Hard states occur for hosts and services in the following situations:

  • When a host or service check results in a non-UP or non-OK state and it has been (re)checked the number of times specified by the max_check_attempts option in the host or service definition. This situation is a hard error state
  • When a host or service transitions from one hard error state to another error state (for example, WARNING to CRITICAL)
  • When a service check results in a non-OK state and its corresponding host is either DOWN or UNREACHABLE
  • When a host or service recovers from a hard error state. This situation is considered a hard recovery.
  • When a passive host check is received. Passive host checks are treated as HARD unless the passive_host_checks_are_soft option is enabled

The following things occur when hosts or services experience HARD state changes:

  • The HARD state is logged.
  • Event handlers are executed to handle the HARD state.
  • Contacts who are subscribers are notified of the host or service problem or recovery. The contacts and notifications can be configured using the cscli alerts_config command.

Following is an example of how state types are determined, when state changes occur, and when event handlers and notifications are sent out. The table shows consecutive checks of a service over time. The service has a max_check_attempts value of 3.

TIMECHECK #STATESTATE TYPESTATE CHANGENOTES
01OKHARDNoInitial state of the service
11CRITICALSOFTYesFirst detection of a non-OK state. Event handlers execute
22WARNINGSOFTYesService continues to be in a non-OK state. Event handlers execute.
33CRITICALHARDYesMax check attempts have been reached, so service goes into a HARD state. Event handlers execute and a problem notification is sent out. Check # is reset to 1 immediately after this happens.
41WARNINGHARDYesService changes to a HARD WARNING state. Event handlers execute and a problem notification is sent out.
51WARNINGHARDNoService stabilizes in a HARD problem state. Depending on what the notification interval for the service is, another notification might be sent out.
61OKHARDYesService experiences a HARD recovery. Event handlers execute and a recovery notification is sent out.
71OKHARDNoService is still OK.
81UNKNOWNSOFTYesService is detected as changing to a SOFT non-OK state. Event handlers execute.
92OKSOFTYesService experiences a SOFT recovery. Event handlers execute, but notifications are not sent, as this was not a "real" problem. State type is set HARD and check # is reset to 1 immediately after this happens.
101OKHARDNoService stabilizes in an OK state.