Troubleshooting

Overview of troubleshooting system problems

When the Lustre storage system is experiencing problems, the three principal resources for debugging an issue are support bundles, system logs, and GEM logs. The following actions and referenced topics may help to diagnose and resolve problems or get assistance from Hewlett Packard Enterprise support.

Table 1.
ActionDescription and Reference
Read the Manage a Service Issue topic.Provides suggested steps to understand and diagnose host and service problems
Review event logs.See the Health Tab topic.
Review information on the Tactical Monitoring Overview (TMO).The TMO is designed to serve as a birds-eye view of all network monitoring activity. It allows the operator to quickly see network outages, host status, and service status.

See the Health Tab topic.

Read system monitoring topics.The views and reports available on the Health tab expose the details of host or service alerts and notifications, plus provide detailed information about underlying causes of alerted problems.

See the View the Host Status and View the Service Status topics.

Review or create a support bundle.When a Lustre error or a system event (such as failover) occurs, the system automatically triggers a process to collect system data and diagnostics and bundle them in support files. System administrators can also collect a support bundle manually. These support bundle files are helpful when diagnosing system problems. Also, if a problem gets escalated to Hewlett Packard Enterprise support, a support bundle must be attached to the support ticket.

See the topics View Support Files, Collect Support Files Automatically, Manually Collect a Support File.

Work with Support Files

The system support functionality enables system administrators to collect diagnostic information, including logs and configuration settings, automatically or manually. When a Lustre error occurs, the system automatically collects this diagnostic information. As needed, system administrators can also manually collect a diagnostic payload and browse the contents.

Support Bundle Contents

Data related to system errors is collected in support files, which are packaged together into support bundles. A support bundle is a standard UNIX-compressed file (tar-gzip), that includes:

  • System logs for all nodes for the 45-minute period before the error occurred
  • List of all cluster nodes and information for each node:
    • Software version
    • Linux kernel and patches
    • Sonexion RPMs
    • OSTs mounted on the node
    • Power states
    • Resource states
    • Relevant processes
    • Sysrq data
  • Current Apache/WSGI logs from the MGS/MDS
  • Application state data (MySQL database dump)
  • Log files for RAID configuration (MDRAID examine output)
  • Log files for local Lustre users (if any are defined)
  • Diagnostic and performance test logs

The subtopics that follow describe the support files collection process and provides procedures to work with the diagnostic information.