RAS – Reliability, Availability, and Serviceability

Overview of the RAS system to monitor and maintain L300 and L300N systems

This section provides an overview of the RAS system to monitor and maintain ClusterStorL300 and L300N systems.

Reliability, Availability, and Serviceability (RAS) is a set of attributes that reflect the robustness of a hardware and/or software system.

  • Reliability is a function of time that expresses the probability at a given future time that a system will still be working, given that it is working at the present time
  • Availability is the measure of how often the system is available for use (such as a system’s up-time percentage). Availability and reliability may sound like the same concept, but they are different, as a system can have great availability but offer no reliability
  • Serviceability is a broad definition that describes the ease of system service or repair

Reliability and Availability (RA)

Reliability and availability features keep a system working when failures occur. Examples of RA include:
  • Hardware redundancy (power, servers, networks)
  • Storage stacks (SCSI, RAID, local file system)
  • HA stack
  • Clustered file system

Serviceability

Serviceability consists of on-product capabilities and off-product tools, processes and staff. Serviceability supports the needs of the product (its RA model, for example); HPC and archive products will behave differently. There are cost constraints associated with serviceability. Most important, the customer should not be impacted; there should be no loss of availability for maintenance operations. Serviceability focuses on Level 1 service personnel and eventually, end users.

In L300 and L300N products, serviceability includes the following capabilities:
  • Inventory discovery (part/serial numbers, firmware versions, location)
  • Monitoring, diagnostics, fault isolation of
    • data collection from functional codes (RAID, HA, FS)
    • discovery and data collection from hardware (SES, IPMI, SNMP)
  • Reporting service manager policy engine (product-specific)
  • Reporting service notifications (user email, SNMP)
  • Reporting live event telemetry stream (interesting events)
  • Guided repair assistance for service
  • Supports disks, PSUs/PCMs and ESM controller server modules
Figure: RAS Service Notification Scenario
RAS for L300 and L300N provides an "expert" system:
  • Architected specifically for L300 and L300N
  • With topological awareness that enables correct issue prioritization
  • Designed to reduce false positives and repetitive alerts
  • With event-driven data end-to-end across the system, allowing real-time updates

The L300 and L300N RAS service console provides a real-time GUI with guided repair assistance.

Remote support includes email notifications, user alert emails, and telemetry data. RAS-based SUs are designed to be applied with no downtime/interruption.