RAS – Reliability, Availability, and Serviceability
Overview of the RAS system to monitor and maintain L300 and L300N systems
This section provides an overview of the RAS system to monitor and maintain ClusterStor™ L300 and L300N systems.
Reliability, Availability, and Serviceability (RAS) is a set of attributes that reflect the robustness of a hardware and/or software system.
- Reliability is a function of time that expresses the probability at a given future time that a system will still be working, given that it is working at the present time
- Availability is the measure of how often the system is available for use (such as a system’s up-time percentage). Availability and reliability may sound like the same concept, but they are different, as a system can have great availability but offer no reliability
- Serviceability is a broad definition that describes the ease of system service or repair
Reliability and Availability (RA)
- Hardware redundancy (power, servers, networks)
- Storage stacks (SCSI, RAID, local file system)
- HA stack
- Clustered file system
Serviceability
Serviceability consists of on-product capabilities and off-product tools, processes and staff. Serviceability supports the needs of the product (its RA model, for example); HPC and archive products will behave differently. There are cost constraints associated with serviceability. Most important, the customer should not be impacted; there should be no loss of availability for maintenance operations. Serviceability focuses on Level 1 service personnel and eventually, end users.
- Inventory discovery (part/serial numbers, firmware versions, location)
- Monitoring, diagnostics, fault isolation of
- data collection from functional codes (RAID, HA, FS)
- discovery and data collection from hardware (SES, IPMI, SNMP)
- Reporting service manager policy engine (product-specific)
- Reporting service notifications (user email, SNMP)
- Reporting live event telemetry stream (interesting events)
- Guided repair assistance for service
- Supports disks, PSUs/PCMs and ESM controller server modules

- Architected specifically for L300 and L300N
- With topological awareness that enables correct issue prioritization
- Designed to reduce false positives and repetitive alerts
- With event-driven data end-to-end across the system, allowing real-time updates
The L300 and L300N RAS service console provides a real-time GUI with guided repair assistance.
Remote support includes email notifications, user alert emails, and telemetry data. RAS-based SUs are designed to be applied with no downtime/interruption.