Redefining the standard for system availability

How HPE Nimble Storage uses predictive analytics to achieve more than six nines availability across its entire installed base

+ show more
TAP IMAGE TO ZOOM IN

Introduction

Businesses in every sector are increasingly reliant on applications to handle everything from back-end operations to the delivery of new products, services, and customer experiences. That is why infrastructure system availability and the elimination of unplanned downtime are more important than ever before.

For far too long, superior storage availability has only been possible through expensive, on-site service contracts on excessively redundant hardware models. Since its founding, HPE Nimble Storage, has been on the ambitious mission to break the mold and not only build better availability into their products but also enable continuous improvement over time.

 

HPE Nimble Storage delivers over six nines (99.999928%) of measured availability across its entire installed base. This translates to an impact of less than 25 seconds annually.

 

It is important to understand that published availability values are not all created equal—many are just theoretical measurements. The details on how availability is delivered distinguish one from the other and reduce business risk. With respect to availability from HPE Nimble Storage:

 

  • It is measured and based on real, achieved values, not theoretical projections. You can be confident about future availability levels only when metrics about past performance are transparent and proven by actual data and customers.

 

  • It is measured for the entire installed base, including every model and OS release. Showing improvement on the latest products and releases is easy. The challenge is delivering complete system availability including systems that have been in operation for over six years.

 

  • It is continuously improving. It already starts out more reliable than others and keeps improving with over six years of installed-base learning and insights.

 

  • It is standard for all products, not requiring special terms or service. Building best-in-class availability into every product without charging a premium or requiring a special service contract or configuration is fundamental to HPE Nimble Storage.

 

This innovation begs the question—how does HPE Nimble Storage do it?

The basis for system reliability at HPE Nimble Storage starts with the architecture of the storage platform. There is no single point of failure (fault tolerance with redundant components). Dual controllers allow for nondisruptive upgrades with no performance impact in the case of controller failure. Moreover, the software architecture is fault tolerant and delivers extremely robust data integrity including Triple+ Parity RAID and end-to-end integrity validation.

 

However, there are degrees of unpredictability that can’t be engineered out through system design, due to complexity across infrastructure layers. This has not stopped HPE Nimble Storage from continuing to improve significantly and progress towards a zero-downtime lifecycle. The measured availability of HPE Nimble Storage arrays keeps getting better through predictive analytics, installed-base learning, and our commitment to a transformed support experience. HPE Nimble Storage is redefining the standard.

 

The following sections of this paper dive into the details, revealing the unique approach that has enabled HPE Nimble Storage to continuously improve and exceed six nines of measured availability across the entire installed base.

  • How availability is measured

    The data that HPE Nimble Storage collects from storage arrays allows availability to be measured to the microsecond.

    While most arrays experience no downtime, any periods of downtime that do occur are automatically identified, categorized, and archived, allowing HPE Nimble Storage to track availability across the installed base as well as by software release, model, or any other dimension. These records are rigorously maintained and all downtime are investigated to make certain that the impact to the customer is accurately captured. Overall availability numbers are monitored regularly, allowing us to identify areas where further improvements can be made.

     

    Since availability tracking is such a powerful tool, it is important to make it as complete as possible. All arrays are included, with the exception of internal systems used for development and testing. Moreover, any issue that results in unplanned downtime is included, even issues due to a third-party problem. Periods when an array is not expected to be available are filtered out, for example, a general power outage or a situation where a customer shuts the array down in order to move it to a new location.

  • Preventing downtime with HPE InfoSight Predictive Analytics

    Since its inception, HPE Nimble Storage has incorporated advanced analytics into the core architecture of every system, it does so to radically improve operational system reliability—not only for the storage arrays but also for infrastructure layers beyond storage.

    The complexity and variability across applications, infrastructure, and configurations has made downtime‑inducing problems all but inevitable.

     

    To combat this longstanding issue, HPE Nimble Storage took a unique approach and began embedding diagnostic sensors into every module of code from day one, building a foundation for real‑time, deep health, and performance analytics. To date, each system contains thousands of sensor collectors and HPE InfoSight Predictive Analytics collects and correlates millions of sensor data points per second across its installed base, enabling global visibility and learning.

TAP IMAGE TO ZOOM IN

Figure 1. Measured availability of the installed base over time

Infrastructure that learns

HPE InfoSight applies data science to identify, predict, and prevent problems across infrastructure layers. For any new problem experienced in the installed base, predictive health signatures are assigned and HPE InfoSight intelligently utilizes pattern-matching algorithms and continuously searches for signatures across the systems.

 

If a signature is detected, HPE InfoSight either prevents the problem from occurring or proactively resolves it with a prescriptive resolution, even if the problem is outside of storage. There are no false alerts as machine learning normalizes performance behavior across the installed base.

 

Each system continually gets smarter, learning from the installed base, and downtime events are increasingly prevented.

 

Non-storage factors, such as misconfigurations, host, network, or VM problems, can impact the I/O path. HPE InfoSight correlates sensor data across the infrastructure and resolves problems beyond storage, uncovering the root causes of issues affecting data delivery from storage to virtual machines (VMs). In fact, 54% of the issues HPE InfoSight resolves are outside of storage. Because HPE Nimble Storage has been at this for over ten years, HPE InfoSight has more diagnostic sensor data and predictive insights than any other vendor.

 

With HPE InfoSight and the power of predictive analytics, measured availability is greater than six nines today and continues improving for all systems. This availability value is not limited to the latest model or software version as it is for other vendors, but instead is representative of the entire HPE Nimble Storage installed base.

TAP IMAGE TO ZOOM IN

Figure 2. Number of predictive health signatures

  • Guiding principle for preventing issues

    If HPE Nimble Storage has seen or knows about a problem, no customer should experience the same problem in their environment—regardless of the complexity or location of the root cause.

    This guiding principle has created a methodical focus on clearly understanding the root cause of every issue and case, even those outside of storage, to prevent any customer from experiencing the same issue.

     

    See once, prevent for all

    HPE InfoSight enables a new and better support experience, one that applies data science and intelligent case automation to help minimize the possibility of a known issue ever being experienced in the installed base. Integral to this support experience are the PEAK engineers—a special team with expertise across the infrastructure layers. These engineers are responsible for case assessment, rapid and definitive root cause analysis, defining case automation rules, and overseeing problem resolution before problems can affect customers. The Figure 3 outlines the team’s standard operating procedure.

     

    • Data analysis: HPE InfoSight continuously monitors and analyzes sensor telemetry from the global installed base—millions of sensors per second from over 10,000 customers.

     

    • Case creation: HPE InfoSight predicts a potential problem or a customer creates a case (Note: Ninety percent of cases are auto-created and 86% of cases are auto-resolved and closed before the customer knows of an issue).

     

    • Root cause analysis: For complex issues, a dedicated PEAK engineer is assigned and works with engineering and HPE InfoSight to quickly diagnose the root cause, including problems outside of storage. A signature is created identifying the parameters, including OS, performance metrics, application and workload profiles, and third-party configurations.

     

    • Problem resolution: The PEAK engineer develops the resolution plan, verifies the completion of fixes, and closes the case.

     

    • Installed-base prevention: HPE InfoSight applies pattern-matching algorithms on the signature to identify, predict, and prevent other systems from experiencing the same problem.

     

TAP IMAGE TO ZOOM IN

Figure 3. Rapid root cause to automated prevention

Customized upgrade paths

The PEAK engineers can invoke a deny list mechanism that prevents customers from upgrading to specific NimbleOS versions associated with a problem that has been identified in other environments with similar configurations. HPE InfoSight, in turn, creates customized upgrade paths for each customer. This means customers can know with certainty that the upgrades available are safe, as identified problems have been mitigated.

 

HPE Nimble Storage’s laser focus on preventing known issues, combined with HPE InfoSight Predictive Analytics, has resulted in a 19.3% year-over-year decrease in customer involved support cases.  1 This achievement has been made despite having grown its customer base 900% over the same period. Net result: Downtime events are prevented and valuable customer time can be spent driving business value rather than on maintenance, troubleshooting, and problem resolution.

TAP IMAGE TO ZOOM IN

Figure 4. 19.3% YoY decrease in customer involved cases

Infrastructure is an investment. Rather than choosing a depreciating asset, you can choose one that actually improves over time.

 

Businesses are increasing their reliance on software applications and even the smallest amount of downtime can have tremendous consequences. A robust design that incorporates flash technology is a requirement today. However, system design alone cannot overcome the complexity in infrastructure that causes unplanned downtime.

 

HPE Nimble Storage combines robust system design with predictive analytics to deliver the highest measured availability in the storage industry and a transformed support experience. Building predictive analytics into the core architecture from day one allows infrastructure to learn, no matter how long it has been deployed. This is reflected in the following:

 

  • Measured availability greater than six nines (99.999928%) across more than 10,000 customers, providing uptime for customers.
  • Over 86% of support cases are automatically resolved by HPE InfoSight, saving time and money from trying to diagnose and troubleshoot.
  • Fifty-four percent of issues that HPE InfoSight resolves are outside of storage, addressing a full spectrum of issues that impact infrastructure uptime.

 

Intuition says that reliability will go down and the likelihood of problems will increase as systems age. However, HPE Nimble Storage has flipped that paradigm with HPE InfoSight Predictive Analytics.

Learn more at

hpe.com/storage/nimblestorage

Download the PDF

  • 1 HPE Nimble Storage internally tracks monthly manual cases. 
HPE Nimble Storage internally tracks monthly manual cases.