Basic Troubleshooting

This article contains basic troubleshooting steps for some common issues that may occur while using HPE Ezmeral Container Platform.

PROBLEM/IMPACT PREPARATION/BEST PRACTICE RECOVERY PROCEDURE

Remote DataTap HDFS storage failure

Follow all applicable best practices and other guidelines from your storage vendor to mitigate any storage failure.

Obtain any applicable recovery procedures from your storage vendor.

The Caching Note (cnode) service will automatically retry when certain HDFS errors occur before propagating the error to the application/interface.

Controller host failure

When the Controller host fails:

If platform HA is enabled, then any pending virtual cluster and/or tenant creation will resume when the Standby Controller takes over.
Virtual clusters that are already running will continue to run, and users can continue interacting with them normally, either directly (routable container network) or via the Gateway hosts (non-routable container network).
Running jobs may be interrupted, and the affected users may need to restart the affected jobs.

HPE recommends storing any critical system files (such as custom keytab files and TLS certificates) on a shared file server that is mounted on both the Primary and Standby Controller hosts.

The Arbiter host will detect the primary Controller host failure and begin a failover transition to the Standby Controller host. The deployment will then be running in a degraded state until the Standby Controller becomes the primary Controller.

The interface will be in Lockdown mode during the transition, and no administration tasks will be possible during this period. Users may need to restart any running jobs that were interrupted as a result of the failure/transition.

Standby Controller host failure

If the Standby Controller host fails or crashes, then the primary Controller host will continue operating; however, the platform will be running a degraded state and will not be protected against any failure of the primate Controller host. The interface displays a warning message when this occurs.

If the Arbiter host is also being used as a Worker host, then please see Worker host failure, below.

See Controller host failure, above

HPE Ezmeral Container Platform analyzes the cause of the host failure and attempts to recover the failed host automatically. If recovery is possible, then the failed host will come back up, and HPE Ezmeral Container Platform will resume normal operation.

If the problem cannot be resolved, then the affected host will be left in a degraded state. You will need to manually diagnose and (if possible) repair the problem, and then reboot that host. If rebooting solves the problem, then the failed host will come back up, and HPE Ezmeral Container Platform will resume normal operation with High Availability protection enabled. Container Platform does not currently support designating another Worker host as the new Standby Controller.

Please contact HPE Technical Support for assistance if you are unable to resolve the issue.

Arbiter host failure

If the Arbiter host fails or crashes, then the Controller and Standby Controller hosts will continue operating; however, the platform will be running in a degraded state and will not be protected against any failure of the Controller or Standby Controller host. The interface displays a warning message when this occurs.

If the Arbiter host is also set up as Worker host, then please see Worker host failure, below, for additional information.

See Standby Controller host failure, above.

Worker host failure

If a Worker host fails or crashes, then any virtual nodes (Docker containers) running on that host will be down. There is no impact to the deployment itself.

If cluster HA is enabled, then HPE Ezmeral Container Platform normally ensures that the standby virtual nodes are not placed on the same hosts as the master virtual node.

The impact on any running applications will depend on the specific applications. A Hadoop cluster with built-in HA should have no impact. Data that is not stored in remote storage will be lost.

End users should create virtual Hadoop clusters with the YARN and HDFS High Availability option enabled, as described in Creating a New Cluster. This ensures that the Hadoop cluster does not have a single point of failure.

The use of node tags and affinity (service-to-node-role) allows you to place virtual nodes/containers in physically diverse locations (such as different server racks) to reduce any single points of failure.

HPE recommends storing application data in a dedicated, shared HDFS storage resource to help streamline the backup and recovery process, as described here (link opens an external website in a new browser tab/window).

IT can diagnose and repair the Worker host, and then reboot the machine and re-install the HPE Ezmeral Container Platform bundle as a Worker host. Alternatively, they can add a new Worker host to the deployment. See EPIC Worker Installation Overview and Gateway Installation Tab.

Virtual nodes residing on the affected Worker host are subject to the specific High Availability options that are available to the Big Data application residing in that virtual cluster, if any. Applications are recovered via a backup/restore process. Users should therefore store critical data on a storage server output directory to facilitate re-importing data/code to the newly spun-up container.

Gateway host failure

The Gateway host may fail or crash while one or more users are connected to virtual clusters.

HPE highly recommends setting up multiple Gateway hosts to provide both High Availability and load balancing.

IT can either diagnose and repair the failed Gateway host, or provision a new Gateway host.

If the deployment has two or more Gateway hosts, then sessions connected through the failed hosts will be moved to the available hosts. The in-flight TCP connections might need to reset as they are moved to the backup host.

The deployment includes a load-balancer in front of the Gateway hosts, and users should therefore experience no performance impacts.

Virtual Node (Container) failure

If one of the virtual nodes/containers in a cluster crashes or fails, then the impact will be the same as a physical node failure in a bare-metal cluster. The most common failure/crash causes are out-of-memory or out-of-disk space errors.

The impact on any running applications will depend on the specific applications. A Hadoop cluster with built-in HA should have no impact. Data that is not stored in remote storage will be lost.

End users should create virtual Hadoop clusters with the YARN and HDFS High Availability option enabled, as described in Creating a New Cluster. This ensures that the Hadoop cluster does not have a single point of failure.

Users may use Nagios to create automated alerts if a container failure is detected.

The HPE Ezmeral Container Platform interface displays container failure in the Services Status tab of the Cluster Details screen (see Services Status Tab). A container marked in red with all services grayed out is probably dead. Authorized users can log in to the container directly to attempt diagnosis and repair. They can also reboot the container to bring it back online.

Utilize HDFS backup and DR best practices (such as DistCp) to maintain up-to-date application data and results in the DR environment.

If a recovery is not possible, then users can re-create the cluster and re-import the data/code to resume the operation. This process is contingent on the type of the Big Data application in a manner similar to that when a Worker host is lost.

Expired TLS certificates or Keytab files

The underlying KDC keytab files that the virtual cluster uses to access the HDFS may be expired. Hadoop services will not be able to run because they cannot access the underlying HDFS.

Ensure that expiration date of all applicable SSL certificated and/or keytab files is sufficient for the lifespan of the cluster.

Use ActionScripts to automatically replace keytab files when needed. See About ActionScripts.

HPE Ezmeral Container Platform can experience adverse operational impact caused by changes in system files or system settings.

Changes to various system settings may cause unpredictable behavior in HPE Ezmeral Container Platform. Some examples include:

/etc/sysconfig/iptables
/etc/sysconfig/network
SELinux context
/etc/rsyslog.d/bds
umask settings
ipforward settings
RPM package deletion/changes.
- Do not manually install Network Manager
- RHEL subscription becomes inactive
Do not delete or alter any service user accounts.

Coordinate with your system administration teams to ensure that Chef/Puppet or other configuration management systems do not modify these settings/files on the HPE Ezmeral Container Platform hosts.

On either a regular basis (e.g. weekly) or when there is a significant configuration change in the environment (e.g. OS patch update, network configuration change), perform the HPE Ezmeral Container Platform configuration checks described in Config Checks Tab, and pay attention to any problem/warning reported by these checks.

Contact HPE Technical Support for assistance.