Issues and Workarounds

This topic describes issues and workarounds in HPE Ezmeral Container Platform version 5.3.x.

Issues are categorized as follows:

Issues Identified in Version 5.3.6
Issues Identified in Version 5.3.5
Issues Identified in Version 5.3.1
Previously-Identified Issues

Issues Identified in Version 5.3.6

See issues for older releases.

Issues Identified in Version 5.3.5

The following issues were identified in HPE Ezmeral Container Platform 5.3.5. Unless otherwise noted, these issues also apply to later 5.3.x releases.

Issues Identified in Version 5.3.5
Kubernetes Issues in 5.3.5

General Platform Issues 5.3.5

The following issues were identified in HPE Ezmeral Container Platform 5.3.5. Unless otherwise noted, these issues also apply to later 5.3.x releases.

EZCP-1408 and EZCP-1711: The Kubernetes and EPIC usage detail scripts shipped with HPE Ezmeral Container Platform 5.3.5 and 5.3.6 contain errors

Symptom: The Kubernetes usage detail scripts that were shipped with HPE Ezmeral Container Platform 5.3.5 and 5.3.6 contain errors related to time zone handling and using HTTPS connections. Field parse errors are returned when the time zone is + UTC. Connection errors are returned when the script attempts to use HTTP instead of HTTPS. Both EPIC and Kubernetes usage detail scripts are affected:

K8Scsv.py
K8Susage.py
bdcsv.py
bdusage.py

Workaround:

Download and install the corrected usage detail scripts (link opens an external website in a new browser tab or window).
Copy the downloaded tar file to the Controller (and to the Shadow and Arbiter controller if this is an HA configuration) and extract it to the following directory:
```
/opt/bluedata/common-install/scripts/monitoring/
```

EZCP-859: When upgrading HPE Ezmeral Container Platform 5.3.1 to 5.3.5 or 5.3.6, upgrading NVIDIA plugin add-on from version 1.0.0-beta-5 to 0.9.0-1 requires multiple steps.

Symptom: When upgrading HPE Ezmeral Container Platform 5.3.1 to 5.3.5 or 5.3.6, upgrading directly from the Beta 5 version of the NVIDIA plugin add-on to the required plugin-version is not supported.

Cause: After upgrading HPE Ezmeral Container Platform from 5.3.1 to 5.3.5 or 5.3.6, upgrading directly from the NVIDIA plugin add-on version 1.0.0-beta-5 to version 0.9.0-1, which is the required add-on version for HPE Ezmeral Container Platform 5.3.5 or 5.3.6, is not supported. You must upgrade NVIDIA plugin add-on from 1.0.0-beta-5 to 1.0.0-beta-6, and then to 0.9.0-1.

Workaround: In the following procedure, you will upgrade the NVIDIA-plugin add-on from version 1.0.0-beta-5 to version 1.0.0-beta-6, and then from version 1.0.0-beta-6 to version 0.9.0-1

(For 5.3.1 to 5.3.5 only) If you are upgrading HPE Ezmeral Container Platform 5.3.1 to 5.3.5, and using air-gap environment:
1. Download the supplemental air-gap images:
  - Changed 5.3.5 container images 10112021 (North America download site)
  - Changed 5.3.5 container images 10112021 (Asia Pacific download site)
2. Import the images to your air-gap registry. For more information about this step, see Configuring Air Gap Kubernetes Host Settings.
If you have GPU hosts (either A100 GPU or non-A100 GPU) installed, remove those hosts from Kubernetes cluster. See Expanding or Shrinking a Kubernetes Cluster.
Remove those GPU hosts from HPE Ezmeral Container Platform. See Decommissioning/Deleting a Kubernetes Host.
Upgrade HPE Ezmeral Container Platform 5.3.1 to 5.3.5 or 5.3.6. See Upgrading to HPE Ezmeral Container Platform 5.3.
Ensure that the NVIDIA driver version on the GPU host is 470.57.02 or later. See GPU Driver Installation.
Add GPU host back to HPE Ezmeral Container Platform as Kubernetes host.

After installing the Kubernetes hosts, SSH to each GPU and run $ rpm -qa | grep nvidia.

The expected output is:

nvidia-container-runtime-3.5.0-1.x86_64
      libnvidia-container1-1.4.0-1.x86_64
      nvidia-container-toolkit-1.5.1-2.x86_64
      libnvidia-container-tools-1.4.0-1.x86_64

SSH to Controller, and execute the following command:

echo '{"user":<site-admin-username>,<password>:<site-admin-password>}' >> /tmp/cred.json

Execute the add-on upgrade script as a dry run by specifying the -t and --required-only parameters.
For example:
```
cd /opt/bluedata/common-install/scripts/k8s-addons
python k8s_addon_upgrade.py -c <CONTROLLER_IP> -f /tmp/cred.json -k <CLUSTER_NAME> --required-only -t
```
Result:

The output of the command lists all the manifest add-ons (both required and optional), and their versions, and all the add-on versions that are currently deployed on the cluster. In that result, you can see the following information about the NVIDIA plugin add-on:
```
Cluster my-k8s-cluster required add-ons info:
...
nvidia-plugin:
      deployed version: 1.0.0-beta-5, tools: 0.4
      manifest version: 0.9.0-0, tools: 0.4
...
```
Change directories to the directory that contains the Kubernetes manifest:
```
cd /opt/bluedata/common-install/manifest
```
Make a backup copy of the k8s_manifest.json manifest file.

Edit the Kubernetes manifest file to change the version of the NVIDIA plug-in to 1.0.0-beta-6

    "nvidia-plugin" : {
      "_version" : 2,
      "required" : true,
      "version" : "1.0.0-beta-6",
      "system" : true,
      "order" : 40,
      "tools_version" : "0.4",
      "deployment" : "hpecp-bootstrap-nvidia-plugin",
      "label" : {
        "name" : "hpecp-bootstrap-nvidia-plugin",
        "description" : ""
      }

Refresh the manifest by doing one of the following:
- Execute the following command:
```
/opt/bluedata/common-install/bd_mgmt/bin/bd_mgmt k8s manifest_refresh
```
- Click Kubernetes Manifest Refresh on HPE Ezmeral Container Platform.
Change directories back to the scripts/k8s-addons directory and execute the add-on upgrade script as a dry run by specifying the -t and --required-only parameters.
For example:
```
cd /opt/bluedata/common-install/scripts/k8s-addons
python k8s_addon_upgrade.py -c <CONTROLLER_IP> -f /tmp/cred.json -k <CLUSTER_NAME> --required-only -t
```
Result:

The resulting output shows that the manifest version of the NVIDIA plugin add-on is 1.0.0-beta-6.
```
Cluster my-k8s-cluster required add-ons info:
...
nvidia-plugin:
      deployed version: 1.0.0-beta-5, tools: 0.4
      manifest version: 1.0.0-beta-6, tools: 0.4
...
```

On each Kubernetes cluster, upgrade the NVIDIA plugin add-on.

For example:

cd /opt/bluedata/common-install/scripts/k8s-addons
python k8s_addon_upgrade.py -c <CONTROLLER_IP> -f /tmp/cred.json -k <CLUSTER_NAME> -a nvidia-plugin

Repeat steps 5 through 15, this time specifying "version" : "0.9.0-1" in the edited Kubernetes manifest file.

After the upgrade, the output of a script dry run command shows the following for the NVIDIA plugin add-on:


Cluster my-k8s-cluster required add-ons info:
...
nvidia-plugin:
      deployed version: 0.9.0-1, tools: 0.4
      manifest version: 0.9.0-1, tools: 0.4
...

Install the GPU Kubernetes hosts as workers. See Kubernetes Worker Installation Overview, beginning with Step 7.

Kubernetes Issues in 5.3.5

The following issues were identified in HPE Ezmeral Container Platform 5.3.5. Unless otherwise noted, these issues also apply to later 5.3.x releases.

EZCP-1661: When you attempt to create Kubernetes cluster in an air-gap configuration, NVIDIA gpu-feature-discovery pod fails to come up.

Symptom: When you attempt to create Kubernetes cluster in an air-gap configuration, NVIDIA gpu-feature-discovery pod fails to come up, and the screen displays Failed to pull image "ImagePullBackoff" error.

This issue was fixed in HPE Ezmeral Container Platform 5.3.6.

Cause: In an air-gap installation, NVIDIA-gpu-feature-discover daemonset does not use the Air-Gap-registry URL.

Workaround: If you are using an air-gap configuration, perform the following steps to update the manifest file:

Download the supplemental image file:
- Changed 5.3.5 container images 10112021 (North America download site).
- Changed 5.3.5 container images 10112021 (Asia Pacific download site) .
Import the images to your air-gap registry. See either Existing Container Registry or New Container Registry, as appropriate.
Download the updated Kubernetes manifest file:
- Updated Kubernetes manifest file (North America download site).
- Updated Kubernetes manifest file (Asia Pacific download site).
Update and refresh the manifest file using the steps in Updating the Kubernetes Manifest.

EZKDF-404, "Clusters that Implement HPE Ezmeral Data Fabric on Kubernetes fail to start after Kubernetes version or HPE Ezmeral Container Platform version upgrade," also applies to upgrading Kubernetes versions in HPE Ezmeral Container Platform 5.3.5 deployments that implement HPE Ezmeral Data Fabric on Kubernetes.

EZCP-1590: When you attempt to upgrade a Kuberentes cluster from Kubernetes 1.18.x to any higher version of Kubernetes, one or more of the workers fails to upgrade.

Symptom: When you attempt to upgrade a Kuberentes cluster that implements either HPE Ezmeral Data Fabric on Kubernetes or Embedded Data Fabric from Kubernetes 1.18.x to a higher version, pods fail to come up, and the Kubernetes Cluster screen displays the message: one or more workers failed to upgrade.

Cause: The 1.0.x and 1.1.x versions of the HPE CSI driver are not supported on Kubernetes 1.19.x or later later Kubernetes versions.

Workaround: Upgrade the HPE CSI driver to version 1.2.5-1.0.5. For instructions, see Upgrading the CSI Plug-In.

EZCP-1608: When Istio add-on is deployed on the Kubernetes cluster, one or more workers fail to upgrade the Kubernetes version.

Symptom: When you deploy Istio add-on on Kubernetes cluster, one or more workers fail to upgrade the Kubernetes version. The Kubernetes version upgrade fails with the following errors:

Warning: one or more workers failed to upgrade on the Kubernetes Cluster screen.
Upgrade error: Failed to drain node error at the individual Kubernetes Host Status screen

This issue also occurs when the application user deploys PodDisruptionBudget (PDB) objects to the application workloads.

Cause: Some PodDisruptionBudget (PDB) objects for Istio resources have minimum replica as 1. This prevents the kubectl drain from succeeding during the Kubernetes upgrade.

Workaround: Execute the following commands on the Kubernetes Master, before initiating the Kubernetes upgrade from the Kubernetes Cluster screen:

kubectl -n istio-system delete poddisruptionbudget/istiod
kubectl -n istio-system delete poddisruptionbudget/istio-ingressgateway
kubectl -n istio-system delete poddisruptionbudget/istio-egressgateway

Note: You can also apply this workaround whenever Kubernetes upgrade fails with Failed to drain node error on the Kubernetes hosts or workers. Execute the preceding kubectl commands on the Kubernetes Master and continue with the Kubernetes upgrade on the remaining workers using the Retry Kubernetes Upgrade on Failed Workers action on the cluster from the Kubernetes Cluster screen.

EZCP-561: When Istio mTLS is enabled in STRICT mode, the Kiali Dashboard and KubeDirector service endpoints are not accessible through NodePort

Symptom: When Istio is configured to use Mutual Transport Layer Security (mTLS) in STRICT mode, the following issues occur:

None of the KubeDirector service endpoints are accessible through the NodePort service.
If mTLS in STRICT mode is enabled in a tenant, the Kiali Dashboard is not accessible through NodePort. Clicking on the endpoint results in an error.

Workaround: If possible, configure Istio to use PERMISSIVE mode (the default mode).

Issues Identified in Version 5.3.1

The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

General Platform Issues (5.3.1)
Kubernetes Issues (5.3.1)
KubeDirector Issues (5.3.1)
Kubeflow Issues (5.3.1)
Katib Issues (5.3.1)
HPE Ezmeral Data Fabric on Kubernetes Issues (5.3.1)
Embedded Data Fabric Issues (5.3.1)
Legacy EPIC Issues (5.3.1)

General Platform Issues (5.3.1)

The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

EZCP-1675: If you delete a host from HPE Ezmeral Container Platform and install the host again, then the host addition fails with a precheck error.

Symptom: If you delete a host from HPE Ezmeral Container Platform, and then install the host again, host addition fails with the error Checking AppArmor configuration: FAILED.

Cause: During SLES deployments, the Zypper package manager enables AppArmor service. If you delete a host from HPE Ezmeral Container Platform without disabling AppArmor and install the host again, then the host addition fails with the precheck error Checking AppArmor configuration: FAILED.

Workaround: Before adding the host to HPE Ezmeral Container Platform, disable and delete AppArmor service using the following commands:

sudo systemctl stop apparmor
sudo systemctl disable apparmor
sudo rm -f /usr/lib/systemd/system/apparmor.service

EZESC-253: After upgrade, UI becomes inaccessible and browser displays internal error 500

Symptom: Following an upgrade to HPE Ezmeral Container Platform 5.3, the UI for the controller is inaccessible, failing with internal error 500. The system fails with the error:

No space left on device

The /var/lib/monitoring/logs director contains large hpecp-monitoring_access and hcep-monitoring_audit logs.

Cause:

Dangling Search Guard indexes exist after the upgrade. You might see log entries similar to the following:

[WARN ][o.e.g.DanglingIndicesState] [xxxxx] [[searchguard/xxx-xxxxxxxxx-xxxxxxx]] can not be imported as a dangling index, as index with same name already exists in cluster metadata

Workaround: Search Guard indexes are not used by HPE Ezmeral Container Platform 5.3. You can remove the Search Guard indexes, delete the large log files, and resume monitoring on the HA nodes.

Remove the Search Guard indexes using one of the following methods:
- If Elasticsearch is running, you can delete the Search Guard index through the Elasticsearch REST API.
  
  For example:
```
curl --insecure  -u $(bdconfig --getvalue bdshared_elasticsearch_admin):$(bdconfig --getvalue bdshared_elasticsearch_adminpass) --silent -X DELETE https://localhost:9210/searchguard
```
- If Elasticsearch is not able to run, you must identify and delete SearchGuard indexes manually:
  1. Identify the indexes.
    Change the directory to /var/lib/monitoring/elasticsearch/nodes/0, then enter the following command:
```
find . -name "state-*.st" -print | xargs grep searchguard
```
    All the indices that are from Search Guard are displayed. You can use matching entries to determine which indexes to remove.
    
    For example, this line identifies a state file related to that contains the word Search Guard. The index name is part of the full file path of that file. In this example, the index name: xtSTTUb7RgOeUlCXWH8dAg
```
./indices/xtSTTUb7RgOeUlCXWH8dAg/_state/state-45.st matches
```
  2. Use the rm command to remove the index.
    
    For example:
```
rm -rf ./indices/xtSTTUb7RgOeUlCXWH8dAg
```
Delete the large log files.
On the HA cluster nodes only, restart monitoring. For example, from the controller, enter the following command:
```
HPECP_ONLY_RESTART_ES=1 /opt/bluedata/bundles/hpe-cp-*/startscript.sh --action enable_monitoring
```

Kubernetes Issues (5.3.1)

The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

EZESC-232: "Failed to pull image" ImagePullBackoff Errors received on Kubernetes clusters

When working with Kubernetes clusters in HPE Ezmeral Container Platform, you receive errors similar to the following:

Failed to pull image "bluedata/hpe-agent:1.1.5": rpc error: code = Unknown desc = Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

Cause: Kubernetes clusters running on any version of HPE Ezmeral Container Platform can occasionally encounter problems caused by the pull rate limit that Docker Hub applies to all free and anonymous accounts. These limits can cause cluster creation and application deployment to fail. If Kubernetes pods in a non-Air-gap environment are failing to come into Ready state and are showing ImagePullBackoff or related errors, this is the most likely cause.

Workaround: Do one of the following

Wait until the current rate limiting timeout has expired, then re-try.
Create a local image registry, then configure the air-gap settings to use that registry. For more information about air gap, see Kubernetes Air-Gap Requirements.
Note:
Hewlett Packard Enterprise strongly recommends performing air-gap configuration steps before adding Kubernetes hosts to the HPE Ezmeral Container Platform environment. Kubernetes hosts do not implement air-gap changes until the hosts are rebooted or the Kubernetes version is upgraded.
Upgrade your Docker Hub account as described in https://www.docker.com/increase-rate-limits (link opens an external website in a new browser tab/window), then on all hosts, do the following:
1. Execute a docker login operation with the credentials of the ugpgraded account.
  Docker will create or update its config.json file after a successful login (or you might want to use an existing configuration file).
2. Ensure that kubelet uses the new config.json file by placing it in one of the known search locations kubelet uses for credential files:
  1. Create a .docker directory directly under the root of the filesystem and place the config.json file in that directory. For example: /.docker/config.json
  2. Restart kubelet:
```
systemctl restart kubelet
```
  3. Verify that kublet has restarted:
```
systemctl status kubelet
```
  Kubelet will then choose that config.json file and use the paid account that generated that config, ensuring that no image pull rate limit will be exceeded.
The following article (link opens an external website in a new browser tab/window) shows all the locations that kubelet searches for Docker credentials files:

https://kubernetes.io/docs/concepts/containers/images/#configuring-nodes-to-authenticate-to-a-private-registry
Create a Docker proxy cache as described in the following article (link opens an external website in a new browser tab/window):

https://docs.docker.com/registry/recipes/mirror/

EZESC-245: After upgrade to HPE Ezmeral Container Platform 5.3, pods such as those for Web Terminal or Jupyter Notebook fail.

Symptom: After upgrade to HPE Ezmeral Container Platform 5.3, pods such as those for Web Terminal or Jupyter Notebook fail with errors similar to the following:

error while creating mount source path '/var/lib/kubelet/pods/ ...

Cause: In release 5.2, the CSI yaml file defines /var/lib/kubelet. In release 5.3.1, /var/lib/kubelet is symbolic and linked to /var/lib/docker/kubelet. However, because CSI performs mount operations that do not support the use of this symbolic link, the release 5.3.1 CSI yaml file defines /var/lib/docker/kubelet. This difference is intepreted in Kubernetes as a sandbox change, and pods fail to start.

Workaround:

After the upgrade, if any pods are in an waiting state with Reason: CrashLoopBackOff, you can use the following command to delete those pods:
```
kubectl -n <namespace> delete pod <pod-name>
```
The pod will be restarted with no loss of persistent data.
For each Kubernetes host that is added to the cluster after the upgrade, do the following:
1. Log in to the host.
2. Unlink /var/lib/kubelet.
3. Move /var/lib/docker/kubelet to /var/lib/kubelet.
  As a result of this step, you might see some Device busy errors. You can ignore those errors because you will be rebooting the host in a later step.
4. In the /var/lib/kubelet/kubeadm-flags.env file, remove --root-dir=/var/lib/docker/kubelet.
5. In the file /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf, change EnvironmentFile=-/var/lib/docker/kubelet/kubeadm-flags.env to EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
6. Reboot the host.

EZESC-240: Kubernetes pods do not come up after host reboot following yum update

Symptom: Following the execution of a yum update command on the host, Kubernetes pods do not come up after host reboot.

Cause: The yum update command, by default, attempts to update packages from all enabled repositories, including the repository that manages the kubeadm, kubelet, and kubecl packages. However, updating that repository using yum is not the correct upgrade procedure, and it can result in the installation of a package version that is not compatible with the current HPE Ezmeral Container Platform deployment, which leads to failures that result in applications not running correctly. In addition, the installed package version no longer matches the package version listed in the HPE Ezmeral Container Platform UI.

Workaround:

On each host, confirm that an incorrect Kubernetes package has been installed by running the following command:
```
yum list installed | grep 'kubeadm\|kubelet\|kubectl'
```
You run the yum list command instead of a kubectl command because the incorrect version numbers will not appear on host that has not been rebooted since the incorrect package was installed.
Identify which package name and version to reinstall by using the yum history and yum history info commands.

The versions listed in the output following the word Updated match the (correct) Kubernetes version that the cluster should be running. The output is similar to the following:
```
# yum history info 7
...
Packages Altered:
    Updated kubeadm-1.17.9-0.x86_64 @kubernetes
    Update          1.21.0-0.x86_64 @kubernetes
    Updated kubectl-1.17.9-0.x86_64 @kubernetes
    Update          1.21.0-0.x86_64 @kubernetes
    Updated kubelet-1.17.9-0.x86_64 @kubernetes
    Update          1.21.0-0.x86_64 @kubernetes
```
For each updated Kubernetes package, run the yum downgrade command, specifying the package name listed on the Updated line. For example:
```
yum downgrade kubeadm-1.17.9-0.x86_64 -y
```
After you downgrade the Kubernetes packages on all the affected hosts, reboot the hosts that were running the incorrect package versions.
1. Run the kubectl get nodes command and examine the output.
2. Reboot each host that is running the wrong Kubernetes version.
After rebooting the affected hosts, allow several minutes for the correct versions to be reflected in the output of kubectl get nodes command.

Then run the kubectl get po --A command to verify that all pods are in the expected state.
On each host, prevent the yum update command from updating the Kubernetes repo.
1. Open the following file in an editor:
```
/etc/yum.repos.d/bd-kubernetes.repo
```
2. The parameter enabled=1 indicates updates are enabled. Change the parameter to enabled=0.
3. When you use yum to update other packages, run the yum update command without the -y option so that you can individually deny any Kubernetes packages updates that show as available.
4. Before you update a host through the HPE Ezmeral Container Platform UI, edit the /etc/yum.repos.d/bd-kubernetes.repo file to enable updates by setting enabled=1. After you update the host, edit the file again to disable updates of the Kubernetes repo.

EZESC-244: Kubernetes host upgrade fails with error "repomd.xml signature could not be verified"

Symptom: The HPE Ezmeral Container Platform UI reports that a Kubernetes host upgrade failed. The upgrade log for that host server contains an error similar to the following: repomd.xml signature could not be verified for kubernetes

Cause: This is a known Kubernetes issue. For an example, see https://github.com/kubernetes/kubernetes/issues/100757

Workaround:

Edit the file /etc/yum.repos.d/bd-kubernetes.repo to set the following: repo_gpgcheck=0
Retry the upgrade from the web UI.

EZCP-811: Webterms do not work for imported clusters. You will encounter an error if you try to start a webterm on an imported cluster.

Workaround: Execute the following command using either the Kubeconfig file used to import the cluster or a Kubeconfig file for the imported cluster downloaded from the HPE Ezmeral Container Platform as described in Downloading Admin Kubeconfig:

kubectl patch hpecpconfigs hpecp-global-config -n hpecp --type merge --patch '{"spec":{"fsMount":{"enabled":false} } }'

After the command is issued, starting a webterm should not generate an error.

EZCP-813: After upgrading to HPE Ezmeral Container Platform 5.3.x, Kubernetes Dashboards and graphs show no data for existing clusters and hosts

After upgrading to HPE Ezmeral Container Platform 5.3.x, Usage tab and Load charts of the Kubernetes dashboard show no data or show incomplete data for existing clusters.

Workaround: Existing system add-ons for monitoring must be upgraded and deployed on the existing clusters. New required system add-ons must be deployed on the existing clusters. See Upgrading Kubernetes Add-Ons.

EZCP-823: Kubernetes Upgrade dialog empty or not showing latest Kubernetes version after upgrade to HPE Ezmeral Container Platform 5.3.x.

Workaround: Refresh the browser screen.

EZCP-854: Istio add-on upgrade fails

This issue was fixed in HPE Ezmeral Container Platform 5.3.5 with the introduction of Istio 1.9.

When you use the add-on upgrade script to upgrade the optional Istio add-on, the add-on upgrade fails.

Workaround: Do not upgrade the Istio add on. An upgrade of the Istio add-on is not needed because the add-on versions are the same in both HPE Ezmeral Container Platform 5.2 and HPE Ezmeral Container Platform 5.3.1.

KubeDirector Issues (5.3.1)

The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

EZESC-217: 503 Service Unavailable error when connecting to training engine instance

Attempts to connect to a training engine instance from a JupyterLab Notebook fail. When you attempt to connect to the service endpoint of the training engine instance in a browser, the error "503 Service Unavailable" is returned.

Cause: The High Availability Proxy (HAProxy) service is not available on the gateway host.

Workaround: Start the HAproxy service. From a master node, enter the following command:

kubectl exec -c app -n <tenant-namespace> <trainingengineinstance-loadbalancer-pod> - systemctl restart haproxy

Kubeflow Issues (5.3.1)

The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

If you specify an external user group, the group is not taken into account when a user logs in to Kubeflow. The user will be allowed to log in to Kubeflow regardless of to which groups that the user belongs. See the following for more information:

https://github.com/dexidp/dex/issues/1562.
Occasionally, the v1beta1.webhook.cert-manager.ioapiservice is unavailable for a period of time after deploying Kubeflow services (applying a manifest). To make the service available, restart the service as follows:
```
kubectl delete apiservices v1beta1.webhook.cert-manager.io
```
There is an issue with Istio authorization for HTTP traffic in which the KFServing predict request returns 503 Service Unavailable. See the following for more information:

https://github.com/kubeflow/kfserving/issues/820
The Dex authentication component does not support the Direct Bind type.

Katib Issues (5.3.1)

The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

The following issues occur in Katib, which is a Kubernetes-native project for automated machine learning.

Suggestion pods running after experiment completes:

https://github.com/kubeflow/katib/issues/1043
Katib with Kubernetes 1.19 and higher:

https://github.com/kubeflow/kfserving/issues/1197

https://github.com/kubeflow/katib/issues/1395

HPE Ezmeral Data Fabric on Kubernetes Issues (5.3.1)

The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

EZKDF-404: Clusters that Implement HPE Ezmeral Data Fabric on Kubernetes fail to start after Kubernetes version or HPE Ezmeral Container Platform version upgrade.

The following advice applies to deployments that have separate Data Fabric clusters, and deployments that combine compute and Data Fabric nodes in the same cluster. This advice does not apply to deployments that implement Embedded Data Fabric only.

Attempts to upgrade or patch Kubernetes or upgrade HPE Ezmeral Container Platform in deployments that include HPE Ezmeral Data Fabric on Kubernetes can fail in ways that require a significant number of recovery steps.

Contact your Hewlett Packard Enterprise support representative for upgrade assistance for any of the following:

Upgrading or patching the Kubernetes version on any cluster that implements HPE Ezmeral Data Fabric on Kubernetes.
Upgrading HPE Ezmeral Data Fabric on Kubernetes independently of an upgrade to HPE Ezmeral Container Platform.
Upgrading HPE Ezmeral Container Platform on deployments that implement HPE Ezmeral Data Fabric on Kubernetes.

If your environment deploys a version of HPE Ezmeral Container Platform prior to version 5.3.5, Hewlett Packard Enterprise recommends that you upgrade to HPE Ezmeral Container Platform 5.3.5 or later before you add HPE Ezmeral Data Fabric on Kubernetes.

EZESC-563: ZooKeeper issue when running the saveAsNewAPIHadoopFile method on HPE Ezmeral Data Fabric on Kubernetes cluster.

Symptom: Running the saveAsNewAPIHadoopFile method on HPE Ezmeral Data Fabric on Kubernetes cluster generates the following error:

ERROR MapRZKRMFinderUtils: Unable to determine ResourceManager service address from Zookeeper at xxx.xxx.xxx.xxx

Workaround: Set the yarn.resourcemanager.ha.custom-ha-enabled and yarn.resourcemanager.recovery.enabled property on /opt/mapr/hadoop/hadoop-2.7.4/etc/hadoop/yarn-site.xml configuration file to false.

EZKDF-109: After CLDB upgrade, MFS pods remain in a bad state

Workaround: Use the following command to restart the MAST gateway:

kubectl exec -it -n <namespace> <mfs-pod> -- /opt/mapr/initscripts/mapr-mastgateway restart

Embedded Data Fabric Issues (5.3.1)

The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

EZCP-1351: Unknown UID warning for UID 997 in new HPE Ezmeral Container Platform deployments with Embedded Data Fabric

In a new HPE Ezmeral Container Platform deployment that implements Embedded Data Fabric, in the Embedded Data Fabric container, the cldb.log contains multiple instances of an error similar to the following:

2021-07-16 22:23:23,490 WARN CLDBServer [RPC-5]: Exception while fetching username for id 997
java.lang.SecurityException: Unknown uid

Workaround: Use SSH and log in to the HPE Ezmeral Container Platform Controller, then create a user that has UID 997. If this is an HA deployment, also add the user in the Shadow and Arbiter controllers.

The following is an example of the command to add the user:

bdmapr --root useradd -u 997 -g 5000 -s /sbin/nologin hpecpserviceaccount

Legacy EPIC Issues (5.3.1)

The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

EZCP-846: Exception error occurs when tagging a shadow or arbiter host

When attempting to add or update a tag on a shadow or arbiter host, an exception error occurs, preventing you from applying the tag.

Workaround: Contact your support specialist.

Previously-Identified Issues

The following issues were identified in a version of HPE Ezmeral Container Platform prior to version 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

Issues are categorized as follows:

General Platform Issues (Prior Releases)
Kubernetes Issues (Prior Releases)
Kubeflow Issues (Prior Releases)
Legacy EPIC Issues (Prior Releases)
Application Issues (Prior Releases)

General Platform Issues (Prior Releases)

The following issues were identified in a version of HPE Ezmeral Container Platform prior to version 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

BDP-2879: The Python ML and DL Toolkit lists a deleted Training cluster in the %attachments list.

Workaround: Ignore the deleted cluster. No jobs will be submitted to deleted clusters.

BDP-841: When enabling multi-domain authentication, the password field must be filled out for all domains before submitting changes to any domain, otherwise the web interface will fail to react.

HAATHI-15068: Unable to create a tenant or FS mount if any host is down.

Workaround: Consider removing the Kubernetes host from the Kubernetes cluster or wait until the host is back up and running.

HAATHI-12781: When HPE Ezmeral Container Platform is installed on RedHat 7.x systems, system reboots are observed under heavy load.

Workaround: Update the RedHat kernel to the newest kernel version.

HAATHI-14220: Adding a license when one or more Worker hosts is in an error state may cause an error.

Workaround: Remove the affected hosts before uploading the license.

HAATHI-12810: After restarting the container that handles monitoring, the service may fail to restart and will show red in the Services tab of the Platform Administrator Dashboard screen.

Workaround: Restart the service manually from the Controller host by executing the command systemctl restart bds-monitoring.

HAATHI-12829: For RHEL/CentOS 7.x OS installs, if a server is physically rebooted, some services that depend on network services may be down as shown in the Services tab of the Platform Administrator Dashboard screen.

Workaround: Execute the following commands on the Controller host:

$ systemctl stop NetworkManager
$ systemctl disable NetworkManager
$ systemctl restart network
$ systemctl restart bds-controller
$ systemctl restart bds-worker

HAATHI 13253: HPE Ezmeral Container Platform does not compress or archive Nagios log files.

Workaround: Manually archive files as needed in the /srv/bluedata/nagios directory on the Controller.

BDP-1511: Platform HA must be enabled before creating Kubernetes clusters.

Workaround: If you enable Platform HA after Kubernetes cluster creation, then reconfigure host monitoring as follows:

On a Kubernetes master node bring up the monitoring bootstrap deployment:

kubectl -n hpecp-bootstrap scale deployment/hpecp-bootstrap-hpecp-monitoring --replicas=1

Exec into the bootstrap pod

kubectl -n hpecp-bootstrap exec -it $(kubectl -n hpecp-bootstrap get -o jsonpath='{.items[0].metadata.name}' pods -l name=hpecp-bootstrap-hpecp-monitoring) -c hpecp-monitoring - bash

Delete running deployment (if exist):

kubectl -n kube-system -delete -f /workspace/monitoring.yaml

Export / change any needed bds_xxx env variables (e.g. redeploy after HA enable)
```
export bds_ha_enabled='Yes'
        export bds_ha_nodes='<controller IP list>'
```
(e.g. export bds_ha_nodes='16.143.21.35,16.143.21.237,16.143.21.38')
Run startscript install:
```
/usr/local/bin/startscript --install
```
This places metricbeat.yaml in the workspace folder.

Deploy metricbeat deployment:

kubectl -n kube-system create -f /workspace/monitoring.yaml

Exit the bootstrap pod and scale down bootstrap deployment:

kubectl -n hpecp-bootstrap scale deployment/hpecp-bootstrap-hpecp-monitoring --replicas=0

BDP-685: Kubernetes cluster creation fails with an "internal error."

Workaround: Remove the Kubernetes hosts, verify that all system clocks are synchronized, and then re-add the hosts and recreate the Kubernetes cluster.

BDP-852: All uploaded files and new folders created by AD/LDAP users via the HPE Ezmeral Container Platform FS mounts interface will have root ownership and full permission for all tenant members.

Workaround: None at this time.

BDP-1868: An admin kubeconfig file downloaded from an imported external Kubernetes cluster will not contain expected edits from the HPE Ezmeral Container Platform web interface.

Workaround: Manually edit the kubeconfig file after download.

Kubernetes Issues (Prior Releases)

The following issues were identified in a version of HPE Ezmeral Container Platform prior to version 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

BDP-582: Kubernetes manifest customizations revert to defaults after anHPE Ezmeral Container Platform failover

Symptom: In deployments that use a customized external Kubernetes manifest feed, unless the manifest feed is also configured on the shadow controller, manifest feeds revert to the default location after an HPE Ezmeral Container Platform failover. Therefore, manifest customizations are no longer in effect.

Workaround: After failover, manually restore the manifest feed. For example, if the feed location is https://bd-poonam.s3-us-west-1.amazonaws.com/epic-5.1/k8S_Manifest.json, then execute the following commands:

On the Primary Controller:

/opt/bluedata/common-install/bd_mgmt/bin/bd_mgmt k8s manifest_change_feed https://bd-poonam.s3-us-west-1.amazonaws.com/epic-5.1/k8S_Manifest.json

On the Shadow Controller:

bdconfig --set bdshared_k8s_manifestfeed=https://bd-poonam.s3-us-west-1.amazonaws.com/epic-5.1/k8S_Manifest.json

For more details, see Updating the Kubernetes Manifest.

BDP-574: Unable to add a Kubernetes host when Platform HA (High Availability) is being enabled.

Workaround: Wait until Platform HA finishes before adding the Kubernetes host.

HAATHI-15093 : A GPU is visible in a non-GPU-requesting pod. When an app spawns on a device having a GPU, it is able to access the GPU even when there are no requests for one. This is a known issue with the NVIDIA k8s-device-plugin.

Workaround: You must manually create an environment variable in the Kubedirectorcluster YAML that 'hides' the GPU from the App. The variable is named NVIDIA_VISIBLE_DEVICES with value VOID. For example:

apiVersion: "kubedirector.bluedata.io/apiVersion" kind: "KubeDirectorCluster" metadata: name: "sample-name" spec: app: sample-app roles: - id: samplerole resources: requests: memory: "4Gi" cpu: "2" limits: memory: "4Gi" cpu: "2" env: - name : "NVIDIA_VISIBLE_DEVICES" value: "VOID"

Kubeflow Issues (Prior Releases)

The following issues were identified in a version of HPE Ezmeral Container Platform prior to version 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

The Kubeflow Pipelines UI does not render correctly in Firefox browser. It is recommended to use Chrome.
Kubeflow Pipelines do not support multi-user isolation. See: https://github.com/kubeflow/pipelines/issues/1223#issuecomment-656507073

Legacy EPIC Issues (Prior Releases)

The following issues were identified in a version of HPE Ezmeral Container Platform prior to version 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

HAATHI-12651: Certain action script commands, such as tailf or top, will continue to run a process on the host after the action script is killed.

Workaround: Please contact HPE Technical Support.

HAATHI-12654: In virtual clusters with AD/LDAP user account integration and Edge nodes, there is some divergence between users who can SSH into a node and users who are allowed to run ActionScripts on the node. The latter is more permissive on Edge nodes. All ActionScript invocations are captured in the audit log.

HAATHI-12698: ActionScripts are not available for virtual clusters that were created prior to upgrading to version 5.2.

Workaround: Create new virtual clusters with the same specifications (distribution, number/ﬂavor of virtual nodes, etc.), and then delete the old virtual clusters.

HAATHI-12656: Entering incorrect information when defining the Kerberos realm of a remote DataTap could cause all further attempts to create a DataTap using that realm to fail (such as using the correct value for realm but incorrect values for Host and/or Port).

Workaround: Edit /etc/krb5.conf on the Controller and enter the correct values for Host and/or Port. After this, re-create the DataTap.

Application Issues (Prior Releases)

The following issues were identified in a version of HPE Ezmeral Container Platform prior to version 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.

HAATHI-14109: When using CEPH for persistent storage, a discrepancy between the client and server versions will cause HPE Ezmeral Container Platform to fail to load App Store images with the error "Failed to map the volume."

Workaround: Remove the persistent storage until the client and server versions are the same.

HAATHI-14192: Running the Impala shell on a container where the Impala daemon is not running.

Workaround: Use the -i option to refer to the worker node, e.g. impala-shell -i <worker hostname>.

HAATHI-14461: Notebooks with a name that includes one or more spaces cannot be committed to GitHub. When working in an AI/ML project that includes a GitHub repository, creating a Jupyterhub notebook with a name that includes one or more spaces will cause an error when trying to commit that notebook to GitHub.

Workaround: Do not include any spaces when naming a Jupyterhub notebook.

HAATHI-10733: Hive jobs that use DataTap paths may fail with a SemanticException error. When Hive creates a table, the location where the table metadata is stored comes from the Hive configuration parameter fs.defaultFS by default (which will point to the cluster file system). If a Hive job references DataTap paths outside of the file system where the table metadata is stored, then the job will fail with a SemanticException error because Hive enforces that all data sources must come from the same file system.

Workaround: Explicitly set the table metadata location to a path on the same DataTap that you will use for the job inputs and/or outputs, using the LOCATION clause when creating the table. For example, if you intend to use the TenantStorage DataTap, you would set the table metadata location to some path on that DataTap such as:

CREATE TABLE docs (c1 INT, c2 STRING) LOCATION
'dtap://TenantStorage/hive-table-docs'

HAATHI-12546: Some http links in applications running on HPE Ezmeral Container Platform show the hostname of the instance. These links will not work when HPE Ezmeral Container Platform is installed with the non-routable network option.

Workaround: See "Configure Client to use Hostname instead of IP Address, below."

HAATHI-13254: If a user updates an app inside a container instead of via the App Store screen, then cluster expansion will fail.

Workaround: Expand the cluster before performing the upgrade. Once the update is complete, edit classpath to point to the correct .jar files, such as hadoop-common-*.jar.

DOC-9: Cloudera Manager reports incorrect values for a node's resources. Cloudera Manager accesses the Linux /proc file system to determine the characteristics of the nodes it is managing. Because container technology is used to implement virtual nodes, this file system reports information about the host rather than about the individual node, causing Cloudera Manager to report inflated values for a node's CPU count, memory, and disk.

Workaround: Use the web interface to see a node's virtual hardware configuration (flavor).

DOC-19: Spark applications may wait indefinitely if no free vCPUs are available. This is a general Spark behavior, but it is worth some emphasis in an environment where various virtual hardware resources (possibly in small amounts) can be quickly provisioned for use with Spark.

A Spark application will be stuck in the Waiting state if all vCPUs in the cluster are already considered to be in-use (by the Spark framework and other running Spark applications). In Spark version 1.5, the thrift server is configured to use 2 vCPUs on the Spark master node by default. You can reduce this to 1 vCPU by editing the total-executor- cores argument value in the /etc/init.d/hive-thriftserver script, and then restarting the thrift server ($ sudo service hive-thriftserver restart).

K8S-1887: A MapR software version alarm is generated, indicating that “One or more services on the node are running an unexpected version.” The alarm includes a “recommended action” to stop and restart the node.

Workaround: You can ignore the alarm and recommended action for container-based HPE Ezmeral Data Fabric.