This topic describes issues and workarounds in version 5.3.x of HPE Ezmeral Container Platform.
This topic describes issues and workarounds in HPE Ezmeral Container Platform version 5.3.x.
See issues for older releases.
The following issues were identified in HPE Ezmeral Container Platform 5.3.5. Unless otherwise noted, these issues also apply to later 5.3.x releases.
The following issues were identified in HPE Ezmeral Container Platform 5.3.5. Unless otherwise noted, these issues also apply to later 5.3.x releases.
K8Scsv.pyK8Susage.pybdcsv.pybdusage.py/opt/bluedata/common-install/scripts/monitoring/Symptom: When upgrading HPE Ezmeral Container Platform 5.3.1 to 5.3.5 or 5.3.6, upgrading directly from the Beta 5 version of the NVIDIA plugin add-on to the required plugin-version is not supported.
Cause: After upgrading HPE Ezmeral Container Platform from 5.3.1 to 5.3.5 or
5.3.6, upgrading directly from the NVIDIA plugin add-on version
1.0.0-beta-5 to version 0.9.0-1,
which is the required add-on version for HPE Ezmeral Container Platform 5.3.5 or 5.3.6, is not
supported. You must upgrade NVIDIA plugin add-on from 1.0.0-beta-5 to
1.0.0-beta-6, and then to 0.9.0-1.
Workaround: In the following procedure, you will upgrade
the NVIDIA-plugin add-on from version 1.0.0-beta-5 to
version 1.0.0-beta-6, and then from version
1.0.0-beta-6 to version 0.9.0-1
Import the images to your air-gap registry. For more information about this step, see Configuring Air Gap Kubernetes Host Settings.
After installing the Kubernetes hosts, SSH to each GPU and run
$ rpm -qa | grep nvidia.
nvidia-container-runtime-3.5.0-1.x86_64
libnvidia-container1-1.4.0-1.x86_64
nvidia-container-toolkit-1.5.1-2.x86_64
libnvidia-container-tools-1.4.0-1.x86_64echo '{"user":<site-admin-username>,<password>:<site-admin-password>}' >> /tmp/cred.json-t and --required-only
parameters.cd /opt/bluedata/common-install/scripts/k8s-addons
python k8s_addon_upgrade.py -c <CONTROLLER_IP> -f /tmp/cred.json -k <CLUSTER_NAME> --required-only -tResult:
The output of the command lists all the manifest add-ons (both required and optional), and their versions, and all the add-on versions that are currently deployed on the cluster. In that result, you can see the following information about the NVIDIA plugin add-on:
Cluster my-k8s-cluster required add-ons info:
...
nvidia-plugin:
deployed version: 1.0.0-beta-5, tools: 0.4
manifest version: 0.9.0-0, tools: 0.4
...
cd /opt/bluedata/common-install/manifestk8s_manifest.json
manifest file.1.0.0-beta-6 "nvidia-plugin" : {
"_version" : 2,
"required" : true,
"version" : "1.0.0-beta-6",
"system" : true,
"order" : 40,
"tools_version" : "0.4",
"deployment" : "hpecp-bootstrap-nvidia-plugin",
"label" : {
"name" : "hpecp-bootstrap-nvidia-plugin",
"description" : ""
}/opt/bluedata/common-install/bd_mgmt/bin/bd_mgmt k8s manifest_refreshscripts/k8s-addons
directory and execute the add-on upgrade script as a dry run by
specifying the -t and
--required-only parameters.For example:
cd /opt/bluedata/common-install/scripts/k8s-addons
python k8s_addon_upgrade.py -c <CONTROLLER_IP> -f /tmp/cred.json -k <CLUSTER_NAME> --required-only -tResult:
The
resulting output shows that the manifest version of the NVIDIA
plugin add-on is
1.0.0-beta-6.
Cluster my-k8s-cluster required add-ons info:
...
nvidia-plugin:
deployed version: 1.0.0-beta-5, tools: 0.4
manifest version: 1.0.0-beta-6, tools: 0.4
...
For example:
cd /opt/bluedata/common-install/scripts/k8s-addons
python k8s_addon_upgrade.py -c <CONTROLLER_IP> -f /tmp/cred.json -k <CLUSTER_NAME> -a nvidia-plugin"version" : "0.9.0-1" in the edited
Kubernetes manifest file.After the upgrade, the output of a script dry run command shows the following for the NVIDIA plugin add-on:
Cluster my-k8s-cluster required add-ons info:
...
nvidia-plugin:
deployed version: 0.9.0-1, tools: 0.4
manifest version: 0.9.0-1, tools: 0.4
...
The following issues were identified in HPE Ezmeral Container Platform 5.3.5. Unless otherwise noted, these issues also apply to later 5.3.x releases.
Symptom: When you attempt to create Kubernetes cluster in
an air-gap configuration, NVIDIA gpu-feature-discovery
pod fails to come up, and the screen displays Failed to pull
image "ImagePullBackoff" error.
This issue was fixed in HPE Ezmeral Container Platform 5.3.6.
Cause: In an air-gap installation, NVIDIA-gpu-feature-discover daemonset does not use the Air-Gap-registry URL.
EZKDF-404, "Clusters that Implement HPE Ezmeral Data Fabric on Kubernetes fail to start after Kubernetes version or HPE Ezmeral Container Platform version upgrade," also applies to upgrading Kubernetes versions in HPE Ezmeral Container Platform 5.3.5 deployments that implement HPE Ezmeral Data Fabric on Kubernetes.
Symptom: When you attempt to upgrade a Kuberentes cluster that implements either HPE Ezmeral Data Fabric on Kubernetes or Embedded Data Fabric from Kubernetes 1.18.x to a higher version, pods fail to come up, and the Kubernetes Cluster screen displays the message: one or more workers failed to upgrade.
Cause: The 1.0.x and 1.1.x versions of the HPE CSI driver are not supported on Kubernetes 1.19.x or later later Kubernetes versions.
Workaround: Upgrade the HPE CSI driver to version
1.2.5-1.0.5. For instructions, see Upgrading the CSI Plug-In.
Warning: one or more workers failed to upgrade
on the Kubernetes Cluster screen.Upgrade error: Failed to drain node error at
the individual Kubernetes Host Status
screenThis issue also occurs when the application user deploys PodDisruptionBudget (PDB) objects to the application workloads.
Cause: Some PodDisruptionBudget
(PDB) objects for Istio resources have minimum replica
as 1. This prevents the kubectl
drain from succeeding during the Kubernetes
upgrade.
kubectl -n istio-system delete poddisruptionbudget/istiod
kubectl -n istio-system delete poddisruptionbudget/istio-ingressgateway
kubectl -n istio-system delete poddisruptionbudget/istio-egressgatewayFailed to drain node error on the Kubernetes hosts
or workers. Execute the preceding kubectl commands on
the Kubernetes Master and continue with the
Kubernetes upgrade on the remaining workers using the Retry
Kubernetes Upgrade on Failed Workers action on the
cluster from the Kubernetes Cluster
screen.STRICT mode, the
Kiali Dashboard and KubeDirector service endpoints are not accessible
through NodePortSymptom: When Istio is configured to use Mutual Transport
Layer Security (mTLS) in STRICT mode, the following
issues occur:
STRICT mode is enabled in a tenant,
the Kiali Dashboard is not accessible through NodePort. Clicking
on the endpoint results in an error.Workaround: If possible, configure Istio to use
PERMISSIVE mode (the default mode).
The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
Symptom: If you delete a host from HPE Ezmeral Container Platform, and then install the host again,
host addition fails with the error Checking AppArmor
configuration: FAILED.
Cause: During SLES deployments, the Zypper package manager
enables AppArmor service. If you delete a host from
HPE Ezmeral Container Platform without disabling
AppArmor and install the host again, then the
host addition fails with the precheck error Checking AppArmor
configuration: FAILED.
sudo systemctl stop apparmor
sudo systemctl disable apparmor
sudo rm -f /usr/lib/systemd/system/apparmor.serviceSymptom: Following an upgrade to HPE Ezmeral Container Platform 5.3, the UI for the controller is inaccessible, failing with internal error 500. The system fails with the error:
No space left on device
The /var/lib/monitoring/logs director contains large
hpecp-monitoring_access and
hcep-monitoring_audit logs.
Cause:
Dangling Search Guard indexes exist after the upgrade. You might see log entries similar to the following:
[WARN ][o.e.g.DanglingIndicesState] [xxxxx]
[[searchguard/xxx-xxxxxxxxx-xxxxxxx]] can not be imported as a
dangling index, as index with same name already exists in cluster
metadata
Workaround: Search Guard indexes are not used by HPE Ezmeral Container Platform 5.3. You can remove the Search Guard indexes, delete the large log files, and resume monitoring on the HA nodes.
Remove the Search Guard indexes using one of the following methods:
If Elasticsearch is running, you can delete the Search Guard index through the Elasticsearch REST API.
For example:
curl --insecure -u $(bdconfig --getvalue bdshared_elasticsearch_admin):$(bdconfig --getvalue bdshared_elasticsearch_adminpass) --silent -X DELETE https://localhost:9210/searchguard
If Elasticsearch is not able to run, you must identify and delete SearchGuard indexes manually:
Change the directory to
/var/lib/monitoring/elasticsearch/nodes/0,
then enter the following
command:
find . -name "state-*.st" -print | xargs grep searchguardAll the indices that are from Search Guard are displayed. You can use matching entries to determine which indexes to remove.
For
example, this line identifies a state file related
to that contains the word Search Guard. The index
name is part of the full file path of that file.
In this example, the index name:
xtSTTUb7RgOeUlCXWH8dAg
./indices/xtSTTUb7RgOeUlCXWH8dAg/_state/state-45.st matchesUse the rm command to remove the
index.
For example:
rm -rf ./indices/xtSTTUb7RgOeUlCXWH8dAg
HPECP_ONLY_RESTART_ES=1 /opt/bluedata/bundles/hpe-cp-*/startscript.sh --action enable_monitoringThe following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
When working with Kubernetes clusters in HPE Ezmeral Container Platform, you receive errors similar to the following:
Failed to pull image "bluedata/hpe-agent:1.1.5": rpc error: code
= Unknown desc = Error response from daemon: toomanyrequests: You
have reached your pull rate limit. You may increase the limit by
authenticating and upgrading: https://www.docker.com/increase-rate-limit
Cause: Kubernetes clusters running on any version of HPE Ezmeral Container Platform can occasionally encounter problems caused by the pull rate limit that Docker Hub applies to all free and anonymous accounts. These limits can cause cluster creation and application deployment to fail. If Kubernetes pods in a non-Air-gap environment are failing to come into Ready state and are showing ImagePullBackoff or related errors, this is the most likely cause.
Workaround: Do one of the following
Hewlett Packard Enterprise strongly recommends performing air-gap configuration steps before adding Kubernetes hosts to the HPE Ezmeral Container Platform environment. Kubernetes hosts do not implement air-gap changes until the hosts are rebooted or the Kubernetes version is upgraded.
Upgrade your Docker Hub account as described in https://www.docker.com/increase-rate-limits (link opens an external website in a new browser tab/window), then on all hosts, do the following:
docker login operation with
the credentials of the ugpgraded account. Docker will
create or update its config.json
file after a successful login (or you might want to
use an existing configuration file).
Ensure that kubelet uses the new
config.json file by placing it in
one of the known search locations kubelet uses for
credential files:
Create a .docker directory
directly under the root of the filesystem and
place the config.json file in that directory. For
example: /.docker/config.json
Restart kubelet:
systemctl restart kubelet
systemctl status kubeletKubelet will then choose that
config.json file and use the paid
account that generated that config, ensuring that no
image pull rate limit will be exceeded.
The following article (link opens an external website in a new browser tab/window) shows all the locations that kubelet searches for Docker credentials files:
Create a Docker proxy cache as described in the following article (link opens an external website in a new browser tab/window):
Symptom: After upgrade to HPE Ezmeral Container Platform 5.3, pods such as those for Web Terminal or Jupyter Notebook fail with errors similar to the following:
error while creating mount source path '/var/lib/kubelet/pods/
...
Cause: In release 5.2, the CSI yaml file defines
/var/lib/kubelet. In release 5.3.1,
/var/lib/kubelet is symbolic and linked to
/var/lib/docker/kubelet. However, because CSI
performs mount operations that do not support the use of this symbolic
link, the release 5.3.1 CSI yaml file defines
/var/lib/docker/kubelet. This difference is
intepreted in Kubernetes as a sandbox change, and pods fail to start.
Workaround:
After the upgrade, if any pods are in an waiting state with
Reason: CrashLoopBackOff, you can use the
following command to delete those pods:
kubectl -n <namespace> delete pod <pod-name>
The pod will be restarted with no loss of persistent data.
For each Kubernetes host that is added to the cluster after the upgrade, do the following:
/var/lib/kubelet./var/lib/docker/kubelet to
/var/lib/kubelet.As a result of this
step, you might see some Device busy
errors. You can ignore those errors because you will be
rebooting the host in a later step.
/var/lib/kubelet/kubeadm-flags.env
file, remove
--root-dir=/var/lib/docker/kubelet. /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf,
change
EnvironmentFile=-/var/lib/docker/kubelet/kubeadm-flags.env
to
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.envSymptom: Following the execution of a yum
update command on the host, Kubernetes pods do not come up
after host reboot.
Cause: The yum update command, by default,
attempts to update packages from all enabled repositories, including the
repository that manages the kubeadm,
kubelet, and kubecl packages.
However, updating that repository using yum is not the
correct upgrade procedure, and it can result in the installation of a
package version that is not compatible with the current HPE Ezmeral Container Platform deployment, which leads to failures
that result in applications not running correctly. In addition, the
installed package version no longer matches the package version listed
in the HPE Ezmeral Container Platform UI.
yum list installed | grep 'kubeadm\|kubelet\|kubectl'You run the yum list command instead of a
kubectl command because the incorrect
version numbers will not appear on host that has not been
rebooted since the incorrect package was installed.
Identify which package name and version to reinstall by using
the yum history and yum history
info commands.
The versions listed in the output following the word
Updated match the (correct) Kubernetes
version that the cluster should be running. The output is
similar to the following:
# yum history info 7
...
Packages Altered:
Updated kubeadm-1.17.9-0.x86_64 @kubernetes
Update 1.21.0-0.x86_64 @kubernetes
Updated kubectl-1.17.9-0.x86_64 @kubernetes
Update 1.21.0-0.x86_64 @kubernetes
Updated kubelet-1.17.9-0.x86_64 @kubernetes
Update 1.21.0-0.x86_64 @kubernetes
For each updated Kubernetes package, run the yum
downgrade command, specifying the package name
listed on the Updated line. For
example:
yum downgrade kubeadm-1.17.9-0.x86_64 -y
After you downgrade the Kubernetes packages on all the affected hosts, reboot the hosts that were running the incorrect package versions.
kubectl get nodes command and
examine the output.After rebooting the affected hosts, allow several minutes for
the correct versions to be reflected in the output of
kubectl get nodes command.
Then run the kubectl get po --A command to
verify that all pods are in the expected state.
On each host, prevent the yum update command
from updating the Kubernetes repo.
Open the following file in an editor:
/etc/yum.repos.d/bd-kubernetes.repo
enabled=1 indicates
updates are enabled. Change the parameter to
enabled=0.When you use yum to update other packages, run the
yum update command without the
-y option so that you can
individually deny any Kubernetes packages updates
that show as available.
Before you update a host through the HPE Ezmeral Container Platform UI, edit the
/etc/yum.repos.d/bd-kubernetes.repo
file to enable updates by setting
enabled=1. After you update the
host, edit the file again to disable updates of the
Kubernetes repo.
Symptom: The HPE Ezmeral Container Platform UI
reports that a Kubernetes host upgrade failed. The upgrade log for that
host server contains an error similar to the following:
repomd.xml signature could not be verified for
kubernetes
Cause: This is a known Kubernetes issue. For an example, see https://github.com/kubernetes/kubernetes/issues/100757
/etc/yum.repos.d/bd-kubernetes.repo to set
the following: repo_gpgcheck=0kubectl patch hpecpconfigs hpecp-global-config -n hpecp --type merge --patch '{"spec":{"fsMount":{"enabled":false} } }'After the command is issued, starting a webterm should not generate an error.
After upgrading to HPE Ezmeral Container Platform 5.3.x, Usage tab and Load charts of the Kubernetes dashboard show no data or show incomplete data for existing clusters.
Workaround: Existing system add-ons for monitoring must be upgraded and deployed on the existing clusters. New required system add-ons must be deployed on the existing clusters. See Upgrading Kubernetes Add-Ons.
Workaround: Refresh the browser screen.
This issue was fixed in HPE Ezmeral Container Platform 5.3.5 with the introduction of Istio 1.9.
When you use the add-on upgrade script to upgrade the optional Istio add-on, the add-on upgrade fails.
The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
Attempts to connect to a training engine instance from a JupyterLab Notebook fail. When you attempt to connect to the service endpoint of the training engine instance in a browser, the error "503 Service Unavailable" is returned.
Cause: The High Availability Proxy (HAProxy) service is not available on the gateway host.
The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
If you specify an external user group, the group is not taken into account when a user logs in to Kubeflow. The user will be allowed to log in to Kubeflow regardless of to which groups that the user belongs. See the following for more information:
Occasionally, the v1beta1.webhook.cert-manager.ioapiservice
is unavailable for a period of time after deploying Kubeflow services
(applying a manifest). To make the service available, restart the service as
follows:
kubectl delete apiservices v1beta1.webhook.cert-manager.io
There is an issue with Istio authorization for HTTP traffic in which the
KFServing predict request returns 503 Service Unavailable.
See the following for more information:
The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
The following issues occur in Katib, which is a Kubernetes-native project for automated machine learning.
Suggestion pods running after experiment completes:
Katib with Kubernetes 1.19 and higher:
The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
The following advice applies to deployments that have separate Data Fabric clusters, and deployments that combine compute and Data Fabric nodes in the same cluster. This advice does not apply to deployments that implement Embedded Data Fabric only.
Attempts to upgrade or patch Kubernetes or upgrade HPE Ezmeral Container Platform in deployments that include HPE Ezmeral Data Fabric on Kubernetes can fail in ways that require a significant number of recovery steps.
Contact your Hewlett Packard Enterprise support representative for upgrade assistance for any of the following:
If your environment deploys a version of HPE Ezmeral Container Platform prior to version 5.3.5, Hewlett Packard Enterprise recommends that you upgrade to HPE Ezmeral Container Platform 5.3.5 or later before you add HPE Ezmeral Data Fabric on Kubernetes.
saveAsNewAPIHadoopFile method on HPE Ezmeral Data Fabric on Kubernetes cluster.saveAsNewAPIHadoopFile method on HPE Ezmeral Data Fabric on Kubernetes cluster generates the following
error:ERROR MapRZKRMFinderUtils: Unable to determine ResourceManager service address from Zookeeper at xxx.xxx.xxx.xxxWorkaround: Use the following command to restart the MAST gateway:
kubectl exec -it -n <namespace> <mfs-pod> -- /opt/mapr/initscripts/mapr-mastgateway restart
The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
In a new HPE Ezmeral Container Platform deployment that
implements Embedded Data Fabric, in the
Embedded Data Fabric container, the
cldb.log contains multiple instances of an error
similar to the following:
2021-07-16 22:23:23,490 WARN CLDBServer [RPC-5]: Exception while fetching username for id 997
java.lang.SecurityException: Unknown uid
Workaround: Use SSH and log in to the HPE Ezmeral Container Platform Controller, then create a user that has UID 997. If this is an HA deployment, also add the user in the Shadow and Arbiter controllers.
The following is an example of the command to add the user:
bdmapr --root useradd -u 997 -g 5000 -s /sbin/nologin hpecpserviceaccount
The following issues were identified in HPE Ezmeral Container Platform 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
When attempting to add or update a tag on a shadow or arbiter host, an exception error occurs, preventing you from applying the tag.
Workaround: Contact your support specialist.
The following issues were identified in a version of HPE Ezmeral Container Platform prior to version 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
Issues are categorized as follows:
The following issues were identified in a version of HPE Ezmeral Container Platform prior to version 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
BDP-2879: The Python ML and DL Toolkit lists a deleted Training cluster in the %attachments list.
Workaround: Ignore the deleted cluster. No jobs will be submitted to deleted clusters.
BDP-841: When enabling multi-domain authentication, the password field must be filled out for all domains before submitting changes to any domain, otherwise the web interface will fail to react.
HAATHI-15068: Unable to create a tenant or FS mount if any host is down.
Workaround: Consider removing the Kubernetes host from the Kubernetes cluster or wait until the host is back up and running.
HAATHI-12781: When HPE Ezmeral Container Platform is installed on RedHat 7.x systems, system reboots are observed under heavy load.
Workaround: Update the RedHat kernel to the newest kernel version.
HAATHI-14220: Adding a license when one or more Worker hosts is in an error state may cause an error.
Workaround: Remove the affected hosts before uploading the license.
HAATHI-12810: After restarting the container that handles monitoring, the service may fail to restart and will show red in the Services tab of the Platform Administrator Dashboard screen.
Workaround: Restart the service manually from the Controller host by executing
the command systemctl restart bds-monitoring.
HAATHI-12829: For RHEL/CentOS 7.x OS installs, if a server is physically rebooted, some services that depend on network services may be down as shown in the Services tab of the Platform Administrator Dashboard screen.
Workaround: Execute the following commands on the Controller host:
$ systemctl stop NetworkManager
$ systemctl disable NetworkManager
$ systemctl restart network
$ systemctl restart bds-controller
$ systemctl restart bds-worker
HAATHI 13253: HPE Ezmeral Container Platform does not compress or archive Nagios log files.
Workaround: Manually archive files as needed in the
/srv/bluedata/nagios directory on the Controller.
BDP-1511: Platform HA must be enabled before creating Kubernetes clusters.
Workaround: If you enable Platform HA after Kubernetes cluster creation, then reconfigure host monitoring as follows:
On a Kubernetes master node bring up the monitoring bootstrap deployment:
kubectl -n hpecp-bootstrap scale deployment/hpecp-bootstrap-hpecp-monitoring --replicas=1
Exec into the bootstrap pod
kubectl -n hpecp-bootstrap exec -it $(kubectl -n hpecp-bootstrap get -o jsonpath='{.items[0].metadata.name}' pods -l name=hpecp-bootstrap-hpecp-monitoring) -c hpecp-monitoring - bash
Delete running deployment (if exist):
kubectl -n kube-system -delete -f /workspace/monitoring.yaml
Export / change any needed bds_xxx env variables (e.g. redeploy after HA enable)
export bds_ha_enabled='Yes'
export bds_ha_nodes='<controller IP list>'
(e.g. export
bds_ha_nodes='16.143.21.35,16.143.21.237,16.143.21.38')
Run startscript install:
/usr/local/bin/startscript --install
This places metricbeat.yaml in the workspace folder.
Deploy metricbeat deployment:
kubectl -n kube-system create -f /workspace/monitoring.yaml
Exit the bootstrap pod and scale down bootstrap deployment:
kubectl -n hpecp-bootstrap scale deployment/hpecp-bootstrap-hpecp-monitoring --replicas=0
BDP-685: Kubernetes cluster creation fails with an "internal error."
Workaround: Remove the Kubernetes hosts, verify that all system clocks are synchronized, and then re-add the hosts and recreate the Kubernetes cluster.
BDP-852: All uploaded files and new folders created by AD/LDAP users via the HPE Ezmeral Container Platform FS mounts interface will have root ownership and full permission for all tenant members.
Workaround: None at this time.
BDP-1868: An admin kubeconfig file downloaded from an imported external Kubernetes cluster will not contain expected edits from the HPE Ezmeral Container Platform web interface.
Workaround: Manually edit the kubeconfig file after download.
The following issues were identified in a version of HPE Ezmeral Container Platform prior to version 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
BDP-582: Kubernetes manifest customizations revert to defaults after anHPE Ezmeral Container Platform failover
Symptom: In deployments that use a customized external Kubernetes manifest feed, unless the manifest feed is also configured on the shadow controller, manifest feeds revert to the default location after an HPE Ezmeral Container Platform failover. Therefore, manifest customizations are no longer in effect.
Workaround: After failover, manually restore the manifest feed. For
example, if the feed location is
https://bd-poonam.s3-us-west-1.amazonaws.com/epic-5.1/k8S_Manifest.json,
then execute the following commands:
On the Primary Controller:
/opt/bluedata/common-install/bd_mgmt/bin/bd_mgmt k8s manifest_change_feed https://bd-poonam.s3-us-west-1.amazonaws.com/epic-5.1/k8S_Manifest.json
On the Shadow Controller:
bdconfig --set bdshared_k8s_manifestfeed=https://bd-poonam.s3-us-west-1.amazonaws.com/epic-5.1/k8S_Manifest.json
For more details, see Updating the Kubernetes Manifest.
BDP-574: Unable to add a Kubernetes host when Platform HA (High Availability) is being enabled.
Workaround: Wait until Platform HA finishes before adding the Kubernetes host.
HAATHI-15093 : A GPU is visible in a non-GPU-requesting pod. When an app spawns on a device having a GPU, it is able to access the GPU even when there are no requests for one. This is a known issue with the NVIDIA k8s-device-plugin.
Workaround: You must manually create an environment variable in the
Kubedirectorcluster YAML that 'hides' the GPU from the App. The
variable is named NVIDIA_VISIBLE_DEVICES with value
VOID. For example:
apiVersion: "kubedirector.bluedata.io/apiVersion" kind: "KubeDirectorCluster"
metadata: name: "sample-name" spec: app: sample-app roles: - id: samplerole
resources: requests: memory: "4Gi" cpu: "2" limits: memory: "4Gi" cpu: "2" env:
- name : "NVIDIA_VISIBLE_DEVICES" value: "VOID"
The following issues were identified in a version of HPE Ezmeral Container Platform prior to version 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
The following issues were identified in a version of HPE Ezmeral Container Platform prior to version 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
HAATHI-12651: Certain action script commands, such as tailf or
top, will continue to run a process on the host after the
action script is killed.
Workaround: Please contact HPE Technical Support.
HAATHI-12654: In virtual clusters with AD/LDAP user account integration and Edge nodes, there is some divergence between users who can SSH into a node and users who are allowed to run ActionScripts on the node. The latter is more permissive on Edge nodes. All ActionScript invocations are captured in the audit log.
HAATHI-12698: ActionScripts are not available for virtual clusters that were created prior to upgrading to version 5.2.
Workaround: Create new virtual clusters with the same specifications (distribution, number/flavor of virtual nodes, etc.), and then delete the old virtual clusters.
HAATHI-12656: Entering incorrect information when defining the Kerberos realm of a remote DataTap could cause all further attempts to create a DataTap using that realm to fail (such as using the correct value for realm but incorrect values for Host and/or Port).
Workaround: Edit /etc/krb5.conf on the Controller and
enter the correct values for Host and/or Port. After this, re-create
the DataTap.
The following issues were identified in a version of HPE Ezmeral Container Platform prior to version 5.3.1. Unless otherwise noted, these issues also apply to later 5.3.x releases.
HAATHI-14109: When using CEPH for persistent storage, a discrepancy between the client and server versions will cause HPE Ezmeral Container Platform to fail to load App Store images with the error "Failed to map the volume."
Workaround: Remove the persistent storage until the client and server versions are the same.
HAATHI-14192: Running the Impala shell on a container where the Impala daemon is not running.
Workaround: Use the -i option to refer to the worker
node, e.g. impala-shell -i <worker hostname>.
HAATHI-14461: Notebooks with a name that includes one or more spaces cannot be committed to GitHub. When working in an AI/ML project that includes a GitHub repository, creating a Jupyterhub notebook with a name that includes one or more spaces will cause an error when trying to commit that notebook to GitHub.
Workaround: Do not include any spaces when naming a Jupyterhub notebook.
HAATHI-10733: Hive jobs that use DataTap paths may fail with a SemanticException
error. When Hive creates a table, the location where the table metadata is
stored comes from the Hive configuration parameter fs.defaultFS by
default (which will point to the cluster file system). If a Hive job references
DataTap paths outside of the file system where the table metadata is stored, then
the job will fail with a SemanticException error because Hive
enforces that all data sources must come from the same file system.
Workaround: Explicitly set the table metadata location to a path on the
same DataTap that you will use for the job inputs and/or outputs, using the
LOCATION clause when creating the table. For example, if you
intend to use the TenantStorage DataTap, you would set the table metadata
location to some path on that DataTap such as:
CREATE TABLE docs (c1 INT, c2 STRING) LOCATION
'dtap://TenantStorage/hive-table-docs'
HAATHI-12546: Some http links in applications running on HPE Ezmeral Container Platform show the hostname of the instance. These links will not work when HPE Ezmeral Container Platform is installed with the non-routable network option.
Workaround: See "Configure Client to use Hostname instead of IP Address, below."
HAATHI-13254: If a user updates an app inside a container instead of via the App Store screen, then cluster expansion will fail.
Workaround: Expand the cluster before performing the upgrade. Once the
update is complete, edit classpath to point to the correct .jar
files, such as hadoop-common-*.jar.
DOC-9: Cloudera Manager reports incorrect values for a node's resources.
Cloudera Manager accesses the Linux /proc file system to determine
the characteristics of the nodes it is managing. Because container technology is
used to implement virtual nodes, this file system reports information about the host
rather than about the individual node, causing Cloudera Manager to report inflated
values for a node's CPU count, memory, and disk.
Workaround: Use the web interface to see a node's virtual hardware configuration (flavor).
DOC-19: Spark applications may wait indefinitely if no free vCPUs are available. This is a general Spark behavior, but it is worth some emphasis in an environment where various virtual hardware resources (possibly in small amounts) can be quickly provisioned for use with Spark.
A Spark application will be stuck in the Waiting state if all vCPUs in the
cluster are already considered to be in-use (by the Spark framework and other
running Spark applications). In Spark version 1.5, the thrift server is configured
to use 2 vCPUs on the Spark master node by default. You can reduce this to 1 vCPU by
editing the total-executor- cores argument value in the
/etc/init.d/hive-thriftserver script, and then restarting the
thrift server ($ sudo service hive-thriftserver restart).
K8S-1887: A MapR software version alarm is generated, indicating that “One or more services on the node are running an unexpected version.” The alarm includes a “recommended action” to stop and restart the node.
Workaround: You can ignore the alarm and recommended action for container-based HPE Ezmeral Data Fabric.