CGE Error Messages and Resolution Information

Describes common CGE issues and provides information about troubleshooting.

The most common errors that are likely to be encountered while using CGE involve failure to connect to a database server successfully. There are a variety of different errors that can occur, depending on exactly what goes wrong. Common error messages that are likely to be encountered along with troubleshooting techniques are documented in the following table.

Table 1. CGE Error Messages and Troubleshooting Information
Error MessageDescriptionResolution
Unable to establish a connection to the database server at host:port as it does not appear to be runningThe CLI tried to connect to a database server running on the given host and port combination but was unable to establish a connection. This typically means one of two things:
  1. There is no database server running on that host and port
  2. Firewall rules are preventing access to that host and port
  • Verify that you have passed the correct host and port to the CLI
  • Verify that there is a database server running on that host and port
  • Verify that there are no firewall rules that are preventing access to the host and port. Contact a system administrator for additional information.
Unable to authenticate to the database server at host:port. You do not have any SSH keys present in your configured identity DirectoryThe CLI tried to connect to a database server running on the given host and port combination. A connection was established successfully, but authentication to the database server failed because there are no SSH keys configured.Create at least one SSH key and place it in the appropriate directory.
Unable to authenticate to the database server at host:port. Your SSH key(s) from your configured identity directory are not in the authorized_keys file of the database or its ownerThe CLI tried to connect to a database server running on the given host and port combination. A connection was established successfully but authentication to the database server failed because none of the SSH keys were in the authorized_keys file that the database is using.

This may also be caused by the CLI selecting the wrong SSH identity. As described in the SSH identities section, the first identity found by searching several default locations is used, but this may not always be the desired identity.

  • Review the database logs (if possible) to see which authorized_keys file was in-use:
    • If the database server was launched, then this is either in the database directory itself or in the ~/.cge directory
    • If another user launched the database server, contact them to find out which authorized_keys file is in-use
  • Add the public key to the relevant authorized_keys file, or ask the relevant user to do so.
  • Use the --identity option to specify the desired identity directory to use
Host key for host host:port is not trusted, please run in interactive mode and trust this key or manually add the host key to your known_hosts file in your configured identity idDirectoryThe CLI tried to connect to a database server running on the given host and port combination. A connection was successfully established but the database server was unable to prove its identity to the CLI because the host key provided by the database server was not trusted.

This error is usually only seen the first time when a connection to a specific server instance is established. Once the key is trusted (see resolution steps) this error should no longer be seen for this host and port combination.

  • If CGE is being run in interactive mode, the system will prompt to trust the host key. Enter Yes to do so.
  • If it is required to use CGE non-interactively, adding the --trust-keys option to commands will automatically trust previously unknown host keys
Timed out attempting to establish a database connection (waited N seconds), database server may be too busy to service your request currentlyThe CLI tried to connect to a database server running on the given host and port combination but was unable to establish a connection within the timeout interval. This means that the database server is currently busy processing another request and cannot accept the request at this time.
  • Check the database logs to see what the database is currently doing
    • If the last log message states: "Trying to read RPN message from network..." then the database is ready, otherwise the database is busy
  • If the database is busy, there are a number of options that can be used to troubleshoot the issue:
    • Execute the request again later
    • Increase the timeout with the --timeout option to wait for a longer period of time.
    • Disable the timeout by setting --timeout 0 to wait indefinitely until the database server is ready to process the next request
  • In rare cases, the database may have become hung (if it is busy and you have not see any new log messages for long periods of time then this is most likely the problem) in which case you should kill and restart the database server and then retry your commands
Server failed to start upOne or more of the CGE job steps failed to launch because CGE was not found.Try relaunching CGE if the system displays this message. In addition, it is recommended to ensure that all compute nodes are correctly configured. In particular verify the following:
  • The same version of CGE is installed on all compute nodes and the login nodes
  • All shared file systems are mounted and mounted in the same place on all compute nodes and the login nodes
  • The munged process is running on all compute nodes
If any of the preceding is not true and if relaunching the CGE CLI does not correct the problem, contact Cray Support.
Not enough symmetric heap for new sorting keysThere is not enough symmetric heap for new sorting keysuse the -H option to cge-launch to set the symmetric heap value to a larger value. Try doubling what shows up by default near the top of the log for a start.

Symmetric heap is a boundary value on a resource that is allocated as needed, so using a larger than necessary value does not mean that this value will be allocated. It only means that no more than this value will be allocated. It is better to overestimate by a bit than to underestimate.

[PE_64]:inet_listen_socket_setup:inet_setup_listen_socket: bind failed port 20219 listen_sock = 5 Address already in useThis may be due to leftover cge-server processesFollow the instructions documented in Terminate Orphaned cge-server Jobs
Error: Timed out waiting for the server to start runningWhen a computational loop during a database build takes an extremely long time without producing any indication of forward progress (generally some kind of output in the log), cge-launch may decide that the start up sequence has hung and terminate it with this message. Change the interval used to detect a start up hang from its default setting of 900 seconds to some longer interval. If you know the problem is just that a dataset is very computationally intensive to build and is prone to such timeouts, setting this timeout value to 3600 seconds (an hour) is almost certain to eliminate any chance of this failure at the expense of causing you to take a very long time to detect an actual hang in start up. To change this, use the --startupTimeout=seconds option to cge-launch.
HTTP Errors are reported by a tool or APIA request submitted to the HTTP Interface provided by the cge-cli fe command was not successful. If the request was submitted via a tool or API then only minimal error details may be reported directly to you. However please see the resolutions for ways to find more detailed error information.
  • Submit the same request using a browser. The browser window may contain additional error messages which indicate the underlying error. Please review these carefully since they may indicate one of the other common errors detailed in this table.
  • Please review the front end logging as this will have logged the HTTP error and associated error details. These may indicate one of the other common errors detailed in this table.
  • If there is no obvious cause or additional error messages in the browser/front end logs then please review the database logs for error messages that may indicate if/why the request failed on the database server.
  • In rare cases, the offending request may have caused the database server to crash in which case, it will be necessary to relaunch it before making further requests
    • If a crash has occurred please report this to your Cray support representative
:inet_listen_socket_setup :inet_setup_listen_socket : bind failed port 1371 listen_sock = 5 Address already in useA previous cge-launch or HPC/mrun job failed or was killed, and the inet_listen socket is likely in the TIME_WAIT state on one or more of the compute nodes.Wait 60-90 seconds for the inet_listen_socket (port 1371) to clear up from TIME_WAIT state. If the problem persists, the likely cause is some other program has an active socket connection to port 1371 on one (or more) compute nodes. That application must release port 1371 on the affected node(s) before new cge-launch or HPC/mrun jobs can be run on that node(s).
User user does not have permission to perform operation operationAn action was requested for which the requesting user did not have the appropriate permissions
  • Submit the request as a user who does have the appropriate permissions
  • Contact the database owner and ask if you can be granted the appropriate permissions