Reducing Failure Detection Time for File Clients

Describes how to set the time for Hadoop and POSIX clients to detect node failures.

To reduce the amount of time it takes (Hadoop and FUSE-based POSIX) clients to detect (CLDB and data node) failure, define the property, fs.mapr.connect.timeout, in the core-site.xml file. The value for this property can be set in 100 milliseconds and will be rounded up to the nearest 100 milliseconds. The minimum value for this property is 100 milliseconds, which can be incremented only by units of 100 milliseconds. Suppose a value of 260 milliseconds is specified, by default, the value will automatically be rounded up to 300 milliseconds. The default value for this property is 0, which means that the Linux TCP timeout setting will be used for connections if this property is not set.

Your entry in core-site.xml file should look similar to the following:

<property>
  <name>fs.mapr.connect.timeout</name>
  <value>200</value>
  <description>file client wait time of 200 milliseconds</description>
</property>

This setting (for hadoop and FUSE-based POSIX clients) ensures that the clients wait only for the specified amount of time to establish a connection. That is, it is used only for the first request sent to CLDB or a data node before or after a failure. For subsequent requests, the default system connection timeout value is used. In the event of a failure after a connection has been established, the client will wait for the connection to timeout (based on the system timeout value) before it contacts the next (CLDB or data) node to process the request.

For example, suppose the value for this parameter is 100 milliseconds and the Linux TCP connection timeout value is 30 seconds. When a hadoop or FUSE-based POSIX client contacts CLDB or a data node for the first time to establish a connection, the client will wait for 100 milliseconds before trying the next CLDB or data node. After a connection is established, for subsequent requests, the client will wait for 30 seconds for a response. If the node goes down after a connection has been established, the client will wait for 30 seconds before trying the next node. If the client contacts a recovered node for the first time, it will wait for 100 milliseconds to establish the connection.

Note: HPE Ezmeral Data Fabric filesystem does not use this property internally; it is used by Hadoop and FUSE-based POSIX clients only. This setting is not applicable to NFS gateway and loopbacknfs POSIX clients.

Detecting CLDB failures

When a connection with CLDB is established, CLDB returns the list of reachable and unreachable CLDB nodes on the cluster.

Populating the cache

The client stores information about the unreachable CLDB nodes in /tmp/cldbinfo/unreachableCldbs file on the client host. The format of this file is the same as the mapr-clusters.conf file (i.e., "clustername ip:port"). For example:

cat /tmp/cldbinfo/unreachableCldbs
  object_pools 10.10.104.33:7222 10.10.104.34:7222

The client reads the mapr-clusters.conf file and the unreachableCldbs file to determine the CLDB to connect to. It then tries to reach the available CLDB nodes first; it tries the unreachable CLDB nodes only if the available CLDB is unable to service its request.

Invalidating the cache

If the available CLDB is unable to service the client request, the client tries the unreachable CLDB. If an unreachable CLDB becomes reachable again, it is removed from the /tmp/cldbinfo/unreachableCldbs file, making it reachable for all subsequent IOs and if a reachable CLDB becomes unreachable, it is added to the /tmp/cldbinfo/unreachableCldbs file.