DataTaps expand access to shared data by specifying a named path to a specified storage
resource. Applications running within virtual clusters that can use the HDFS filesystem
protocols can then access paths within that resource using that name, and DataTap
implements Hadoop File System API. This allows you to run jobs using your existing data
systems without the need to make time-consuming copies or transfers of your data.
Tenant/Project Administrator users can quickly and easily build, edit, and remove
DataTaps using the DataTaps screen, as described in The DataTaps Screen (Admin). Tenant
Member users can access DataTaps by name.
Each DataTap requires the following properties to be configured, depending on the type of
storage being connected to (MapR, HDFS, HDFS with Kerberos, or NFS):
- Name: A unique name for each DataTap. This name may contain letters (A-Z or
a-z), digits (0-9), and hyphens (-), but may not contain spaces. You can use the
name of a valid DataTap to compose DataTap URIs that you pass to applications as
arguments. Each such URI maps to some path on the storage system that the DataTap
points to. The path indicated by a URI might or might not exist at the time you
start a job, depending on what the application wants to do with that path. Sometimes
the path must indicate a directory or file that already exists, because the
application intends to use it as input. Sometimes, the path must not currently
exist, because the application expects to create it. The semantics of these paths
are entirely application- dependent, and are identical to their behavior when
running the application on a physical Hadoop or Spark platform.
- Description: Brief description of the DataTap, such as the type of data or
the purpose of the DataTap.
- Type: Type of file system used by the shared storage resource associated with
the DataTap (MAPR, HDFS, or NFS). This is completely
transparent to the end job or other process using the DataTap.
The following fields depend on the DataTap type:
MapR
Note:
All of the links to MapR articles in this section will open
in a new browser tab/window.
A MapR DataTap is configured as follows:
- Cluster Name: Name of the MapR cluster. See the
MapR articles Creating the Cluster and Creating a Volume articles.
- CLDB Hosts: DNS name or address of the container location
database of a MapR cluster. See the MapR article Viewing CLDB Information.
- Port: Port for the namenode service on the host used to
access the MapR file system. See the MapR article Specifying Ports.
- Mount Path: Complete path to the directory containing the data within the
specified MapR file system. You can leave this field blank if you intend the
DataTap to point at the root of the MapR cluster. See the MapR articles Viewing Volume Details and Creating a Volume.
- MapR Secure: Checking this check box if MapR cluster is
secured. When the MapR cluster is secured, all network connections
require authentication, and moving data is protected with wire-level encryption.
MapR allows applying direct security protection for data as it
comes into and out of the platform without requiring an external security
manager server or a particular security plug-in for each ecosystem component.
The security semantics are applied automatically on data being retrieved or
stored by any ecosystem component, application, or users. See the MapR article
Security.
- Ticket Source: Select the ticket source. This will be one of the
following:
- Upload Ticket File: This is enabled when Ticket source is
selected as Use Existing File.
- Use the existing one: To use the existing ticket details.
- Ticket file: This will be one of the following:
- When Upload Ticket File is selected, Browse button is
enabled to select the tiket file.
- When Use the Existing One is selected, it is the name of the
existing ticket file.
- Enable Impersonation: Enable user impersonation. To enable user
impersonation, user authentication, such as AD/LDAP should be configured at the
MapR cluster side.
- Select Ticket Type: Select the ticket type. This will be one of the
following:
- User: Grants access to individual users with no impersonation
support. The ticket UID is used as the identity of the entity using this
ticket.
- Service: Accesses services running on client nodes with no
impersonation support. The ticket UID is used as the identity of the
entity using this ticket.
- Service (with impersonation): Accesses services running on client
nodes to run jobs on behalf of any user. The ticket cannot be used to
impersonate the
root or mapr
users.
- Tenant: Allows tenant users to access tenant volumes in a
multi-tenant environment. The ticket can impersonate any user.
- Ticket User: Username to be included in the ticket for
authentication.
- MapR Tenant Volume: Indicates whether or not the mount path is a
MapR tenant volume. See the MapR article Setting Up a Tenant.
See the following examples for additional information:
HDFS
An HDFS DataTap is configured as follows:
- Host: DNS name or IP address of the server providing access to the
storage resource. For example, this could be the host running the namenode
service of an HDFS cluster.
- Standby NameNode: DNS name or IP address of a standby namenode host that
an HDFS DataTap will try to reach if it cannot contact the primary host. This
field is optional; when used, it provides high-availability access to the
specified HFDS DataTap.
- Port: For HDFS DataTaps, this is the port for the namenode server on the
host used to access the HDFS file system.
- Path: Complete path to the directory containing the data within the
specified HDFS file system. You can leave this field blank if you intend the
DataTap to point at the root of the specified file system.
- Kerberos parameters: If the HDFS DataTap has Kerberos enabled, then you
will need to specify additional parameters. HPE Ezmeral Container Platform
supports two modes of user access/authentication.
- Proxy mode permits a “proxy user” to be configured to have access to the
remote HDFS cluster. Individual users are granted access to the remote
HDFS cluster by the proxy user configuration. Mixing and matching
distributions is permitted between the compute Hadoop cluster and the
remote HDFS. See Sample HDFS
Proxy DataTap.
- Passthrough mode passes the credentials of the current user to the
remote HDFS cluster for authentication. See Sample HDFS Passthrough
DataTap for an example.
- HDFS file systems configured with TDE encryption as well as cross-realm Kerberos
authentication are supported. See HDFS DataTap TDE Configuration and HDFS DataTap
Cross-Realm Kerberos Authentication for additional configuration
instructions.
NFS
Note: This option is not available for Kubernetes tenants.
An NFS DataTap is configured as follows:
- Host: DNS name or IP address of the server providing
access to the storage resource.
- Share:This is the exported share on the selected
host.
- Path: Complete path to the directory containing the data
within the specified NFS share. You can leave this field blank if you intend the
DataTap to point at the root of the specified share.
GCS
An GCS DataTap is configured as follows:
- Bucket Name: Specify the bucket name for GCS.
- Credential File Source: This will be one of the
following:
- When Upload Ticket File: is selected,
Browse button is enabled to select in the
Credential File. The credential file is a JSON file that
contains the service account key.
- When Use the Existing One: is selected, enter the
name of the previously uploaded credential file. The credetial file is a
JSON file that contains the service account key.
- Proxy: This is optional. Specify http proxy to access GCS.
- Mount Path:Enter a path within the bucket that will serve as the starting
pointfor the DataTap. If the path is not specified, the starting point will
default to the bucket.
Using a DataTap
The storage pointed to by a DataTap can be accessed via a URI that includes
the name of the DataTap.
A DataTap points to the top of the “path” configured for the given DataTap. The URI
has the following form:
dtap://datatap_name/
In this example, datatap_name is the name of the DataTap that you
wish to use. You can access files and directories further in the hierarchy by
appending path components to the URI:
dtap://datatap_name/some_subdirectory/another_subdirectory/some_file
For example, the URI dtap://mydatatapr/home/mydirectory means that
the data is located within the /home/mydirectory directory in the
storage that the DataTap named mydatatap points to.
DataTaps exist on a per-tenant basis. This means that a DataTap created for Tenant A
cannot be used by Tenant B. You may, however, create a DataTap for Tenant B with the
exact same properties as its counterpart for Tenant A, thus allowing both tenants to
access the same storage resource. Further, multiple jobs within a tenant may use a
given DataTap simultaneously. While such sharing can be useful, be aware that the
same cautions and restrictions apply to these use cases as for other types of shared
storage: multiple jobs modifying files at the same location may lead to file access
errors and/or unexpected job results.
Users who have a Tenant Administrator role may view and modify detailed DataTap
information. Members may only view general DataTap information and are unable to
create, edit, or remove a DataTap.
CAUTION:
Data conflicts may occur if more than one DataTap points to a
location being used by multiple jobs at once.
CAUTION:
Editing or deleting a DataTap while it is being used by one or
more running jobs may cause errors in the affected jobs.