Data Sources

Data Sources specify a named path to a specified storage resource. AI/ML clusters within projects can then access paths within that resource using that name. This allows you to access data for AI/ML project using your existing data systems without the need to make time-consuming copies or transfers of your data. Project Member and Administrator users can quickly and easily build, edit, and remove Data Sources using the Data Sources screen, as described in The Data Sources Screen.

Each Data Source requires the following properties to be configured, depending on the type of storage being connected to(MapR, HFDS, HDFS with Kerberos, or NFS):

Name: A unique name for each Data Source. This name may contain letters (A-Z or a-z), digits (0-9), and hyphens (-), but may not contain spaces. You can use the name of a valid Data Source to compose Data Source URIs that you pass to applications as arguments. Each such URI maps to some path on the storage system that the Data Source points to. The path indicated by a URI might or might not exist at the time you start a job, depending on what the application wants to do with that path. Sometimes the path must indicate a directory or file that already exists, because the application intends to use it as input. Sometimes, the path must not currently exist, because the application expects to create it. The semantics of these paths are entirely application- dependent, and are identical to their behavior when running the application on a physical Hadoop or Spark platform.
Type: Type of file system used by the shared storage resource associated with the Data Source (MAPR, HDFS, or NFS). This is completely transparent to the end job or other process using the Data Source.

The following fields depend on the Data Source type:

MapR
HDFS or NFS

MapR

Note: All of the links to HPE Ezmeral Data Fabric articles in this section will open in a new browser tab/window.

A MapR Data Source is configured as follows:

Cluster Name: Name of the MapR cluster. See the HPE Ezmeral Data Fabric articles Creating the Cluster and Creating a Volume articles.
CLDB Hosts: DNS name or address of the service providing access to the storage resource. For example, this could be the namenode of a MapR cluster. See theHPE Ezmeral Data Fabric article Viewing CLDB Information.
Port: Port for the namenode server on the host used to access the MapR file system. See the HPE Ezmeral Data Fabric article Specifying Ports.
Mount Path: Complete path to the directory containing the data within the specified MapR file system. You can leave this field blank if you intend the Data Source to point at the root of the specified share/volume/file system. See the HPE Ezmeral Data Fabric articles Viewing Volume Details and Creating a Volume.
MapR Secure: Checking this check box enables the MapR Secure feature. MapR includes both the MapR Data Platform and MEP components, and is secure out-of-the-box on all new installations. All network connections require authentication, and all moving data is protected with wire-level encryption. MapR allows applying direct security protection for data as it comes into and out of the platform without requiring an external security manager server or a particular security plug-in for each ecosystem component. The security semantics are applied automatically on data being retrieved or stored by any ecosystem component, application, or users. See the HPE Ezmeral Data Fabric article Security.
Ticket: Enter the complete path to the MapR ticket. MapR uses tickets for authentication. Tickets contain keys that are used to authenticate users and MapR servers. In addition, certificates are used to implement server authentication. Every user who wants to access a cluster must have a MapR user ticket (maprticket_<uid>), and every node in the cluster must have a MapR server ticket (maprserverticket). Tickets are encrypted to protect their contents. See the HPE Ezmeral Data Fabric articles Tickets and How Tickets Work.
Ticket Type: Select the ticket type. This will be one of the following:
- User: Grants access to individual users with no impersonation support. The ticket UID is used as the identity of the entity using this ticket.
- Service: Accesses services running on client nodes with no impersonation support. The ticket UID is used as the identity of the entity using this ticket.
- Service (with impersonation): Accesses services running on client nodes to run jobs on behalf of any user. The ticket cannot be used to impersonate the root or mapr users.
- Tenant: Allows tenant users to access tenant volumes in a multi-tenant environment. The ticket can impersonate any user.
Ticket User: Username to be used by the ticket for authentication.
MapR Tenant Volume: Volume to be accessed by the Data Source. See the HPE Ezmeral Data Fabric article Enabling and Restricting Access to Tenant Volume and Data.
Enable Impersonation: Enable user impersonation. See the HPE Ezmeral Data Fabric article Impersonation.

HDFS or NFS

An HDFS or NFS Data Source is configured as follows:

Host: DNS name or address of the service providing access to the storage resource. For example, this could be the namenode of an HDFS cluster.
Share: For NFS Data Sources, this is the exported share on the selected host.
Port: For HDFS Data Sources, this is the port for the namenode server on the host used to access the HDFS file system.
Path: Complete path to the directory containing the data within the specified NFS share or HDFS file system. You can leave this field blank if you intend the Data Source to point at the root of the specified share/volume/file system.
Standby NameNode: DNS name or IP address of a standby namenode that an HDFS Data Source will try to reach if it cannot contact the primary host. This field is optional; when used, it provides high-availability access to the specified HFDS Data Source.
Kerberos parameters: If the HDFS Data Source has Kerberos enabled, then you will need to specify additional parameters. See Kerberos Security - Data Sources.

Using a Data Source

The storage pointed to by a Data Source can be accessed by any AI/ML activity in an HPE Ezmeral ML Ops virtual node via a URI that includes the name of the Data Source.

A Data Source points to the top of the “path” configured for the given Data Source. The URI has the following form:

dtap://data-source_name/

In this example, data_source_name is the name of the Data Source that you wish to use. You can access files and directories further in the hierarchy by appending path components to the URI:

dtap://data_source_name/some_subdirectory/another_subdirectory/some_file

For example, the URI dtap://mydatasource/home/mydirectory means that the data is located within the /home/mydirectory directory in the storage that the Data Source named mydatasource points to.

Data Sources exist on a per-project basis. This means that a Data Source created for Project A cannot be used by Project B. You may, however, create a Data Source for Project B with the exact same properties as its counterpart for Project A, thus allowing both projects to access the same storage resource. Further, multiple jobs within a project may use a given Data Source simultaneously. While such sharing can be useful for shared data or models across projects, be aware that the same cautions and restrictions apply to these use cases as for other types of shared storage: multiple jobs modifying files at the same location may lead to file access errors and/or unexpected job results.

Users who have a Project Administrator role may view and modify detailed Data Source information. Project Members may only view general Data Source information and are unable to create, edit, or remove a Data Source.

CAUTION:

Data conflicts may occur if more than one data source points to a location being used by multiple jobs at once.

CAUTION:

Editing or deleting a data source while it is being used by one or more running job(s) may cause errors in the affected job(s).