Data Sources specify a named path to a specified storage resource. AI/ML
clusters within projects can then access paths within that resource using that
name. This allows you to access data for AI/ML project using your existing data systems
without the need to make time-consuming copies or transfers of your data. Project Member
and Administrator users can quickly and easily build, edit, and remove Data Sources
using the Data Sources screen, as described in The Data Sources Screen.
Each Data Source requires the following properties to be configured, depending on
the type of storage being connected to(MapR, HFDS, HDFS with Kerberos, or
NFS):
- Name: A unique name for each Data Source. This name may contain letters (A-Z
or a-z), digits (0-9), and hyphens (-), but may not contain spaces. You can use the
name of a valid Data Source to compose Data Source URIs that you pass
to applications as arguments. Each such URI maps to some path on the storage system
that the Data Source points to. The path indicated by a URI might or might
not exist at the time you start a job, depending on what the application wants to do
with that path. Sometimes the path must indicate a directory or file that already
exists, because the application intends to use it as input. Sometimes, the path must
not currently exist, because the application expects to create it. The semantics of
these paths are entirely application- dependent, and are identical to their behavior
when running the application on a physical Hadoop or Spark platform.
- Type: Type of file system used by the shared storage resource associated with
the Data Source (MAPR, HDFS, or NFS). This is completely
transparent to the end job or other process using the Data Source.
The following fields depend on the Data Source type:
MapR
Note: All of the links to HPE Ezmeral Data Fabric
articles in this section will open in a new browser tab/window.
A MapR Data Source is configured as follows:
- Cluster Name: Name of the MapR cluster. See the HPE Ezmeral Data Fabric articles Creating the Cluster and Creating a Volume articles.
- CLDB Hosts: DNS name or address of the service providing access to the
storage resource. For example, this could be the namenode of a MapR cluster. See
theHPE Ezmeral Data Fabric article Viewing CLDB Information.
- Port: Port for the namenode server on the host used to access the MapR
file system. See the HPE Ezmeral Data Fabric article Specifying Ports.
- Mount Path: Complete path to the directory containing the data within the
specified MapR file system. You can leave this field blank if you intend the
Data Source to point at the root of the specified share/volume/file system. See
the HPE Ezmeral Data Fabric articles Viewing Volume Details and Creating a Volume.
- MapR Secure: Checking this check box enables the MapR Secure feature.
MapR includes both the MapR Data Platform and MEP components, and is secure
out-of-the-box on all new installations. All network connections require
authentication, and all moving data is protected with wire-level encryption.
MapR allows applying direct security protection for data as it comes into and
out of the platform without requiring an external security manager server or a
particular security plug-in for each ecosystem component. The security semantics
are applied automatically on data being retrieved or stored by any ecosystem
component, application, or users. See the HPE Ezmeral Data Fabric article Security.
- Ticket: Enter the complete path to the MapR ticket. MapR uses tickets for
authentication. Tickets contain keys that are used to authenticate users and
MapR servers. In addition, certificates are used to implement server
authentication. Every user who wants to access a cluster must have a MapR user
ticket (
maprticket_<uid>), and every node in the cluster
must have a MapR server ticket (maprserverticket). Tickets are
encrypted to protect their contents. See the HPE Ezmeral Data Fabric articles Tickets and How Tickets Work.
- Ticket Type: Select the ticket type. This will be one of the
following:
- User: Grants access to individual users with no
impersonation support. The ticket UID is used as the identity of the
entity using this ticket.
- Service: Accesses services
running on client nodes with no impersonation support. The ticket UID is
used as the identity of the entity using this ticket.
- Service
(with impersonation): Accesses services running on client nodes
to run jobs on behalf of any user. The ticket cannot be used to
impersonate the
root or mapr
users.
- Tenant: Allows tenant users to access tenant
volumes in a multi-tenant environment. The ticket can impersonate any
user.
- Ticket User: Username to be used by the ticket for authentication.
- MapR Tenant Volume: Volume to be accessed by the Data Source. See the
HPE Ezmeral Data Fabric article Enabling and Restricting Access to Tenant
Volume and Data.
- Enable Impersonation: Enable user impersonation. See the HPE Ezmeral Data Fabric article Impersonation.
HDFS or NFS
An HDFS or NFS Data Source is configured as follows:
- Host: DNS name or address of the service providing access to the storage
resource. For example, this could be the namenode of an HDFS cluster.
- Share: For NFS Data Sources, this is the exported share on the selected
host.
- Port: For HDFS Data Sources, this is the port for the namenode server on
the host used to access the HDFS file system.
- Path: Complete path to the directory containing the data within the
specified NFS share or HDFS file system. You can leave this field blank if you
intend the Data Source to point at the root of the specified
share/volume/file system.
- Standby NameNode: DNS name or IP address of a standby namenode that an
HDFS Data Source will try to reach if it cannot contact the primary host.
This field is optional; when used, it provides high-availability access to the
specified HFDS Data Source.
- Kerberos parameters: If the HDFS Data Source has Kerberos enabled,
then you will need to specify additional parameters. See Kerberos Security - Data Sources.
Using a Data Source
The storage pointed to by a Data Source can be accessed by any AI/ML activity in an
HPE Ezmeral ML Ops virtual node via a URI that
includes the name of the Data Source.
A Data Source points to the top of the “path” configured for the given Data Source.
The URI has the following form:
dtap://data-source_name/
In this example, data_source_name is the name of the Data Source
that you wish to use. You can access files and directories further in the hierarchy
by appending path components to the URI:
dtap://data_source_name/some_subdirectory/another_subdirectory/some_file
For example, the URI dtap://mydatasource/home/mydirectory means
that the data is located within the /home/mydirectory directory
in the storage that the Data Source named mydatasource points to.
Data Sources exist on a per-project basis. This means that a Data Source created for
Project A cannot be used by Project B. You may, however, create a Data Source for
Project B with the exact same properties as its counterpart for Project A, thus
allowing both projects to access the same storage resource. Further, multiple jobs
within a project may use a given Data Source simultaneously. While such sharing can
be useful for shared data or models across projects, be aware that the same cautions
and restrictions apply to these use cases as for other types of shared storage:
multiple jobs modifying files at the same location may lead to file access errors
and/or unexpected job results.
Users who have a Project Administrator role may view and modify detailed Data Source
information. Project Members may only view general Data Source information and are
unable to create, edit, or remove a Data Source.
CAUTION:
Data conflicts may occur if more than one data source points to a
location being used by multiple jobs at once.
CAUTION:
Editing or deleting a data source while it is being used by one or
more running job(s) may cause errors in the affected job(s).