Data Sources

Data Sources specify a named path to a specified storage resource. AI/ML clusters within projects can then access paths within that resource using that name. This allows you to access data for AI/ML project using your existing data systems without the need to make time-consuming copies or transfers of your data. Project Member and Administrator users can quickly and easily build, edit, and remove Data Sources using the Data Sources screen, as described in The Data Sources Screen.

Each Data Source requires the following properties to be configured, depending on the type of storage being connected to(MapR, HFDS, HDFS with Kerberos, or NFS):

The following fields depend on the Data Source type:

MapR

Note: All of the links to HPE Ezmeral Data Fabric articles in this section will open in a new browser tab/window.

A MapR Data Source is configured as follows:

HDFS or NFS

An HDFS or NFS Data Source is configured as follows:

Using a Data Source

The storage pointed to by a Data Source can be accessed by any AI/ML activity in an HPE Ezmeral ML Ops virtual node via a URI that includes the name of the Data Source.

A Data Source points to the top of the “path” configured for the given Data Source. The URI has the following form:

dtap://data-source_name/

In this example, data_source_name is the name of the Data Source that you wish to use. You can access files and directories further in the hierarchy by appending path components to the URI:

dtap://data_source_name/some_subdirectory/another_subdirectory/some_file

For example, the URI dtap://mydatasource/home/mydirectory means that the data is located within the /home/mydirectory directory in the storage that the Data Source named mydatasource points to.

Data Sources exist on a per-project basis. This means that a Data Source created for Project A cannot be used by Project B. You may, however, create a Data Source for Project B with the exact same properties as its counterpart for Project A, thus allowing both projects to access the same storage resource. Further, multiple jobs within a project may use a given Data Source simultaneously. While such sharing can be useful for shared data or models across projects, be aware that the same cautions and restrictions apply to these use cases as for other types of shared storage: multiple jobs modifying files at the same location may lead to file access errors and/or unexpected job results.

Users who have a Project Administrator role may view and modify detailed Data Source information. Project Members may only view general Data Source information and are unable to create, edit, or remove a Data Source.

CAUTION:
Data conflicts may occur if more than one data source points to a location being used by multiple jobs at once.
CAUTION:
Editing or deleting a data source while it is being used by one or more running job(s) may cause errors in the affected job(s).