The CGE Database Build Process

How dataset.nt and dataset.nt files are picked from a database, converted to RDF and then stored in various forms, upon which queries can be executed. This topic also covers memory requirements for building a CGE database as well as sample RDF files.

CGE is launched using the cge-launch command. When the CGE application is launched, a database directory is specified using the -d option of the cge-launch command. Initially, this directory contains RDF data in N-triples or N-quads format. When the application is first launched on a new database directory, the database is compiled and stored in an internal format in the same directory. Subsequent launches with the same database directory will use the compiled database. The update command can then be used to add data to an existing database or to update it. For more information, see the cge-launch and update man pages.

Data must be presented in this directory in one of the following ways to enable CGE to recognize raw RDF data to be built:

In a single file called dataset.nq (for N-Quads form data)
In a single file called dataset.nt (for N-Triples form data)
In multiple files listed in a file called graph.info

Data to RDF Triples Conversion

CGE reads RDF data in N-triples or N-quads format. There are many third-party tools that may be used to convert data into RDF. D2R is often used to extract data from an RDBMS into RDF format. The TopBraid Composer by TopQuadrant® can also be used to convert Excel, TSV, UML, or XML data. Conversion of data to RDF is beyond the scope of this publication.

Internal Representation

Once the data has been translated into RDF, the user must place the data in the directory where CGE builds its compiled database files. If the RDF is contained in a single file, rename this file to dataset.nt or dataset.nq. A dataset.nt has NTriples format, whereas a dataset.nq file has NQuads format. On the other hand, if the RDF is found in more than one file, a file named graph.info will need to be created. This file contains a list of RDF files, one file per line. Each file name in graph.info may optionally be followed by a graph name. If a graph name is specified, the graph name is applied to any triples found in the corresponding RDF file.

Following is a sample of a dataset.nt file that has been extracted from the Lehigh University Benchmark (LUBM) synthetic dataset:

<http://www.Department14.University0.edu/GraduateStudent87>
<http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse>
<http://www.Department14.University0.edu/GraduateCourse17> .
<http://www.Department14.University0.edu/GraduateStudent87>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#TeachingAssistant> .
<http://www.Department14.University0.edu/GraduateStudent87>
<http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#teachingAssistantOf>
<http://www.Department14.University0.edu/Course6> .
<http://www.Department14.University0.edu/GraduateStudent87>
<http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse>
<http://www.Department14.University0.edu/GraduateCourse18> .
<http://www.Department14.University0.edu/GraduateStudent87>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#GraduateStudent> .
<http://www.Department14.University0.edu/GraduateStudent87>
<http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#name>
"GraduateStudent87" .
<http://www.Department14.University0.edu/GraduateStudent87>
<http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#emailAddress>
"GraduateStudent87@Department14.University0.edu" .
<http://www.Department14.University0.edu/GraduateStudent87>
<http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#undergraduateDegreeFr
om> <http://www.University843.edu <http://www.university843.edu/>> .
<http://www.Department14.University0.edu/GraduateStudent87>
<http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#advisor>
<http://www.Department14.University0.edu/AssistantProfessor6> .

Each predicate must appear on its own line. Some predicates are shown on multiple lines in the code block above due to lack of space.

The specification for NTriples can be found at https://www.w3.org/TR/n-triples/

Following is a sample of a graph.info file:

# example graph.info file
 
# filenames can be absolute
/lustre/scratch/users/jdoe/database1/dbtriples1.nt
 
# or they can be relative to the database directory, which is where the graph.info file resides
database2/dbtriples2.nt
 
# they can specify a named subgraph with a URI
/lustre/scratch/users/jdoe/database3/dbquads3.nq     <http://cray.com/namedGraphs/Graph3>

Triples and quads are supported in both the .nt and .nq files. Quads in the RDF file are not affected by the optional graph name specified in the graph.info file. Lines containing only white space or lines beginning with the comment character (‘#’) are ignored. If the file is a mix of triples and quads, the triples become part of the graph specified in the graph.info file. As mentioned earlier, when the application is launched via the cge-launch command. The -d parameter specifies the database directory.

Warning: The -d parameter is mandatory. Launching CGE without specifying it will result in an error.

This directory must already exist if it has been populated with dataset.nt, dataset.nq, rules and/or a graph.info file. If a compiled database is not present, a database is built using the graph.info, dataset.nt, or dataset.nq file in that directory.

When the database has been built, the following files are saved in the database directory:

dbQuads
string_table_chars
string_table_chars.index
graph.info file is created (if not already present), which is only used to load in a database from RDF files and is not used once the database is compiled.

CGE can begin executing queries and updates once the database has been built. When the application is subsequently launched via the cge-launch command specifying the same directory, the dbQuads file is detected, and the compiled database is read rather than the RDF.

CAUTION: If a user attempts to create a new database and the input data files do not contain any valid triples, the database will exit with an error. The recommended way of creating an empty database is to create a completely empty input file using the touch command and then starting the database.

CGE searches for a dataset in the following places when loading a dataset:

If dbQuads exists, it will be used.
If dbQuads does not exist, but graph.info exists, graph.info will be opened and read to obtain a list of source data files, which will then be used to build a new dataset.
If neither dbQuads nor graph.info exist, but dataset.nt (or dataset.nq) exist, dataset.nt or dataset.nq will be used to build a new dataset.
If none of the above files exist, CGE will fail.

In each of these cases, if the file exists but is in some way invalid, CGE will fail.

Memory Requirements

Memory Requirement for reading a database from RDF - The amount of memory required to read a database from RDF depends on the number of triples/quads in the database, the number of unique strings in the dictionary, and the length of those strings. As a rule of thumb, however, the main memory should be 4 times the size of the RDF file(s). For example, for a 100 GiB triples file, at least 400 GiB (4 * 100) should be used.
Memory Requirement for loading a compiled database - A compiled database consists primarily of the dbQuads files, containing the compiled quads, and the string_table_chars files, containing the dictionary. To enable CGE to load the database and execute meaningful queries, the main memory should be 20 times the sum of the sizes of dbQuads and the string_table_chars file. For example, if dbQuads is 32 GiB and string_table_chars is 256 GiB, at least (20 * (32 + 256)) GiB of memory should be used.