The CGE Database Build Process
How dataset.nt and dataset.nt files are picked from a database, converted to RDF and then stored in various forms, upon which queries can be executed. This topic also covers memory requirements for building a CGE database as well as sample RDF files.
CGE is launched using the cge-launch command. When the CGE application is launched, a database directory is specified using the -d option of the cge-launch command. Initially, this directory contains RDF data in N-triples or N-quads format. When the application is first launched on a new database directory, the database is compiled and stored in an internal format in the same directory. Subsequent launches with the same database directory will use the compiled database. The update command can then be used to add data to an existing database or to update it. For more information, see the cge-launch and update man pages.
- In a single file called dataset.nq (for N-Quads form data)
- In a single file called dataset.nt (for N-Triples form data)
- In multiple files listed in a file called graph.info
Data to RDF Triples Conversion
CGE reads RDF data in N-triples or N-quads format. There are many third-party tools that may be used to convert data into RDF. D2R is often used to extract data from an RDBMS into RDF format. The TopBraid Composer by TopQuadrant® can also be used to convert Excel, TSV, UML, or XML data. Conversion of data to RDF is beyond the scope of this publication.
Internal Representation
Once the data has been translated into RDF, the user must place the data in the directory where CGE builds its compiled database files. If the RDF is contained in a single file, rename this file to dataset.nt or dataset.nq. A dataset.nt has NTriples format, whereas a dataset.nq file has NQuads format. On the other hand, if the RDF is found in more than one file, a file named graph.info will need to be created. This file contains a list of RDF files, one file per line. Each file name in graph.info may optionally be followed by a graph name. If a graph name is specified, the graph name is applied to any triples found in the corresponding RDF file.
Following is a sample of a dataset.nt file that has been extracted from the Lehigh University Benchmark (LUBM) synthetic dataset:
<http://www.Department14.University0.edu/GraduateStudent87> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse> <http://www.Department14.University0.edu/GraduateCourse17> . <http://www.Department14.University0.edu/GraduateStudent87> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#TeachingAssistant> . <http://www.Department14.University0.edu/GraduateStudent87> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#teachingAssistantOf> <http://www.Department14.University0.edu/Course6> . <http://www.Department14.University0.edu/GraduateStudent87> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse> <http://www.Department14.University0.edu/GraduateCourse18> . <http://www.Department14.University0.edu/GraduateStudent87> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#GraduateStudent> . <http://www.Department14.University0.edu/GraduateStudent87> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#name> "GraduateStudent87" . <http://www.Department14.University0.edu/GraduateStudent87> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#emailAddress> "GraduateStudent87@Department14.University0.edu" . <http://www.Department14.University0.edu/GraduateStudent87> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#undergraduateDegreeFr om> <http://www.University843.edu <http://www.university843.edu/>> . <http://www.Department14.University0.edu/GraduateStudent87> <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#advisor> <http://www.Department14.University0.edu/AssistantProfessor6> .Each predicate must appear on its own line. Some predicates are shown on multiple lines in the code block above due to lack of space.
The specification for NTriples can be found at https://www.w3.org/TR/n-triples/
# example graph.info file # filenames can be absolute /lustre/scratch/users/jdoe/database1/dbtriples1.nt # or they can be relative to the database directory, which is where the graph.info file resides database2/dbtriples2.nt # they can specify a named subgraph with a URI /lustre/scratch/users/jdoe/database3/dbquads3.nq <http://cray.com/namedGraphs/Graph3>Triples and quads are supported in both the
.nt and .nq files. Quads in the RDF file are not affected by the optional graph name specified in the graph.info file. Lines containing only white space or lines beginning with the comment character (‘#’) are ignored. If the file is a mix of triples and quads, the triples become part of the graph specified in the graph.info file. As mentioned earlier, when the application is launched via the cge-launch command. The -d parameter specifies the database directory. - dbQuads
- string_table_chars
- string_table_chars.index
- graph.info file is created (if not already present), which is only used to load in a database from RDF files and is not used once the database is compiled.
CGE searches for a dataset in the following places when loading a dataset:
- If dbQuads exists, it will be used.
- If dbQuads does not exist, but graph.info exists, graph.info will be opened and read to obtain a list of source data files, which will then be used to build a new dataset.
- If neither dbQuads nor graph.info exist, but dataset.nt (or dataset.nq) exist, dataset.nt or dataset.nq will be used to build a new dataset.
- If none of the above files exist, CGE will fail.
In each of these cases, if the file exists but is in some way invalid, CGE will fail.
Memory Requirements
- Memory Requirement for reading a database from RDF - The amount of memory required to read a database from RDF depends on the number of triples/quads in the database, the number of unique strings in the dictionary, and the length of those strings. As a rule of thumb, however, the main memory should be 4 times the size of the RDF file(s). For example, for a 100 GiB triples file, at least 400 GiB (4 * 100) should be used.
- Memory Requirement for loading a compiled database - A compiled database consists primarily of the dbQuads files, containing the compiled quads, and the string_table_chars files, containing the dictionary. To enable CGE to load the database and execute meaningful queries, the main memory should be 20 times the sum of the sizes of dbQuads and the string_table_chars file. For example, if dbQuads is 32 GiB and string_table_chars is 256 GiB, at least (20 * (32 + 256)) GiB of memory should be used.