Loading Data into a DataFrame Using an Explicit Schema

If you know the schema of your data, you can specify an explicit schema when loading a DataFrame.

The following example loads data into a user profile table using an explicit schema:

Scala
import org.apache.spark.sql.SparkSession
import com.mapr.db.spark.sql._
              
val addressSchema = StructType(StructField("Pin", IntegerType) ::
                    StructField("city", StringType) ::
                    StructField("street", StringType) :: Nil)
              
val personSchema = StructType(StructField("_id", StringType) ::
                   StructField("first_name", StringType) ::
                   StructField("last_name", StringType) :: 
                   StructField("address", addressSchema) ::
                   StructField("interests", ArrayType(StringType)) :: Nil)
              
val df = sparkSession.loadFromMapRDB("/tmp/user_profiles", personSchema)
Java
import com.mapr.db.spark.sql.api.java.MapRDBJavaSession;
import org.apache.spark.sql.SparkSession;
              
StructField[] addressSchema = {
                               new StructField("Pin", IntegerType, true, Metadata.empty()),
                               new StructField("city",StringType, true, Metadata.empty()),
                               new StructField("street", StringType, true, Metadata.empty())
                              };
StructField[] schemaFields = {
   new StructField("_id", StringType, true, Metadata.empty()),
   new StructField("first_name", StringType, true, Metadata.empty()),
   new StructField("address", new StructType(addressSchema), true, Metadata.empty()),
   new StructField("interests", new ArrayType(StringType, true), true, Metadata.empty())
                             };
StructType personSchema = new StructType(schemaFields);
MapRDBJavaSession maprSession = new MapRDBJavaSession(sparkSession);
Dataset<Row> df = maprSession.loadFromMapRDB("/tmp/user_profiles", personSchema);
Python
from pyspark.sql import SparkSession
              
addressSchema = [StructField("Pin", IntegerType(), True),
                 StructField("city", StringType(), True),
                 StructField("street", StringType(), True)]
schemaFields = [StructField("_id", StringType(), True),
                StructField("first_name", StringType(), True),
                StructField("last_name", StringType(), True),
                StructField("address", StructType(addressSchema), True),
                StructField("interests", ArrayType(StringType()), True)]
personSchema = StructType(schemaFields)
              
df = spark_session.loadFromMapRDB("/tmp/user_profiles", personSchema)

To create the DataFrame object named df, pass the schema as a parameter to the load call. Invoke the loadFromMapRDB method on a SparkSession object.

The resulting schema of the object is the following:

df.printSchema()
 ----------------------------------
 root
 |-- _id: String (nullable = true)
 |-- first_name: String (nullable = true)
 |-- last_name: String (nullable = true) 
 |-- address: Struct (nullable = true)
 |    |-- Pin: integer (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- street: string (nullable = true)
 |-- interests: array (nullable = true)
 |    |-- element: string (containsNull = true)

When specifying StructField in a schema, optionally specify whether the field is nullable. In the example above, all fields are nullable.

Depending on the nullability of the field in the schema and the existence of fields in the HPE Ezmeral Data Fabric Database table, the load returns an InvalidSchema exception in the following cases: