When APIs are only available on an Apache Spark RDD but not an Apache Spark DataFrame, you can operate on the RDD and then convert it to a DataFrame.
You can convert an RDD to a DataFrame in one of two ways:
toDF.createDataFrame call on a
SparkSession object.The
toDF method is available through MapRDBTableScanRDD. The
following example loads an RDD that filters on first_name equal to
"Peter" and projects the _id and
first_name fields, and then converts the RDD to a DataFrame:
import com.mapr.db.spark.sql._
val df = sc.loadFromMapRDB(<table-name>)
.where(field("first_name") === "Peter")
.select("_id", "first_name").toDF()With this approach,
you can convert an RDD[Row] to a DataFrame by calling
createDataFrame on a SparkSession object. The API for
the call is as follows:
def createDataFrame(RDD, schema: StructType)You might need to first convert an RDD[OJAIDocument] to an
RDD[Row]. The following example shows how to do this:
val df = sparkSession.createDataFrame(
rdd.map(doc =>MapRDBSpark.docToRow(doc, schema)), schema)rdd is of type RDD[OJAIDocument]. The
docToRow call converts rdd to an
RDD[Row] that is then passed to createDataFrame.