Spark Streaming is a micro-batching, stream-processing framework built on top of Spark. HBase APIs and Spark Streaming make great companions. When used alongside Spark Streaming, HBase APIs can serve as:
The HPE Ezmeral Data Fabric Database Binary Connector for Apache Spark integration points with Spark Streaming are similar to its normal Spark integration points. You can use the following commands straight off a Spark Streaming DStream:
bulkPut |
Enables massively parallel sending of puts to HBase APIs. |
bulkDelete |
Enables massively parallel sending of deletes to HBase APIs. |
bulkGet |
Enables massively parallel sending of gets to HBase APIs to create a new RDD. |
mapPartition |
Enables the Spark Map function with a Connection object to allow
full access to HBase APIs. |
hBaseRDD |
Simplifies a distributed scan to create an RDD. |
bulkPut Example with DStreams
hbaseBulkPut method, make sure you import the
HBaseDStreamFunctions
class.import org.apache.hadoop.hbase.spark.HBaseDStreamFunctions._
val sc = new SparkContext("local", "test")
val config = new HBaseConfiguration()
val hbaseContext = new HBaseContext(sc, config)
val ssc = new StreamingContext(sc, Milliseconds(200))
val rdd1 = ...
val rdd2 = ...
val queue = mutable.Queue[
RDD[(Array[Byte],
Array[(Array[Byte],
Array[Byte],
Array[Byte])])]]()
queue += rdd1
queue += rdd2
val dStream = ssc.queueStream(queue)
dStream.hbaseBulkPut(
hbaseContext,
TableName.valueOf(tableName),
(putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) =>
put.addColumn(putValue._1, putValue._2, putValue._3))
put
})
hbaseBulkPut function has three inputs:hbaseContext that carries the configuration broadcast information
link to the HBase Connections in the executors.Put
object.The code snippet above has been extracted from https://github.com/mapr/hbase/blob/1.1.8-mapr-1703/hbase-spark/src/test/scala/org/apache/hadoop/hbase/spark/HBaseDStreamFunctionsSuite.scala.