Spark Streaming 开发指南
2016-04-12 本文已影响93人
TinyKing86
概述
Spark Streaming是基于核心SparkAPI的扩展,实现了可扩展,高吞吐量,实时数据流的容错流处理。提供了多种数据接入,如Kafka, Flume, Twitter, ZeroMQ, Kinesis, TCP sockets。可以使用高层次的功能,如map,reduce,join和window,使用复杂的算法来处理,最后将处理后的数据推送到文件系统,数据库,和实施仪表等。
图1Spark Streaming接受实时数据流,将数据分批处理,然后有Spark引擎处理并输出最终结果流。Spark Streaming内部工作如下:
图2
实例
Scala
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ // Spark 1.3起, 非必须
// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)
// Split each line into words
val words = lines.flatMap(_.split(" "))
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
运行实例
$ ./bin/run-example streaming.NetworkWordCount localhost 9999
在unix系统下,可以如下测试
# TERMINAL 1:
# Running Netcat
$ nc -lk 9999
hello world
...
带完善...