Spark Streaming 开发指南

2016-04-12 本文已影响93人 TinyKing86

原文地址：Spark Streaming Programming Guide

概述

Spark Streaming是基于核心SparkAPI的扩展，实现了可扩展，高吞吐量，实时数据流的容错流处理。提供了多种数据接入，如Kafka, Flume, Twitter, ZeroMQ, Kinesis, TCP sockets。可以使用高层次的功能，如map，reduce，join和window，使用复杂的算法来处理，最后将处理后的数据推送到文件系统，数据库，和实施仪表等。

图1

Spark Streaming接受实时数据流，将数据分批处理，然后有Spark引擎处理并输出最终结果流。Spark Streaming内部工作如下：

图2

实例

Scala

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._  // Spark 1.3起， 非必须
   
// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)

// Split each line into words
val words = lines.flatMap(_.split(" "))

// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()

ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate

运行实例

$ ./bin/run-example streaming.NetworkWordCount localhost 9999

在unix系统下，可以如下测试

# TERMINAL 1:
# Running Netcat

$ nc -lk 9999
hello world

...

带完善...

Spark Streaming 开发指南

概述

实例

Scala

猜你喜欢

热点阅读