开发乱炖

Spark Streaming 开发指南

2016-04-12  本文已影响93人  TinyKing86

原文地址:Spark Streaming Programming Guide

概述

Spark Streaming是基于核心SparkAPI的扩展,实现了可扩展,高吞吐量,实时数据流的容错流处理。提供了多种数据接入,如Kafka, Flume, Twitter, ZeroMQ, Kinesis, TCP sockets。可以使用高层次的功能,如map,reduce,join和window,使用复杂的算法来处理,最后将处理后的数据推送到文件系统,数据库,和实施仪表等。

图1

Spark Streaming接受实时数据流,将数据分批处理,然后有Spark引擎处理并输出最终结果流。Spark Streaming内部工作如下:


图2

实例

Scala

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._  // Spark 1.3起, 非必须
   
// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)

// Split each line into words
val words = lines.flatMap(_.split(" "))

// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()

ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate

运行实例

$ ./bin/run-example streaming.NetworkWordCount localhost 9999

在unix系统下,可以如下测试

# TERMINAL 1:
# Running Netcat

$ nc -lk 9999
hello world

...

带完善...

上一篇下一篇

猜你喜欢

热点阅读