storm介绍

2018-11-30 本文已影响0人 hexg1016

1 基本概念

1.1 Nimbus

Storm集群主节点，负责资源分配和任务调度。我们提交任务和截止任务都是在Nimbus上操作的。一个Storm集群只有一个Nimbus节点。

1.2 Supervisor

Storm集群工作节点，接受Nimbus分配任务，管理所有Worker。

1.3 topology

The logic for a realtime application is packaged into a Storm topology. A Storm topology is analogous to a MapReduce job. One key difference is that a MapReduce job eventually finishes, whereas a topology runs forever (or until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings. These concepts are described below.

一个实时程序的逻辑被打包成一个storm的拓扑单元。storm的拓扑单元和mapreduce的工作类似。一个关键的不同是mapreduce执行完就结束了，但是拓扑一直运行，除非你关闭它。拓扑单元是一系列的通过消息流组相连的spout和bolt组成。

1.4 streams

The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream's tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.

streams是storm中的核心概念。一个消息流就是一个无限的的元组，元组由分布式，平行的方式被创建和处理。默认的元组可以包含 integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays数据类型。你可以定义自己的元组序列化方式。

1.5 spouts

A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API). Spouts can either be reliable or unreliable. A reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout forgets about the tuple as soon as it is emitted.

spout是一个topology中的消息源头。通常来讲，spout读取外部的数据，并将消息发射给一个topology。spout可以是可靠和不可靠的。

1.6 bolts

All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.

bolts是topology中的处理单元。它可以处理任何事情，例如过滤数据，处理，聚合，连接，和数据库交互等。

1.7 Stream groupings

Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt's tasks.

topology中的一部分任务就是定义每个bolt如何接受数据。一个流组定义了留如何在多个bolt的任务中被分发。storm中有八种内置的stream groupings

1.7.1 Shuffle grouping

Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.

数据元组被随机的分配给不同的bolt任务处理。

1.7.2 Fields grouping

The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.

数据流按照指定的字段进行划分。以user-id为例，相同user-id被分发到相同的bolt处理

1.7.3 Partial Key grouping

The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.

消息流按照字段分发给不同的bolt，和field grouping类似，但是这里是负载均衡的分发到不同的bolt，因此不能保证相同的field被分配到相同的bolt。因此，在统计类型的bolt中，需要在后续再添加一个汇总bolt。

1.7.4 All grouping

The stream is replicated across all the bolt's tasks. Use this grouping with care.

消息被复制给所有的bolt任务。

1.7.5 Global grouping

The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id.

这整个数据流被发送给负载最低的一个bolt。

1.7.6 None grouping

This grouping specifies that you don't care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).

None grouping表明你不关心消息是怎么分组的，当前None grouping和shuffle groupings是相等的。

1.7.7 Direct grouping

This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).

1.7.8 Local or shuffle grouping

If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

1.8 tasks

Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout and setBolt methods of TopologyBuilder.

每个任务使用一个线程，stream groups定义如何在一个任务和另外一个任务中发送数据元组。

1.9 workers

Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.

每个worker对应一个java虚拟机进程，他包括一个或者多个task。storm在所有的worker中分发task。